This package was made to bridge the gap between the raw output of HotNet2 and further visualization and analysis. This vignette describes how I ran the example provided by the Raphael group on the HotNet2 GitHub repository.
The first step is to clone the HotNet2 repository from GitHub and set-up the environment to use it in.
I began by making a specific directory for this example analysis.
I then cloned the GitHub repository.
The HotNet2 tool was written in Python 2.7. Therefore, I setup a virtual environment, activated it, and then installed the necessary libraries (listed in the “requirements.txt” file).
# create virtual environment
virtualenv venv-hotnet2
# activate virtual environment
source venv-hotnet2/bin/activate
# install the libraries for HotNet2
cd hotnet2
pip install -r requirements.txt
To increase the speed of using HotNet2, the team implemented some of the algorithms (namely the edge-swapping algorithm) in Fortran and C. These two commands pre-compile the code. They produce a few warnings, but no errors.
python hotnet2/ build_src build_ext --inplace
python hotnet2/ build_src build_ext --inplace
The heat files contains the starting “heat” for each node. The input data needs to follow the correct format, so make sure to take a look at “hotnet2/example/example.heat” to see what it should look like.
The heat file is first transformed into a JSON in a specific format for HotNet2. This uses the
file; the help information is shown below.
python --help
#> usage: [-h] {scores,mutation,oncodrive,mutsig,music} ...
#> Generates a JSON heat file for input to runHotNet2.
#> optional arguments:
#> -h, --help show this help message and exit
#> Heat score type:
#> {scores,mutation,oncodrive,mutsig,music}
#> scores Pre-computed heat scores
#> mutation Mutation data
#> oncodrive Oncodrive scores
#> mutsig MutSig scores
#> music MuSiC scores
script can take several types of input. Here, I will demonstrate two of them. The first is scores
; below is the help information.
python scores --help
#> usage: scores [-h] [-o OUTPUT_FILE] [-n NAME] -hf HEAT_FILE
#> optional arguments:
#> -h, --help show this help message and exit
#> -o OUTPUT_FILE, --output_file OUTPUT_FILE
#> Output file. If none given, output will be written to
#> stdout.
#> -n NAME, --name NAME Name/Label describing the heat scores.
#> -hf HEAT_FILE, --heat_file HEAT_FILE
#> Path to a tab-separated file containing a gene name in
#> the first column and the heat score for that gene in
#> the second column of each line.
#> -ms MIN_HEAT_SCORE, --min_heat_score MIN_HEAT_SCORE
#> Minimum heat score for genes to have their original
#> heat score in the resulting output file. Genes with
#> score below this value will be assigned score 0.
#> -gff GENE_FILTER_FILE, --gene_filter_file GENE_FILTER_FILE
#> Path to file listing genes whose heat scores should be
#> preserved, one per line. If present, all other heat
#> scores will be discarded.
Using the above information, I wrote the following command to make the JSON heat file from “example/example.heat” and saved it to “example_run/heatfiles/scores_heatfile.json”.
python scores \
--heat_file example/example.heat \
--name scoresheat \
--output_file example_run/heatfiles/scores_heatfile.json
#> * Loading heat scores for 25 genes
Here it was the resulting JSON looked like.
"heat": {
"gene8": 4.0,
"gene9": 2.0,
"gene1": 15.0,
"gene2": 6.0,
"gene3": 5.0,
"gene4": 1.0,
"gene5": 3.0,
"gene6": 2.0,
"gene7": 1.0,
"gene12": 1.0,
"gene13": 1.0,
"gene10": 5.0,
"gene11": 1.0,
"gene16": 1.0,
"gene17": 1.0,
"gene14": 1.0,
"gene15": 1.0,
"gene18": 1.0,
"gene19": 1.0,
"gene23": 1.0,
"gene22": 1.0,
"gene21": 1.0,
"gene20": 1.0,
"gene25": 1.0,
"gene24": 1.0
"parameters": {
"name": "HN2example",
"gene_filter_file": null,
"heat_file": "example/example.heat",
"output_file": "example/example_heatfile.json",
"min_heat_score": 0,
"heat_fn": "load_direct_heat"
Below, I show the help information and the command I used to create a help file using CNA and SNV information.
python mutation --help
#> usage: mutation [-h] [-o OUTPUT_FILE] [-n NAME] --snv_file
#> SNV_FILE [--cna_file CNA_FILE]
#> [--sample_file SAMPLE_FILE]
#> [--sample_type_file SAMPLE_TYPE_FILE]
#> [--gene_file GENE_FILE] [--min_freq MIN_FREQ]
#> [--cna_filter_threshold CNA_FILTER_THRESHOLD]
#> optional arguments:
#> -h, --help show this help message and exit
#> -o OUTPUT_FILE, --output_file OUTPUT_FILE
#> Output file. If none given, output will be written to
#> stdout.
#> -n NAME, --name NAME Name/Label describing the heat scores.
#> --snv_file SNV_FILE Path to a tab-separated file containing SNVs where the
#> first column of each line is a sample ID and
#> subsequent columns contain the names of genes with
#> SNVs in that sample. Lines starting with "#" will be
#> ignored.
#> --cna_file CNA_FILE Path to a tab-separated file containing CNAs where the
#> first column of each line is a sample ID and
#> subsequent columns contain gene names followed by
#> "(A)" or "(D)" indicating an amplification or deletion
#> in that gene for the sample. Lines starting with "#"
#> will be ignored.
#> --sample_file SAMPLE_FILE
#> File listing samples. Any SNVs or CNAs in samples not
#> listed in this file will be ignored. If HotNet is run
#> with mutation permutation testing, all samples in this
#> file will be eligible for random mutations even if the
#> sample did not have any mutations in the real data. If
#> not provided, the set of samples is assumed to be all
#> samples that are provided in the SNV or CNA data.
#> --sample_type_file SAMPLE_TYPE_FILE
#> File listing type (e.g. cancer, datasets, etc.) of
#> samples (see --sample_file). Each line is a space-
#> separated row listing one sample and its type. The
#> sample types are used for creating the HotNet(2) web
#> output.
#> --gene_file GENE_FILE
#> File listing tested genes. SNVs or CNAs in genes not
#> listed in this file will be ignored. If HotNet is run
#> with mutation permutation testing, every gene in this
#> file will be eligible for random mutations even if the
#> gene did not have mutations in any samples in the
#> original data. If not provided, the set of tested
#> genes is assumed to be all genes that have mutations
#> in either the SNV or CNA data.
#> --min_freq MIN_FREQ The minimum number of samples in which a gene must
#> have an SNV to be considered mutated in the heat score
#> calculation.
#> --cna_filter_threshold CNA_FILTER_THRESHOLD
#> Proportion of CNAs in a gene across samples that must
#> share the same CNA type in order for the CNAs to be
#> included. This must either be > .5, or the default,
#> None, in which case all CNAs will be included.
The next step is to prepare the real and permuted network files using
. Note that this step can take a very long time on a personal computer and a real PPI. This example data only takes a few seconds to run, though.
Below I show the help information for the
python --help
#> usage: [-h] -e EDGELIST_FILE -i GENE_INDEX_FILE -nn
#> -b BETA [-op] [-q Q] [-ps PERMUTATION_START_INDEX]
#> Create the personalized pagerank matrix and 100 permuted PPR matrices for
#> thegiven network and restart probability beta.
#> optional arguments:
#> -h, --help show this help message and exit
#> -e EDGELIST_FILE, --edgelist_file EDGELIST_FILE
#> Path to TSV file listing edges of the interaction
#> network, whereeach row contains the indices of two
#> genes that are connected in thenetwork.
#> -i GENE_INDEX_FILE, --gene_index_file GENE_INDEX_FILE
#> Path to tab-separated file containing an index in the
#> first columnand the name of the gene represented at
#> that index in the secondcolumn of each line.
#> -nn NETWORK_NAME, --network_name NETWORK_NAME
#> Name of network.
#> -p PREFIX, --prefix PREFIX
#> Output prefix.
#> -is INDEX_FILE_START_INDEX, --index_file_start_index INDEX_FILE_START_INDEX
#> Minimum index in the index file.
#> -b BETA, --beta BETA Beta is the restart probability for the insulated heat
#> diffusion process.
#> -op, --only_permutations
#> Only permutations, i.e., do not generate influence
#> matrix forobserved data. Useful for generating
#> permuted network files onmultiple machines.
#> -q Q, --Q Q Edge swap constant. The script will attempt Q*|E| edge
#> swaps
#> Index at which to start permutation file names.
#> -np NUM_PERMUTATIONS, --num_permutations NUM_PERMUTATIONS
#> Number of permuted networks to create.
#> -o OUTPUT_DIR, --output_dir OUTPUT_DIR
#> Output directory.
#> -c CORES, --cores CORES
#> Use given number of cores. Pass -1 to use all
#> available.
I then use it on the two provided edge lists, “example/example_edgelist.txt” and “example/example_edgelist2.txt”. The output of each are saved to different directories. Note that I just used 0.5 for the beta
without trying different values. There is method for programmatically deciding on a value for the \(\beta\) of the RWR (described in the paper’s Supplementary), though it doesn’t seem to be too important to get a high-precision value (judging from personal experience and the author’s comments in the Supplementary).
python \
--edgelist_file example/example_edgelist.txt \
--gene_index_file example/example_gene_index.txt \
--network_name network1 \
--prefix network1 \
--beta 0.5 \
--num_permutations 100 \
--output_dir example_run/network1 \
--cores 1
#> Creating PPR matrix for real network
#> --------------------------------------
#> Creating edge lists for permuted networks
#> -------------------------------------------
#> * Loading edge list..
#> - 31 edges among 25 nodes.
#> - No. swaps to attempt = 3565.0
#> * Creating permuted networks...
#> 100/100
#> * Avg. No. Swaps Made: 1842
#> Creating PPR matrices for permuted networks
#> ---------------------------------------------
#> 100/100%
python \
--edgelist_file example/example_edgelist2.txt \
--gene_index_file example/example_gene_index.txt \
--network_name network2 \
--prefix network2 \
--beta 0.5 \
--num_permutations 100 \
--output_dir example_run/network2 \
--cores 1
#>Creating PPR matrix for real network
#>Creating edge lists for permuted networks
#>* Loading edge list..
#> - 30 edges among 25 nodes.
#> - No. swaps to attempt = 3450.0
#>* Creating permuted networks...
#>* Avg. No. Swaps Made: 1820
#>Creating PPR matrices for permuted networks
Here is what the file structure for the first network looks like.
Running HotNet2 is done using the
script. Below is the help information.
python --help
#> usage: [-h] -nf [NETWORK_FILES [NETWORK_FILES ...]] -pnp
#> [-dnf DISPLAY_NAME_FILE] [--output_hierarchy]
#> [--verbose {0,1,2,3,4}]
#> Helper script for simple runs of generalized HotNet2, including
#> automatedparameter selection.
#> optional arguments:
#> -h, --help show this help message and exit
#> Path to HDF5 (.h5) file containing influence matrix
#> and edge list.
#> Path to influence matrices for permuted networks, one
#> path per network file. Include ##NUM## in the path to
#> be replaced with the iteration number
#> -hf [HEAT_FILES [HEAT_FILES ...]], --heat_files [HEAT_FILES [HEAT_FILES ...]]
#> Path to heat file containing gene names and scores.
#> This can eitherbe a JSON file created by
#>, in which case the filename must end
#> in .json, or a tab-separated file containing a
#> genename in the first column and the heat score for
#> that gene in thesecond column of each line.
#> -ccs MIN_CC_SIZE, --min_cc_size MIN_CC_SIZE
#> Minimum size connected components that should be
#> returned.
#> -d [DELTAS [DELTAS ...]], --deltas [DELTAS [DELTAS ...]]
#> Delta value(s).
#> Number of permutations to be used for delta parameter
#> selection.
#> Number of permutations to be used for consensus
#> statistical significance testing.
#> Number of permutations to be used for statistical
#> significance testing.
#> Output directory. Files results.json, components.txt,
#> andsignificance.txt will be generated in
#> subdirectories for each delta.
#> -c NUM_CORES, --num_cores NUM_CORES
#> Number of cores to use for running permutation tests
#> in parallel. If-1, all available cores will be used.
#> -dsf DISPLAY_SCORE_FILE, --display_score_file DISPLAY_SCORE_FILE
#> Path to a tab-separated file containing a gene name in
#> the firstcolumn and the display score for that gene in
#> the second column ofeach line.
#> -dnf DISPLAY_NAME_FILE, --display_name_file DISPLAY_NAME_FILE
#> Path to a tab-separated file containing a gene name in
#> the firstcolumn and the display name for that gene in
#> the second column ofeach line.
#> --output_hierarchy Output the hierarchical decomposition of the HotNet2
#> similarity matrix.
#> --verbose {0,1,2,3,4}
#> Set verbosity of output (minimum: 0, maximum: 5).
The first run of HotNet2 that I show in this vignette just runs one heat file on one network. This is the simplest use-case for HotNet2. It is likely to be satisfactory for most users.
python \
--network_files example_run/network1/network1_ppr_0.5.h5 \
--permuted_network_paths example_run/network1/permuted/network1_ppr_0.5_##NUM##.h5 \
--heat_files example_run/heatfiles/mutations_heatfile.json \
--output_directory example_run/output_simple \
--num_cores 1
#> * Running HotNet2 in consensus mode...
#> - network1 mutationsheat
#> * Outputting results to file...
#> * Generating and outputting visualization data...
The second example run, shown below, runs two heat files on two networks. HotNet2 then creates a “consensus” output for the two networks.
python \
--network_files example_run/network1/network1_ppr_0.5.h5 \
example_run/network2/network2_ppr_0.5.h5 \
--permuted_network_paths example_run/network1/permuted/network1_ppr_0.5_##NUM##.h5 \
example_run/network2/permuted/network2_ppr_0.5_##NUM##.h5 \
--heat_files example_run/heatfiles/mutations_heatfile.json \
example_run/heatfiles/scores_heatfile.json \
--output_directory example_run/output_consensus \
--num_cores 1
#> * Running HotNet2 in consensus mode...
#> - network1 mutationsheat
#> - network1 scoresheat
#> - network2 mutationsheat
#> - network2 scoresheat
#> * Outputting results to file...
#> * Generating and outputting visualization data...
Here is a diagram showing the file structure for the simple output.
#> tree example_run/output_simple
#> example_run/output_simple
#> ├── consensus
#> │ ├── stats.tsv
#> │ ├── subnetworks.json
#> │ ├── subnetworks.tsv
#> │ └── viz-data.json
#> └── network1-mutationsheat
#> ├── delta_0.007907676044851542
#> │ ├── components.txt
#> │ ├── results.json
#> │ └── significance.txt
#> ├── delta_1.024861412588507e-05
#> │ ├── components.txt
#> │ ├── results.json
#> │ └── significance.txt
#> └── viz-data.json
#> 4 directories, 11 files
The output for the run with multiple networks is similar in structure but there are more files.
tree example_run/output_consensus
#> example_run/output_consensus
#> ├── consensus
#> │ ├── stats.tsv
#> │ ├── subnetworks.json
#> │ ├── subnetworks.tsv
#> │ └── viz-data.json
#> ├── network1-mutationsheat
#> │ ├── delta_0.007907676044851542
#> │ │ ├── components.txt
#> │ │ ├── results.json
#> │ │ └── significance.txt
#> │ ├── delta_1.024861412588507e-05
#> │ │ ├── components.txt
#> │ │ ├── results.json
#> │ │ └── significance.txt
#> │ └── viz-data.json
#> ├── network1-scoresheat
#> │ ├── delta_0.03640584065578878
#> │ │ ├── components.txt
#> │ │ ├── results.json
#> │ │ └── significance.txt
#> │ ├── delta_0.04944752808660269
#> │ │ ├── components.txt
#> │ │ ├── results.json
#> │ │ └── significance.txt
#> │ ├── delta_0.052240293473005295
#> │ │ ├── components.txt
#> │ │ ├── results.json
#> │ │ └── significance.txt
#> │ ├── delta_0.10117589682340622
#> │ │ ├── components.txt
#> │ │ ├── results.json
#> │ │ └── significance.txt
#> │ └── viz-data.json
#> ├── network2-mutationsheat
#> │ ├── delta_0.005027415044605733
#> │ │ ├── components.txt
#> │ │ ├── results.json
#> │ │ └── significance.txt
#> │ ├── delta_5.711058292945381e-06
#> │ │ ├── components.txt
#> │ │ ├── results.json
#> │ │ └── significance.txt
#> │ └── viz-data.json
#> └── network2-scoresheat
#> ├── delta_0.051236389204859734
#> │ ├── components.txt
#> │ ├── results.json
#> │ └── significance.txt
#> ├── delta_0.05293082073330879
#> │ ├── components.txt
#> │ ├── results.json
#> │ └── significance.txt
#> ├── delta_0.05503008887171745
#> │ ├── components.txt
#> │ ├── results.json
#> │ └── significance.txt
#> ├── delta_0.10069134272634983
#> │ ├── components.txt
#> │ ├── results.json
#> │ └── significance.txt
#> └── viz-data.json
#> 17 directories, 44 files