HotNet2-setup-and-example.Rmd
library(HotNetvieweR)
This package was made to bridge the gap between the raw output of HotNet2 and further visualization and analysis. This vignette describes how I ran the example provided by the Raphael group on the HotNet2 GitHub repository.
The first step is to clone the HotNet2 repository from GitHub and set-up the environment to use it in.
I began by making a specific directory for this example analysis.
I then cloned the GitHub repository.
The HotNet2 tool was written in Python 2.7. Therefore, I setup a virtual environment, activated it, and then installed the necessary libraries (listed in the “requirements.txt” file).
# create virtual environment
virtualenv venv-hotnet2
# activate virtual environment
source venv-hotnet2/bin/activate
# install the libraries for HotNet2
cd hotnet2
pip install -r requirements.txt
To increase the speed of using HotNet2, the team implemented some of the algorithms (namely the edge-swapping algorithm) in Fortran and C. These two commands pre-compile the code. They produce a few warnings, but no errors.
python hotnet2/setup_fortran.py build_src build_ext --inplace
python hotnet2/setup_c.py build_src build_ext --inplace
The heat files contains the starting “heat” for each node. The input data needs to follow the correct format, so make sure to take a look at “hotnet2/example/example.heat” to see what it should look like.
The heat file is first transformed into a JSON in a specific format for HotNet2. This uses the makeHeatFile.py
file; the help information is shown below.
python makeHeatFile.py --help
#> usage: makeHeatFile.py [-h] {scores,mutation,oncodrive,mutsig,music} ...
#>
#> Generates a JSON heat file for input to runHotNet2.
#>
#> optional arguments:
#> -h, --help show this help message and exit
#>
#> Heat score type:
#> {scores,mutation,oncodrive,mutsig,music}
#> scores Pre-computed heat scores
#> mutation Mutation data
#> oncodrive Oncodrive scores
#> mutsig MutSig scores
#> music MuSiC scores
The makeHeatFile.py
script can take several types of input. Here, I will demonstrate two of them. The first is scores
; below is the help information.
python makeHeatFile.py scores --help
#> usage: makeHeatFile.py scores [-h] [-o OUTPUT_FILE] [-n NAME] -hf HEAT_FILE
#> [-ms MIN_HEAT_SCORE] [-gff GENE_FILTER_FILE]
#>
#> optional arguments:
#> -h, --help show this help message and exit
#> -o OUTPUT_FILE, --output_file OUTPUT_FILE
#> Output file. If none given, output will be written to
#> stdout.
#> -n NAME, --name NAME Name/Label describing the heat scores.
#> -hf HEAT_FILE, --heat_file HEAT_FILE
#> Path to a tab-separated file containing a gene name in
#> the first column and the heat score for that gene in
#> the second column of each line.
#> -ms MIN_HEAT_SCORE, --min_heat_score MIN_HEAT_SCORE
#> Minimum heat score for genes to have their original
#> heat score in the resulting output file. Genes with
#> score below this value will be assigned score 0.
#> -gff GENE_FILTER_FILE, --gene_filter_file GENE_FILTER_FILE
#> Path to file listing genes whose heat scores should be
#> preserved, one per line. If present, all other heat
#> scores will be discarded.
Using the above information, I wrote the following command to make the JSON heat file from “example/example.heat” and saved it to “example_run/heatfiles/scores_heatfile.json”.
python makeHeatFile.py scores \
--heat_file example/example.heat \
--name scoresheat \
--output_file example_run/heatfiles/scores_heatfile.json
#> * Loading heat scores for 25 genes
Here it was the resulting JSON looked like.
{
"heat": {
"gene8": 4.0,
"gene9": 2.0,
"gene1": 15.0,
"gene2": 6.0,
"gene3": 5.0,
"gene4": 1.0,
"gene5": 3.0,
"gene6": 2.0,
"gene7": 1.0,
"gene12": 1.0,
"gene13": 1.0,
"gene10": 5.0,
"gene11": 1.0,
"gene16": 1.0,
"gene17": 1.0,
"gene14": 1.0,
"gene15": 1.0,
"gene18": 1.0,
"gene19": 1.0,
"gene23": 1.0,
"gene22": 1.0,
"gene21": 1.0,
"gene20": 1.0,
"gene25": 1.0,
"gene24": 1.0
},
"parameters": {
"name": "HN2example",
"gene_filter_file": null,
"heat_file": "example/example.heat",
"output_file": "example/example_heatfile.json",
"min_heat_score": 0,
"heat_fn": "load_direct_heat"
}
}
Below, I show the help information and the command I used to create a help file using CNA and SNV information.
python makeHeatFile.py mutation --help
#> usage: makeHeatFile.py mutation [-h] [-o OUTPUT_FILE] [-n NAME] --snv_file
#> SNV_FILE [--cna_file CNA_FILE]
#> [--sample_file SAMPLE_FILE]
#> [--sample_type_file SAMPLE_TYPE_FILE]
#> [--gene_file GENE_FILE] [--min_freq MIN_FREQ]
#> [--cna_filter_threshold CNA_FILTER_THRESHOLD]
#>
#> optional arguments:
#> -h, --help show this help message and exit
#> -o OUTPUT_FILE, --output_file OUTPUT_FILE
#> Output file. If none given, output will be written to
#> stdout.
#> -n NAME, --name NAME Name/Label describing the heat scores.
#> --snv_file SNV_FILE Path to a tab-separated file containing SNVs where the
#> first column of each line is a sample ID and
#> subsequent columns contain the names of genes with
#> SNVs in that sample. Lines starting with "#" will be
#> ignored.
#> --cna_file CNA_FILE Path to a tab-separated file containing CNAs where the
#> first column of each line is a sample ID and
#> subsequent columns contain gene names followed by
#> "(A)" or "(D)" indicating an amplification or deletion
#> in that gene for the sample. Lines starting with "#"
#> will be ignored.
#> --sample_file SAMPLE_FILE
#> File listing samples. Any SNVs or CNAs in samples not
#> listed in this file will be ignored. If HotNet is run
#> with mutation permutation testing, all samples in this
#> file will be eligible for random mutations even if the
#> sample did not have any mutations in the real data. If
#> not provided, the set of samples is assumed to be all
#> samples that are provided in the SNV or CNA data.
#> --sample_type_file SAMPLE_TYPE_FILE
#> File listing type (e.g. cancer, datasets, etc.) of
#> samples (see --sample_file). Each line is a space-
#> separated row listing one sample and its type. The
#> sample types are used for creating the HotNet(2) web
#> output.
#> --gene_file GENE_FILE
#> File listing tested genes. SNVs or CNAs in genes not
#> listed in this file will be ignored. If HotNet is run
#> with mutation permutation testing, every gene in this
#> file will be eligible for random mutations even if the
#> gene did not have mutations in any samples in the
#> original data. If not provided, the set of tested
#> genes is assumed to be all genes that have mutations
#> in either the SNV or CNA data.
#> --min_freq MIN_FREQ The minimum number of samples in which a gene must
#> have an SNV to be considered mutated in the heat score
#> calculation.
#> --cna_filter_threshold CNA_FILTER_THRESHOLD
#> Proportion of CNAs in a gene across samples that must
#> share the same CNA type in order for the CNAs to be
#> included. This must either be > .5, or the default,
#> None, in which case all CNAs will be included.
The next step is to prepare the real and permuted network files using makeNetworkFiles.py
. Note that this step can take a very long time on a personal computer and a real PPI. This example data only takes a few seconds to run, though.
Below I show the help information for the makeNetworkFiles.py
script.
python makeNetworkFiles.py --help
#> usage: makeNetworkFiles.py [-h] -e EDGELIST_FILE -i GENE_INDEX_FILE -nn
#> NETWORK_NAME -p PREFIX [-is INDEX_FILE_START_INDEX]
#> -b BETA [-op] [-q Q] [-ps PERMUTATION_START_INDEX]
#> [-np NUM_PERMUTATIONS] -o OUTPUT_DIR [-c CORES]
#>
#> Create the personalized pagerank matrix and 100 permuted PPR matrices for
#> thegiven network and restart probability beta.
#>
#> optional arguments:
#> -h, --help show this help message and exit
#> -e EDGELIST_FILE, --edgelist_file EDGELIST_FILE
#> Path to TSV file listing edges of the interaction
#> network, whereeach row contains the indices of two
#> genes that are connected in thenetwork.
#> -i GENE_INDEX_FILE, --gene_index_file GENE_INDEX_FILE
#> Path to tab-separated file containing an index in the
#> first columnand the name of the gene represented at
#> that index in the secondcolumn of each line.
#> -nn NETWORK_NAME, --network_name NETWORK_NAME
#> Name of network.
#> -p PREFIX, --prefix PREFIX
#> Output prefix.
#> -is INDEX_FILE_START_INDEX, --index_file_start_index INDEX_FILE_START_INDEX
#> Minimum index in the index file.
#> -b BETA, --beta BETA Beta is the restart probability for the insulated heat
#> diffusion process.
#> -op, --only_permutations
#> Only permutations, i.e., do not generate influence
#> matrix forobserved data. Useful for generating
#> permuted network files onmultiple machines.
#> -q Q, --Q Q Edge swap constant. The script will attempt Q*|E| edge
#> swaps
#> -ps PERMUTATION_START_INDEX, --permutation_start_index PERMUTATION_START_INDEX
#> Index at which to start permutation file names.
#> -np NUM_PERMUTATIONS, --num_permutations NUM_PERMUTATIONS
#> Number of permuted networks to create.
#> -o OUTPUT_DIR, --output_dir OUTPUT_DIR
#> Output directory.
#> -c CORES, --cores CORES
#> Use given number of cores. Pass -1 to use all
#> available.
I then use it on the two provided edge lists, “example/example_edgelist.txt” and “example/example_edgelist2.txt”. The output of each are saved to different directories. Note that I just used 0.5 for the beta
without trying different values. There is method for programmatically deciding on a value for the \(\beta\) of the RWR (described in the paper’s Supplementary), though it doesn’t seem to be too important to get a high-precision value (judging from personal experience and the author’s comments in the Supplementary).
python makeNetworkFiles.py \
--edgelist_file example/example_edgelist.txt \
--gene_index_file example/example_gene_index.txt \
--network_name network1 \
--prefix network1 \
--beta 0.5 \
--num_permutations 100 \
--output_dir example_run/network1 \
--cores 1
#> Creating PPR matrix for real network
#> --------------------------------------
#>
#> Creating edge lists for permuted networks
#> -------------------------------------------
#> * Loading edge list..
#> - 31 edges among 25 nodes.
#> - No. swaps to attempt = 3565.0
#> * Creating permuted networks...
#> 100/100
#> * Avg. No. Swaps Made: 1842
#>
#> Creating PPR matrices for permuted networks
#> ---------------------------------------------
#> 100/100%
python makeNetworkFiles.py \
--edgelist_file example/example_edgelist2.txt \
--gene_index_file example/example_gene_index.txt \
--network_name network2 \
--prefix network2 \
--beta 0.5 \
--num_permutations 100 \
--output_dir example_run/network2 \
--cores 1
#>Creating PPR matrix for real network
#>--------------------------------------
#>
#>Creating edge lists for permuted networks
#>-------------------------------------------
#>* Loading edge list..
#> - 30 edges among 25 nodes.
#> - No. swaps to attempt = 3450.0
#>* Creating permuted networks...
#>100/100
#>* Avg. No. Swaps Made: 1820
#>
#>Creating PPR matrices for permuted networks
#>---------------------------------------------
#>100/100%
Here is what the file structure for the first network looks like.
Running HotNet2 is done using the HotNet2.py
script. Below is the help information.
python HotNet2.py --help
#> usage: HotNet2.py [-h] -nf [NETWORK_FILES [NETWORK_FILES ...]] -pnp
#> [PERMUTED_NETWORK_PATHS [PERMUTED_NETWORK_PATHS ...]] -hf
#> [HEAT_FILES [HEAT_FILES ...]] [-ccs MIN_CC_SIZE]
#> [-d [DELTAS [DELTAS ...]]] [-np NETWORK_PERMUTATIONS]
#> [-cp CONSENSUS_PERMUTATIONS] [-hp HEAT_PERMUTATIONS] -o
#> OUTPUT_DIRECTORY [-c NUM_CORES] [-dsf DISPLAY_SCORE_FILE]
#> [-dnf DISPLAY_NAME_FILE] [--output_hierarchy]
#> [--verbose {0,1,2,3,4}]
#>
#> Helper script for simple runs of generalized HotNet2, including
#> automatedparameter selection.
#>
#> optional arguments:
#> -h, --help show this help message and exit
#> -nf [NETWORK_FILES [NETWORK_FILES ...]], --network_files [NETWORK_FILES [NETWORK_FILES ...]]
#> Path to HDF5 (.h5) file containing influence matrix
#> and edge list.
#> -pnp [PERMUTED_NETWORK_PATHS [PERMUTED_NETWORK_PATHS ...]], --permuted_network_paths [PERMUTED_NETWORK_PATHS #> [PERMUTED_NETWORK_PATHS ...]]
#> Path to influence matrices for permuted networks, one
#> path per network file. Include ##NUM## in the path to
#> be replaced with the iteration number
#> -hf [HEAT_FILES [HEAT_FILES ...]], --heat_files [HEAT_FILES [HEAT_FILES ...]]
#> Path to heat file containing gene names and scores.
#> This can eitherbe a JSON file created by
#> generateHeat.py, in which case the filename must end
#> in .json, or a tab-separated file containing a
#> genename in the first column and the heat score for
#> that gene in thesecond column of each line.
#> -ccs MIN_CC_SIZE, --min_cc_size MIN_CC_SIZE
#> Minimum size connected components that should be
#> returned.
#> -d [DELTAS [DELTAS ...]], --deltas [DELTAS [DELTAS ...]]
#> Delta value(s).
#> -np NETWORK_PERMUTATIONS, --network_permutations NETWORK_PERMUTATIONS
#> Number of permutations to be used for delta parameter
#> selection.
#> -cp CONSENSUS_PERMUTATIONS, --consensus_permutations CONSENSUS_PERMUTATIONS
#> Number of permutations to be used for consensus
#> statistical significance testing.
#> -hp HEAT_PERMUTATIONS, --heat_permutations HEAT_PERMUTATIONS
#> Number of permutations to be used for statistical
#> significance testing.
#> -o OUTPUT_DIRECTORY, --output_directory OUTPUT_DIRECTORY
#> Output directory. Files results.json, components.txt,
#> andsignificance.txt will be generated in
#> subdirectories for each delta.
#> -c NUM_CORES, --num_cores NUM_CORES
#> Number of cores to use for running permutation tests
#> in parallel. If-1, all available cores will be used.
#> -dsf DISPLAY_SCORE_FILE, --display_score_file DISPLAY_SCORE_FILE
#> Path to a tab-separated file containing a gene name in
#> the firstcolumn and the display score for that gene in
#> the second column ofeach line.
#> -dnf DISPLAY_NAME_FILE, --display_name_file DISPLAY_NAME_FILE
#> Path to a tab-separated file containing a gene name in
#> the firstcolumn and the display name for that gene in
#> the second column ofeach line.
#> --output_hierarchy Output the hierarchical decomposition of the HotNet2
#> similarity matrix.
#> --verbose {0,1,2,3,4}
#> Set verbosity of output (minimum: 0, maximum: 5).
The first run of HotNet2 that I show in this vignette just runs one heat file on one network. This is the simplest use-case for HotNet2. It is likely to be satisfactory for most users.
python HotNet2.py \
--network_files example_run/network1/network1_ppr_0.5.h5 \
--permuted_network_paths example_run/network1/permuted/network1_ppr_0.5_##NUM##.h5 \
--heat_files example_run/heatfiles/mutations_heatfile.json \
--output_directory example_run/output_simple \
--num_cores 1
#> /Users/admin/Documents/Python/HotNet2_example/hotnet2/hotnet2/hnio.py:402: H5pyDeprecationWarning: dataset.value has been deprecated. Use dataset[()] instead.
#> dictionary = {key:f[key].value for key in f}
#> * Running HotNet2 in consensus mode...
#> - network1 mutationsheat
#> * Outputting results to file...
#> * Generating and outputting visualization data...
The second example run, shown below, runs two heat files on two networks. HotNet2 then creates a “consensus” output for the two networks.
python HotNet2.py \
--network_files example_run/network1/network1_ppr_0.5.h5 \
example_run/network2/network2_ppr_0.5.h5 \
--permuted_network_paths example_run/network1/permuted/network1_ppr_0.5_##NUM##.h5 \
example_run/network2/permuted/network2_ppr_0.5_##NUM##.h5 \
--heat_files example_run/heatfiles/mutations_heatfile.json \
example_run/heatfiles/scores_heatfile.json \
--output_directory example_run/output_consensus \
--num_cores 1
#> /Users/admin/Documents/Python/HotNet2_example/hotnet2/hotnet2/hnio.py:402: H5pyDeprecationWarning: dataset.value has been deprecated. Use dataset[()] instead.
#> dictionary = {key:f[key].value for key in f}
#> * Running HotNet2 in consensus mode...
#> - network1 mutationsheat
#> - network1 scoresheat
#> - network2 mutationsheat
#> - network2 scoresheat
#> * Outputting results to file...
#> * Generating and outputting visualization data...
Here is a diagram showing the file structure for the simple output.
#> tree example_run/output_simple
#> example_run/output_simple
#> ├── consensus
#> │ ├── stats.tsv
#> │ ├── subnetworks.json
#> │ ├── subnetworks.tsv
#> │ └── viz-data.json
#> └── network1-mutationsheat
#> ├── delta_0.007907676044851542
#> │ ├── components.txt
#> │ ├── results.json
#> │ └── significance.txt
#> ├── delta_1.024861412588507e-05
#> │ ├── components.txt
#> │ ├── results.json
#> │ └── significance.txt
#> └── viz-data.json
#>
#> 4 directories, 11 files
The output for the run with multiple networks is similar in structure but there are more files.
tree example_run/output_consensus
#> example_run/output_consensus
#> ├── consensus
#> │ ├── stats.tsv
#> │ ├── subnetworks.json
#> │ ├── subnetworks.tsv
#> │ └── viz-data.json
#> ├── network1-mutationsheat
#> │ ├── delta_0.007907676044851542
#> │ │ ├── components.txt
#> │ │ ├── results.json
#> │ │ └── significance.txt
#> │ ├── delta_1.024861412588507e-05
#> │ │ ├── components.txt
#> │ │ ├── results.json
#> │ │ └── significance.txt
#> │ └── viz-data.json
#> ├── network1-scoresheat
#> │ ├── delta_0.03640584065578878
#> │ │ ├── components.txt
#> │ │ ├── results.json
#> │ │ └── significance.txt
#> │ ├── delta_0.04944752808660269
#> │ │ ├── components.txt
#> │ │ ├── results.json
#> │ │ └── significance.txt
#> │ ├── delta_0.052240293473005295
#> │ │ ├── components.txt
#> │ │ ├── results.json
#> │ │ └── significance.txt
#> │ ├── delta_0.10117589682340622
#> │ │ ├── components.txt
#> │ │ ├── results.json
#> │ │ └── significance.txt
#> │ └── viz-data.json
#> ├── network2-mutationsheat
#> │ ├── delta_0.005027415044605733
#> │ │ ├── components.txt
#> │ │ ├── results.json
#> │ │ └── significance.txt
#> │ ├── delta_5.711058292945381e-06
#> │ │ ├── components.txt
#> │ │ ├── results.json
#> │ │ └── significance.txt
#> │ └── viz-data.json
#> └── network2-scoresheat
#> ├── delta_0.051236389204859734
#> │ ├── components.txt
#> │ ├── results.json
#> │ └── significance.txt
#> ├── delta_0.05293082073330879
#> │ ├── components.txt
#> │ ├── results.json
#> │ └── significance.txt
#> ├── delta_0.05503008887171745
#> │ ├── components.txt
#> │ ├── results.json
#> │ └── significance.txt
#> ├── delta_0.10069134272634983
#> │ ├── components.txt
#> │ ├── results.json
#> │ └── significance.txt
#> └── viz-data.json
#>
#> 17 directories, 44 files