The data used for this example is the same data set in the “simple” example from the Raphael group’s GitHub repository. The original file format was one line per sample with a list of the genes that were mutated. I have already prepared this into a tidy format in “data-raw/simple_dataset.R”, as it is likely what the user will start with.
simple_dataset
#> # A tibble: 36 x 2
#> sample_name mutated_gene
#> <chr> <chr>
#> 1 1 a
#> 2 1 c
#> 3 1 d
#> 4 1 e
#> 5 1 f
#> 6 2 b
#> 7 2 c
#> 8 2 d
#> 9 2 g
#> 10 2 h
#> # … with 26 more rows
In this vignette, I show how WExT works by running each step separately. The “Simple Example (the easy way)” explains how a normal user would run the analysis.
This is simply defined as
\[ \tag{1} W_R = \frac{1}{\Omega_R} \sum_{B \in \Omega_R}{B} \]
but since every row of \(B \in \Omega_R\) can be considered independently, this simplifies to
\[ \tag{2} W_R = \left[ w_{ij} = \frac{r_i}{n} \right] \]
Thus, we compute the weights for the data
W_R <- calculate_row_exclusivity_weights(dat = simple_dataset,
sample_col = sample_name,
mutgene_col = mutated_gene)
#> # A tibble: 112 x 3
#> sample_name mutated_gene row_ex_weights
#> <chr> <chr> <dbl>
#> 1 1 a 0.429
#> 2 1 b 0.429
#> 3 1 c 0.643
#> 4 1 d 0.643
#> 5 1 e 0.143
#> 6 1 f 0.0714
#> 7 1 g 0.143
#> 8 1 h 0.0714
#> 9 10 a 0.429
#> 10 10 b 0.429
#> # … with 102 more rows
We still need to calculate equation \((1)\), but there is no closed form formula for \(W_{RC}\) as there is equation \((2)\) for \(W_R\). Thus, the authors generate an empirical weight matrix \(W^{N}_{RC}\) by drawing \(N\) matrices \(\Omega^{N}_{RC}\) from \(\Omega_{RC}\) and calculating equation \((1)\) from those matrices. The algorithm used is explained in “Creating the Row-Column-Exclusivity Null Distribution” vignette. The process is implemented in the function calculate_row_col_exclusivity_weights()