Comparing Edge Swap Algorithms • wext

library(wext)
library(tidygraph)
library(tibble)
library(dplyr)
library(bench)

This is just a quick comparison of the edge swapping algorithms used in this package.

The first step is to create a bipartite graph for the test.

v1 <- sample(seq(2, 1e3, 2), 1e4, replace = TRUE)
v2 <- sample(seq(1, 1e3+1, 2), 1e4, replace = TRUE)
bgr <- tibble(node1 = as.character(v1),
              node2 = as.character(v2)) %>% 
  unique() %>%
  as_tbl_graph(directed = FALSE) %N>% 
  mutate(type = name %in% unique(as.character(v1)))
bgr
#> # A tbl_graph: 1001 nodes and 9799 edges
#> #
#> # A bipartite simple graph with 1 component
#> #
#> # Node Data: 1,001 x 2 (active)
#>   name  type 
#>   <chr> <lgl>
#> 1 898   TRUE 
#> 2 266   TRUE 
#> 3 374   TRUE 
#> 4 574   TRUE 
#> 5 910   TRUE 
#> 6 202   TRUE 
#> # … with 995 more rows
#> #
#> # Edge Data: 9,799 x 2
#>    from    to
#>   <int> <int>
#> 1     1   501
#> 2     2   502
#> 3     3   503
#> # … with 9,796 more rows

This graph has 1001 nodes and 9799 edges.

The first bipartite edge swapping method is bipartite_edge_swap() which relies on the ‘igraph’ and ‘tidygraph’ APIs for handling all of the data and logically manipulates a tidygraph graph object. This was the easiest implementation, though I expected that the manipulation of high-level data types would be a time-suck. This first test runs the edge swapping algorithm to swap 100 edges 20 times.

m1 <- bench::mark(bipartite_edge_swap(bgr, N = 100),
            iterations = 20, filter_gc = FALSE) %>% 
  dplyr::select(-expression)
m1
#> # A tibble: 1 x 5
#>        min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1    6.82s    7.38s     0.136    1.96GB     9.44

The second edge swapping algorithm implementation is in bipartite_edge_swap2(). This method took more work on my end, though its use of vector manipulations rapidly increased the speed of the algorithm. (As it is much faster, I now run 50 iterations for the benchmark.)

m2 <- bench::mark(bipartite_edge_swap2(bgr, N = 100),
            iterations = 50, filter_gc = FALSE) %>% 
  dplyr::select(-expression)
m2
#> # A tibble: 1 x 5
#>        min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1    506ms    625ms      1.54     227MB     11.7

This first test ran 100 edge swaps on the graph bgr. The first method took 7.38 s on average (a median of 20 iterations) while the second took 625 ms on average. Thus, the second method is much faster. Another striking difference is the amount of memory each method was allocated - the first method took about 2 GB of RAM! It also called the garbage collector an order of magnitude more frequently.

The function bipartite_edge_swap3() implements the same algorithm used in method 2 in C++.

m2_C <- bench::mark(bipartite_edge_swap3(bgr, N = 100),
            iterations = 50, filter_gc = FALSE) %>% 
  dplyr::select(-expression)
m2_C
#> # A tibble: 1 x 5
#>        min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1    403ms    457ms      2.18     182MB     4.18

This is the fastest method, requiring 457 ms on average with even less RAM required.

To specifically test everything expect for the edge swapping, I re-ran the same test but only conducting 1 edge swap in each iteration.

# first method
bench::mark(bipartite_edge_swap(bgr, N = 1),
            iterations = 50, filter_gc = FALSE) %>% 
  dplyr::select(-expression)
#> # A tibble: 1 x 5
#>        min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1   52.9ms   80.9ms      10.9    21.9MB     2.85
# second method
bench::mark(bipartite_edge_swap2(bgr, N = 1),
            iterations = 50, filter_gc = FALSE) %>% 
  dplyr::select(-expression)
#> # A tibble: 1 x 5
#>        min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1    404ms    549ms      1.77     160MB     3.46
# second method in C++
bench::mark(bipartite_edge_swap3(bgr, N = 1),
            iterations = 50, filter_gc = FALSE) %>% 
  dplyr::select(-expression)
#> # A tibble: 1 x 5
#>        min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1    411ms    569ms      1.76     160MB     3.59

Interestingly, the first method (based around the ‘igraph’ and ‘tidygraph’ APIs) is substantially faster to do all of the other steps in the edge swapping process. This is likely because the second method must first transform the bipartite graph bgr into the list of vectors while the first method works directly on the tidygraph object (it only does this once per call to bipartite_edge_swap2). Still, it may be worth poking around on the setup steps in the second method to improve its efficiency.

To further test this hypothesis, the difference should shrink if I use a smaller bipartite graph.

v1 <- sample(seq(2, 50, 2), 100, replace = TRUE)
v2 <- sample(seq(1, 51, 2), 100, replace = TRUE)
bgr <- tibble(node1 = as.character(v1),
              node2 = as.character(v2)) %>% 
  unique() %>%
  as_tbl_graph(directed = FALSE) %N>% 
  mutate(type = name %in% unique(as.character(v1)))
bgr
#> # A tbl_graph: 51 nodes and 95 edges
#> #
#> # A bipartite simple graph with 1 component
#> #
#> # Node Data: 51 x 2 (active)
#>   name  type 
#>   <chr> <lgl>
#> 1 10    TRUE 
#> 2 30    TRUE 
#> 3 40    TRUE 
#> 4 38    TRUE 
#> 5 32    TRUE 
#> 6 2     TRUE 
#> # … with 45 more rows
#> #
#> # Edge Data: 95 x 2
#>    from    to
#>   <int> <int>
#> 1     1    26
#> 2     2    27
#> 3     3    28
#> # … with 92 more rows

This smaller graph has 51 nodes and 95 edges. I now re-run the test with only 1 edge swap.

# first method
bench::mark(bipartite_edge_swap(bgr, N = 1),
            iterations = 50, filter_gc = FALSE) %>% 
  dplyr::select(-expression)
#> # A tibble: 1 x 5
#>        min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1   20.8ms   29.1ms      22.5     324KB     5.39
# second method
bench::mark(bipartite_edge_swap2(bgr, N = 1),
            iterations = 50, filter_gc = FALSE) %>% 
  dplyr::select(-expression)
#> # A tibble: 1 x 5
#>        min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1   4.48ms   6.22ms      105.     202KB     4.20
# second method in C++
bench::mark(bipartite_edge_swap3(bgr, N = 1),
            iterations = 50, filter_gc = FALSE) %>% 
  dplyr::select(-expression)
#> # A tibble: 1 x 5
#>        min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1   4.67ms   5.19ms      167.     202KB     9.99

Now, the second method is actually faster because it was easier to change to the vector manipulation than lug around the relatively “heavy” tidygraph object (see the mem_alloc columns). You can see that the garbage collection intervened a few times (in the gc/sec and n_gc column).

For comparison, here is the same test using the heavily optimized edge swapping algorithm in ‘igraph’. While fast, this test does not preserve the bipartite nature of the graph, thus I cannot use it for ‘WExT’.

bench::mark(
  igraph::rewire(bgr, igraph::keeping_degseq(loops = FALSE, niter = 100)),
  iterations = 30, filter_gc = FALSE
) %>% 
  dplyr::select(-expression)
#> # A tibble: 1 x 5
#>        min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1    174µs    180µs     5259.    44.2KB        0