Commit 284c3cff authored by Cameron's avatar Cameron

Merge branch 'doc/updated-benchmarking' into 'development'

Doc/updated benchmarking

See merge request !73
parents 410260ce 9c3dd640
Pipeline #4249 passed with stage
in 13 seconds
SPN3402 ERR052325 PAIRED
SPN4683 ERR271793 PAIRED
SPN22809 ERR052456 PAIRED
SPN2906 ERR052331 PAIRED
LMG1250 ERR052038 PAIRED
LMG1435 ERR052043 PAIRED
S-044 ERR051898 PAIRED
S-112 ERR051900 PAIRED
SPN8332 ERR271792 PAIRED
SN26575 ERR052511 PAIRED
SPN13522 ERR052440 PAIRED
SPN13151 ERR052439 PAIRED
SPN13633 ERR052441 PAIRED
SN33007 ERR055616 PAIRED
SPN4488 ERR052465 PAIRED
SN34677 ERR055617 PAIRED
SPN22664 ERR052452 PAIRED
SPN4901 ERR052467 PAIRED
SPN4876 ERR052466 PAIRED
SPN4000 ERR052464 PAIRED
0111+21110 ERR107537 PAIRED
0412+17021 ERR107563 PAIRED
spnIC6 ERR028599 PAIRED
9701+10122 ERR107494 PAIRED
0105+29173 ERR107533 PAIRED
9901+13020 ERR107512 PAIRED
0108+03050 ERR107535 PAIRED
spnIC176 ERR028755 PAIRED
spnIC192 ERR028748 PAIRED
spnIC187 ERR028757 PAIRED
9605+07141 ERR107487 PAIRED
9608+02116 ERR107490 PAIRED
0106+11154 ERR107534 PAIRED
9909+21117 ERR107518 PAIRED
spnIC463 ERR028771 PAIRED
0207+15090 ERR107543 PAIRED
9612+28032 ERR107493 PAIRED
0211+13275 ERR107546 PAIRED
spnIC55 ERR028742 PAIRED
0506+29167 ERR107568 PAIRED
9609+24019 ERR107491 PAIRED
9707+14101 ERR107499 PAIRED
9812+11187 ERR107511 PAIRED
0407+30157 ERR107561 PAIRED
0202+01134 ERR107539 PAIRED
0502+25231 ERR107565 PAIRED
9602+15128 ERR107485 PAIRED
0501+31089 ERR107564 PAIRED
9706+10167 ERR107498 PAIRED
spnIC43 ERR028737 PAIRED
spnIC58 ERR028740 PAIRED
spnIC41 ERR028594 PAIRED
spnIC54 ERR028743 PAIRED
9711+04135 ERR107501 PAIRED
0509+09128 ERR107571 PAIRED
9807+09096 ERR107507 PAIRED
spnIC116 ERR028592 PAIRED
0012+25031 ERR107530 PAIRED
0203+19212 ERR107540 PAIRED
0107+17127 ERR107572 PAIRED
0102+01171 ERR107532 PAIRED
0101+26271 ERR107531 PAIRED
9712+30001 ERR107502 PAIRED
spnIC52 ERR028744 PAIRED
spnIC59 ERR028739 PAIRED
spnIC51 ERR028745 PAIRED
spnIC42 ERR028590 PAIRED
spnIC10 ERR028598 PAIRED
spnIC48 ERR028736 PAIRED
spnIC49 ERR028735 PAIRED
spnIC57 ERR028741 PAIRED
spnIC38 ERR028595 PAIRED
spnIC19 ERR028597 PAIRED
spnIC2 ERR028600 PAIRED
spnIC28 ERR028596 PAIRED
spnIC104 ERR028591 PAIRED
spnIC210 ERR028764 PAIRED
spnIC174 ERR028754 PAIRED
spnIC178 ERR028756 PAIRED
spnIC197 ERR028750 PAIRED
spnIC195 ERR028749 PAIRED
spnIC190 ERR028758 PAIRED
spnIC203 ERR028760 PAIRED
spnIC434 ERR028770 PAIRED
spnIC432 ERR028769 PAIRED
spnIC426 ERR028768 PAIRED
spnIC425 ERR028767 PAIRED
spnIC419 ERR028766 PAIRED
spnIC139 ERR028747 PAIRED
spnIC161 ERR028753 PAIRED
spnIC100 ERR028601 PAIRED
0003+13150 ERR107523 PAIRED
9905+21046 ERR107514 PAIRED
9702+03174 ERR107495 PAIRED
9511+06158 ERR107482 PAIRED
9912+17151 ERR107520 PAIRED
0005+24045 ERR107524 PAIRED
0503+03063 ERR107566 PAIRED
0505+04001 ERR107567 PAIRED
0507+08068 ERR107569 PAIRED
0011+09030 ERR107529 PAIRED
0409+02133 ERR107562 PAIRED
0112+27129 ERR107538 PAIRED
0303+13263 ERR107550 PAIRED
0212+05159 ERR107547 PAIRED
0302+20271 ERR107549 PAIRED
0209+24059 ERR107545 PAIRED
0508+26122 ERR107570 PAIRED
9811+05053 ERR107510 PAIRED
9808+20015 ERR107508 PAIRED
9803+09178 ERR107505 PAIRED
9708+26128 ERR107500 PAIRED
9607+10105 ERR107489 PAIRED
9908+13028 ERR107517 PAIRED
9806+08086 ERR107506 PAIRED
9809+25147 ERR107509 PAIRED
0009+01095 ERR107528 PAIRED
9801+23240 ERR107503 PAIRED
9907+14172 ERR107516 PAIRED
0206+21189 ERR107542 PAIRED
0301+23540 ERR107548 PAIRED
9601+30166 ERR107484 PAIRED
9703+24150 ERR107496 PAIRED
9705+05087 ERR107497 PAIRED
9603+12093 ERR107486 PAIRED
9906+15160 ERR107515 PAIRED
0109+28217 ERR107536 PAIRED
9512+07034 ERR107483 PAIRED
0006+20010 ERR107525 PAIRED
9903+09147 ERR107513 PAIRED
0205+05018 ERR107541 PAIRED
9802+26112 ERR107504 PAIRED
9611+04103 ERR107492 PAIRED
0001+05003 ERR107521 PAIRED
9911+18139 ERR107519 PAIRED
0208+13020 ERR107544 PAIRED
spnIC141 ERR028751 PAIRED
spnIC145 ERR028752 PAIRED
K01_071280 ERR029279 PAIRED
K01_071218 ERR029286 PAIRED
K01_071205 ERR029285 PAIRED
K13_0820 ERR029287 PAIRED
K13_0940 ERR029284 PAIRED
K13_082 ERR029289 PAIRED
K13_0827 ERR029282 PAIRED
K13_0913 ERR029283 PAIRED
K13_0810 ERR029278 PAIRED
LMG2888 ERR051966 PAIRED
LMG2926 ERR271773 PAIRED
LMG2230 ERR051955 PAIRED
LMG3367 ERR051968 PAIRED
LMG2311 ERR051965 PAIRED
LMG2290 ERR051960 PAIRED
LMG2302 ERR051963 PAIRED
PT1430 ERR052395 PAIRED
PT582 ERR052396 PAIRED
DCC1738 ERR052391 PAIRED
DCC2613 ERR052400 PAIRED
PT3104 ERR052399 PAIRED
PT2236 ERR052398 PAIRED
DCC2623 ERR052401 PAIRED
PT3536 ERR052394 PAIRED
PT1175 ERR052393 PAIRED
PT3026 ERR052402 PAIRED
DCC1524 ERR052390 PAIRED
DCC1902 ERR052392 PAIRED
spnSP681 ERR028738 PAIRED
spnSP522 ERR028734 PAIRED
SN11927 ERR052506 PAIRED
SPN11926 ERR052438 PAIRED
SN11917 ERR052505 PAIRED
LMG2050 ERR051950 PAIRED
LMG2062 ERR051951 PAIRED
DG_25 ERR051889 PAIRED
BS_06 ERR271785 PAIRED
DG_23 ERR051888 PAIRED
LMG1417 ERR051938 PAIRED
1014-00 ERR051829 PAIRED
1422-00 ERR051833 PAIRED
LMG1978 ERR051948 PAIRED
6575-07 ERR051863 PAIRED
#!/usr/bin/env perl
use Parallel::ForkManager;
use FindBin;
my $script_dir = $FindBin::Bin;
use warnings;
use strict;
# given a list of files in format (name, SRA URL, layout), downloads and renames the runs
# downloads files ot [fastq-dir] using [cores] of fastq-dump in parallel
my $usage = "Usage: $0 [fastq-dir] [cores] < files.txt";
if (@ARGV != 2) {
die "$usage";
my $outdir = $ARGV[0];
my $cores = $ARGV[1];
my $pm = new Parallel::ForkManager($cores);
while(my $line = <STDIN>)
$pm->start and next;
chomp $line;
my @parts = split(/\t/,$line);
my $files_dir = "$script_dir/$outdir";
my $name = $parts[0];
my $sra_run = $parts[1];
my $layout = $parts[2];
chdir $files_dir;
my $command = "fastq-dump -F --defline-qual + -A $name ";
$command .= ($layout eq 'PAIRED') ? '--split-files' : '';
$command .= " $sra_run";
print $command."\n";
if (system($command) != 0)
print "Failed for $name\n";
print "Success for $name\n";
# Benchmarking Methods
The below describes the methods used to gather the information used for benchmarking SNVPhyl. In all cases, SNVPhyl was run with version 1.0.1, and using default parameters (minimum coverage = 10, min mean mapping = 30, relative snv abundance = 0.75, SNV filtering = 2 SNVs in a 500 bp window).
## Docker only
The SNVPhyl Docker container was run using:
docker run -d -p 48888:80 phacnml/snvphyl-galaxy-1.0.1:1.0.1b
The following presents how each field was obtained, mostly from the Docker [cgroups](
1. Max RSS
Record `total_rss` from the following file.
cat /sys/fs/cgroup/memory/docker/[docker id]/memory.stat
2. Max Mem.
cat /sys/fs/cgroup/memory/docker/[docker id]/memory.max_usage_in_bytes
3. Disk Space
This was obtained by using `du -sh /var/lib/docker/` before and after running the container and recording the difference.
## Other cases
For the other cases, the script []( was used to run SNVPhyl (via the [SNVPhyl CLI]( and record information. The following describes how each field was obtained.
1. Runtime
The value is the one recorded in `total_seconds` from the file `runSettings.txt.
2. Max RSS
The value is sampled in `` every 5 seconds from the file `cat /sys/fs/cgroup/memory/docker/[docker id]/memory.max_usage_in_bytes` for the time SNVPhyl runs. The maximum value (of `total_rss`) after SNVPhyl completes is recorded.
3. Max Memory
The value from `cat /sys/fs/cgroup/memory/docker/[docker id]/memory.max_usage_in_bytes`.
4. Disk Space
The command `du -sh /var/lib/docker/` was run before the SNVPhyl docker container was launched and after SNVPhyl completes and the difference was taken.
For the **Simulated data**, **Density filter**, and **_S._ Heidelberg** the data was taken from the [SNVPhyl Manuscript]( (with **_S._ Heidelberg** being run using the downsampled dataset so the minimum genome has a mean coverage of 30X).
For the **189 _S. pneumoniae_** genomes, the genomes were obtained from <> (a table of the genomes can be found at [1-pneumo.tsv](1-pneumo.tsv)). The reference genome `CP002176` (670-6B) was used.
## Gubbins analysis
To construct a fasta alignment of polymorphic and monomorphic sites, the `snvTable.tsv` and reference genome `CP002176` were run through the Galaxy tool **Positions to SNV invariant alignment** (provided with SNVPhyl), setting **Keep all positions** to **Yes**.
The following commands were run on the alignment to replace ambiguous bases `R` and `K` with `N`.
sed -i -e 's/TK/TN/' alignment-all-positions.fasta
sed -i -e 's/TR/TN/' alignment-all-positions.fasta
The modified alignment was run through Gubbins with ` alignment.fasta`. The `Reference` label in the produced phylogenetic tree was renamed to `670-6B ` to be consistent with Microreact.
if [ $# -ne 4 ]
echo "Usage: $0 [fastq_dir] [reference] [output_dir] [snvphyl_log]"
exit 1
echo "Disk before docker - `date`"
du -sh /var/lib/docker/
echo "Start docker"
docker_id=`docker run -d -p 48888:80 -v $fastq_dir:$fastq_dir phacnml/snvphyl-galaxy-1.0.1:1.0.1b | tr -d '\n'`
echo -n "Waiting 90s for docker to start..."
sleep 90
echo "started"
echo "Disk before SNVPhyl - `date`"
du -sh /var/lib/docker/
echo "Memory before SNVPhyl - `date`"
cat /sys/fs/cgroup/memory/docker/$docker_id/memory.stat --galaxy-url http://localhost:48888 --galaxy-api-key admin --fastq-dir $fastq_dir --reference-file $reference --fastq-files-as-links --output-dir $output_dir > $snvphyl_log &
while [ "`pgrep -f`" != "" ]
echo "Memory during SNVPhyl - `date`"
cat /sys/fs/cgroup/memory/docker/$docker_id/memory.stat
sleep 5
echo "Peak memory usage after SNVPhyl - `date`"
cat /sys/fs/cgroup/memory/docker/$docker_id/memory.max_usage_in_bytes
echo "Disk usage after SNVPhyl - `date`"
du -sh /var/lib/docker/
docker rm -f -v $docker_id
# Benchmarking
A number of datasets have been used to benchmark the runtime of SNVPhyl across a range of scenarios. These are as follows.
A number of datasets have been used to benchmark the runtime, memory, and disk usage of SNVPhyl across a range of scenarios using the Docker version of SNVPhyl on a 16-core machine. The results are presented in the table below.
## Manuscript datasets
| Case | # Genomes | Total read size <br/> (GB) | Runtime <br/> (hrs) | Max Mem. (RSS)<br/> (GB) | Max Mem. (All)<br/> (GB) | Disk Space <br/> (GB) |
| Docker only | - | - | - | 0.662 | 1.15 | 2.4 |
| Simulated data | 4 | 1.4 | 0.261 | 3.04 | 9.90 | 6.8 |
| Density filter | 11 | 13 | 0.439 | 4.18 | 14.1 | 9.6 |
| *S.* Heidelberg | 59 | 40 | 3.04 | 4.07 | 21.4 | 66.6 |
| *S. pneumoniae* | 189 | 169 | 8.23 | 12.4 | 21.7 | 136 |
The datasets from the [SNVPhyl manuscript][] were run on a single machine using the Docker version of the pipeline. The following table presents the run times (to go from sequence reads to a phylogeny) and data sizes of each case.
The **Docker only** case represents the resource useage of the snvphyl-galaxy Docker image alone, without any data. The next three cases are data analyzed in the [SNVPhyl manuscript][]. The **Simulated data** case was run using a set of simulated reads through SNVPhyl, based off of *E. coli* str. Sakai (NC_002695) and two plasmids (NC_002128 and NC_002127). The **SNV density filtering** case was run using a set of 11 *Streptococcus pneumoniae* genomes through SNVPhyl. The **_Salmonella_ Heidelberg** case was run using a set of 59 *Salmonella* Heidelberg genomes. The final case was not analyzed in the SNVPhyl manuscript, but consists of a set of 189 *Streptococcus pneumoniae* genomes analyzed in <>. Additional details on this dataset are provided below.
| Case | Number of genomes | Total size of reads (GB) | Runtime (min) |
| Simulated data | 4 | 1.4 | 16 |
| SNV density filtering | 11 | 13 | 63 |
| *Salmonella* Heidelberg | 59 | 40 | 192 |
All datasets were run using the default SNVPhyl parameters on a 16-core Intel Xeon CPU (W5590) @ 3.33 GHz with 24 GB of RAM. Additional details on the methods used to run SNVPhyl for each case can be found [here][methods].
The **Simulated data** case was run using a set of simulated reads through SNVPhyl, based off of *E. coli* str. Sakai (NC_002695) and two plasmids (NC_002128 and NC_002127). The other two cases were run with real-world data. The **SNV density filtering** case was run using a set of 11 *Streptococcus pneumoniae* genomes through SNVPhyl, in particular the runtime presented was recorded when no SNV density filtering was applied. The **_Salmonella_ Heidelberg** case was run using a set of 59 *Salmonella* Heidelberg genomes, and in particular the runtime presented corresponds to the case of using a minimum coverage threshold of 10X while keeping all other parameters at default values.
## Details on *Streptococcus pneumoniae* case
The machine used to run each of these cases was an Intel Xeon CPU @ 3.33 GHz, 16 cores, and 24 GB of memory. More details on the methods can be found in the [SNVPhyl manuscript][], or in [snvphyl-validations][] github project.
In addition to running the 189 *Streptococcus pneumoniae* genomes using Docker, we also ran the dataset on a 2000-core cluster for comparison of the runtime and a rough estimate of scalability (runtime on the cluster is highly variable depending on usage). We also provide a comparison of the produced phylogenetic tree to the one available from [Microreact][] for this same dataset (the SNVPhyl tree is tree 1 on the left, Microreact tree is tree 2 on the right).
As an additional comparison to the default tree produced by SNVPhyl, we extracted all SNVs from the `snvTable.tsv` file (including SNVs in regions with low coverage in one or more genomes, or in repetitive regions) to construct an alignment with both polymorphic and monomorphic sites. This alignment was then run through [Gubbins][] (with default parameters). This produced the tree shown for the case labeled **Gubbins (all positions detected by SNVPhyl)**.
In both cases, the SNVPhyl tree is tree 1 on the left while the tree available from Microreact - <> is tree 2 on the right.
| Case | SNVs used | % core | Docker runtime <br/> (hrs) | Cluster runtime <br/> (hrs) | Phylogenetic tree comparison |
| Default <br/> (2 SNVs in 500 bp) | 1111 | 36.81 | 8.23 | 2.22 | [Comparison][1-tree-2-500] |
| Gubbins\* <br/> (all positions detected by SNVPhyl)| - | - | - | - | [Comparison][1-tree-gubbins] |
[docker version of SNVPhyl]: ../install/docker
[SNVPhyl manuscript]:
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment