Commit dd7ff138 authored by Aaron Petkau's avatar Aaron Petkau

Finished most of the methods

parent 2827b28a
# 1. Variable recombination dynamics during the emergence, transmission and ‘disarming’ of a multidrug-resistant pneumococcal clone
From <https://www.ncbi.nlm.nih.gov/pubmed/24957517>.
Reference genome <https://www.ncbi.nlm.nih.gov/nuccore/CP002176>.
189 Illumina reads in `1-pneumo.tsv`.
```
perl download_sra_runs.pl fastq-1-pneumo 12 < 1-pneumo.tsv
# Rename _3.fastq to _2.fastq
prename 's/_3.fastq/_2.fastq/' *_3.fastq
# Estimate coverages
(for i in fastq-1-pneumo/*_1.fastq; do name=`basename $i _1.fastq`; forward=`sed -n 2~4p fastq-1-pneumo/${name}_1.fastq|tr -d '\n'|wc -c`; reverse=`sed -n 2~4p fastq-1-pneumo/${name}_2.fastq|tr -d '\n'|wc -c`; ref=`bp_seq_length references/CP002176.fasta | cut -d ' ' -f 2| tr -d '\n'`; cov=`echo "($forward+$reverse)/$ref"|bc -l`; echo -e "$name\t$forward\t$reverse\t$ref\t$cov"; done) | sort -k 5,5n | tee coverages.txt
# Run SNVPhyl
## Case 2 SNVs in 500 bp
### Docker
snvphyl.py --deploy-docker --fastq-dir fastq-1-pneumo/ --reference-file references/CP002176.fasta --output-dir output-1-pneumo-docker-3 --min-coverage 20
### Cluster
snvphyl.py --galaxy-url [URL] --galaxy-api-key [KEY] --fastq-history-name fastq-1-pneumo --reference-file references/CP002176.fasta --output-dir output-1-pneumo-waffles-no-filter --min-coverage 20
## Case no filtering
### Cluster
snvphyl.py --galaxy-url [URL] --galaxy-api-key [KEY] --fastq-history-name fastq-1-pneumo --reference-file references/CP002176.fasta --output-dir output-1-pneumo-waffles-no-filter --min-coverage 20 --filter-density-window 5 --filter-density-threshold 10
# Swap 'Reference' for '670-6B' for comparison with microreact tree
sed -i -e 's/reference/670-6B /' [tree]
```
# Benchmarking Methods
The below describes the methods used to gather the information used for benchmarking SNVPhyl. In all cases, SNVPhyl was run with version 1.0.1, and using default parameters (minimum coverage = 10, min mean mapping = 30, relative snv abundance = 0.75, SNV filtering = 2 SNVs in a 500 bp window).
## Docker only
The SNVPhyl Docker container was run using:
```
docker run -d -p 48888:80 phacnml/snvphyl-galaxy-1.0.1:1.0.1b
```
The following presentes how each field was obtained, mostly from the Docker [cgroups](https://docs.docker.com/engine/admin/runmetrics/).
1. Max RSS
Record `total_rss` from the following file.
```
cat /sys/fs/cgroup/memory/docker/[docker id]/memory.stat
```
2. Max Mem.
```
cat /sys/fs/cgroup/memory/docker/[docker id]/memory.max_usage_in_bytes
```
3. Disk Space
This was obtained by using `du -sh /var/lib/docker/` before and after running the container and recording the difference.
## Other cases
For the other cases, the script [snvphyl-stats.sh](snvphyl-stats.sh) was used to run SNVPhyl (via the [SNVPhyl CLI](https://github.com/phac-nml/snvphyl-galaxy-cli)) and record information. The following describes how each field was obtained.
1. Runtime
The value is the one recorded in `total_seconds` from the file `runSettings.txt.
2. Max RSS
The value is sampled in `snvphyl-stats.sh` every 10 seconds from the file `cat /sys/fs/cgroup/memory/docker/[docker id]/memory.max_usage_in_bytes` for the time SNVPhyl runs. The maximum value (of `total_rss`) after SNVPhyl completes is recorded.
3. Max Memory
The value from `cat /sys/fs/cgroup/memory/docker/[docker id]/memory.max_usage_in_bytes`.
4. Disk Space
The command `du -sh /var/lib/docker/` was run before the SNVPhyl docker container was launched and after SNVPhyl completes and the difference was taken.
For the **Simulated data**, **Density filter**, and **_S._ Heidelberg** the data was taken from the [SNVPhyl Manuscript](http://biorxiv.org/content/early/2016/12/10/092940) (with **_S._ Heidelberg** being run using the downsampled dataset so the minimum genome has a mean coverage of 30X).
For the **189 _S. pneumoniae_** genomes, the genomes were obtained from <http://www.ncbi.nlm.nih.gov/pubmed/24957517> (a table of the genomes can be found at [1-pneumo.tsv](1-pneumo.tsv)). The reference genome `CP002176` (670-6B) was used.
## Gubbins analysis
To construct a fasta alignment of polymorphic and monomorphic sites, the `snvTable.tsv` and reference genome `CP002176` were run through the Galaxy tool **Positions to SNV invariant alignment** (provided with SNVPhyl), setting **Keep all positions** to **Yes**.
The following commands were run on the alignment to replace ambiguous bases `R` and `K` with `N`.
```
sed -i -e 's/TK/TN/' alignment-all-positions.fasta
sed -i -e 's/TR/TN/' alignment-all-positions.fasta
```
The modified alignment was run through Gubbins with `run_gubbins.py alignment.fasta`. The `Reference` label in the produced phylogenetic tree was renamed to `670-6B ` to be consistent with Microreact.
......@@ -16,8 +16,8 @@ du -sh /var/lib/docker/
echo "Start docker"
docker_id=`docker run -d -p 48888:80 -v $fastq_dir:$fastq_dir phacnml/snvphyl-galaxy-1.0.1:1.0.1b | tr -d '\n'`
echo -n "Waiting 60s for docker to start..."
sleep 60
echo -n "Waiting 90s for docker to start..."
sleep 90
echo "started"
echo "Disk before SNVPhyl - `date`"
......
# Benchmarking
A number of datasets have been used to benchmark the runtime of SNVPhyl across a range of scenarios. These are as follows.
A number of datasets have been used to benchmark the runtime, memory, and disk usage of SNVPhyl across a range of scenarios using the Docker version of SNVPhyl on a 16-core machine. The results are presented in the table below.
## Manuscript datasets
| Case | # Genomes | Read size <br/> (GB) | Runtime <br/> (hrs) | Max RSS <br/> (GB) | Max Mem. <br/> (GB) | Disk Space <br/> (GB) |
|:----------------|:---------:|:--------------------:|:-------------------:|:------------------:|:-------------------:|:---------------------:|
| Docker only | - | - | - | 0.662 | 1.15 | 2.4 |
| Simulated data | 4 | 1.4 | 0.261 | 3.04 | 9.90 | 6.8 |
| Density filter | 11 | 13 | 0.439 | 4.18 | 14.1 | 9.6 |
| *S.* Heidelberg | 59 | 40 | 3.04 | 4.07 | 21.4 | 66.6 |
| *S. pneumoniae* | 189 | 169 | 8.23 | 12.4 | 21.7 | 136 |
The datasets from the [SNVPhyl manuscript][] were run on a single machine using the Docker version of the pipeline. The following table presents the run times (to go from sequence reads to a phylogeny) and data sizes of each case.
The **Docker only** case represents the resource useage of the `snvphyl-galaxy` Docker image alone, without any data. The next three cases are data analyzed in the [SNVPhyl manuscript][]. The **Simulated data** case was run using a set of simulated reads through SNVPhyl, based off of *E. coli* str. Sakai (NC_002695) and two plasmids (NC_002128 and NC_002127). The **SNV density filtering** case was run using a set of 11 *Streptococcus pneumoniae* genomes through SNVPhyl. The **_Salmonella_ Heidelberg** case was run using a set of 59 *Salmonella* Heidelberg genomes. The final case was not analyzed in the SNVPhyl manuscript, but consists of a set of 189 *Streptococcus pneumoniae* genomes analyzed in <http://www.ncbi.nlm.nih.gov/pubmed/24957517>. Additional details on this dataset are provided below.
| Case | Number of genomes | Total size of reads (GB) | Runtime (hours) | Peak RSS (GB) | Peak Memory (GB) | Temporary Disk Space (GB) |
|:--------------------------:|:-----------------:|:------------------------:|:---------------:|:-------------:|:----------------:|:-------------------------:|
| Docker no data | - | - | - | 0.662 | | 2.4 |
| Simulated data | 4 | 1.4 | 0.261 | 3.04 | 9.90 | 6.8 |
| SNV density filtering | 11 | 13 | 0.439 | 4.18 | 14.1 | 9.6 |
| *Salmonella* Heidelberg | 59 | 40 | 3.04 | 4.07 | 21.4 | 66.6 |
| *Streptococcus pneumoniae* | 189 | 169 | 8.04 |
All datasets were run using the default SNVPhyl parameters on a 16-core Intel Xeon CPU (W5590) @ 3.33 GHz with 24 GB of RAM. Additional details on the methods used to run SNVPhyl for each case can be found [here][methods].
## 189 *Streptococcus pneumoniae* genomes: details
The **Simulated data** case was run using a set of simulated reads through SNVPhyl, based off of *E. coli* str. Sakai (NC_002695) and two plasmids (NC_002128 and NC_002127). The other two cases were run with real-world data. The **SNV density filtering** case was run using a set of 11 *Streptococcus pneumoniae* genomes through SNVPhyl, in particular the runtime presented was recorded when no SNV density filtering was applied. The **_Salmonella_ Heidelberg** case was run using a set of 59 *Salmonella* Heidelberg genomes, and in particular the runtime presented corresponds to the case of using a minimum coverage threshold of 10X while keeping all other parameters at default values.
In addition to running the 189 *Streptococcus pneumoniae* genomes using Docker, we also ran the dataset on a 2000-core cluster for comparison of the runtimes. We also provide a comparison of the produced phylogenetic tree to the one available from [Microreact][] for this same dataset <https://microreact.org/project/N1TRn11L> (the SNVPhyl tree is tree 1 on the left, Microreact tree is tree 2 on the right).
## 189 *Streptococcus pneumoniae* genomes
For this scenario, we ran 189 *Streptococcus pneumoniae* genomes (169 GB of reads), published in <http://www.ncbi.nlm.nih.gov/pubmed/24957517>, under a number of different parameter settings. We also provide a comparison of the SNVPhyl-produced phylogenetic tree with the tree available on [Microreact][] for this same dataset <https://microreact.org/project/N1TRn11L>.
| Case | SNVs used | Docker runtime (hours) | Cluster runtime (hours) | Phylogenetic tree comparison |
|:------------------------------:|:---------:|:----------------------:|:-----------------------:|:----------------------------:|
| SNVPhyl (2 SNVs in 500 bp) | 800 | 8.04 | 2.33 | [Comparison][1-tree-2-500] |
| SNVPhyl (no density filtering) | 20,185 | 9.28 | 5.09 | [Comparison][1-tree-nf] |
We also extract out the SNVs identified by SNVPhyl from the `snvTable.tsv` file to an alignment with both polymorphic and monomorphic SNVs, which was then ran through Gubbins (with default parameters). We note that this gives phylogenetic trees most closely matching the tree available on Microreact.
| Case | Gubbins runtime (minutes) | Phylogenetic tree comparison |
|:--------------------------------:|:-------------------------:|:----------------------------:|
| SNVs without density filtering | 25.8 | [Comparison][1-tree-gubbins] |
| Case | SNVs used | % core | Docker runtime <br/> (hrs) | Cluster runtime <br/> (hrs) | Phylogenetic tree comparison |
|:--------------------------------:|:---------:|:------:|:--------------------------:|:---------------------------:|:----------------------------:|
| Default <br/> (2 SNVs in 500 bp) | 1111 | 36.81 | 8.23 | 2.33\* | [Comparison][1-tree-2-500] |
We note that this try is not highly resolved when compared to the tree available on Microreact. We suspect this is due to the default filtering criteria (removal of SNVs in any region with at least 2 SNVs in a 500 bp window). For comparison, we extracted all SNVs from the `snvTable.tsv` file (including SNVs in regions with low coverage in one or more genomes, or in repetitive regions) to construct an alignment with both polymorphic and monomorphic SNVs. This alignment was then run through [Gubbins][] (with default parameters). This took approximately x minutes and produced the following [phylogenetic tree][1-tree-gubbins] (Gubbins tree is tree 1 on the left, Microreact tree is tree 2 on the right). We note that this gives a phylogenetic tree that is a bit more resolved and a closer match to the Microreact tree.
[docker version of SNVPhyl]: ../install/docker
[SNVPhyl manuscript]: http://biorxiv.org/content/early/2016/12/10/092940
[snvphyl-validations]: https://github.com/apetkau/snvphyl-validations
[Microreact]: https://microreact.org
[1-tree-2-500]: http://phylo.io/#2e656069ebe8e2ec69374ef75dde8cf7%23bac6e0fc58be1fea772e304682eb34f1
[1-tree-nf]: http://phylo.io/#56eb4dfaed14ca3d086b31c65365c52c%230ea3c6aba4c8d3151f3a8b7a47b163a5
[Gubbins]: https://sanger-pathogens.github.io/gubbins/
[methods]: benchmarking-methods/methods.md
[1-tree-2-500]: http://phylo.io/#db3f0e933657efbab5b732f19c3b3276%23e5f863ba72a3551780bdebc610e87dd1
[1-tree-gubbins]: http://phylo.io/#e6ade76551bfb717c90f5b0e870478cd%23ff42dc3ffc873cd87684d9d44a5afabc
......@@ -22,6 +22,7 @@ pages:
- Evaluation:
- 'ASM NGS Challenge 2015': 'evaluation/asm-ngs-2015.md'
- 'Benchmarking': 'evaluation/benchmarking.md'
- 'Benchmarking Methods': 'evaluation/benchmarking-methods/methods.md'
- Developer:
- 'Upgrading SNVPhyl Toolshed' : 'developer/upgrading.md'
- Citation/Contact: 'citation.md'
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment