Commit f6f0dc88 authored by Aaron Petkau's avatar Aaron Petkau

Merge branch 'development'

parents c9dd87db 681cf353
Assembly and Annotation
This workflow uses the software [SPAdes][] and [Prokka][] for assembly and annotation of genomes as well as a few tools for filtering of data and generating assembly statistics. The specific Galaxy tools are listed in the table below.
| Tool Name | Tool Revision | Toolshed Installable Revision | Toolshed |
| **flash** | 4287dd541327 | 0 (2015-05-05) | [IRIDA Toolshed][] |
| **filter_spades_repeats** | f9fc830fa47c | 0 (2015-05-05) | [IRIDA Toolshed][] |
| **assemblystats** | 51b76a5d78a5 | 1 (2015-05-07) | [IRIDA Toolshed][] |
| **spades** | 21734680d921 | 14 (2015-02-27) | [Galaxy Main Shed][] |
| **prokka** | 3ad7ef0ba385 | 6 (2014-10-27) | [Galaxy Main Shed][] |
| **regex_find_replace** | 9ea374bb0350 | 0 (2014-03-29) | [Galaxy Main Shed][] |
To install these tools please proceed through the following steps.
## Step 1: Install Dependencies
Some of these tools require additional dependencies to be installed. For a cluster environment please make sure these are available on all cluster nodes by installing to a shared directory.
1. [Java][]: Please download and install [Java][] version 1.6+ or make sure it is available in your execution environment.
2. [gnuplot][]: Please download and install [gnuplot][] or make sure this is available in your execution environment.
2. **Perl Modules**: Please download and install dependency Perl modules with the command.
cpanm Time::Piece XML::Simple Data::Dumper
In addition, [BioPerl][] version 1.6.901 must be installed. Please run the following command to install.
## Step 2: Install Galaxy Tools
Please install all the Galaxy tools in the table above by logging into Galaxy, navigating to **Admin > Search and browse tool sheds**, searching for the appropriate **Tool Name** and installing the appropriate **Toolshed Installable Revision**.
The install progress can be checked by monitoring the Galaxy log file `$GALAXY_BASE_DIR/main.log`. On completion you should see a message of `Installed` next to the tool when going to **Admin > Manage installed tool shed repositories**.
**Note**: Prokka downloads several large databases and may take some time to install.
## Step 3: Testing Pipeline
A Galaxy workflow and some test data has been included with this documentation to verify that all tools are installed correctly. To test this pipeline, please proceed through the following steps.
1. Upload the [Assembly Annotation Galaxy Workflow][] by going to **Workflow > Upload or import workflow**.
2. Upload the sequence reads by going to **Analyze Data** and then clicking on the **upload files from disk** icon ![upload-icon][]. Select the [test/reads][] files. Make sure to change the **Type** of each file from **Auto-detect** to **fastqsanger**. When uploaded you should see the following in your history.
3. Construct a dataset collection of the paired-end reads by clicking the **Operations on multiple datasets** icon ![datasets-icon][]. Please check off the two **.fastq** files and then go to **For all selected... > Build List of dataset pairs**. You should see a screen that looks as follows.
4. This should have properly paired your data and named the sample **a**. Enter the name of this paired dataset collection at the bottom and click **Create list**.
5. Run the uploaded workflow by clicking on **Workflow**, clicking on the name of the workflow **FLASH, SPAdes and Prokka (imported from uploaded file)** and clicking **Run**. This should auto fill in the dataset collection. At the very bottom of the screen click **Run workflow**.
6. If everything was installed correctly, you should see each of the tools run successfully (turn green). On completion this should look like.
If you see any tool turn red, you can click on the view details icon ![view-details-icon][] for more information.
If everything was successfull then all dependencies for this pipeline have been properly installed.
[Galaxy Main Shed]:
[IRIDA Toolshed]:
[Assembly Annotation Galaxy Workflow]: workflows/AssemblyAnnotation/0.2/
[upload-icon]: test/images/upload-icon.jpg
[test/reads]: test/reads
[upload-history]: test/images/upload-history.jpg
[datasets-icon]: test/images/datasets-icon.jpg
[dataset-pair-screen]: test/images/dataset-pair-screen.jpg
[workflow-success]: test/images/workflow-success.png
[view-details-icon]: test/images/view-details-icon.jpg
......@@ -2,5 +2,8 @@
This repository contains the [Galaxy][] assembly and annotation workflow which is used for [IRIDA][] along with any customized Galaxy tools.
For information on how to install this workflow, please refer to the [Installation][] documentation.
ROOT_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"
echo "Builds tool and workflow tarballs under $BUILD_DIR"
echo "These can then be uploaded to a Galaxy Tool Shed."
if [ ! -e $BUILD_DIR ]
mkdir $BUILD_DIR
# build all tools under $TOOLS_DIR
for i in $TOOLS_DIR/*
name=`basename $i`
tar -C $TOOLS_DIR/$name -czf $BUILD_DIR/$name.tar.gz .
echo "Successfully built tarballs"
ls -l $BUILD_DIR/*.tar.gz
echo "Please upload these into a Galaxy Tool Shed"
This diff is collapsed.
This diff is collapsed.
#Created 07/01/2011
#Konrad Paszkiewicz, University of Exeter
Assembly stats
This series of scripts calculates various metrics on an input FASTA file. Typically this is most useful on either denovo genomic or transcriptomic data.
1. The bundled perl script
Ideally this should output a composite dataset of some sort
#!/usr/bin/env python
#Version 1.01 - bugs kindly corrected by Jan van Haarst
import pkg_resources
import logging, os, string, sys, tempfile, glob, shutil, types, urllib
import shlex, subprocess
from optparse import OptionParser, OptionGroup
from stat import *
log = logging.getLogger( __name__ )
assert sys.version_info[:2] >= ( 2, 4 )
def stop_err( msg ):
sys.stderr.write( "%s\n" % msg )
def __main__():
#Parse Command Line
s = ' argv = %s\n' % (sys.argv)
argcnt = len(sys.argv)
html_file = sys.argv[1]
working_dir = sys.argv[2]
type = sys.argv[3]
bucket = sys.argv[4]
input = sys.argv[5]
stats = sys.argv[6]
sortedcontigs = sys.argv[7]
histogrampng = sys.argv[8]
summedcontigspng = sys.argv[9]
histogramdata = sys.argv[10]
summedcontigdata = sys.argv[11]
try: # for test - needs this done
except Exception, e:
stop_err( 'Error running ' + str( e ) )
cmdline = '%s/ -i %s -t %s %s -o %s > /dev/null' % (os.path.dirname(sys.argv[0]),input, type, bucket, working_dir)
proc = subprocess.Popen( args=cmdline, shell=True, stderr=subprocess.PIPE )
returncode = proc.wait()
# get stderr, allowing for case where it's very large
stderr = ''
buffsize = 1048576
while True:
stderr += buffsize )
if not stderr or len( stderr ) % buffsize != 0:
except OverflowError:
if returncode != 0:
raise Exception, stderr
except Exception, e:
stop_err( 'Error running ' + str( e ) )
stats_path = os.path.join(working_dir,'stats.txt')
sorted_contigs_path = os.path.join(working_dir,'sorted_contigs.fa')
histogram_png_path = os.path.join(working_dir,'histogram_bins.dat.png')
summed_contigs_path = os.path.join(working_dir,'summed_contig_lengths.dat.png')
histogram_data_path = os.path.join(working_dir,'histogram_bins.dat')
summed_contigs_data_path = os.path.join(working_dir,'summed_contig_lengths.dat')
out = open(stats,'w')
for line in open( stats_path ):
out.write( "%s" % (line) )
out = open(sortedcontigs,'w')
for line in open(sorted_contigs_path ):
out.write( "%s" % (line) )
out = open(histogrampng,'w')
for line in open(histogram_png_path ):
out.write( "%s" % (line) )
out = open(summedcontigspng,'w')
for line in open(summed_contigs_path ):
out.write( "%s" % (line) )
out = open(histogramdata,'w')
for line in open(histogram_data_path ):
out.write( "%s" % (line) )
out = open(summedcontigdata,'w')
for line in open(summed_contigs_data_path ):
out.write( "%s" % (line) )
# rval = ['<html><head><title>Assembly stats Galaxy Composite Dataset </title></head><p/>']
# rval.append('<div>%s<p/></div>' % (cmdline) )
# rval.append('<div>This composite dataset is composed of the following files:<p/><ul>')
# rval.append( '<li><a href="%s" type="text/plain">%s </a>%s</li>' % (stats_path,'stats.txt','stats.txt' ) )
# rval.append( '<li><a href="%s" type="text/plain">%s </a>%s</li>' % (sorted_contigs_path,'sorted_contigs.fa','sorted_contigs.fa' ) )
# rval.append( '<li><a href="%s" type="image/png">%s </a>%s</li>' % (histogram_png_path,'histogram_bins.dat.png','histogram_bins.dat.png' ) )
# rval.append( '<li><a href="%s" type="image/png">%s </a>%s</li>' % (summed_contigs_path,'summed_contig_lengths.dat.png','summed_contig_lengths.dat.png' ) )
# rval.append( '<li><a href="%s" type="text/plain">%s </a>%s</li>' % (histogram_data_path,'histogram_bins.dat','histogram_bins.dat' ) )
# rval.append( '<li><a href="%s" type="text/plain">%s </a>%s</li>' % (summed_contigs_data_path,'summed_contig_lengths.dat','summed_contig_lengths.dat' ) )
# rval.append( '</ul></div></html>' )
# f = file(html_file,'w')
# f.write("\n".join( rval ))
# f.write('\n')
# f.close()
if __name__ == "__main__": __main__()
<tool id="assemblystats" name="assemblystats" version="1.0.2">
<description>Summarise an assembly (e.g. N50 metrics)</description>
<command interpreter="python">
'$input_type' '$stats.extra_files_path'
<param help="Is this from an genomic (contig) or transcriptomic assembly (isotig) or are these raw reads (read)" label="Type of read" name="input_type" type="select">
<option selected="yes" value="contig">Contig (if from genomic assembly)</option>
<option value="isotig">Isotig (if from transcriptomic assembly)</option>
<option value="read">Raw reads from sequencer in FASTA format (useful for 454 data)</option>
<param falsevalue="" help="Use this to specify whether or not bin sizes of 1 should be used when plotting histograms" label="Output histogram with bin sizes=1" name="bucket" truevalue="-b" type="boolean" />
<param format="fasta" label="Source file in FASTA format" name="input" type="data" />
<param checked="false" help="If checked, all output files will be displayed. If not checked, only the file 'Assembly Statistics' will be provided." label="Return all output files" name="all_outputs" type="boolean" />
<data format="tabular" label="Assembly statistics - $input.display_name" name="stats" />
<data format="fasta" label="Sorted contigs - $input.display_name" name="sortedcontigs">
<filter>all_outputs is True</filter>
<data format="png" label="Histogram of contig sizes - $input.display_name" name="histogrampng">
<filter>all_outputs is True</filter>
<data format="png" label="Cumulative sum of contig sizes - $input.display_name" name="summedcontigspng">
<filter>all_outputs is True</filter>
<data format="tabular" label="Histogram data - $input.display_name" name="histogramdata">
<filter>all_outputs is True</filter>
<data format="tabular" label="Cumulative sum of contig size data - $input.display_name" name="summedcontigdata">
<filter>all_outputs is True</filter>
**Summarise assembly overview**
This script is used to give summary statistics of an assembly or set of reads. Typically this is run after an assembly to evaluate gross features.
# Gives back
# - N50
# - num of contigs &gt; 1 kb
# - num of contigs
# - Read or Contig Histogram and graphs.
# - Summed contig length (by number of contigs, in sorted order)
This diff is collapsed.
<tool id="filter_spades_repeat" name="Filter SPAdes repeats" version="1.0.0">
<description>Remove short and repeat contigs/scaffolds</description>
<requirement type="package" version="5.18.1">perl</requirement>
<command interpreter="perl"> -i $fasta_input -t $tab_input -c $cov_cutoff -r $rep_cutoff -l $len_cutoff -o $output_with_repeats -u $output_without_repeats -n $repeat_sequences_only -e $cov_len_cutoff -f $discarded_sequences -s $summary
<param name="fasta_input" type="data" format="fasta" label="Contigs or scaffolds file" help="Contigs/Scaffolds output file from Spades" />
<param name="tab_input" type="data" format="tabular" label="Stats file" help="Enter the corresponding stats file of the fasta file input above" />
<param name="cov_cutoff" type="float" value="0.33" min="0" label="Coverage cut-off ratio" help="This is the average coverage ratio cutoff. For example: if the average coverage is 100 and a coverage cut-off ratio of 0.5 is used, then any contigs with coverage lower than 50 will be eliminated." />
<param name="rep_cutoff" type="float" value="1.75" min="0" label="Repeat cut-off ratio" help="This is the coverage ratio cutoff to determine repeats in contigs. For exmaple: if the average coverage is 100 and a repeat cut-off ratio of 1.75 is used, then any contigs with coverage more than or equal to 175 will be marked as repeats." />
<param name="len_cutoff" type="integer" value="1000" min="0" label="Length cut-off" help="Contigs with length under the chosen cut-off will be eliminated." />
<param name="cov_len_cutoff" type="integer" value="5000" min="0" label="Length for average coverage calculation" help="Only contigs above this length will be used to calculate the average coverage." />
<param name="keep_leftover" type="select" label="Print out a fasta file containing the discarded sequences?">
<option value="yes">Yes</option>
<option value="no">No</option>
<param name="print_summary" type="select" label="Print out a summary of all the results?">
<option value="yes">Yes</option>
<option value="no">No</option>
<data format="fasta" name="output_with_repeats" label="Filtered sequences (with repeats)" />
<data format="fasta" name="output_without_repeats" label="Filtered sequences (no repeats)" />
<data format="fasta" name="repeat_sequences_only" label="Repeat sequences" />
<data format="fasta" name="discarded_sequences" label="Discarded sequences">
<filter>keep_leftover == "yes"</filter>
<data format="txt" name="summary" label="Results summary">
<filter>print_summary == "yes"</filter>
**What it does**
Using the output of SPAdes (a fasta and a stats file, either from contigs or scaffolds), it filters the fasta files, discarding all sequences that are under a given length or under a calculated coverage. Repeated contigs are detected based on coverage.
- **Filtered sequences (with repeats)**
- Will contain the filtered contigs/scaffolds including the repeats. These are the sequences that passed the length and minumum coverage cutoffs.
- For workflows, this output is named **output_with_repeats**
- **Filtered sequences (no repeats)**
- Will contain the filtered contigs/scaffolds excluding the repeats. These are the sequences that passed the length, minimum coverage and repeat cutoffs.
- For workflows, this output is named **output_without_repeats**
- **Repeat sequences**
- Will contain the repeated contigs/scaffolds only. These are the sequences that were exluded for having high coverage (determined by the repeat cutoff).
- For workflows, this output is named **repeat_sequences_only**
- **Discarded sequences**
- If selected, will contain the discarded sequences. These are the sequences that fell below the length and minumum coverage cutoffs, and got discarded.
- For workflows, this output is named **discarded_sequences**
- **Results summary** : If selected, will contain a summary of all the results.
Stats file input:
|#name |length |coverage |
|NODE_1 |2500 |15.5 |
|NODE_2 |102 |3.0 |
|NODE_3 |1300 |50.0 |
|NODE_4 |1000 |2.3 |
|NODE_5 |5000 |14.3 |
|NODE_6 |450 |25.2 |
User Inputs:
- Coverage cut-off ratio = 0.33
- Repeat cut-off ratio = 1.75
- Length cut-off = 500
- Length for average coverage calculation = 1000
**Average coverage will be calculatd based on contigs with length >= 1000bp**
- Average coverage = 15.5 + 50.0 + 2.3 + 14.3 / 4 = 20.5
**Contigs that have coverage in the lower 1/3 of the average coverage will be eliminated.**
- Coverage cut-off = 20.5 * 0.33 = 6.8
**Contigs with high coverage (larger than 1.75 times the average coverage) are considered to be repeated contigs.**
- Repeat cut-off = 20.5 * 1.75 = 35.9
**Number of copies are calculated by dividing the sequence coverage by the average coverage.**
- Number of repeats for NODE_3 = 50 / 20.5 = 2 copies
Output (in fasta format):
**Filtered sequences (with repeats)**
>NODE_3 (2 copies)
**Filtered sequences (no repeats)**
**Repeat sequences**
>NODE_3 (2 copies)
**Discarded sequences**
#!/usr/bin/env perl
use warnings;
use strict;
use Getopt::Long;
use Bio::SeqIO;
use Pod::Usage;
my ($fasta_file, $tab_file, $coverage_co, $length_co, $repeat_co, $out_filtered, $out_repeats, $out_norepeats,$coverage_length_co, $summary_out, $filtered_repeats, $help);
'c|coverage-cutoff=s' => \$coverage_co,
'l|length-cutoff=s' => \$length_co,
'e|coverage-length-cutoff=s' => \$coverage_length_co,
'r|repeat_cutoff=s' => \$repeat_co,
'i|input=s' => \$fasta_file,
't|tab=s' => \$tab_file,
'f|filtered-out=s' => \$out_filtered,
'o|output-repeats=s' => \$out_repeats,
'u|output-norepeats=s' => \$out_norepeats,
'n|filtered-repeats=s' => \$filtered_repeats,
's|summary=s' => \$summary_out,
'h|help' => \$help
pod2usage(-verbose => 2) if ($help);
print "A fasta file is required. Please enter a fasta file using the -i flag.\n" if (!$fasta_file);
print "A spades tabs file is required. Please enter a tabs file using the -t flag\n" if (!$tab_file);
pod2usage(1) unless $fasta_file && $tab_file;
if (!$coverage_co)
$coverage_co = 0.33;
if (!$length_co)
$length_co = 1000;
if (!$coverage_length_co)
$coverage_length_co = 5000;
if (!$repeat_co)
$repeat_co = 1.75;
if (!$out_filtered)
$out_filtered = "Discarded_sequences.fasta";
print "Discarded sequences will be printed out to $out_filtered\n";
if (!$out_repeats)
$out_repeats = "Filtered_sequences_with_repeats.fasta";
print "Filtered sequences with repeats will be printed out to $out_repeats\n";
if (!$out_norepeats)
$out_norepeats = "Filtered_sequences_no_repeats.fasta";
print "Filtered sequences without repeats will be printed out to $out_norepeats\n";
if (!$filtered_repeats)
$filtered_repeats = "Repeat_sequences.fasta";
print "Repeat sequences will be printed out to $filtered_repeats\n";
die ("No tab file specified") unless ($tab_file);
die ("No fasta file specified") unless ($fasta_file);
##Read tab file and discard rows with comments
open TAB, '<', $tab_file or die "Could not open tab file: $?";
open SEQIN, '<', $fasta_file or die "Could not open tab file: $?";
open SEQOUT_REP, '>', $out_repeats or die "Could not open file for writing: $?";
open SEQOUT_NOREP, '>', $out_norepeats or die "Could not open file for writing: $?";
open SEQOUT_FILT, '>', $out_filtered if ($out_filtered);
open SEQOUT_FILT_REP, '>', $filtered_repeats or die "Could not open file for writing: $?";
open SUMMARY, '>', $summary_out if ($summary_out);
my $avg_coverage = 0;
my $num_contigs = 0;
my $cutoff_coverage;
my $cutoff_repeats;
my @stats;
while (<TAB>)
push @stats, $_ unless (/^#/);
#Calculate average coverage.
foreach my $stat(@stats)
my ($length, $coverage);
(undef,$length, $coverage) = split(/\t+/, $stat);
die "length or coverage not defined at $stat\n" unless ($length && ($coverage ne '' && $coverage >= 0));
if ($length >= $coverage_length_co)
$avg_coverage = $avg_coverage + $coverage;
$avg_coverage = $avg_coverage / $num_contigs;
$cutoff_coverage = $avg_coverage * $coverage_co;
$cutoff_repeats = $avg_coverage * $repeat_co;
print SUMMARY "Filter SPAdes repeats Results Summary\n======================================\n\n" if ($summary_out);
print SUMMARY "Paramaters used:\nLength cutoff for calcularing average cutoff: $coverage_length_co\nCoverage cutoff ratio: $coverage_co\nRepeat cutoff ratio: $repeat_co\nLength cutoff: $length_co\n\n" if ($summary_out);
print SUMMARY "Calculations:\nAverage coverage: $avg_coverage\nCoverage cutoff: $cutoff_coverage\nRepeat cutoff: $cutoff_repeats\n\nFile headers:\n" if ($summary_out);
my ($header, $seq_id, $seq);
my $repeated = 0;
my $valid = 0;
#Summary strings:
my $discarded = "";
my $repeats = "";
my $filtered_rep = "";
my $filtered_norep = "";
while (my $line = <SEQIN>)
if ($line =~ />/)
chomp $line;
#Get the sequence name to compare against tab file
$header = $line;
$seq_id = $line =~ /(\w+)_length/;
$seq = "";
my $stat = shift @stats;
die "Less rows in tab than sequences in seq file" unless $stat;
my($name, $length, $coverage) = split(/\t+/, $stat);
die "name or length not defined at $stat\n" unless ($name && $length);
die "coverage is not defined at $stat\n" unless ($coverage ne '' && $coverage >= 0);
die "Unmatched names $header and $name\n" unless ($header =~ /$name/i);
#Entry passes the length and coverage cutoffs?
if ($length >= $length_co && $coverage >= $cutoff_coverage)
$valid = 1;
if ($coverage >= $cutoff_repeats)
my $num_repeats = int($coverage/$avg_coverage);
$header = $header."(".$num_repeats." copies)";
print SEQOUT_REP $header,"\n";
$filtered_rep = $filtered_rep.$header."\n";
print SEQOUT_FILT_REP $header, "\n";
$repeats = $repeats.$header."\n";
$repeated = 1;
print SEQOUT_REP $header, "\n";
$filtered_rep = $filtered_rep.$header."\n";
print SEQOUT_NOREP $header, "\n";
$filtered_norep = $filtered_norep.$header."\n";
$repeated = 0;
elsif ($out_filtered)
$valid = 0;
print SEQOUT_FILT $header,"\n";
$discarded = $discarded.$header."\n";
if ($valid)
print SEQOUT_REP $line;
if (!$repeated)
print SEQOUT_NOREP $line;
print SEQOUT_FILT_REP $line;
elsif ($out_filtered)
print SEQOUT_FILT $line;
close TAB;
close SEQIN;
#Get summary info:
if ($summary_out)
print SUMMARY "Filtered sequences (with repeats):\n$filtered_rep\n";
print SUMMARY "Filtered sequences (no repeats):\n$filtered_norep\n";
print SUMMARY "Repeat sequences:\n$repeats\n";
if ($out_filtered)
print SUMMARY "Discarded sequences:\n$discarded\n";
close SUMMARY;
die "More rows in stats file t