================================================
1. Quality Trimming and Filtering Your Sequences
================================================
Boot up an m1.xlarge machine from Amazon Web Services; this has about
15 GB of RAM, and 2 CPUs, and will be enough to complete the assembly
of the example data set.
.. note::
The raw data for this tutorial is available as public snapshot
snap-05633504 on EC2/EBS.
.. note::
Some of these commands may take a very long time. Please see
:doc:`../amazon/using-screen`.
Install software
================
Install `screed `__::
pip install screed
Install `khmer `__::
cd /usr/local/share
git clone https://github.com/ged-lab/khmer.git
cd khmer
git checkout protocols-v0.8.3
make
echo 'export PYTHONPATH=/usr/local/share/khmer:$PYTHONPATH' >> ~/.bashrc
source ~/.bashrc
Install `Trimmomatic `__::
cd /root
curl -O http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/Trimmomatic-0.30.zip
unzip Trimmomatic-0.30.zip
cd Trimmomatic-0.30/
cp trimmomatic-0.30.jar /usr/local/bin
cp -r adapters /usr/local/share/adapters
Install `libgtextutils and fastx `__::
cd /root
curl -O http://hannonlab.cshl.edu/fastx_toolkit/libgtextutils-0.6.1.tar.bz2
tar xjf libgtextutils-0.6.1.tar.bz2
cd libgtextutils-0.6.1/
./configure && make && make install
cd /root
curl -O http://hannonlab.cshl.edu/fastx_toolkit/fastx_toolkit-0.0.13.2.tar.bz2
tar xjf fastx_toolkit-0.0.13.2.tar.bz2
cd fastx_toolkit-0.0.13.2/
./configure && make && make install
In each of these cases, we're downloading the software -- you can use
google to figure out what each package is and does if we don't discuss
it below. We're then unpacking it, sometimes compiling it (which we
can discuss later), and then installing it for general use.
Create a working directory
==========================
Let's create a place to work::
cd /mnt
mkdir assembly
cd assembly
Link in the data
================
The tutorial data is from `Sharon et al. 2013
`__; it's two data points
from an infant gut sample.
If you want to play along with this data (guaranteed to work -- highly
advised on a first runthrough!), please see
:doc:`../mrnaseq/0-download-and-save` and follow the instructions to
(a) create a volume from snapshot snap-05633504 and (b) mount it as
/data.
Once you've done that, check that 'ls /data' shows exactly four FASTQ
files, and then type::
ln -fs /data/SRR49206?_?.fastq.gz .
This links the data into the /mnt/assembly directory.
Trim and quality filter
=======================
Trim the first data set (~20 minutes)::
mkdir trim
cd trim
java -jar /usr/local/bin/trimmomatic-0.30.jar PE ../SRR492065_?.fastq.gz s1_pe s1_se s2_pe s2_se ILLUMINACLIP:/usr/local/share/adapters/TruSeq3-PE.fa:2:30:10
/usr/local/share/khmer/scripts/interleave-reads.py s?_pe > combined.fq
fastq_quality_filter -Q33 -q 30 -p 50 -i combined.fq > combined-trim.fq
fastq_quality_filter -Q33 -q 30 -p 50 -i s1_se > s1_se.trim
fastq_quality_filter -Q33 -q 30 -p 50 -i s2_se > s2_se.trim
/usr/local/share/khmer/scripts/extract-paired-reads.py combined-trim.fq
gzip -9c combined-trim.fq.pe > ../SRR492065.pe.qc.fq.gz
gzip -9c combined-trim.fq.se s1_se.trim s2_se.trim > ../SRR492065.se.qc.fq.gz
cd ../
rm -fr trim
Trim the second data set (~20 minutes)::
mkdir trim
cd trim
java -jar /usr/local/bin/trimmomatic-0.30.jar PE ../SRR492066_?.fastq.gz s1_pe s1_se s2_pe s2_se ILLUMINACLIP:/usr/local/share/adapters/TruSeq3-PE.fa:2:30:10
/usr/local/share/khmer/scripts/interleave-reads.py s?_pe > combined.fq
fastq_quality_filter -Q33 -q 30 -p 50 -i combined.fq > combined-trim.fq
fastq_quality_filter -Q33 -q 30 -p 50 -i s1_se > s1_se.trim
fastq_quality_filter -Q33 -q 30 -p 50 -i s2_se > s2_se.trim
/usr/local/share/khmer/scripts/extract-paired-reads.py combined-trim.fq
gzip -9c combined-trim.fq.pe > ../SRR492066.pe.qc.fq.gz
gzip -9c combined-trim.fq.se s1_se.trim s2_se.trim > ../SRR492066.se.qc.fq.gz
cd ../
rm -fr trim
Done! Now you have four files: SRR492065.pe.qc.fq.gz, SRR492065.se.qc.fq.gz, SRR492066.pe.qc.fq.gz, and SRR492066.se.qc.fq.gz.
The '.pe' files are interleaved paired-end; you can take a look at them like so::
gunzip -c SRR492065.pe.qc.fq.gz | head
The other two are single-ended files, where the reads have been
orphaned because we discarded stuff.
All four files are in FASTQ format.
----
Next: :doc:`2-diginorm`