================================= 2. Applying Digital Normalization ================================= .. shell start .. note:: You can start this tutorial with the contents of EC2/EBS snapshot snap-126cc847. .. note:: You'll need ~15 GB of RAM for this, or more if you have a LOT of data. Link in your data ----------------- Make sure your data is in /mnt/work/. If you've loaded it onto /data, you can do:: cd /mnt mkdir work cd /mnt/work ln -fs /data/*.qc.fq.gz . Run digital normalization ------------------------- .. Fix the working directory location; clean up some stuff .. :: cd /mnt/work rm -f *.abundfilt.pe.gz rm -f *.keep.abundfilt.fq.gz .. :: echo 2-diginorm normalize1-pe `date` >> /root/times.out Apply digital normalization to the paired-end reads: :: normalize-by-median.py -p -k 20 -C 20 -N 4 -x 3e8 --savetable normC20k20.kh *.pe.qc.fq.gz .. :: echo 2-diginorm normalize1-se `date` >> /root/times.out and then to the single-end reads: :: normalize-by-median.py -C 20 --loadtable normC20k20.kh --savetable normC20k20.kh *.se.qc.fq.gz Note the '-p' in the first normalize-by-median command -- when run on PE data, that ensures that no paired ends are orphaned. However, it will complain on single-ended data, so you have to give the data to it separately. Also note the '-N' and '-x' parameters. These specify how much memory diginorm should use. The product of these should be less than the memory size of the machine you selected. The maximum needed for *any* transcriptome should be in the ~60 GB range, e.g. -N 4 -x 15e9; for only a few hundred million reads, 16 GB should be plenty. (See `choosing hash sizes for khmer `__ for more information.) Trim off likely erroneous k-mers ------------------------------- .. :: echo 2-diginorm filter-abund `date` >> /root/times.out Now, run through all the reads and trim off low-abundance parts of high-coverage reads: :: filter-abund.py -V normC20k20.kh *.keep This will turn some reads into orphans, but that's ok -- their partner read was bad. Rename files ~~~~~~~~~~~~ You'll have a bunch of 'keep.abundfilt' files -- let's make things prettier. .. :: echo 2-diginorm extract `date` >> /root/times.out First, let's break out the orphaned and still-paired reads: :: for i in *.pe.*.abundfilt; do extract-paired-reads.py $i done We can combine the orphaned reads into a single file: :: for i in *.se.qc.fq.gz.keep.abundfilt do pe_orphans=$(basename $i .se.qc.fq.gz.keep.abundfilt).pe.qc.fq.gz.keep.abundfilt.se newfile=$(basename $i .se.qc.fq.gz.keep.abundfilt).se.qc.keep.abundfilt.fq.gz cat $i $pe_orphans | gzip -c > $newfile done We can also rename the remaining PE reads & compress those files: :: for i in *.abundfilt.pe do newfile=$(basename $i .fq.gz.keep.abundfilt.pe).keep.abundfilt.fq mv $i $newfile gzip $newfile done This leaves you with a whole passel o' files, most of which you want to go away! :: 6Hour_CGATGT_L002_R1_005.pe.qc.fq.gz 6Hour_CGATGT_L002_R1_005.pe.qc.fq.gz.keep 6Hour_CGATGT_L002_R1_005.pe.qc.fq.gz.keep.abundfilt 6Hour_CGATGT_L002_R1_005.pe.qc.fq.gz.keep.abundfilt.se 6Hour_CGATGT_L002_R1_005.se.qc.fq.gz 6Hour_CGATGT_L002_R1_005.se.qc.fq.gz.keep 6Hour_CGATGT_L002_R1_005.se.qc.fq.gz.keep.abundfilt .. :: echo 2-diginorm DONE `date` >> /root/times.out So, finally, let's get rid of a lot of the old files : :: rm *.se.qc.fq.gz.keep.abundfilt rm *.pe.qc.fq.gz.keep.abundfilt.se rm *.keep rm *.abundfilt rm *.kh .. rm *.qc.fq.gz @CTB should we do? Gut check ~~~~~~~~~ You should now have the following files in your directory (after typing 'ls'):: 6Hour_CGATGT_L002_R1_005.pe.qc.keep.abundfilt.fq.gz 6Hour_CGATGT_L002_R1_005.se.qc.keep.abundfilt.fq.gz These files are, respectively, the paired (pe) quality-filtered (qc) digitally normalized (keep) abundance-trimmed (abundfilt) FASTQ (fq) gzipped (gz) sequences, and the orphaned (se) quality-filtered (qc) digitally normalized (keep) abundance-trimmed (abundfilt) FASTQ (fq) gzipped (gz) sequences. Save all these files to a new volume, and get ready to assemble! .. shell stop Next: :doc:`3-big-assembly`.