We are using Podar dataset from ...
You’ll want an m1.large or m1.xlarge for this.
First, we need to install the BWA aligner:
cd /root
wget -O bwa-0.7.5.tar.bz2 http://sourceforge.net/projects/bio-bwa/files/bwa-0.7.5a.tar.bz2/download
tar xvfj bwa-0.7.5.tar.bz2
cd bwa-0.7.5a
make
cp bwa /usr/local/bin
Download the reference genome and the resequencing reads:
cd /mnt
curl -O http://athyra.idyll.org/~mahmoud4/all.fa
curl -O ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR606/SRR606249/SRR606249_1.fastq.gz
curl -O ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR606/SRR606249/SRR606249_2.fastq.gz
Now let’s map all of the reads to the reference. Start by indexing the reference genome:
cd /mnt
bwa index ref.fa
Now, do the mapping of the raw reads to the reference genome:
bwa aln ref.fa SRR606249_1.fastq.gz > SRR606249_1.sai
bwa aln ref.fa SRR606249_2.fastq.gz > SRR606249_2.sai
Make a SAM file (this would be done with ‘samse’ if these were single reads):
bwa sampe ref.fa SRR606249_1.sai SRR606249_2.sai SRR606249_1.fastq.gz SRR606249_2.fastq.gz> SRR606249.sam
This file contains all of the information about where each read hits on the reference.
Next, index the reference genome with samtools:
samtools faidx ref.fa
Convert the SAM into a BAM file:
samtools import ref.fa.fai SRR606249.sam SRR606249.bam
Sort the BAM file:
samtools sort SRR606249.bam SRR606249.sorted
And index the sorted BAM file:
samtools index SRR606249.sorted.bam
At this point you can visualize with tview or Tablet.
‘samtools tview’ is a text interface that you use from the command line; run it like so:
samtools tview SRR606249.sorted.bam ref.fa
The ‘.’s are places where the reads align perfectly in the forward direction, and the ‘,’s are places where the reads align perfectly in the reverse direction. Mismatches are indicated as A, T, C, G, etc.
You can scroll around using left and right arrows; to go to a specific coordinate, use ‘g’ and then type in the contig name and the position. For example, type ‘g’ and then ‘rel606:553093<ENTER>’ to go to position 553093 in the BAM file.
For the Tablet viewer, click on the link and get it installed on your local computer. Then, start it up as an application. To open your alignments in Tablet, you’ll need three files on your local computer: ref.fa, SRR606249.sorted.bam, and SRR606249.sorted.bam.bai. You can copy them over using Dropbox, for example.
You can use samtools to call SNPs like so:
samtools mpileup -uD -f ref.fa SRR606249.sorted.bam | bcftools view -bvcg - > SRR606249.raw.bcf
(See the ‘mpileup’ docs here.)
Now convert the BCF into VCF:
bcftools view SRR606249.raw.bcf > SRR606249.vcf
You can check out the VCF file by using ‘tail’ to look at the bottom:
tail *.vcf
To further analyze the VCF file, take a look at this IPython notebook: hw5-variant-solutions.ipynb.
5 things to know about the samtools mpileup tool: http://massgenomics.org/2012/03/5-things-to-know-about-samtools-mpileup.html
VCF file format specification: http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-41