Variant calling
###############
We are using Podar dataset from ...
Install software
~~~~~~~~~~~~~~~~
You'll want an m1.large or m1.xlarge for this.
First, we need to install the `BWA aligner
`__::
cd /root
wget -O bwa-0.7.5.tar.bz2 http://sourceforge.net/projects/bio-bwa/files/bwa-0.7.5a.tar.bz2/download
tar xvfj bwa-0.7.5.tar.bz2
cd bwa-0.7.5a
make
cp bwa /usr/local/bin
.. We also need a new version of `samtools `__::
cd /root
curl -O -L http://sourceforge.net/projects/samtools/files/samtools/0.1.19/samtools-0.1.19.tar.bz2
tar xvfj samtools-0.1.19.tar.bz2
cd samtools-0.1.19
make
cp samtools /usr/local/bin
cp bcftools/bcftools /usr/local/bin
cd misc/
cp *.pl maq2sam-long maq2sam-short md5fa md5sum-lite wgsim /usr/local/bin/
Download data
~~~~~~~~~~~~~
Download the reference genome and the resequencing reads::
cd /mnt
curl -O http://athyra.idyll.org/~mahmoud4/all.fa
curl -O ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR606/SRR606249/SRR606249_1.fastq.gz
curl -O ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR606/SRR606249/SRR606249_2.fastq.gz
Do the mapping
~~~~~~~~~~~~~~
Now let's map all of the reads to the reference. Start by indexing the
reference genome::
cd /mnt
bwa index ref.fa
Now, do the mapping of the raw reads to the reference genome::
bwa aln ref.fa SRR606249_1.fastq.gz > SRR606249_1.sai
bwa aln ref.fa SRR606249_2.fastq.gz > SRR606249_2.sai
Make a SAM file (this would be done with 'samse' if these were single
reads)::
bwa sampe ref.fa SRR606249_1.sai SRR606249_2.sai SRR606249_1.fastq.gz SRR606249_2.fastq.gz> SRR606249.sam
This file contains all of the information about where each read hits
on the reference.
Next, index the reference genome with samtools::
samtools faidx ref.fa
Convert the SAM into a BAM file::
samtools import ref.fa.fai SRR606249.sam SRR606249.bam
Sort the BAM file::
samtools sort SRR606249.bam SRR606249.sorted
And index the sorted BAM file::
samtools index SRR606249.sorted.bam
At this point you can visualize with tview or Tablet.
'samtools tview' is a text interface that you use from the command
line; run it like so::
samtools tview SRR606249.sorted.bam ref.fa
The '.'s are places where the reads align perfectly in the forward direction,
and the ','s are places where the reads align perfectly in the reverse
direction. Mismatches are indicated as A, T, C, G, etc.
You can scroll around using left and right arrows; to go to a specific
coordinate, use 'g' and then type in the contig name and the position.
For example, type 'g' and then 'rel606:553093' to go to
position 553093 in the BAM file.
For the `Tablet viewer `__, click on
the link and get it installed on your local computer. Then, start it
up as an application. To open your alignments in Tablet, you'll need
three files on your local computer: ``ref.fa``, ``SRR606249.sorted.bam``,
and ``SRR606249.sorted.bam.bai``. You can copy them over using Dropbox,
for example.
Calling SNPs
~~~~~~~~~~~~
You can use samtools to call SNPs like so::
samtools mpileup -uD -f ref.fa SRR606249.sorted.bam | bcftools view -bvcg - > SRR606249.raw.bcf
(See the 'mpileup' docs `here `__.)
Now convert the BCF into VCF::
bcftools view SRR606249.raw.bcf > SRR606249.vcf
You can check out the VCF file by using 'tail' to look at the bottom::
tail *.vcf
To further analyze the VCF file, take a look at this IPython notebook: `hw5-variant-solutions.ipynb `__.
Other resources
---------------
5 things to know about the samtools mpileup tool: http://massgenomics.org/2012/03/5-things-to-know-about-samtools-mpileup.html
VCF file format specification: http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-41