Install khmer, screed, and BLAST. (See 1. Quality Trimming and Filtering Your Sequences and BLASTing your assembled data). I would suggest using an m1.large or m1.xlarge machine.
You’ll also need to install some eel-pond scripts:
git clone https://github.com/ctb/eel-pond.git /usr/local/share/eel-pond
You need to get ahold of your assembled transcriptome (from e.g. 3. Running the Actual Assembly). Put it in /mnt.
For the purposes of your first run through, I suggest just grabbing my copy of the Nematostella assembly:
cd /mnt
curl -O https://s3.amazonaws.com/public.ged.msu.edu/trinity-nematostella-raw.fa.gz
Partitioning runs a de Bruijn graph-based clustering algorithm that will cluster your transcripts by transitive sequence overlap. That is, it will build transcript families :).
/usr/local/share/khmer/scripts/do-partition.py -x 1e9 -N 4 --threads 4 nema trinity-nematostella-raw.fa.gz
This should take about 15 minutes, and outputs a file ending in ‘.part’ that contains the partition assignments. Now, group and rename the sequences:
python /usr/local/share/eel-pond/rename-with-partitions.py nema trinity-nematostella-raw.fa.gz.part
mv trinity-nematostella-raw.fa.gz.part.renamed.fasta.gz trinity-nematostella.renamed.fa.gz
Let’s look at the renamed sequences:
gunzip -c trinity-nematostella.renamed.fa.gz | head
You’ll see that each sequence name looks like this:
>nema.id1.tr16001 1_of_1_in_tr16001 len=261 id=1 tr=16001
Some explanation:
‘nema’ is the prefix that you gave the rename script, above; modify accordingly for your own organism. It’s best to change it each time you do an assembly, just to keep things straight.
file.
‘trN’ is the transcript family, which may contain one or more transcripts.
‘1_of_1_in_tr16001’ tells you that this transcript family has only one transcript in it (this one!) Other transcript families may (will) have more.
‘len’ is the sequence length.
Now let’s assign putative homology & orthology to these transcripts, by doing BLASTs & reciprocal best hit analysis. First, uncompress your transcripts file:
gunzip trinity-nematostella.renamed.fa.gz
Now, grab the latest mouse RefSeq:
curl -O ftp://ftp.ncbi.nih.gov/refseq/M_musculus/mRNA_Prot/mouse.protein.faa.gz
gunzip mouse.protein.faa.gz
Format both as BLAST databases:
formatdb -i mouse.protein.faa -o T -p T
formatdb -i trinity-nematostella.renamed.fa -o T -p F
And, now, run BLAST in both directions. Note, this may take ~24 hours or longer; you probably want to run it in screen:
blastall -i trinity-nematostella.renamed.fa -d mouse.protein.faa -e 1e-3 -p blastx -o nema.x.mouse -a 8 -v 4 -b 4
blastall -i mouse.protein.faa -d trinity-nematostella.renamed.fa -e 1e-3 -p tblastn -o mouse.x.nema -a 8 -v 4 -b 4
Note
These BLASTs will take a long time, like 24-36 hours. If you want to work with canned BLASTs, do:
curl -O http://athyra.idyll.org/~t/nema.x.mouse.gz
curl -O http://athyra.idyll.org/~t/mouse.x.nema.gz
gunzip nema.x.mouse.gz
gunzip mouse.x.nema.gz
Now, calculate putative homology (best BLAST hit) and orthology (reciprocal best hits):
python /usr/local/share/eel-pond/make-uni-best-hits.py nema.x.mouse nema.x.mouse.homol
python /usr/local/share/eel-pond/make-reciprocal-best-hits.py nema.x.mouse mouse.x.nema nema.x.mouse.ortho
Prepare some of the mouse info:
python /usr/local/share/eel-pond/make-namedb.py mouse.protein.faa mouse.namedb
python -m screed.fadbm mouse.protein.faa
And, finally, annotate the sequences:
python /usr/local/share/eel-pond/annotate-seqs.py trinity-nematostella.renamed.fa nema.x.mouse.ortho nema.x.mouse.homol
This will produce a file ‘trinity-nematostella.renamed.fa.annot’, which will have sequences that look like this:
>nematostella.id1.tr115222 h=43% => suppressor of tumorigenicity 7 protein isoform 2 [Mus musculus] 1_of_7_in_tr115222 len=1635 id=1 tr=115222 1_of_7_in_tr115222 len=1635 id=1 tr=115222
I suggest renaming this file to ‘nematostella.fa’ and using it for BLASTs (see BLASTing your assembled data).
This file can be edited directly through the Web. Anyone can update and fix errors in this document with few clicks -- no downloads needed.
For an introduction to the documentation format please see the reST primer.