Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Align your sequence. For RNA-Seq data sets, we mainly have used TopHat, from the University of Maryland.

TopHat is a spliced alignment tool that first runs BowTie (a non-spliced alignment tool) and then attempts to align any reads BowTie couldn't align by splitting them across putative introns. For this reason, to run TopHat you'll have to install BowTie. You'll also have to install samtools, a program that TopHat and BowTie use to generate alignment files called "BAM" (binary alignment) files. Once you have sorted and indexed these files, IGB will be able to display them. More on this later.

Many different versions of TopHat have been released over the past couple of years and each behaves slightly differently. However, a few things seem to remain stable. First, TopHat will typically report multiple alignments for some number of reads. This is to be expected. However, depending on your experimental goals, you may want to focus on the reads that map exactly once onto the genome. You can figure out which reads mapped to multiple locations by looking at the "NH" flag in each alignment. More on this later. Also, you should determine the minimum and maximum intron sizes for your genome and provide these as parameters to TopHat. For details on running TopHat, see the TopHat Manual.

An Here is an example invocation of TopHat, fine-tuned for Arabidopsis thaliana:

No Format

tophat --min-intron-length 40 --max-intron-length 2000 /data/bowtieindex/A_thaliana_Jun_2009 Sample.fastq -o Sample

In this example, we've run bowtie-build using a fasta file A_thaliana_Jun_2009.fa (which you can download from a folder on our QuickLoad site here) to create index files bowtie uses to speed up the alignment process.

Please note that you need to make sure that the names of chromosomes in the fasta file (and hence the BowTie index files) match with what IGB uses. For Arabidopsis, chromosome names are chr1, chr2, chr3, chr4, chr5, chrC, and chrM.

IGB has a synonyms system that allows it to "understand" that chr1 and Chr1 are really the same thing, but other programs you will run into won't be smart enough to match up names in this way. For this reason, it's a good idea to use the same names for chromosome throughout all the different steps of processing data.

Process output files

Index (and rename) your alignment file

When TopHat finishes, it will have created a file called "accepted_hits.bam" that contains your alignments. Your next step in the process will be to sort and index this file, creating what's called a "bai" index file. For this, you'll use samtools, a freely available command-line tool with many useful functions. You can learn about samtools at this Web site and download the code from this page hosted at sourceforge.net.

The "accepted_hits.bam" file should already be sorted. So all you should need to do next is rename it and make an index for it, like so:

No Format

$ mv accepted_hits.bam Sample.bam
# samtools index Sample.bam

which will create an index file named Sample.bam.bai.

Make a bedgraph (wiggle) file

Older versions of TopHat used to create a depth (also called coverage) file as part of the TopHat processing step. However, more recent versions of Tophat no longer do this. The program TopHat used to make the coverage graph file was called wiggles; we obtained a copy and made a few modifications to it. You can obtain compiled versions from this wiki page (click "Attachments") or from our subversion repository.

To create  bedgraph file from your BAM alignments file on a Linux machine, do this:

No Format

$ samtools view Sample.bam | wiggles.linux --wiggle-name="Sample coverage graph" > Sample.bedgraph

to be continued