Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Introduction

FindJunctions is a Java program that uses spliced uses gapped RNA-Seq read to genome alignments to identify and quantify exon-exon junctions in RNA-Seq data. When given a BAM file, it produces a BED file that summarizes every spliced aligned alignment identified in the BAM file. If also given a reference genomic sequence file (in .2bit format) it attempts to identify the strand of origin for each junction by looking for canonical intron splice junction sequences.You can also run FindJunctions within Integrated Genome Browser. , i.e., introns. FindJunctions is implemented both as a visual analytics function within IGB and as a stand-alone, command-line program.

Within IGB, FindJunctions produces a new track showing exon-exon junction features labeled with the number of alignments that supported the junction. The command-line program produces a BED file in which the score field contains the number of reads supporting the junction.

FindJunctions algorithm

FindJunctions operates on RNA-Seq read alignments, using alignments that contain gaps in the read sequence relative to genomic sequence. These gaps in the read sequence correspond to introns and typically start and end with the so-called canonical splice site consensus sequences "GT" (5' end) and "AG" (3' end) for genes transcribed from the plus (forward) strand. For genes transcribed from the minus (reverse) strand, the consensus sequences relative to the plus strand are the reverse complement of the consensus splice site sequences, i.e., "CT" on the 5' end of the gap and "CA" on the 3' prime end. For each alignment containing a gap, FindJunctions inspects the start and end coordinates of the gap and uses the genomic sequence to infer the strand, if available. FindJunctions creates a list of all such gaps, recording the strand and the genomic coordinates of the start and end coordinates. For each unique triplet of start, end, and strand, FindJunctions creates a scored junction feature and increments the score each time a gap supporting that feature is encountered in a dataset. Options are available to limit scoring to read alignments that have a given minimum number of bases flanking a gap and/or which having one unique mapping onto the genomic sequence.

How to use FindJunctions within IGB

...

Follow the instructions to compile FindJunctions and create a "jar" file.

Run the program using java, providing a comma-separated list of BAM files.

Optional FindJunctions takes one argument - the name of a BAM file, and multiple options:

  • -u option (for unique) ensures that only uniquely mapping spliced reads (with NH tag equal to 1) will be used to construct junctions.  Not required, default is to use all reads regardless of mapping quality or number of mappings obtained.
  • -n option is the number of bases that must map to either side of a putative intron for a spliced read to be used to create or support a junction feature. Default is 5.
  • -b is the absolute full path to the a .2bit format genomic sequence file that will be used to identify junction strand strand. This is required.
  • -o (output) is the name of the junctions file that will be written.  Default is print to stdout.

Output is tab-delimited BED12 format. The name field contains a name constructed from the location of the junction and the score field contains the number of spliced alignments supporting each junction. 

...