Table of Contents

Introduction

High-throughput

...

sequencing

...

of

...

cDNA,

...

also

...

called

...

RNA-Seq,

...

can

...

provide

...

many

...

levels

...

of

...

information

...

about

...

gene

...

expression,

...

such

...

as

...

information

...

about

...

previously

...

unannotated

...

genes,

...

expression

...

of

...

pseudogenes,

...

differential

...

expression

...

both

...

within

...

and

...

across

...

samples,

...

and

...

alternative

...

splicing.

...

Making

...

the

...

libraries

...

and

...

sending

...

them

...

out

...

for

...

sequencing

...

is

...

only

...

the

...

first

...

step

...

in

...

performing

...

an

...

RNA-Seq

...

experiment.

...

What

...

most

...

people

...

find

...

is

...

that

...

processing,

...

analyzing,

...

and

...

interpreting

...

the

...

data

...

can

...

often

...

be

...

just

...

as

...

time-consuming.

...

Take

...

heart.

...

These

...

data

...

are

...

sets

...

are

...

so

...

rich

...

that

...

you

...

may

...

never

...

fully

...

exhaust

...

their

...

potential.

...

However,

...

there

...

are

...

a

...

few

...

first

...

steps

...

you'll

...

want

...

to

...

perform

...

right

...

away,

...

as

...

well

...

as

...

some

...

quality

...

control

...

steps

...

that

...

will

...

help

...

you

...

assess

...

how

...

well

...

your

...

experiment

...

worked.

...

What

...

follows

...

is

...

a

...

description

...

of

...

RNA-Seq

...

processing

...

steps

...

we

...

do

...

fairly

...

routinely

...

in

...

the

...

Loraine

...

lab,

...

along

...

with

...

links

...

to

...

software

...

programs

...

we've

...

developed

...

for

...

in-house

...

use.

...

Please

...

be

...

aware

...

that

...

these

...

programs

...

are

...

very

...

much

...

works

...

in

...

progress

...

and

...

so

...

may

...

not

...

always

...

work

...

as

...

advertised.

...

If

...

you

...

find

...

bugs

...

or

...

inconsistencies,

...

please

...

let

...

us

...

know

...

-

...

contact

...

Ann

...

(aloraine@uncc.edu)

...

with

...

feedback,

...

suggestions,

...

and

...

bug

...

reports.

...

RNA-Seq

...

tutorial

...

The
...
following
...
protocol
...
describes
...
processing
...
data
...
from
...
Illumina
...
HiSeq
...
pipeline.
...
Whether
...
you
...
should
...
perform
...
exactly
...
these
...
steps
...
with
...
your
...
data
...
sets
...
will
...
depend
...
on
...
your
...
data
...
-
...
its
...
age,
...
whether
...
you've
...
done
...
single-end
...
or
...
paired-end
...
sequencing,
...
the
...
read
...
lengths,
...
your
...
reference
...
genome,
...
and
...
so
...
on.
...
For
...
example,
...
when
...
you
...
run
...
TopHat,
...
you
...
may
...
need
...
to
...
adjust
...
parameters
...
to
...
accommodate
...
smaller
...
read
...
lengths
...
if
...
you
...
are
...
running
...
data
...
from
...
pre-HiSeq
...
instruments.
...

Check

...

sequence

...

quality.

...

Your
...
first
...
step
...
upon
...
downloading
...
your
...
reads
...
will
...
be
...
to
...
check
...
the
...
quality
...
of
...
your
...
sequence.
...
One
...
of
...
the
...
easiest
...
to
...
use
...
tools
...
we
...
(in
...
the
...
Loraine
...
lab)
...
have
...
found
...
for
...
quality
...
checking
...
is
...
a
...
terrific
...
program
...
called
...
FastQC
...
,
...
from
...
the
...
Babraham
...
Institute.
...
You
...
can
...
run
...
it
...
interactively
...
or
...
as
...
a
...
command
...
line
...
tool,
...
and
...
it's
...
written
...
in
...
Java,
...
which
...
means
...
you
...
can
...
run
...
it
...
on
...
a
...
Mac,
...
a
...
Linux
...
machine,
...
or
...
a
...
Windows
...
computer.
...
Like
...
IGB,
...
any
...
computer
...
that
...
supports
...
Java
...
can
...
run
...
FastQC.
...
What
...
you
...
should
...
be
...
looking
...
for
...
in
...
your
...
data
...
is
...
evidence
...
of
...
poor
...
sequencing
...
quality
...
as
...
well
...
as
...
spots
...
in
...
your
...
sequence
...
where
...
you
...
have
...
lots
...
and
...
lots
...
of
...
N
...
characters
...
-
...
these
...
usually
...
correspond
...
to
...
place
...
with
...
poor
...
quality.
...
If
...
your
...
reads
...
are
...
generally
...
low
...
quality,
...
you
...
should
...
ask
...
your
...
sequencing
...
facility
...
to
...
try
...
again.
...
If
...
your
...
data
...
have
...
many,
...
many
...
over-represented
...
sequences
...
(something
...
FastQC
...
can
...
tell
...
you),
...
then
...
you
...
may
...
want
...
to
...
make
...
another
...
library
...
for
...
that
...
sample.
...
In
...
our
...
experience,
...
low
...
yields
...
and
...
low
...
quality
...
generally
...
happen
...
because
...
something
...
went
...
wrong
...
with
...
sequence.
...
Over-represented
...
sequences
...
and
...
so-called
...
PCR
...
duplicates
...
(many
...
copies
...
of
...
the
...
same
...
sequence)
...
are
...
usually
...
due
...
to
...
problems
...
in
...
library
...
construction.
...

Align your sequence.

Let's

...

assume

...

that

...

(happily)

...

you

...

have

...

good-quality

...

sequence.

...

Your

...

step

...

should

...

be

...

to

...

align

...

your

...

sequences

...

onto

...

a

...

reference

...

genome

...

or

...

refernece

...

transcriptome.

...

Indeed,

...

even

...

if

...

you

...

haven't

...

got

...

high-quality

...

sequence,

...

you

...

should

...

still

...

try

...

to

...

align

...

it,

...

because

...

the

...

alignments

...

can

...

tell

...

you

...

a

...

lot

...

about

...

what

...

went

...

wrong.

...

For

...

this

...

tutorial,

...

we'll

...

use

...

a

...

spliced

...

alignment

...

tool

...

to

...

align

...

the

...

RNA-Seq

...

reads

...

onto

...

a

...

reference

...

genome.

...

For

...

RNA-Seq

...

data

...

sets,

...

we

...

mainly

...

have

...

used

...

TopHat,

...

from

...

the

...

University

...

of

...

Maryland.

...

Others

...

are

...

available,

...

but

...

since

...

we

...

have

...

the

...

most

...

experience

...

with

...

TopHat

...

and

...

seems

...

to

...

be

...

one

...

of

...

the

...

more

...

widely

...

used

...

programs,

...

the

...

following

...

examples

...

will

...

demonstrate

...

how

...

to

...

run

...

it.

...

TopHat

...

is

...

a

...

spliced

...

alignment

...

tool

...

that

...

first

...

runs

...

BowTie

...

(a

...

non-spliced

...

alignment

...

tool

...

from

...

the

...

same

...

group)

...

and

...

then

...

attempts

...

to

...

align

...

any

...

reads

...

BowTie

...

couldn't

...

align

...

by

...

splitting

...

them

...

across

...

putative

...

introns.

...

For

...

this

...

reason,

...

to

...

run

...

TopHat

...

you'll

...

have

...

to

...

install

...

BowTie.

...

You'll

...

also

...

have

...

to

...

install

...

samtools,

...

a

...

program

...

that

...

TopHat

...

and

...

BowTie

...

use

...

to

...

generate

...

alignment

...

files

...

called

...

"BAM"

...

(binary

...

alignment)

...

files.

...

IGB

...

can

...

display

...

data

...

from

...

BAM

...

files,

...

once

...

you've

...

created

...

an

...

index

...

for

...

them.

...

More

...

on

...

indexing

...

BAM

...

files

...

will

...

come

...

later.

...

Many

...

different

...

versions

...

of

...

TopHat

...

have

...

been

...

released

...

over

...

the

...

past

...

couple

...

of

...

years

...

and

...

each

...

behaves

...

slightly

...

differently.

...

However,

...

a

...

few

...

things

...

seem

...

to

...

remain

...

stable.

...

First,

...

TopHat

...

will

...

typically

...

report

...

multiple

...

alignments

...

for

...

some

...

number

...

of

...

reads.

...

This

...

is

...

to

...

be

...

expected.

...

However,

...

depending

...

on

...

your

...

experimental

...

goals,

...

you

...

may

...

want

...

to

...

focus

...

on

...

the

...

reads

...

that

...

map

...

exactly

...

once

...

onto

...

the

...

genome.

...

You

...

can

...

figure

...

out

...

which

...

reads

...

mapped

...

to

...

multiple

...

locations

...

by

...

looking

...

at

...

the

...

"NH"

...

flag

...

in

...

each

...

alignment.

...

Also,

...

you

...

should

...

determine

...

the

...

minimum

...

and

...

maximum

...

intron

...

sizes

...

for

...

your

...

genome

...

and

...

provide

...

these

...

as

...

parameters

...

to

...

TopHat.

...

For

...

details

...

on

...

running

...

TopHat,

...

see

...

the

...

TopHat

...

Manual

...

.

Here is an example invocation of TopHat, fine-tuned

...

for

...

Arabidopsis

...

thaliana

...

.

...

TopHat

...

is

...

a

...

command

...

line

...

program,

...

which

...

means

...

you

...

run

...

it

...

by

...

typing

...

commands

...

into

...

a

...

Unix

...

terminal.

...

(On

...

Mac,

...

this

...

is

...

the

...

Terminal

...

program

...

from

...

the

...

Applications

...

Utilities.)

...

For

...

the

...

rest

...

of

...

this

...

tutorial,

...

wherever

...

you

...

see

...

a

...

line

...

that

...

starts

...

with

...

a

...

"$"

...

sign,

...

this

...

means

...

you

...

type

...

the

...

command

...

into

...

a

...

terminal.

...

(The

...

"$"

...

means:

...

the

...

Unix

...

prompt.)

...

}

Code Block

$ tophat2 --max-intron-length 2000 /data/bowtieindex/A_thaliana_Jun_2009 Sample.fastq -o Sample {code}

However,

...

you're

...

probably

...

better

...

off

...

running

...

TopHat

...

on

...

a

...

fairly

...

powerful

...

machine.

...

If

...

you

...

can

...

find

...

a

...

multi-processor

...

server

...

with

...

16

...

or

...

more

...

of

...

RAM,

...

I

...

would

...

strongly

...

recommend

...

using

...

it

...

to

...

run

...

the

...

alignment

...

step.

...

If

...

you

...

do

...

get

...

access

...

to

...

a

...

multi-processor

...

server,

...

you

...

should

...

tell

...

TopHat

...

to

...

take

...

advantage

...

of

...

the

...

extra

...

processing

...

power

...

using

...

the

...

-p

...

parameter

...

that

...

allows

...

you

...

to

...

specify

...

the

...

number

...

of

...

processors

...

TopHat

...

can

...

use.

...

To

...

run

...

bowtie2,

...

you

...

have

...

to

...

first

...

make

...

an

...

index

...

file

...

for

...

your

...

genome.

...

And

...

to

...

make

...

an

...

index

...

for

...

your

...

genome,

...

you'll

...

need

...

a

...

fasta

...

file

...

for

...

your

...

genome.

...

This

...

is

...

easy

...

to

...

get,

...

however.

...

You

...

can

...

either

...

download

...

the

...

fasta

...

files

...

directly

...

from

...

the

...

genome

...

data

...

provider

...

(e.g.,

...

UCSC

...

or

...

NBCI)

...

or

...

you

...

can

...

get

...

a

...

2bit

...

file

...

from

...

the

...

IGBQuickLoad

...

Web

...

site

...

and

...

convert

...

it

...

to

...

fasta.

...

To

...

get

...

a

...

sequence

...

data

...

file

...

from

...

IGBQuickLoad.org,

...

go

...

to

...

the

...

genome

...

Process output files

Index (and rename) your alignment file

When TopHat finishes, it will have created a file called "accepted_hits.bam"

...

that

...

contains

...

your

...

alignments.

...

Your

...

step

...

in

...

the

...

process

...

will

...

be

...

to

...

index

...

this

...

file,

...

creating

...

what's

...

called

...

a

...

"bai"

...

index

...

file.

...

For

...

this,

...

you'll

...

use

...

samtools,

...

a

...

freely

...

available

...

command-line

...

tool

...

with

...

many

...

useful

...

functions.

...

You

...

can

...

learn

...

about

...

samtools

...

at

...

this

...

Web

...

site and download the code from this page hosted at sourceforge.net.

...

Why

...

make

...

an

...

index?

...

This

...

is

...

a

...

crucial

...

step

...

because

...

it

...

allows

...

programs

...

like

...

IGB

...

to

...

do

...

what

...

is

...

called

...

a

...

"region-based

...

query."

...

That

...

is,

...

once

...

you've

...

started

...

IGB,

...

you

...

can

...

load

...

the

...

BAM

...

file,

...

zoom

...

in

...

on

...

a

...

region,

...

and

...

then

...

ask

...

IGB

...

to

...

load

...

in

...

just

...

the

...

reads

...

that

...

overlap

...

your

...

region

...

of

...

interest.

...

This

...

works

...

because

...

IGB

...

reads

...

the

...

index

...

file,

...

which

...

tells

...

IGB

...

exactly

...

where

...

to

...

look

...

in

...

the

...

larger

...

BAM

...

file

...

(which

...

may

...

be

...

1

...

gig

...

or

...

more

...

!)

...

for

...

just

...

the

...

subset

...

of

...

reads

...

that

...

are

...

relevant

...

to

...

the

...

region

...

in

...

view.

...

To

...

make

...

an

...

index

...

(.bai)

...

file

...

for

...

a

...

BAM

...

file,

...

you

...

typically

...

would

...

first

...

have

...

to

...

sort

...

the

...

BAM

...

file

...

first.

...

However,

...

the

...

"accepted_hits.bam"

...

file

...

from

...

TopHat

...

should

...

already

...

be

...

sorted.

...

So

...

all

...

you

...

should

...

need

...

to

...

do

...

is

...

rename

...

it

...

and

...

make

...

an

...

index

...

for

...

it,

...

like

...

so:

...

}

Code Block

$ mv accepted_hits.bam Sample.bam $ samtools index Sample.bam {code}

which

...

will

...

create

...

an

...

index

...

file

...

named

...

Sample.bam.bai.

...

Make

...

a

...

bedgraph

...

(wiggle)

...

genome

...

coverage

...

file

...

A

...

genome

...

coverage

...

(or

...

depth)

...

file

...

reports

...

the

...

number

...

of

...

reads

...

overlapping

...

regions

...

or

...

individual

...

bases

...

of

...

the

...

reference

...

genome.

...

Older

...

versions

...

of

...

TopHat

...

used

...

to

...

create

...

a

...

coverage

...

file,

...

but

...

more

...

recent

...

versions

...

of

...

Tophat

...

no

...

longer

...

do

...

this.

...

However,

...

you

...

can

...

create

...

coverage

...

graphs

...

using

...

bedtools

...

genomecov

...

command.

...

For

...

information

...

on

...

where

...

to

...

get

...

bedtools

...

and

...

how

...

to

...

install

...

it,

...

visit

...

the

...

bedtools

...

documentation

...

.

To create a genome coverage bedgraph file from your BAM alignments file on a Linux machine, do this:

Code Block
$ genomecov -ibam Sample.bam -split -bg > Sample.bedgraph {code}

What

...

should

...

happen

...

is

...

that

...

a

...

new

...

file

...

should

...

appear

...

named

...

Sample.bedgraph

...

,

...

which

...

will

...

have

...

a

...

structure

...

that

...

looks

...

a

...

bit

...

like:

...

}

Code Block

chr1 3667 3697 2 chr1 3697 3707 3 chr1 3707 3726 5 chr1 3726 3742 6 chr1 3742 3744 4 chr1 3744 3749 5 chr1 3749 3755 6 {code}

At

...

this

...

point,

...

you

...

should

...

be

...

able

...

to

...

open

...

the

...

"bedgraph"

...

file

...

directly

...

in

...

IGB.

...

However,

...

IGB

...

will

...

work

...

better

...

if

...

you

...

sort

...

the

...

file,

...

compress

...

it

...

using

...

bgzip,

...

and

...

then

...

make

...

an

...

index

...

for

...

the

...

file

...

using

...

tabix.

...

}

Code Block

$ sort -k1,1 -k2,2n Sample.bedgraph \| bgzip > Sample.bedgraph.gz $ tabix -s 1 -b 2 -e 3 Sample.bedgraph.gz {code}

The

...

sort

...

command

...

will

...

read

...

the

...

entire

...

Sample.bedgraph

...

file

...

into

...

memory,

...

sort

...

the

...

data,

...

and

...

then

...

output

...

the

...

data

...

in

...

sorted

...

order.

...

The

...

sorted

...

data

...

will

...

first

...

be

...

ordered

...

by

...

chromosome

...

name

...

(field

...

1)

...

and

...

then

...

by

...

interval

...

start

...

position

...

(field

...

2).

...

In

...

the

...

command

...

above,

...

the

...

sorted

...

data

...

are

...

piped

...

into

...

the

...

bgzip

...

compression

...

tool

...

and

...

then

...

the

...

compressed,

...

sorted

...

data

...

are

...

saved

...

to

...

a

...

file

...

called

...

Sample.bedgraph.gz

...

.

...

The

...

tabix

...

command

...

then

...

creates

...

an

...

index

...

file

...

(extension

...

.tbi

...

)

...

for

...

the

...

newly

...

compressed,

...

sorted

...

Sample.bedgraph.gz

...

file.

...

Note

...

:

...

To

...

view

...

Sample.bedgraph.gz

...

file

...

in

...

IGB,

...

you

...

should

...

always

...

keep

...

the

...

index

...

file

...

(named

...

Sample.bedgraph.gz.tbi

...

)

...

in

...

the

...

same

...

folder.

...

To

...

get

...

a

...

copy

...

of

...

bgzip

...

and

...

tabix

...

,

...

go

...

to

...

...

...

...

...

.

...

Note

...

that

...

development

...

of

...

tabix

...

has

...

moved

...

to

...

this

...

repository at github.

View data in IGB

Once you've created the files, you should then be able to open them in IGB. However, be sure to keep the index files (.bai from samtools and .tbi from tabix) in the same folder with their corresponding alignments or bedgraph files.

Once you open them in IGB, scroll and zoom to a smallist region of your genome -- a region where you can see maybe five or six different genes. (Use the "search" tab to zoom in on a particular gene of interest.)

Then, click "Load Data" to load data from the files you selected. (All selected files should appear in the Data Management Table on the right side of the Data Access tab.)

If all goes well, the bedgraph file data should appear in a graph track. To change its appearance, e.g., add a y-axis, change the color, etc. select the track label and click the Graph Adjuster tab, which contains various controls for working with graphs.

The reads (from the BAM) file appear in tracks above and below the axis. To change how reads look, choose File->Preferences->Tracks and use the options you see there to merge all the reads into a single +/- track, use color to indicate strand, change the number of reads that are shown in stacks, and so on.

For more information on working with tracks and graphs, please see the IGB User's Guide.

And remember, if you have questions, get in touch with us or use Google. All IGB tutorials and user guide materials are publicly available on our wiki and have (mostly) been indexed in Google.

Page tree

Versions Compared

Old Version 12

New Version 13

Key

Introduction

RNA-Seq

tutorial

Check

sequence

quality.

Align your sequence.

Process output files

Index (and rename) your alignment file

Make

a

bedgraph

(wiggle)

genome

coverage

file

View data in IGB

Page tree

Page History

Versions Compared

Old Version 12

New Version 13

Key

Introduction

RNA-Seq

tutorial

Check

sequence

quality.

Align your sequence.

Process output files

Index (and rename) your alignment file

Make

a

bedgraph

(wiggle)

genome

coverage

file

View data in IGB