Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migrated to Confluence 5.3

...

Table of Contents

Introduction

High-throughput

...

sequencing

...

of

...

cDNA,

...

also

...

called

...

RNA-Seq,

...

can

...

provide

...

many

...

levels

...

of

...

information

...

about

...

gene

...

expression,

...

such

...

as

...

information

...

about

...

previously

...

unannotated

...

genes,

...

expression

...

of

...

pseudogenes,

...

  differential

...

expression

...

both

...

within

...

and

...

across

...

samples,

...

and

...

alternative

...

splicing.

...

Making

...

the

...

libraries

...

and

...

sending

...

them

...

out

...

for

...

sequencing

...

is

...

only

...

the

...

first

...

step

...

in

...

performing

...

an

...

RNA-Seq

...

experiment.

...

What

...

most

...

people

...

find

...

is

...

that

...

processing,

...

analyzing,

...

and

...

interpreting

...

the

...

data

...

can

...

often

...

be

...

just

...

as

...

time-consuming.

...

Take

...

heart.

...

These

...

data

...

are

...

sets

...

are

...

so

...

rich

...

that

...

you

...

may

...

never

...

fully

...

exhaust

...

their

...

potential.

...

However,

...

there

...

are

...

a

...

few

...

first

...

steps

...

you'll

...

want

...

to

...

perform

...

right

...

away,

...

as

...

well

...

as

...

some

...

quality

...

control

...

steps

...

that

...

will

...

help

...

you

...

assess

...

how

...

well

...

your

...

experiment

...

worked.

...

What

...

follows

...

is

...

a

...

description

...

of

...

RNA-Seq

...

processing

...

steps

...

we

...

do

...

fairly

...

routinely

...

in

...

the

...

Loraine

...

lab,

...

along

...

with

...

links

...

to

...

software

...

programs

...

we've

...

developed

...

for

...

in-house

...

use.

...

Please

...

be

...

aware

...

that

...

these

...

programs

...

are

...

very

...

much

...

works

...

in

...

progress

...

and

...

so

...

may

...

not

...

always

...

work

...

as

...

advertised.

...

If

...

you

...

find

...

bugs

...

or

...

inconsistencies,

...

please

...

let

...

us

...

know

...

-

...

contact

...

Ann

...

(aloraine@uncc.edu)

...

with

...

feedback,

...

suggestions,

...

and

...

bug

...

reports.

...

RNA-Seq

...

tutorial

...

The

...

following

...

protocol

...

describes

...

processing

...

data

...

from

...

Illumina

...

HiSeq

...

pipeline.

...

Whether

...

you

...

should

...

perform

...

exactly

...

these

...

steps

...

with

...

your

...

data

...

sets

...

will

...

depend

...

on

...

your

...

data

...

-

...

its

...

age,

...

whether

...

you've

...

done

...

single-end

...

or

...

paired-end

...

sequencing,

...

the

...

read

...

lengths,

...

your

...

reference

...

genome,

...

and

...

so

...

on.

...

For

...

example,

...

when

...

you

...

run

...

TopHat,

...

you

...

may

...

need

...

to

...

adjust

...

parameters

...

to

...

accommodate

...

smaller

...

read

...

lengths

...

if

...

you

...

are

...

running

...

data

...

from

...

pre-HiSeq

...

instruments.

...

Check

...

sequence

...

quality.

...

Your

...

first

...

step

...

upon

...

downloading

...

your

...

reads

...

will

...

be

...

to

...

check

...

the

...

quality

...

of

...

your

...

sequence.

...

One

...

of

...

the

...

easiest

...

to

...

use

...

tools

...

we

...

(in

...

the

...

Loraine

...

lab)

...

have

...

found

...

for

...

quality

...

checking

...

is

...

a

...

terrific

...

program

...

called

...

FastQC

...

,

...

from

...

the

...

Babraham

...

Institute.

...

You

...

can

...

run

...

it

...

interactively

...

or

...

as

...

a

...

command

...

line

...

tool,

...

and

...

it's

...

written

...

in

...

Java,

...

which

...

means

...

you

...

can

...

run

...

it

...

on

...

a

...

Mac,

...

a

...

Linux

...

machine,

...

or

...

a

...

Windows

...

computer.

...

Like

...

IGB,

...

any

...

computer

...

that

...

supports

...

Java

...

can

...

run

...

FastQC.

...

What

...

you

...

should

...

be

...

looking

...

for

...

in

...

your

...

data

...

is

...

evidence

...

of

...

poor

...

sequencing

...

quality

...

as

...

well

...

as

...

spots

...

in

...

your

...

sequence

...

where

...

you

...

have

...

lots

...

and

...

lots

...

of

...

N

...

characters

...

-

...

these

...

usually

...

correspond

...

to

...

place

...

with

...

poor

...

quality.

...

If

...

your

...

reads

...

are

...

generally

...

low

...

quality,

...

you

...

should

...

ask

...

your

...

sequencing

...

facility

...

to

...

try

...

again.

...

If

...

your

...

data

...

have

...

many,

...

many

...

over-represented

...

sequences

...

(something

...

FastQC

...

can

...

tell

...

you),

...

then

...

you

...

may

...

want

...

to

...

make

...

another

...

library

...

for

...

that

...

sample.

...

In

...

our

...

experience,

...

low

...

yields

...

and

...

low

...

quality

...

generally

...

happen

...

because

...

something

...

went

...

wrong

...

with

...

sequence.

...

Over-represented

...

sequences

...

and

...

so-called

...

PCR

...

duplicates

...

(many

...

copies

...

of

...

the

...

same

...

sequence)

...

are

...

usually

...

due

...

to

...

problems

...

in

...

library

...

construction.

...

Align your sequence.

Let's

...

assume

...

that

...

(happily)

...

you

...

have

...

good-quality

...

sequence.

...

Your

...

next

...

step

...

should

...

be

...

to

...

align

...

your

...

sequences

...

onto

...

a

...

reference

...

genome

...

or

...

refernece

...

transcriptome.

...

Indeed,

...

even

...

if

...

you

...

haven't

...

got

...

high-quality

...

sequence,

...

you

...

should

...

still

...

try

...

to

...

align

...

it,

...

because

...

the

...

alignments

...

can

...

tell

...

you

...

a

...

lot

...

about

...

what

...

went

...

wrong.

...

For

...

this

...

tutorial,

...

we'll

...

use

...

a

...

spliced

...

alignment

...

tool

...

to

...

align

...

the

...

RNA-Seq

...

reads

...

onto

...

a

...

reference

...

genome.

...

For RNA-Seq

...

data

...

sets,

...

we

...

mainly

...

have

...

used

...

TopHat,

...

from

...

the

...

University

...

of

...

Maryland.

...

Others

...

are

...

available,

...

but

...

since

...

we

...

have

...

the

...

most

...

experience

...

with

...

TopHat

...

and

...

seems

...

to

...

be

...

one

...

of

...

the

...

more

...

widely

...

used

...

programs,

...

the

...

following

...

examples

...

will

...

demonstrate

...

how

...

to

...

run

...

it.

...

TopHat

...

is

...

a

...

spliced

...

alignment

...

tool

...

that

...

first

...

runs

...

BowTie

...

(a

...

non-spliced

...

alignment

...

tool

...

from

...

the

...

same

...

group)

...

and

...

then

...

attempts

...

to

...

align

...

any

...

reads

...

BowTie

...

couldn't

...

align

...

by

...

splitting

...

them

...

across

...

putative

...

introns.

...

For

...

this

...

reason,

...

to

...

run

...

TopHat

...

you'll

...

have

...

to

...

install

...

BowTie.

...

You'll

...

also

...

have

...

to

...

install

...

samtools,

...

a

...

program

...

that

...

TopHat

...

and

...

BowTie

...

use

...

to

...

generate

...

alignment

...

files

...

called

...

"BAM"

...

(binary

...

alignment)

...

files.

...

IGB

...

can

...

display

...

data

...

from

...

BAM

...

files,

...

once

...

you've

...

created

...

an

...

index

...

for

...

them.

...

More

...

on

...

indexing

...

BAM

...

files

...

will

...

come

...

later.

...

Many

...

different

...

versions

...

of

...

TopHat

...

have

...

been

...

released

...

over

...

the

...

past

...

couple

...

of

...

years

...

and

...

each

...

behaves

...

slightly

...

differently.

...

However,

...

a

...

few

...

things

...

seem

...

to

...

remain

...

stable.

...

First,

...

TopHat

...

will

...

typically

...

report

...

multiple

...

alignments

...

for

...

some

...

number

...

of

...

reads.

...

This

...

is

...

to

...

be

...

expected.

...

However,

...

depending

...

on

...

your

...

experimental

...

goals,

...

you

...

may

...

want

...

to

...

focus

...

on

...

the

...

reads

...

that

...

map

...

exactly

...

once

...

onto

...

the

...

genome.

...

You

...

can

...

figure

...

out

...

which

...

reads

...

mapped

...

to

...

multiple

...

locations

...

by

...

looking

...

at

...

the

...

"NH"

...

flag

...

in

...

each

...

alignment.

...

Also,

...

you

...

should

...

determine

...

the

...

minimum

...

and

...

maximum

...

intron

...

sizes

...

for

...

your

...

genome

...

and

...

provide

...

these

...

as

...

parameters

...

to

...

TopHat.

...

For

...

details

...

on

...

running

...

TopHat,

...

see

...

the

...

TopHat

...

Manual

...

.

Here is an example invocation of TopHat, fine-tuned

...

for

...

Arabidopsis

...

thaliana

...

.

...

TopHat

...

is

...

a

...

command

...

line

...

program,

...

which

...

means

...

you

...

run

...

it

...

by

...

typing

...

commands

...

into

...

a

...

Unix

...

terminal.

...

(On

...

Mac,

...

this

...

is

...

the

...

Terminal

...

program

...

from

...

the

...

Applications

...

Utilities.)

...

For

...

the

...

rest

...

of

...

this

...

tutorial,

...

wherever

...

you

...

see

...

a

...

line

...

that

...

starts

...

with

...

a

...

"$"

...

sign,

...

this

...

means

...

you

...

type

...

the

...

command

...

into

...

a

...

terminal.

...

(The

...

"$"

...

means:

...

the

...

Unix

...

prompt.)

...

}
Code Block
$ tophattophat2 --max-intron-length 2000 /data/bowtieindex/A_thaliana_Jun_2009 Sample.fastq -o Sample
{code}

However,

...

you're

...

probably

...

better

...

off

...

running

...

TopHat

...

on

...

a

...

fairly

...

powerful

...

machine.

...

If

...

you

...

can

...

find

...

a

...

multi-processor

...

server

...

with

...

16

...

or

...

more

...

of

...

RAM,

...

I

...

would

...

strongly

...

recommend

...

using

...

it

...

to

...

run

...

the

...

alignment

...

step.

...

If

...

you

...

do

...

get

...

access

...

to

...

a

...

multi-processor

...

server,

...

you

...

should

...

tell

...

TopHat

...

to

...

take

...

advantage

...

of

...

the

...

extra

...

processing

...

power

...

using

...

the

...

-p

...

parameter

...

that

...

allows

...

you

...

to

...

specify

...

the

...

number

...

of

...

processors

...

TopHat

...

can

...

use.

...

To

...

run

...

bowtie2,

...

you

...

have

...

to

...

first

...

make

...

an

...

index

...

file

...

for

...

your

...

genome.

...

And

...

to

...

make

...

an

...

index

...

for

...

your

...

genome,

...

you'll

...

need

...

a

...

fasta

...

file

...

for

...

your

...

genome.

...

This

...

is

...

easy

...

to

...

get,

...

however.

...

You

...

can

...

either

...

download

...

the

...

fasta

...

files

...

directly

...

from

...

the

...

genome

...

data

...

provider

...

(e.g.,

...

UCSC

...

or

...

NBCI)

...

or

...

you

...

can

...

get

...

a

...

2bit

...

file

...

from

...

the

...

IGBQuickLoad

...

Web

...

site

...

and

...

convert

...

it

...

to

...

fasta.

...

To

...

get

...

a

...

sequence

...

data

...

file

...

from

...

IGBQuickLoad.org,

...

go

...

to

...

the

...

genome

...

directory

...

and

...

download

...

the

...

"2bit"

...

file

...

you

...

find

...

there.

...

Then

...

use

...

Jim

...

Kent's

...

2bit2

...


In

...

this

...

example,

...

I'll

...

run

...

bowtie2-build

...

using

...

a

...

fasta

...

file

...

A_thaliana_Jun_2009.fa

...

to

...

create

...

index

...

files

...

bowtie2 uses

...

to

...

speed

...

up

...

the

...

alignment

...

process.

...

We

...

TopHat

...

will

...

created

...

a

...

directory

...

called

...

Sample

...

(the

...

-o

...

parameter).

...

In

...

IGB,

...

we

...

use

...

files

...

in

...

"2bit"

...

format

...

to

...

represent

...

genomic

...

sequence

...

data.

...

It's

...

easy

...

to

...

convert

...

a

...

2bit

...

file

...

into

...

a

...

fasta

...

file

...

and

...

back

...

again.

...

To

...

create

...

a

...

fasta

...

file,

...

do something like:

Code Block

twoBitToFa A_thaliana_Jun_2009.2bit A_thaliana_Jun_2009.fa

Please note that you need to make sure that the names of chromosomes in the fasta file (and hence the BowTie index files) match with what IGB uses. For Arabidopsis, chromosome names are chr1, chr2, chr3, chr4, chr5, chrC, and chrM.

IGB has a synonyms system that allows it to "understand" that chr1 and Chr1 are really the same thing, but other programs you will run into might not be able to match names in this way. For this reason, it's a good idea to use the same names for chromosome throughout all the different steps of processing data.

Process output files

Index (and rename) your alignment file

When TopHat finishes, it will have created a file called "accepted_hits.bam"

...

that

...

contains

...

your

...

alignments.

...

Your

...

next

...

step

...

in

...

the

...

process

...

will

...

be

...

to

...

index

...

this

...

file,

...

creating

...

what's

...

called

...

a

...

"bai"

...

index

...

file.

...

For

...

this,

...

you'll

...

use

...

samtools,

...

a

...

freely

...

available

...

command-line

...

tool

...

with

...

many

...

useful

...

functions.

...

You

...

can

...

learn

...

about

...

samtools

...

at

...

this

...

Web

...

site

...

and

...

download

...

the

...

code

...

from

...

this

...

page

...

hosted

...

at

...

sourceforge.net.

...

Why

...

make

...

an

...

index?

...

This

...

is

...

a

...

crucial

...

step

...

because

...

it

...

allows

...

programs

...

like

...

IGB

...

to

...

do

...

what

...

is

...

called

...

a

...

"region-based

...

query."

...

That

...

is,

...

once

...

you've

...

started

...

IGB,

...

you

...

can

...

load

...

the

...

BAM

...

file,

...

zoom

...

in

...

on

...

a

...

region,

...

and

...

then

...

ask

...

IGB

...

to

...

load

...

in

...

just

...

the

...

reads

...

that

...

overlap

...

your

...

region

...

of

...

interest.

...

This

...

works

...

because

...

IGB

...

reads

...

the

...

index

...

file,

...

which

...

tells

...

IGB

...

exactly

...

where

...

to

...

look

...

in

...

the

...

larger

...

BAM

...

file

...

(which

...

may

...

be

...

1

...

gig

...

or

...

more

...

!)

...

for

...

just

...

the

...

subset

...

of

...

reads

...

that

...

are

...

relevant

...

to

...

the

...

region

...

in

...

view.

...

To

...

make

...

an

...

index

...

(.bai)

...

file

...

for

...

a

...

BAM

...

file,

...

you

...

typically

...

would

...

first

...

have

...

to

...

sort

...

the

...

BAM

...

file

...

first.

...

However,

...

the

...

"accepted_hits.bam"

...

file

...

from

...

TopHat

...

should

...

already

...

be

...

sorted.

...

So

...

all

...

you

...

should

...

need

...

to

...

do

...

next

...

is

...

rename

...

it

...

and

...

make

...

an

...

index

...

for

...

it,

...

like

...

so:

...

Code Block

$ mv accepted_hits.bam Sample.bam
$ samtools index Sample.bam
{noformat}

which

...

will

...

create

...

an

...

index

...

file

...

named

...

Sample.bam.bai.

...

Make

...

a

...

bedgraph

...

(wiggle)

...

genome

...

coverage

...

file

...

A

...

genome

...

coverage

...

(or

...

depth)

...

file

...

reports

...

the

...

number

...

of

...

reads

...

overlapping

...

regions

...

or

...

individual

...

bases

...

of

...

the

...

reference

...

genome.

...

Older

...

versions

...

of

...

TopHat

...

used

...

to

...

create

...

a

...

coverage

...

file,

...

but

...

more

...

recent

...

versions

...

of

...

Tophat

...

no

...

longer

...

do

...

this.

...

However,

...

you

...

can

...

create

...

coverage

...

graphs

...

using

...

bedtools

...

genomecov

...

command.

...

For

...

information

...

on

...

where

...

to

...

get

...

bedtools

...

and

...

how

...

to

...

install

...

it,

...

visit

...

the

...

bedtools documentation.

To create a genome coverage bedgraph file from your BAM alignments file on a Linux machine, do this:

Code Block
$ genomecov -ibam Sample.bam -split -bg > Sample.bedgraph
{code}

What

...

should

...

happen

...

next

...

is

...

that

...

a

...

new

...

file

...

should

...

appear

...

named

...

Sample.bedgraph,

...

which will have a structure that looks a bit like:

Code Block

chr1    3667    3697    2
chr1    3697    3707    3
chr1    3707    3726    5
chr1    3726    3742    6
chr1    3742    3744    4
chr1    3744    3749    5
chr1    3749    3755    6

At this point, you should be able to open the "bedgraph" file directly in IGB. However, IGB will work better if you sort the file, compress it using bgzip, and then make an index for the file using tabix.

Code Block
$ sort -k1,1 -k2,2n Sample.bedgraph | bgzip > Sample.bedgraph.gz
$ tabix -s 1 -b 2 -e 3 Sample.bedgraph.gz
{code}

The

...

sort

...

command

...

will

...

read

...

the

...

entire

...

Sample.bedgraph

...

file

...

into

...

memory,

...

sort

...

the

...

data,

...

and

...

then

...

output

...

the

...

data

...

in

...

sorted

...

order.

...

The

...

sorted

...

data

...

will

...

first

...

be

...

ordered

...

by

...

chromosome

...

name

...

(field

...

1)

...

and

...

then

...

by

...

interval

...

start

...

position

...

(field

...

2).

...

In

...

the

...

command

...

above,

...

the

...

sorted

...

data

...

are

...

piped

...

into

...

the

...

bgzip

...

compression

...

tool

...

and

...

then

...

the

...

compressed,

...

sorted

...

data

...

are

...

saved

...

to

...

a

...

file

...

called

...

Sample.bedgraph.gz

...

.

...

The

...

tabix

...

command

...

then

...

creates

...

an

...

index

...

file

...

(extension

...

.tbi

...

)

...

for

...

the

...

newly

...

compressed,

...

sorted

...

Sample.bedgraph.gz

...

file.

...

Note

...

:

...

To

...

view

...

Sample.bedgraph.gz

...

file

...

in

...

IGB,

...

you

...

should

...

always

...

keep

...

the

...

index

...

file

...

(named

...

Sample.bedgraph.gz.tbi

...

)

...

in

...

the

...

same

...

folder.

...

To

...

get

...

a

...

copy

...

of

...

bgzip

...

and

...

tabix

...

,

...

go

...

to

...

tabix

...

download

...

at

...

SourceForge

...

.

...

Note

...

that

...

development

...

of

...

tabix

...

has

...

moved

...

to

...

this

...

repository

...

at

...

github.

View data in IGB

Once you've

...

created

...

the

...

files,

...

you

...

should

...

then

...

be

...

able

...

to

...

open

...

them

...

in

...

IGB.

...

However,

...

be

...

sure

...

to

...

keep

...

the

...

index

...

files

...

(.bai

...

from

...

samtools

...

and

...

.tbi

...

from

...

tabix)

...

in

...

the

...

same

...

folder

...

with

...

their

...

corresponding

...

alignments

...

or

...

bedgraph

...

files.

...

Once

...

you

...

open

...

them

...

in

...

IGB,

...

scroll

...

and

...

zoom

...

to

...

a

...

smallist

...

region

...

of

...

your

...

genome

...

-

...

-

...

a

...

region

...

where

...

you

...

can

...

see

...

maybe

...

five

...

or

...

six

...

different

...

genes.

...

(Use

...

the

...

"search"

...

tab

...

to

...

zoom

...

in

...

on

...

a

...

particular

...

gene

...

of

...

interest.)

...

Then,

...

click

...

"Load

...

Data"

...

to

...

load

...

data

...

from

...

the

...

files

...

you

...

selected.

...

(All

...

selected

...

files

...

should

...

appear

...

in

...

the

...

Data

...

Management

...

Table

...

on

...

the

...

right

...

side

...

of

...

the

...

Data

...

Access

...

tab.)

...

If

...

all

...

goes

...

well,

...

the

...

bedgraph

...

file

...

data

...

should

...

appear

...

in

...

a

...

graph

...

track.

...

To

...

change

...

its

...

appearance,

...

e.g.,

...

add

...

a

...

y-axis,

...

change

...

the

...

color,

...

etc.

...

select

...

the

...

track

...

label

...

and

...

click

...

the

...

Graph

...

Adjuster

...

tab,

...

which

...

contains

...

various

...

controls

...

for

...

working

...

with

...

graphs.

...

The

...

reads

...

(from

...

the

...

BAM)

...

file

...

appear

...

in

...

tracks

...

above

...

and

...

below

...

the

...

axis.

...

To

...

change

...

how

...

reads

...

look,

...

choose

...

File->Preferences->Tracks

...

and

...

use

...

the

...

options

...

you

...

see

...

there

...

to

...

merge

...

all

...

the

...

reads

...

into

...

a

...

single

...

+/

...

-

...

track,

...

use

...

color

...

to

...

indicate

...

strand,

...

change

...

the

...

number

...

of

...

reads

...

that

...

are

...

shown

...

in

...

stacks,

...

and

...

so

...

on.

...

For

...

more

...

information

...

on

...

working

...

with

...

tracks

...

and

...

graphs,

...

please

...

see

...

the

...

IGB

...

User's

...

Guide

...

.

And remember,

...

if

...

you

...

have

...

questions,

...

get

...

in

...

touch

...

with

...

us

...

or

...

use

...

Google.

...

All

...

IGB

...

tutorials

...

and

...

user

...

guide

...

materials

...

are

...

publicly

...

available

...

on

...

our

...

wiki

...

and

...

have

...

(mostly)

...

been

...

indexed

...

in

...

Google.