Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

Introduction

The following instructions assume you are not going to make a new genome version directory and that all you need to do is update refGene, all_mrna and all_est files.

...

For genomes harvested from UCSC, this foundation gene annotations data typically are from the the UCSC refGene table. If a genome does not have a refGene table, we instead use ensGene or whatever other genes data set looks the most complete and the most useful.

Understand IGB QuickLoad naming conventions

Note

IGB QuickLoad file names should include the genome version and the UCSC table name. The title in the annots.xml file should match the track name but should not include the species name, as that will be obvious to the user and may make linkout patterns harder to maintain.

...

However, the UCSC genome browser sometimes include species names in the title of its mRNA track. IGB QuickLoad data sets should not include species names in the titles. For example, the title of an mRNA track is always "mRNA" and never (for example) "Zebrafish mRNA."

Command-line utilities you'll need need

  • tabix from samtools sourceforge siteand bgzip from samtools sourceforge sitehtslib 
  • UNIX wget (not installed by default on Mac but available on most other UNIX systems)
  • UNIX sort
  • UNIX gunzip

...

Code Block
export PATH=.:$HOME/bin:$PATH

Step-by-step guide to updating UCSC data sets in IGB QuickLoad

Get the QuickLoad data repo

Check out or update a copy of IGB QuickLoad data and source code directories.

Use git to obtain a copy of genome_srcsource code:

Code Block
$ git clone https://bitbucket.org/lorainelab/genomes_srcgenomesrc

If you already have a copy, then update it. Changed into your local copy and run:

...

Use svn to get a copy of the QuickLoad data repository: TO BE UPDATED

Code Block
$ svn co https://svn.transvar.org/repos/genomes/trunk/pub/quickload TO BE UPDATED

If you already have a copy, just update using svn up. Change into your checked-out, local copy and run:

Code Block
$ svn up 

Add

...

quickloadsrc to your PATH (to run the python code there)

Add quickload_src quickloadsrc to your PATH as it contains a python script you'll use to created BED detail files from ordinary BED files. Edit the .bash_profile file as in above:

Code Block
export PATH=.:$HOME/quickload_src:$HOME/bin:$PATH

Pick a genome to update and change into the genome version directory

Open a Web browser and find the genome in the Table Browser at UCSC

Go to http://genome.ucsc.edu/cgi-bin/hgTables

Update refGene data set

File name: G_species_strain_MMM_YYYY_refGene.gz
Data set title: RefSeq Genes

Get refGene data from UCSC Table Browser

Configure Table Browser

Configure Table Browser with the following settings:

...

Note

UCSC data set file names saved in IGB QuickLoad should always include the IGB genome version name followed by an underscore character followed by the UCSC table name. The title field in the annots.xml file should always be the UCSC track name because that is what users will recognize from having used the UCSC genome browser.

Click the "get output" button to download the data

Get gene info and accession info files from NCBI ftp site

Code Block
$ wget ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2accession.gz
$ wget ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene_info.gz

Create BED detail file with gene information

Use ucscToBedDetail.py (from https://bitbucket.org/lorainelab/genomes_src) to create a new BED file with gene symbol and description

...

You can also provide these files using options -a and -g if you have saved them to a different location. See the script documentation for details.

Sort, compress, and index

Code Block
$ sort -k1,1 -k2,2n G_species_MMM_YYYY_refGene.bed | bgzip > G_species_MMM_YYYY_refGene.bed.gz
$ tabix -s 1 -b 2 -e 3 bed G_species_MMM_YYYY_refGene.bed.gz

Validate the file

Check that it has data

Code Block
$ gunzip -c G_species_MMM_YYYY_refGene.bed.gz | wc -l

The wc command should print the number of lines in the file, which should be equal to the number of rows in the corresponding refGene table. To find out how many rows the refGene table contains, click \"describe table scheme\" in the Table Browser.

Try to open it in IGB

Open file the file and change load mode to whole genome. Click the genome row in the Current Sequence table. You should see something above every chromosome.

If some chromosomes have no data, go back to the table browser and use the region text area to confirm there was no data for the given chromosome.

Check the chromosomes

Make sure the number of chromosomes listedin listed in the file does not exceed the number of chromosomes listed in the genome descriptor file genome.txt:

...

The first line counts the number of unique sequences appearing the first column of the bed file. The second line counts the number of lines in the genome.txt file.

Edit annots.xml

The annots.xml file description for the refGene annotations contains the date the data were downloaded. Edit the file accordingly to reflect today's date.

Check in the new files

Use svn status command to double-check which files you've changed:

...

It should print something like:

Code Block
M  G_species_MMM_YYYY_refGene.bed.gz
M  G_species_MMM_YYYY_refGene.bed.gz.tbi
M  annots.xml
M

...

Warning

These are the only files that should have changed. If others are different, something has gone wrong.

Check in the files one-by-one:

Code Block
svn ci G_species_MMM_YYYY_refGene.bed.gz -m "Enter a message here."
svn ci G_species_MMM_YYYY_refGene.bed.gz.tbi -m "Enter a message here."
svn ci annots.xml -m "Enter a message here."

Update all_mrna

Get mRNA data from UCSC Table Browser

Configure Table Browser

Configure Table Browser with the following settings:

  • choose assembly using release date to match it with IGB QuickLoad genome version
  • choose group: mRNA and EST tracks
  • choose track: SPECIES mRNA
  • choose table: all_mrna
  • choose output format: selected fields from primary and related tables
  • enter output file G_species_MMM_YYYY_all_mrna.psl.gz
    • The prefix should be THE SAME AS the genome version directory name, with _all_mrna appended.
  • file type returned: choose gzip compressed
Warning

Do not capitalize "rna" in mrna. The data set suffix should be identical to the UCSC table name, and all_mrna is spelled with lower-case letters.

Click "get output" and configure fields

The browser will then show a new screen from which you can select specific fields. For this, select all fields except the field named "bin" (top of the list).

Click "get output" again to download the file.

Create sorted, compressed PSL file minus header

Use grep, sort, and bgzip to make a PSL file:

Code Block
$ gunzip -c ~/Downloads/G_species_MMM_YYYY_all_mrna.psl.gz | grep -v '^#' | sort -k14,14 -k16,16n | bgzip > G_species_MMM_YYYY_all_mrna.psl.gz

Use tabix to index the sorted, compressed PSL file

Code Block
$ tabix -s 14 -b 16 -e 17 G_species_MMM_YYYY_all_mrna.psl.gz

Validate the file

Check that it has data

Code Block
$ gunzip -c G_species_MMM_YYYY_all_mrna.psl.gz | wc -l

...

refGene

...

Try to open it in IGB

Open file the file and change load mode to whole genome. Click the genome row in the Current Sequence table. You should see something above every chromosome.

If some chromosomes have no data, go back to the table browser and use the region text area to confirm there was no data for the given chromosome.

Check the chromosomes

Make sure the number of chromosomes listed in the file does not exceed the number of chromosomes listed in the genome descriptor file genome.txt:

Code Block
$ gunzip -c G_species_MMM_YYYY_all_mrna.psl.gz | cut -f1 | sort | uniq  wc -l
$ wc -l genome.txt

...

bed

...

Edit annots.xml

The annots.xml file description for the all_mrna annotations contains the date the data were downloaded. Edit the file accordingly to reflect today's date.

Annots.xml attributes:

attribute

value

name

G_species_MMM_YYYY_all_mrna.psl.gz

title

mRNA

description

UCSC all mRNA track (date of download)

url

http://www.igbquickload.org/quickload/G_species_MMM_YYYY

All other attributes (e.g., foreground, background, load_hint) should match other UCSC genome release mRNA data sets.

Check in the new files

Use svn status command to double-check which files you've changed:

Code Block
$ svn st

It should print something like:

Code Block
M  G_species_MMM_YYYY_all_mrna.psl.gz
M  G_species_MMM_YYYY_all_mrna.psl.gz.tbi
M  annots.xml

M stands for "modified".

Warning

These are the only files that should have changed. If others are different, something has gone wrong. If you did not intend to change them, use svn revert to undo the local changes and revert to the checked-out version.

Check in the files one-by-one:

Code Block
svn ci G_species_MMM_YYYY_refGene.bed.gz -m "Enter a message here."
svn ci G_species_MMM_YYYY_refGene.bed.gz.tbi -m "Enter a message here."
svn ci annots.xml -m "Enter a message here."

The message you enter should include the date of the download. This is important because researchers who are using these data will need to know this information in order to publish their results.

Update all_est

Repeat the same steps you used for all_mrna, but instead of choosing Track: SPECIES mRNAs, instead choose SPECIES ESTs, and choose table all_ests.

The data files you create will contain table suffix all_est, e.g., G_species_MMM_YYYY_all_est.psl.gz and its index G_species_MMM_YYYY_all_est.psl.gz.tbi.

The data set file name, title, and description in the annots.xml should be:

...