Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Table of Contents

Introduction

The following instructions assumes you are not going to make a new genome version directory and that all you need to do is update RefGen, mRNA, and EST files.

...

For genomes harvested from UCSC, this foundation gene annotations data typically are from the the UCSC refGene table. If a genome does not have a refGene table, we instead use ensGene or whatever other genes data set looks the most complete and the most useful.

...

Note

IGBQuickLoad file names should include the genome version and the UCSC table name. The title in the annots.xml file should match the track name.

For example, gene models data for the UCSC track named RefSeq Genes is stored in a table called refGene. In IGB QuickLoad, the data file should be named G_species_strain_MMM_YYYY_refGene and the title attribute in annots.xml should be RefSeq Genes.

Command-line utilities you'll need need

  • tabix from samtools sourceforge site
  • bgzip from samtools sourceforge site
  • UNIX wget (not installed by default on Mac but available on most other UNIX systems)
  • UNIX sort
  • UNIX gunzip

Install these in a directory in your PATH.

If you're doing this on a Mac desktop or laptop computer, best practice is to create a directory called "bin" in your home directory and save all compiled binaries there. Edit your .bash_profile file to include a line like :the following to ensure that the shell can find the programs.

Code Block
export PATH=.:$HOME/bin:$PATH

Step-by-step guide to updating UCSC data sets in IGB QuickLoad

Get the QuickLoad data repo

Check out or update a copy of IGB QuickLoad data and source code directories.

Code Block
$ svn co https://svn.transvar.org/repos/genomes/pub/quickload quickload
$ svn co https://svn.transvar.org/repos/genomes/pub/src quickload_src

If you already have a copy, just update using svn up.

Add quickload_src to your PATH (to run the python code there)

Add quickload_src to your PATH as it contains a python script you'll use to created BED detail files from ordinary BED files.

If you already have a copy, just update using svn up.

...

Edit the .bash_profile file as in above:

Code Block


export PATH=.:$HOME/quickload_src:$HOME/bin:$PATH

Pick a genome to update and change into the genome version directory

Open a Web browser and find the genome in the Table Browser at UCSC

Go to http://genome.ucsc.edu/cgi-bin/hgTables

...

Update refGene data set

File name: G_species_strain_MMM_YYYY_refGene.gz
Data set title: RefSeq Genes

Configure Table Browser

Configure Table Browser with the following settings:

...

Note

UCSC data set file names saved in IGB QuickLoad always include the IGB genome version name followed by an underscore character followed by the UCSC table name. The title field in the annots.xml file should always be the UCSC track name because that is what users will recongized from having used the UCSC genome browser.

Click the

...

"get output

...

" button to download the data

Download gene info and accession info files from NCBI ftp site

Code Block
$ wget ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2accession.gz
$ wget ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene_info.gz

Use ucscToBedDetail.py to create a new BED file with gene symbol and description

For example, do something like this:

...

You can also provide these files using options -a and -g if you have saved them to a different location. See the script documentation for details.

Sort, compress, and index

Code Block
$ sort -k1,1 -k2,2n G_species_MMM_YYYY_refGene.bed | bgzip > k1,1 -k2,2n G_species_MMM_YYYY_refGene.bed.gz
$ tabix -p bed G_species_MMM_YYYY_refGene.bed.gz

Validate the file

Check that it has data

Code Block
$ gunzip -c G_species_MMM_YYYY_refGene.bed.gz | wc -l

The wc command should print the number of lines in the file, which should be equal to the number of rows in the corresponding refGene table. To find out how many rows the refGene table contains, click \"describe table scheme\" in the Table Browser.

Try to open it in IGB

Open file the file and change load mode to whole genome. Click the genome row in the Current Sequence table. You should see something above every chromosome.

If some chromosomes have no data, go back to the table browser and use the region text area to confirm there was no data for the given chromosome.

Check the chromosomes

Make sure the number of chromosomes listedin the file does not exceed the number of chromosomes listed in the genome descriptor file genome.txt:

...

The first line counts the number of unique sequences appearing the first column of the bed file. The second line counts the number of lines in the genome.txt file.

Edit annots.xml

The annots.xml file description for the refGene annotations contains the date the data were downloaded. Edit the file accordingly to reflect today's date.

Check in the new files

Use svn status command to double-check which files you've changed:

...

Code Block
M  G_species_MMM_YYYY_refGene.bed.gz
M  G_species_MMM_YYYY_refGene.bed.gz.tbi
M  annots.xml

M stands for \ "modified\".

Warning

These are the only files that should have changed. If others are different, something has gone wrong.

...