Table of Contents |
---|
Introduction
The following instructions assumes you are not going to make a new genome version directory and that all you need to do is update RefGen, mRNA, and EST files.IGB QuickLoad is configured so that a set of foundation gene annotations (gene models) load main IGB QuickLoad site is configured so the canonical gene models annotations are loaded into IGB as soon as the user selects the corresponding genome version.This This is configured through the annots.xml file that resides in every genome version directory. Any data set with attribute "load_model" set to "Whole Genome" will automatically load into IGB.
...
Note |
---|
As of IGB 10.1.0 most UCSC genome versions and tracks should now be available by default in IGB via the UCSC REST data provider. |
Understand IGB QuickLoad naming conventions
Note |
---|
IGB QuickLoad file names should include the genome version and the UCSC table name. The title in the annots.xml file should match the track name but should not include the species name, as that will be obvious to the user and may make linkout patterns harder to maintain. |
For example, gene models data for the UCSC track named RefSeq Genes is stored in a table called refGene. In IGB QuickLoad, the data file should be named G_species_strain_MMM_YYYY_refGene.bed.gz (a compressed BED detail file) and the title attribute in annots.xml should be RefSeq Genes.
However, the UCSC genome browser sometimes include species names in the title of its mRNA track. IGB QuickLoad data sets should not include species names in the titles. For example, the title of an mRNA track is always "mRNA" and never (for example) "Zebrafish mRNA."
Command-line utilities you'll need
...
- tabix from samtools sourceforge sitebgzip from samtools sourceforge siteand bgzip from htslib
- git
- svn
- UNIX wget (not installed by default on Mac but available on most other UNIX systems)
- UNIX sort
- UNIX gunzip
...
If you're doing this on a Mac desktop or laptop computer, create a directory called "bin" in your home directory and save all compiled binaries there. Edit your .bash_profile file to include a line like the following to ensure that the shell can find the programs.
Code Block |
---|
export PATH=.:$HOME/bin:$PATH
|
Step-by-step guide to updating UCSC RefGene data
...
set in IGB QuickLoad
Get the QuickLoad data repo
Check out or update a copy of IGB QuickLoad data and source code directories.
Use git to obtain a copy of source code:
Code Block |
---|
$ svngit coclone https://svn.transvarbitbucket.org/repos/genomes/pub/quickload quickload lorainelab/genomesource |
If you already have a copy, then update it. Changed into your local copy and run:
Code Block |
---|
$ git pull origin main-JDK8
|
Use svn to get a copy of the QuickLoad data repository: TO BE UPDATED
Code Block |
---|
$ svn co https://svn.transvar.org/repos/genomes/pub/src quickload_srcTO BE UPDATED |
If you already have a copy, just update using svn up. Change into your checked-out, local copy and run:
Code Block |
---|
...
$ svn up
|
Add quickloadsource to your PATH (to run the python code there)
Add quickload_src quickloadsource to your PATH as it contains a python script you'll use to created BED detail files from ordinary BED files. Edit the .bash_profile file as in above:
Code Block |
---|
export PATH=.:$HOME/quickload_srcquickloadsource:$HOME/bin:$PATH |
Pick a genome to update and change into the genome version directory
Open a Web browser and find the genome in the Table Browser at UCSC
Go to http://genome.ucsc.edu/cgi-bin/hgTables
Update refGene data set
File name: G_species_strain_MMM_YYYY_refGene.gz
Data set title: RefSeq Genes
Get refGene data from UCSC Table Browser
Configure Table Browser
Configure Table Browser with the following settings:
...
Note |
---|
UCSC data set file names saved in IGB QuickLoad should always include the IGB genome version name followed by an underscore character followed by the UCSC table name. The title field in the annots.xml file should always be the UCSC track name because that is what users will recongized recognize from having used the UCSC genome browser. |
Click the "get output" button to download the data
...
Get gene info and accession info files from NCBI ftp site
Code Block |
---|
$ wget ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2accession.gz
$ wget ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene_info.gz
|
Create BED detail file with gene information
Use ucscToBedDetail.py (from https://bitbucket.org/lorainelab/genomesource) to create a new BED file with gene symbol and description
For example, do something like this:
Code Block |
---|
$ ucscToBedDetail.py ~/Downloads/G_species_MMM_YYYY_refGene.bed.gz G_species_MMM_YYYY_refGene.bed
|
...
You can also provide these files using options -a and -g if you have saved them to a different location. See the script documentation for details.
Sort, compress, and index
Code Block |
---|
$ sort -k1,1 -k2,2n G_species_MMM_YYYY_refGene.bed | bgzip > k1,1 -k2,2n G_species_MMM_YYYY_refGene.bed.gz $ tabix -s 1 -pb 2 -e 3 bed G_species_MMM_YYYY_refGene.bed.gz |
Validate the file
Check that it has data
Code Block |
---|
$ gunzip -c G_species_MMM_YYYY_refGene.bed.gz | wc -l
|
The wc command should print the number of lines in the file, which should be equal to the number of rows in the corresponding refGene table. To find out how many rows the refGene table contains, click \"describe table scheme\" in the Table Browser.
Try to open it in IGB
Open file the file and change load mode to whole genome. Click the genome row in the Current Sequence table. You should see something above every chromosome.
If some chromosomes have no data, go back to the table browser and use the region text area to confirm there was no data for the given chromosome.
Check the chromosomes
Make sure the number of chromosomes listedin listed in the file does not exceed the number of chromosomes listed in the genome descriptor file genome.txt:
Code Block |
---|
$ gunzip -c G_species_MMM_YYYY_refGene.bed.gz | cut -f1 | sort | uniq wc -l
$ wc -l genome.txt
|
The first line counts the number of unique sequences appearing the first column of the bed file. The second line counts the number of lines in the genome.txt file.
Edit annots.xml
The annots.xml file description for the refGene annotations contains the date the data were downloaded. Edit the file accordingly to reflect today's date.
Check in the new files
Use svn status command to double-check which files you've changed:
Code Block |
---|
$ svn st
|
It should print something like:
Code Block |
---|
M G_species_MMM_YYYY_refGene.bed.gz
M G_species_MMM_YYYY_refGene.bed.gz.tbi
M annots.xml
|
...
Check in the files one-by-one:
Code Block |
---|
svn ci G_species_MMM_YYYY_refGene.bed.gz -m "Enter a message here."
svn ci G_species_MMM_YYYY_refGene.bed.gz.tbi -m "Enter a message here."
svn ci annots.xml -m "Enter a message here."
|