...
Note |
---|
IGBQuickLoad file names should include the genome version and the UCSC table name. The title in the annots.xml file should match the track name but should not include the species name, as that will be obvious to the user and may make linkout patterns harder to maintain. |
For example, gene models data for the UCSC track named RefSeq Genes is stored in a table called refGene. In IGB QuickLoad, the data file should be named G_species_strain_MMM_YYYY_refGene and the title attribute in annots.xml should be RefSeq Genes.
However, the UCSC genome browser sometimes include species names in the title of its mRNA track. IGB QuickLoad data sets should not include species names in the titles. For example, the title of an mRNA track is always "mRNA" and never (for example) "Zebrafish mRNA."
Command-line utilities you'll need need
...
File name: G_species_strain_MMM_YYYY_refGene.gz
Data set title: RefSeq Genes
Get refGene data from UCSC Table Browser
Configure Table Browser
Configure Table Browser with the following settings:
...
Note |
---|
UCSC data set file names saved in IGB QuickLoad should always include the IGB genome version name followed by an underscore character followed by the UCSC table name. The title field in the annots.xml file should always be the UCSC track name because that is what users will recongized from having used the UCSC genome browser. |
Click the "get output" button to download the data
...
Get gene info and accession info files from NCBI ftp site
Code Block |
---|
$ wget ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2accession.gz $ wget ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene_info.gz |
Create BED detail file with gene information
Use ucscToBedDetail.py to create a new BED file with gene symbol and description
For example, do something like this:
...
Code Block |
---|
svn ci G_species_MMM_YYYY_refGene.bed.gz -m "Enter a message here." svn ci G_species_MMM_YYYY_refGene.bed.gz.tbi -m "Enter a message here." svn ci annots.xml -m "Enter a message here." |
Update mRNA
Get mRNA data from UCSC Table Browser
Configure Table Browser
Configure Table Browser with the following settings:
- choose assembly using release date to match it with IGB QuickLoad genome version
- choose group: mRNA and EST tracks
- choose track: SPECIES mRNA
- choose table: all_mRNA
- choose output format: selected fields from primary and related tables
- enter output file G_species_MMM_YYYY_all_mRNA.bed.gz
- The prefix should be THE SAME AS the genome version directory name, with _all_mRNA appended.
- file type returned: choose gzip compressed
Warning |
---|
Do not capitalize "rna" in mrna. The data set suffix should be identical to the UCSC table name, and all_mrna is spelled with lower-case letters. |
Click "get output" and configure fields
The browser will then show a new screen from which you can select specific fields. For this, select all fields except the field named "bin" (top of the list).
Click "get output" again to download the file.
Create sorted, compressed PSL file minus header
Use grep, sort, and bgzip to make a PSL file:
Code Block |
---|
$ gunzip -c ~/Downloads/G_species_MMM_YYYY_all_mrna.psl.gz | grep -v '^#' |
sort -k14,14 -k16,16n | bgzip > G_species_MMM_YYYY_all_mrna.psl.gz
|
Use tabix to index the sorted, compressed PSL file
Code Block |
---|
$ tabix -s 14 -b 16 -e 17 G_species_MMM_YYYY_all_mrna.psl.gz
|
Validate the file
Check that it has data
Code Block |
---|
$ gunzip -c G_species_MMM_YYYY_all_mrna.psl.gz | wc -l
|
The wc command should print the number of lines in the file, which should be equal to the number of rows in the corresponding refGene table. To find out how many rows the refGene table contains, click "describe table scheme" in the Table Browser.
Try to open it in IGB
Open file the file and change load mode to whole genome. Click the genome row in the Current Sequence table. You should see something above every chromosome.
If some chromosomes have no data, go back to the table browser and use the region text area to confirm there was no data for the given chromosome.
Check the chromosomes
Make sure the number of chromosomes listedin the file does not exceed the number of chromosomes listed in the genome descriptor file genome.txt:
Code Block |
---|
$ gunzip -c G_species_MMM_YYYY_all_mrna.psl.gz | cut -f1 | sort | uniq wc -l
$ wc -l genome.txt
|
The first line counts the number of unique sequences appearing the first column of the bed file. The second line counts the number of lines in the genome.txt file.
Edit annots.xml
The annots.xml file description for the all mRNA annotations contains the date the data were downloaded. Edit the file accordingly to reflect today's date.
The data set title should be "mRNA."
Check in the new files
Use svn status command to double-check which files you've changed:
Code Block |
---|
$ svn st
|
It should print something like:
Code Block |
---|
M G_species_MMM_YYYY_all_mrna.psl.gz
M G_species_MMM_YYYY_all_mrna.psl.gz.tbi
M annots.xml
|
M stands for "modified".
Warning |
---|
These are the only files that should have changed. If others are different, something has gone wrong. |
Check in the files one-by-one:
Code Block |
---|
svn ci G_species_MMM_YYYY_refGene.bed.gz -m "Enter a message here."
svn ci G_species_MMM_YYYY_refGene.bed.gz.tbi -m "Enter a message here."
svn ci annots.xml -m "Enter a message here."
|
Update all_est
Repeat the same steps you used for all_mrna, but instead of choosing Track: SPECIES mRNAs, instead choose SPECIES ESTs, and choose table all_ests.