UCSC genome bioinformatics supports mammalian, insect, fish, avian, and some fungal genomes.
IGBQuickLoad contains genome directories with sequence and annotations data for some (not all) genome assemblies supported at UCSC.
The following describes how to add a new genome version to the IGB QuickLoad repository and update IGB species.txt.
Compiled UCSC software tools are available from http://hgdownload.cse.ucsc.edu/admin/exe/.
Get the compiled programs these in a directory in your PATH. Make sure they are executable on your system.
If you're doing this on a Mac desktop or laptop computer, create a directory called "bin" in your home directory and save all compiled binaries there. Edit your .bash_profile file to include a line like the following to ensure that the shell can find the programs.
export PATH=.:$HOME/bin:$PATH |
For example, the following sequence of commands downloads the software, moves it to a directory named "bin" in the home directory, and then makes it executable using the chmod command.
$ wget http://hgdownload.cse.ucsc.edu/admin/exe/macOSX.i386/twoBitInfo $ mv twoBitInfo ~/bin $ chmod a+x ~/bin/twoBitInfo |
Open a terminal (UNIX shell) and change into the directory where you want your checked-out copy of the genomes repository to reside.
For example, you might do this:
$ cd /mydata |
Use svn to get a copy of the repo:
$ svn co https://svn.transvar.org/repos/genomes/trunk/pub/quickload |
If you already have a copy, just update using svn up.
$ svn up |
This will update everything in the current working directory and all the directories beneath it.
Questions about using subversion? See: Version Control with Subversion.
As before, open a terminal and change into the directory where you want your checked-out copy of the genomes src code to reside. A good place for checked-out code is a directory named src
in your home directory.
To set up a src
directory for checked-out code:
cd mkdir src cd src |
Then, use svn to get a copy of the QuickLoad source code from the repo and save it to a directory named quickload_src:
$ svn co https://svn.transvar.org/repos/genomes/trunk/pub/src quickload_src |
To ensure that you'll be able to run the code, add the new directory to your PATH and your PYTHONPATH environmental variables by editing your .bash_profile startup script:
export PATH=$HOME/src/quickload_src:$PATH export PYTHONPATH=$HOME/src/quickload_src:$PYTHONPATH |
Test that it worked by opening a new terminal and typing sample.py
at the prompt. If your path is correctly configured, the script will run without error.
Go to http://genome.ucsc.edu/cgi-bin/hgTables
Make note of the genome version synonyms UCSC is using. This usually in parentheses next to the month and year of the release. These will need to be included in IGB's curated list of genome version synonyms to ensure compatibility with Galaxy, UCSC DAS, or other external resources.
For example, the assembly menu for zebrafish reads Jul. 2010 (Zv9/danRer7). The terms in parentheses are genome version synonyms for this assembly. The one on the right (danRer7) is what UCSC calls the "database" for this assembly and is used as identifier of the genome in the UCSC DAS1 data source. The term on left (Zv9) is usually another commonly-used term, sometimes assigned by the sequencing consortium that generated the assembly or the original sequence. Sometimes, however, this term is not unique. For example, some genome versions are reported with the term "Broad," which is an organization, not an assembly.
Choose an IGB genome assembly version identifier to represent the UCSC genome.
IGB genome assembly versions identifiers look like:
IGB uses genus, species, strain (optionally), and release month and year to identify genome assembly versions for a species, individual, strain, or cultivar whose genome was sequenced.
G_species_strain_Mmm_YYYYY
where
Examples)
Use svn mkdir to create a new genome directory. The name of the new directory should be identical to the IGB QuickLoad genome version name.
$ svn mkdir G_species_Mmm_YYYYY |
Don't use the UNIX mkdir command to create the directory and then use svn add later to add it later to the repo. If you do this and are not careful, you will accidentally add large sequence data files to the repo. |
Change directories into the newly created genome version directory
$ cd G_species_Mmm_YYYYY |
Go to UCSC Genome Bioinformatics and click Downloads > Genome Data. Click the link for your species and then click the link labeled Full data set.
This will take you to a directory where you can download files. Typically, the address of the directory is
http://hgdownload.soe.ucsc.edu/goldenPath/UCSCNAME/bigZips/
For example, UCSC's danRer7 genome is in http://hgdownload.soe.ucsc.edu/goldenPath/danRer7/bigZips/.
Right-click the link in your browser and select "Copy Link Location."
Return to your terminal UNIX shell, type wget, and paste the URL into the shell.
For example,
$ wget http://hgdownload.soe.ucsc.edu/goldenPath/danRer7/bigZips/danRer7.2bit |
The prefix (file name part) of the 2bit file for a genome should be the genome version identifier and suffix (file extension) should be 2bit.
For example,
$ mv danRer7.2bit D_rerio_Jul_2010.2bit |
For older genomes, UCSC provides sequence data in a so-called "bigZip" file which contains each assembled sequence (typically one per physical chromosome) in a separate fasta file.
For example, as of this writing, the dm3 (fruit fly) genome is provided in this format.
Right-click the link in your browser and select "Copy Link Location."
Return to your terminal UNIX shell, type wget, and paste the URL into the shell.
For example,
$ wget http://hgdownload.soe.ucsc.edu/goldenPath/dm3/bigZips/chromFa.tar.gz |
Use tar with options xvf to uncompress the file while also extracting its contents. An ".fa" file for each chromosome will appear when completed.
$ tar xvf chromFa.tar.gz |
Once you've created the 2bit file for the genome assembly, you'll delete the .fa and the .gz files. |
Get faToTwoBit installed:
$ wget http://hgdownload.cse.ucsc.edu/admin/exe/macOSX.i386/faToTwoBit $ mv faToTwoBit ~/bin $ chmod a+x ~/bin/faToTwoBit |
Convert the fa files to twoBit; faToTwoBit will read one or more fasta files and convert them to a single 2bit file. To understand how to run it, type the name of the program. If you run it without any arguments, it will print a usage message.
$ faToTwoBit faToTwoBit - Convert DNA from fasta to 2bit format usage: faToTwoBit in.fa [in2.fa in3.fa ...] out.2bit options: -noMask - Ignore lower-case masking in fa file. -stripVersion - Strip off version number after . for genbank accessions. -ignoreDups - only convert first sequence if there are duplicates |
The prefix (file name part) of the 2bit file you create for this genome version should be the genome version identifier and suffix (file extension) should be "2bit."
$ faToTwoBit *.fa G_species_Mmm_YYYY.2bit |
$ rm *.fa $ rm chromFa.tar.gz |
You can always re-create the .fa files using twoBitToFa, another UCSC tool. |
The 2bit file is typically smaller than the compressed chromFa.tar.gz file it replaced. Unlike the .tar.gz file, it supports random access, allowing IGB to support partial loading of sequence from the IGBQuickLoad site into IGB. |
Use twoBitInfo to create a genome.txt file for the genome. This file lists sequences and their sizes for the genome.
$ twoBitInfo G_species_Mmm_YYYY.2bit genome.txt |
The order of sequences listed in the genome.txt is how they will appear in the IGB Current Sequence tab when users visit the genome version in IGB.
Check the genome.txt file. Are the chromosomes listed in a reasonable order?
If you created the 2bit file from fasta files, then probably they will be listed in alphabetical.
Depending on the state of the genome assembly, i.e., how close to complete it is, it's usually much better to list the chromosomes by size, with larger sequences appearing first in the list.
To ensure that the chromosomes are listed with largest ones first, sort the file
$ sort -k2,2nr genome.txt > tmp $ mv tmp genome.txt |
Add the genome.txt file to the repo
svn add genome.txt |
Add the new genome to the contents.txt in the top level of your checked-out quickload directory.
The contents.txt file is a tab-separated file with two columns:
The genome title is what IGB will display in the title bar when users visit the new genome.
To create the title, follow the same conventions as for the other UCSC genomes. Title begins with genus and species, followed by the date of the release (in parentheses), followed by the common name for the species, followed by the UCSC genome name (in parentheses).
For example,
Cavia porcellus (Feb 2008) guinea pig (cavPor3) |
Note that this file is tab-separated, so when you edit the file, be sure to use a tab character to separate the genome version and geome title fields. |
Click the configure link.
In the species menu, you should see the organism's genus, species, and strain listed. When you hover the mouse over the its menu item, you should also see a tooltip reporting the species' colloquial name (in English).
In the genome version menu, you should see the genome version name listed.
If you don't, this only means that IGB doens't recognize it. To ensure IGB recognizes the genome, add it to the file species.txt that resides at the top level of the IGBQuickLoad directory, the same level as contents.txt.
species.txt is a tab-separated file in which each line represents synonyms for a genome. The first column should list the full Linnean name for the species; this is what will appear in the species menu. It can include spaces. The second column should list the common (colloquial) name for the species, in English. The next column should contain the IGB geome version minus the data and also minus any numbers at the end of the name. The next column should contain the UCSC genome version, minus the number. The final column should contain the genus and species name joined by an underscore.
Ann's note: Is this right?
For example:
Pan troglodytes Chimp P_troglodytes panTro Pan_troglodytes |
Choose your genome version. The chromosomes from your genome.txt file should then appear in the Current Sequence tab. You should also see the corresponding UCSC DAS1 data source appear in the Data Sets/Data Sources tree on the left. If this step fails, it may be there is a problem with your genome.txt file.
Check that IGB can load sequence data from your local 2bit file.
Visit a chromosome, zoom in, and click Load Sequence. Zoom in to check that the sequence is visible. Note that depending on which sequence file you used, some letters will be lower-case. This is how UCSC masks repetitive or low-complexity sequence.
These include:
When you're finished, submit a ticket to IGB Jira to request testing of the newly added genome on a test site.
If you downloaded the 2bit file from UCSC, include a link to the file. If you created it from a fasta file, include a link to the fasta file instead.