Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migrated to Confluence 4.0
Table of Contents

Introduction

UCSC genome bioinformatics supports mammalian, insect, fish, avian, and some fungal genomes.

IGBQuickLoad contains genome directories with sequence and annotations data for some (not all) genome assemblies supported at UCSC.

The following describes how to add a new genome version to the IGB QuickLoad repository and update IGB species.txt.

Command-line utilities you'll need need

  • faToTwoBit from UCSC (needed if there is no 2Bit file available)
  • twoBitInfo from UCSC (needed to generate the genome.txt file)
  • UNIX wget (not installed by default on Mac but available on most other UNIX systems)
  • UNIX sort (should be pre-installed on any UNIX system, including Mac)
  • IGBQuickLoad scripts in genomes/pub/src subversion repo.

Compiled UCSC software tools are available from http://hgdownload.cse.ucsc.edu/admin/exe/.

Get the compiled programs these in a directory in your PATH. Make sure they are executable on your system.

If you're doing this on a Mac desktop or laptop computer, create a directory called "bin" in your home directory and save all compiled binaries there. Edit your .bash_profile file to include a line like the following to ensure that the shell can find the programs.

Code Block
export PATH=.:$HOME/bin:$PATH

For example, the following sequence of commands downloads the software, moves it to a directory named "bin" in the home directory, and then makes it executable using the chmod command.

Code Block
wget http://hgdownload.cse.ucsc.edu/admin/exe/macOSX.i386/twoBitInfo
mv twoBitInfo ~/bin
chmod a+x ~/bin/twoBitInfo

Step-by-step guide to adding a new UCSC genome to IGBQuickLoad

The following instructions explain how to

  • set up your local copy of the public QuickLoad data repository
  • set up your environment to run QuickLoad scripts
  • get annotation files and sequence files from UCSC Genome Bioinformatics
  • convert files to random access, indexed file formats that enable partial data loading in IGB
  • update meta-data files IGB requires to update its interface and allow users to access the newly added genome
  • update HEADER.html and other files describing the new genome
  • commit your new files and updates the repo
  • submit a Jira ticket requesting that the main site and mirror sites be updated

If you have questions don't hesitate to ask Ann.

Get the QuickLoad data repo

Check out or update a copy of IGB QuickLoad data directories.

Open a terminal (UNIX shell) and change into the directory where you want your checked-out copy of the genomes repository to reside.

For example, you might do this:

Code Block
cd # change into your home directory
svn co https://svn.transvar.org/repos/genomes/trunk/pub/quickload

Or, if you already have a copy, just update using svn up.

Code Block
svn up

This will update everything in the current working directory and all the directories beneath it. To avoid conflicts with other people's committed changes, be sure to update your local copy of IGB QuickLoad repo before starting work.

Questions about using subversion? See: Version Control with Subversion.

Configure your Apache web server to server the IGB QuickLoad data directories via http

You'll need this to test that the new genome directory looks OK when visited in a Web browser. How you do this will depend on your computer. The following instructions explain how to do this on a Mac.

  • Use locate to find your local copy of httpd.conf, the Apache configuration file. Probably it's located at /private/etc/apache2/httpd.conf, depending on your system.
  • Open a terminal window and change into the same directory as the configuration file.
  • Make a backup copy of the file:
Code Block
cp httpd.conf httpd.conf.bak
  • Use sudo to open the file in a text editor like pico or emacs and enter your password. *Note* this only works if you have admin privileges on your computer. If you can't edit this file, then you'll need to get help before proceeding.
Code Block
sudo pico httpd.conf
  • Find the place in the file that says DocumentRoot. Comment the current DocumentRoot and substitute the full path to your checked-out copy of the QuickLoad repository:
Code Block
#DocumentRoot "/Library/WebServer/Documents"
DocumentRoot "/Users/username/quickload"

and

Code Block
#<Directory "/Library/WebServer/Documents">
<Directory "/Users/username/quickload">

where quickload is your copy of the checked-out repo.

  • Restart Apache. To restart Apache on a Mac, open Apple > System Preferences ... > Sharing and select Web Sharing. If it is already selected, that means Apache is already running. Unselect it to stop Apache and then select it again to restart Apache.
  • Open a Web browser and enter url http://localhost. (You may need to refresh your browser.)
  • You should see now see something that looks exactly like the public IGB QuickLoad site.
Note

Now, you can configure IGB to access your local copy of IGB QuickLoad using both the URL http://localhost *or* using the file chooser because IGB supports QuickLoad access via
the Web (http) or from local files.

Check out or update a copy of IGB QuickLoad source code (src) directory.

As before, open a terminal and change into the directory where you want your checked-out copy of the genomes src code to reside. A good place for checked-out code is a directory named src in your home directory.

To set up a src directory for checked-out code:

Code Block
cd
mkdir src
cd src

Then, use svn to get a copy of the QuickLoad source code from the repo and save it to a directory named quickload_src:

Code Block
$ svn co https://svn.transvar.org/repos/genomes/trunk/pub/src quickload_src

To ensure that you'll be able to run the code, add the new directory to your PATH and your PYTHONPATH environmental variables by editing your .bash_profile startup script:

Code Block
export PATH=$HOME/src/quickload_src:$PATH
export PYTHONPATH=$HOME/src/quickload_src:$PYTHONPATH

Test that it worked by opening a new terminal and typing sample.py at the prompt. If your path is correctly configured, the script will run without error.

Get the data from UCSC

Open a Web browser and find the genome you would like to add in the Table Browser at UCSC

Go to http://genome.ucsc.edu/cgi-bin/hgTables

Use the genome version menu to determine the month and year of the genome release you want.

Make note of the genome version synonyms UCSC is using. This usually in parentheses next to the month and year of the release. These will need to be included in IGB's curated list of genome version synonyms to ensure compatibility with Galaxy, UCSC DAS, or other external resources.

For example, the assembly menu for zebrafish reads Jul. 2010 (Zv9/danRer7). The terms in parentheses are genome version synonyms for this assembly. The one on the right (danRer7) is what UCSC calls the "database" for this assembly and is used as identifier of the genome in the UCSC DAS1 data source. The term on left (Zv9) is usually another commonly-used term, sometimes assigned by the sequencing consortium that generated the assembly or the original sequence. Sometimes, however, this term is not unique. For example, some genome versions are reported with the term "Broad," which is an organization, not an assembly.

Name the genome for genus, species, strain (optional), release month and year

Choose an IGB genome assembly version identifier to represent the UCSC genome.

IGB genome assembly versions identifiers look like:

IGB uses genus, species, strain (optionally), and release month and year to identify genome assembly versions for a species, individual, strain, or cultivar whose genome was sequenced.

G_species_strain_Mmm_YYYY

where

  • G is the first letter (upper-case) of the genus name
  • species is the species name (lower-case)
  • strain is cultivar, strain, or individual whose genome was sequenced (this is optional and not usually needed for UCSC-managed genomes)
  • Mmm is the three-letter English abbreviation of the month of the release (first letter is upper-case)
  • YYYY is the year of the release

Examples)

  • P_troglodytes_Oct_2010 genome assembly for chimp released Oct 2010
  • Z_mays_B73_Mar_2010 genome assembly for maize plant cultivar B73 released March 2010

Use svn mkdir to create a new genome version directory for the genomes repo

Use svn mkdir to create a new genome directory. The name of the new directory should be identical to the IGB QuickLoad genome version name.

Code Block
svn mkdir G_species_Mmm_YYYYY
Warning

Don't use the UNIX mkdir command to create the directory and then use svn add later to add it later to the repo. If you do this and are not careful, you will accidentally add large sequence data files to the repo.

Change directories into the newly created genome version directory

Code Block
cd G_species_Mmm_YYYYY

Download the sequence data

Go to UCSC Genome Bioinformatics and click Downloads > Genome Data. Click the link for your species and then click the link labeled Full data set.

This will take you to a directory where you can download files. Typically, the address of the directory is

http://hgdownload.soe.ucsc.edu/goldenPath/UCSCNAME/bigZips/

For example, UCSC's danRer7 genome is in http://hgdownload.soe.ucsc.edu/goldenPath/danRer7/bigZips/.

Is the 2bit file available?

For most of the more recent genomes, UCSC is using the 2bit format to distribute sequence data. However, some older versions may not make this available.

If yes, download it using wget.

Right-click the link in your browser and select "Copy Link Location."

Return to your terminal UNIX shell, type wget, and paste the URL into the shell.

For example,

Code Block
wget http://hgdownload.soe.ucsc.edu/goldenPath/danRer7/bigZips/danRer7.2bit
Rename the file to G_species_Mmm_YYYYY.2bit.

The prefix (file name part) of the 2bit file for a genome should be the genome version identifier and suffix (file extension) should be 2bit.

For example,

Code Block
mv danRer7.2bit D_rerio_Jul_2010.2bit
If not, download the sequence data in fasta format.

For older genomes, UCSC provides sequence data in a so-called "bigZip" file which contains each assembled sequence (typically one per physical chromosome) in a separate fasta file.

For example, as of this writing, the dm3 (fruit fly) genome is provided in this format.

Right-click the link in your browser and select "Copy Link Location."

Return to your terminal UNIX shell, type wget, and paste the URL into the shell.

For example,

Code Block
wget http://hgdownload.soe.ucsc.edu/goldenPath/dm3/bigZips/chromFa.tar.gz
Unpack the file

Use tar with options xvf to uncompress the file while also extracting its contents. An ".fa" file for each chromosome will appear when completed.

Code Block
tar xvf chromFa.tar.gz
Note

Once you've created the 2bit file for the genome assembly, you'll delete the .fa and the .gz files.

Create 2bit file using faToTwoBit

Get faToTwoBit installed. In the following example, the MacOS version is downloaded:

Code Block
wget http://hgdownload.cse.ucsc.edu/admin/exe/macOSX.i386/faToTwoBit
mv faToTwoBit ~/bin
chmod a+x ~/bin/faToTwoBit

Convert the fa files to twoBit

faToTwoBit will read one or more fasta files and convert them to a single 2bit file. To understand how to run it, type the name of the program. If you run it without any arguments, it will print a usage message.

Code Block
faToTwoBit
faToTwoBit - Convert DNA from fasta to 2bit format
usage:
   faToTwoBit in.fa [in2.fa in3.fa ...] out.2bit
options:
   -noMask       - Ignore lower-case masking in fa file.
   -stripVersion - Strip off version number after . for genbank accessions.
   -ignoreDups   - only convert first sequence if there are duplicates

The prefix (file name part) of the 2bit file you create for this genome version should be the genome version identifier and suffix (file extension) should be "2bit."

Code Block
faToTwoBit *.fa G_species_Mmm_YYYY.2bit
Delete .fa files and the chromFa.tar.gz file
Code Block
rm *.fa
rm chromFa.tar.gz
Note

You can always re-create the .fa files using twoBitToFa, another UCSC tool.

Note

The 2bit file is typically smaller than the compressed chromFa.tar.gz file it replaced. Unlike the .tar.gz file, it supports random access, allowing IGB to support partial loading of sequence from the IGBQuickLoad site into the coordinates track.

Make genome.txt file

Use twoBitInfo to create a genome.txt file for the genome. This file lists sequences and their sizes for the genome.

Code Block
twoBitInfo G_species_Mmm_YYYY.2bit genome.txt
Note

You can use twoBitInfo to make BED files marking the location of N's in the genome or calculate the amount of non-N sequence in an assembly.

Sort the genome.txt file sequence size

The order of sequences listed in the genome.txt is how they will appear in the IGB Current Sequence tab when users visit the genome version in IGB.

Check the genome.txt file. Are the chromosomes listed in a reasonable order?

If you created the 2bit file from fasta files, then probably they will be listed in alphabetical.

Depending on the state of the genome assembly, i.e., how close to complete it is, it's usually much better to list the chromosomes by size, with larger sequences appearing first in the list.

To ensure that the chromosomes are listed with largest ones first, sort the file

Code Block
sort -k2,2nr genome.txt > tmp
mv tmp genome.txt

Add genome.txt to the repo

Add the genome.txt file to the repo

Code Block
svn add genome.txt

Edit contents.txt

Add the new genome to the contents.txt in the top level of your checked-out quickload directory.

The contents.txt file is a tab-separated file with two columns:

  • Column 1 - genome directory
  • Column 2 - genome title

The genome title is what IGB will display in the title bar when users visit the new genome.

To create the title, follow the same conventions as for the other UCSC genomes. Title begins with genus and species, followed by the date of the release (in parentheses), followed by the common name for the species, followed by the UCSC genome name (in parentheses).

For example,

Code Block
Cavia porcellus (Feb 2008) guinea pig (cavPor3)
Warning

Note that this file is tab-separated, so when you edit the file, be sure to use a tab character to separate the genome version and geome title fields.

Edit files to make the Web site look nicer.

The main job of a QuickLoad site is to enable users to load data into IGB. However, a QuickLoad site is also a Web site, and so it's important to provide some additional files for users who visit the site in their Web browser. When you add a new set of files to the main IGBQuickLoad site, you need to also:

  • provide appropriate HEADER.html files for the genome version directory
  • edit the .htacess file in the QuickLoad root directory (one level above the genome version directories)

Add a new HEADER.html file to the genome directory.

For this, use the script named writeQuickLoadHeaderUCSC.py. To run it, make sure you are in the top level of the QuickLoad directory and enter:

Code Block
writeQuickLoadHeaderUCSC.py G_species_Mmm_YYYY > G_species_Mmm_YYYY/HEADER.html

The script will read the contents.txt file, look for the human-readable title of the genome (column 2) for the directory G_species_Mmm_YYYY, and then print text for HEADER.html to stdout.

Edit the .htaccess file

When users open a location in the IGB QuickLoad site using their Web browser, the browser displays a list of files and some descriptive text next to each file. This happens because the IGB QuickLoad site has a file named .htaccess that causes Apache to display this information.

So when you add a new file type or a new genome to the QuickLoad site, you also need to add a new Description to the .htaccess file. Open the file in a text editor and add one new line for the genome, the same text you added to the contents.txt file, but with the order to the columns reversed. Use the same text to ensure consistency between what IGB shows in the window title bar, what the HEADER.html title displays, and what appears in the directory description when users view the Web site in their Web browser.

For example:

Code Block
AddDescription "Gallus gallus (Nov 2011) chicken (galGal4/ICGC Gallus-gallus-4.0)" G_gallus_Nov_2011

Note: Add the new text to the end of the file.

Test the new genome

Testing is absolutely critical as it is easy to make an error along with way. Plan to spend at least as much time testing as you spent on building the site!

Test under the released version of IGB

Download the latest release of IGB from http://www.bioviz.org.

Configure data sources under Data Sources tab

Click the configure link.

  • Add your local copy of the IGB QuickLoad site to your list of data sources
  • Remove all data sources EXCEPT the UCSC DAS1 server and your local QL directory

Check that the new genome version appears

In the species menu, you should see the organism's genus, species, and strain listed. When you hover the mouse over the its menu item, you should also see a tooltip reporting the species' colloquial name (in English).

In the genome version menu, you should see the genome version name listed.

If not, edit species.txt

If you don't, this only means that IGB doesn't recognize it. To ensure IGB recognizes the genome, add it to the file species.txt that resides at the top level of the IGBQuickLoad directory, the same level as contents.txt.

species.txt is a tab-separated file in which each line represents synonyms for a genome. The first column should list the full Linnean name for the species; this is what will appear in the species menu. It can include spaces. The second column should list the common (colloquial) name for the species, in English. The next column should contain the IGB geome version minus the data and also minus any numbers at the end of the name. The next column should contain the UCSC genome version, minus the number. The final column should contain the genus and species name joined by an underscore.

For example:

Code Block
Pan troglodytes	Chimp	P_troglodytes	panTro	Pan_troglodytes

Restart IGB and re-test

When you're done, restart IGB and check that the species and version are displayed correctly.

Use the Species and Genome menus (under Current Sequence tab) to select your new genome

Choose your genome version. The chromosomes from your genome.txt file should then appear in the Current Sequence tab. You should also see the corresponding UCSC DAS1 data source appear in the Data Sets/Data Sources tree on the left. If this step fails, it may be there is a problem with your genome.txt file.

Test that you can load sequence data.

Check that IGB can load sequence data from your local 2bit file.

Visit a chromosome, zoom in, and click Load Sequence. Zoom in to check that the sequence is visible. Note that depending on which sequence file you used, some letters will be lower-case. This is how UCSC masks repetitive or low-complexity sequence.

Repeat the above tests using the latest (trunk) version of IGB

To get a copy of the trunk, check it using svn from sourceforge OR download it from <a href="http://test.bioviz.org/igb/">the test deployment site for IGB</a>. To get the trunk, follow the Download links.

Check the new site looks OK in a Web browser

Open the site (http://localhost if it's on your local computer) in a Web browser and check that

  • there are no typos or errors in the descriptive text next to the new genome directory
  • the genome directory has a header and all the files are listed
  • each file has a description
  • all the links in the HEADER file still work (report any broken links as a Jira issue and notify Ann)

If all tests pass on both versions, use svn to check in your changes.

These include:

  • edits to contents.txt
  • addition of the new genome directory
  • addition of the genome.txt file for the genome
  • edits to species.txt (possibly)

When you're finished, submit a ticket to IGB Jira to request testing of the newly added genome on a test site.

If you downloaded the 2bit file from UCSC, include a link to the file. If you created it from a fasta file, include a link to the fasta file instead.