Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

Introduction

UCSC genome bioinformatics supports mammalian, insect, fish, avian, and some fungal genomes.

...

The following describes how to add a new genome version to the IGB QuickLoad repository and update IGB species.txt.

Command-line utilities you'll need

...

  • faToTwoBit from UCSC (needed if there is no 2Bit file available)twoBitInfo from UCSC (needed to generate the genome.txt file)and twoBitInfo scripts from UCSC Jim Kent tools. Available from http://hgdownload.cse.ucsc.edu/admin/exe/
  • UNIX wget (not installed by default on Mac but available on most other UNIX systems)
  • UNIX sort (should be pre-installed on any UNIX system, including Mac)
  • A version of git for your platform.
  • A version of subversion (svn) for your platform.
  • IGBQuickLoad scripts in genomeshttps:/pub/src subversion repo.

...

Get the compiled programs from UCSC and make sure they are executable on your system.

Code Block
languagebash
titleGetting UCSC utilities
linenumberstrue
wget http://hgdownload.cse.ucsc.edu/admin/exe/

...

Get the compiled programs these in a directory in your PATH. Make sure they are executable on your system.

If you're doing this on a Mac desktop or laptop computer, create a directory called "bin" in your home directory and save all compiled binaries there. Edit your .bash_profile file to include a line like the following to ensure that the shell can find the programs.

Code Block

export PATH=.:$HOME/bin:$PATH

For example, the following sequence of commands downloads the software, moves it to a directory named "bin" in the home directory, and then makes it executable using the chmod command.

Code Block

wget http://hgdownload.cse.ucsc.edu/admin/exe/macOSX.i386/twoBitInfo
mv twoBitInfo ~/bin
chmod a+x ~/bin/twoBitInfo

Step-by-step guide to adding a new UCSC genome to IGBQuickLoad

The following instructions explain how to

  • set up your local copy of the public QuickLoad data repository
  • set up your environment to run QuickLoad scripts
  • get annotation files and sequence files from UCSC Genome Bioinformatics
  • convert files to random access, indexed file formats that enable partial data loading in IGB
  • update meta-data files IGB requires to update its interface and allow users to access the newly added genome
  • update HEADER.html and other files describing the new genome
  • commit your new files and updates the repo
  • submit a Jira ticket requesting that the main site and mirror sites be updated

If you have questions don't hesitate to ask Ann.

Get the QuickLoad data repo

Check out or update a copy of IGB QuickLoad data directories.

Open a terminal (UNIX shell) and change into the directory where you want your checked-out copy of the genomes repository to reside.

For example, you might do this:

Code Block

cd # change into your home directory
svn co https://svn.transvar.org/repos/genomes/trunk/pub/quickload

Or, if you already have a copy, just update using svn up.

Code Block

svn up

This will update everything in the current working directory and all the directories beneath it. To avoid conflicts with other people's committed changes, be sure to update your local copy of IGB QuickLoad repo before starting work.

Questions about using subversion? See: Version Control with Subversion.

Configure your Apache web server to server the IGB QuickLoad data directories via http

You'll need this to test that the new genome directory looks OK when visited in a Web browser. How you do this will depend on your computer. The following instructions explain how to do this on a Mac.

  • Use locate to find your local copy of httpd.conf, the Apache configuration file. Probably it's located at /private/etc/apache2/httpd.conf, depending on your system.
  • Open a terminal window and change into the same directory as the configuration file.
  • Make a backup copy of the file:
Code Block

cp httpd.conf httpd.conf.bak
  • Use sudo to open the file in a text editor like pico or emacs and enter your password. *Note* this only works if you have admin privileges on your computer. If you can't edit this file, then you'll need to get help before proceeding.
Code Block

sudo pico httpd.conf
  • Find the place in the file that says DocumentRoot. Comment the current DocumentRoot and substitute the full path to your checked-out copy of the QuickLoad repository:
Code Block

#DocumentRoot "/Library/WebServer/Documents"
DocumentRoot "/Users/username/quickload"

and

Code Block

#<Directory "/Library/WebServer/Documents">
<Directory "/Users/username/quickload">

where quickload is your copy of the checked-out repo.

  • Restart Apache. To restart Apache on a Mac, open Apple > System Preferences ... > Sharing and select Web Sharing. If it is already selected, that means Apache is already running. Unselect it to stop Apache and then select it again to restart Apache.
  • Open a Web browser and enter url http://localhost. (You may need to refresh your browser.)
  • You should see now see something that looks exactly like the public IGB QuickLoad site.
Note
Now, you can configure IGB to access your local copy of IGB QuickLoad using both the URL http://localhost *or* using the file chooser because IGB supports QuickLoad access via
the Web (http) or from local files.
macOSX.i386/twoBitInfo
mv twoBitInfo ~/bin
chmod a+x ~/bin/twoBitInfo

Step-by-step guide to adding a new UCSC genome to IGBQuickLoad

The following instructions explain how to

  • set up your local copy of the public QuickLoad data repository
  • set up your environment to run QuickLoad scripts
  • get annotation files and sequence files from UCSC Genome Bioinformatics
  • convert files to random access, indexed file formats that enable partial data loading in IGB
  • update meta-data files IGB requires to update its interface and allow users to access the newly added genome
  • update HEADER.html and other files describing the new genome
  • commit your new files and updates the repo
  • submit a Jira ticket requesting that the main site and mirror sites be updated

If you have questions don't hesitate to ask Ann.

Get the QuickLoad data repo

Check out or update a copy of IGB QuickLoad data directories.

Open a terminal (UNIX shell) and change into the directory where you want your checked-out copy of the genomes repository to reside.

For example, you might do this:

Code Block
cd # change into your home directory
svn co https://svn.bioviz.org/repos/genomes/quickload/

To ensure you will be able to upload changes, request a user id and password from Dr. Loraine. Alternatively, use user name "guest" and password "guest" for read-only access.

Or, if you already have a copy, just update using svn up. Change into your local copy and run:

Code Block
svn up

This will update everything in the current working directory and all the directories beneath it. To avoid conflicts with other people's committed changes, be sure to update your local copy of IGB QuickLoad repo before starting work.

Questions about using subversion? See: Version Control with Subversion.

Configure your Apache web server to server the IGB QuickLoad data directories via http \[Optional\]

You'll need this only if you want to test how the new genome directory looks when visited in a Web browser. How you do this will depend on your computer.

For this, you need to install Apache and then configure it to use the checked-out copy of the QuickLoad content as the "DocumentRoot"

Check out or update a copy of IGB QuickLoad source code (src) directory.

As before, open a terminal and change into the directory where you want your checked-out copy of the genomes src genomesource code to reside. A As described above, a good place for checked-out code is a directory named src in your home directory.

To set up a src directory for checked-out code:

Code Block

cd
mkdir src
cd src

Then, use svn git to get a copy of the QuickLoad source code from the repo and save it to a directory named quickload_src:

$ svn co
Code Block
languagebash
titleGet genomesource scripts
linenumberstrue
git clone https://svnbitbucket.transvar.org/repos/genomes/trunk/pub/src quickload_srclorainelab/genomesource

To ensure that you'll be able to run the code, add the new directory to your PATH and your PYTHONPATH environmental variables by editing your .bash_profile startup script:

Code Block

export PATH=$HOME/src/quickload_src:$PATH
export PYTHONPATH=$HOME/src/quickload_src:$PYTHONPATH

...

Use svn mkdir to create a new genome directory. The name of the new directory should be identical to the IGB QuickLoad genome version name.

Code Block

svn mkdir G_species_Mmm_YYYYY

...

Change directories into the newly created genome version directory

Code Block

cd G_species_Mmm_YYYYY

Download the sequence data

...

Return to your terminal UNIX shell, type wget, and paste the URL into the shell.

For example,

Code Block

wget http://hgdownload.soe.ucsc.edu/goldenPath/danRer7/bigZips/danRer7.2bit

...

The prefix (file name part) of the 2bit file for a genome should be the genome version identifier and suffix (file extension) should be 2bit.

For example,

Code Block

mv danRer7.2bit D_rerio_Jul_2010.2bit

...

Return to your terminal UNIX shell, type wget, and paste the URL into the shell.

For example,

Code Block

wget http://hgdownload.soe.ucsc.edu/goldenPath/dm3/bigZips/chromFa.tar.gz

...

Use tar with options xvf to uncompress the file while also extracting its contents. An ".fa" file for each chromosome will appear when completed.

Code Block

tar xvf chromFa.tar.gz
Note

Once you've created the 2bit file for the genome assembly, you'll delete the .fa and the .gz files.

...

Get faToTwoBit installed. In the following example, the MacOS version is downloaded:

Code Block

wget http://hgdownload.cse.ucsc.edu/admin/exe/macOSX.i386/faToTwoBit
mv faToTwoBit ~/bin
chmod a+x ~/bin/faToTwoBit

h7. Convert the fa files to twoBit

faToTwoBit will read one or more fasta files and convert them to a single 2bit file. To understand how to run it, type the name of the program. If you run it without any arguments, it will print a usage message.

Code Block

faToTwoBit
faToTwoBit - Convert DNA from fasta to 2bit format
usage:
   faToTwoBit in.fa [in2.fa in3.fa ...] out.2bit
options:
   -noMask       - Ignore lower-case masking in fa file.
   -stripVersion - Strip off version number after . for genbank accessions.
   -ignoreDups   - only convert first sequence if there are duplicates

The prefix (file name part) of the 2bit file you create for this genome version should be the genome version identifier and suffix (file extension) should be "2bit."

Code Block

faToTwoBit *.fa G_species_Mmm_YYYY.2bit
Delete .fa files and the chromFa.tar.gz file
Code Block

rm *.fa
rm chromFa.tar.gz

...

Use twoBitInfo to create a genome.txt file for the genome. This file lists sequences and their sizes for the genome.

Code Block

twoBitInfo G_species_Mmm_YYYY.2bit genome.txt

...

To ensure that the chromosomes are listed with largest ones first, sort the file

Code Block

sort -k2,2nr genome.txt > tmp
mv tmp genome.txt

...

Add the genome.txt file to the repo

Code Block

svn add genome.txt

Edit contents.txt

...

To create the title, follow the same conventions as for the other UCSC genomes. Title begins with genus and species, followed by the date of the release (in parentheses), followed by the common name for the species, followed by the UCSC genome name (in parentheses).

For example,

Code Block

Cavia porcellus (Feb 2008) guinea pig (cavPor3)

...

  • provide appropriate HEADER.html md files for the genome version directory
  • edit the .htacess file in the QuickLoad root directory (one level above the genome version directories)

Add a new HEADER.

...

md file to the genome directory.

For this, use the script named writeQuickLoadHeaderUCSC.py. To run it, make sure you are in the top level of the QuickLoad directory and enter:

Code Block

writeQuickLoadHeaderUCSC.py G_species_Mmm_YYYY > G_species_Mmm_YYYY/HEADER.htmlmd

The script will read the contents.txt file, look for the human-readable title of the genome (column 2) for the directory G_species_Mmm_YYYY, and then print text for HEADER.html md to stdout.

Edit the .htaccess file

...

So when you add a new file type or a new genome to the QuickLoad site, you also need to add a new Description to the .htaccess file. Open the file in a text editor and add one new line for the genome, the same text you added to the contents.txt file, but with the order to the columns reversed. Use the same text to ensure consistency between what IGB shows in the window title bar, what the HEADER.html title displays, and what appears in the directory description when users view the Web site in their Web browser.

For example:

Code Block

AddDescription "Gallus gallus (Nov 2011) chicken (galGal4/ICGC Gallus-gallus-4.0)" G_gallus_Nov_2011

...

species.txt is a tab-separated file in which each line represents synonyms for a genome. The first column should list the full Linnean name for the species; this is what will appear in the species menu. It can include spaces. The second column should list the common (colloquial) name for the species, in English. The next column should contain the IGB geome version minus the data and also minus any numbers at the end of the name. The next column should contain the UCSC genome version, minus the number. The final column should contain the genus and species name joined by an underscore.

For example:

Code Block

Pan troglodytes	Chimp	P_troglodytes	panTro	Pan_troglodytes

...

If you downloaded the 2bit file from UCSC, include a link to the file. If you created it from a fasta file, include a link to the fasta file instead.

Add annotations

Now, the basic structure - sequence and meta-data about contigs and chromosomes - is available. Now you need to add the annotations. For this, see:

Updating RefGene UCSC data set for an existing genome in IGB QuickLoad