Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

Introduction

An IGB user contacted us with a request to add a new genome to the IGB QuickLoad system - the Xenopus tropicalis genome assembly dated November, 2009, also called JGI 4.2/xenTro3. (See: http://genome.ucsc.edu/cgi-bin/hgGateway?hgsid=262072157&clade=vertebrate&org=X.+tropicalis&db=0).

Because this genome is supported at the UCSC Genome Browser, adding it to the system should be fairly straightforward. All we need to do is download the data, format it, and move it onto the main IGB QuickLoad site currently hosted at UNC Charlotte, in the Bioinformatics and Genomics Department's server room.

Note

As of IGB 10.1.0 most UCSC genome versions and tracks should now be available by default in IGB via the UCSC REST data provider.

Methods

Downloading sequence data (2bit file) from UCSC Genome Bioinformatics

Step one is to download the sequence data.

...

Next, I changed in the new directory and then downloaded this 2bit file using wget in the usual way:

Panel

$ wget http://hgdownload.cse.ucsc.edu/goldenPath/xenTro3/bigZips/xenTro3.2bit

Once the file downloaded, I changed the name and checked the file size:

Panel

$ mv xenTro3.2bit X_tropicalis_Nov_2009.2bit
$ ls -lh *.2bit
-rw-r-r-  1 pi  staff   373M Sep  9  2011 X_tropicalis_Nov_2009.2bit

I changed the name because IGB expects the genome sequence file name to match the genome version name. That is, when a user requests the sequence to be loaded into IGB, IGB will look for a file named X_tropicalis_Nov_2009.2bit and then use that file to retrieve the sequence data.

...

Panel

$ svn add X_tropicalis_Nov_2009.2bit
A  (bin)  X_tropicalis_Nov_2009.2bit
$ svn ci X_tropicalis_Nov_2009.2bit -m "Downloaded from UCSC (http://hgdownload.cse.ucsc.edu/goldenPath/xenTro3/bigZips/xenTro3.2bit) today."
Adding  (bin)  X_tropicalis_Nov_2009.2bit

Transmitting file data .
Committed revision 438.

Creating genome.txt using twoBitInfo

Next, I used twoBitInfo to make a "genome.txt" file reporting the names of the assembled chromosomes and contigs and their sizes:

Panel

$ twoBitInfo X_tropicalis_Nov_2009.2bit genome.txt

This creates the genome.txt file IGB needs to display contig and chromosome names and their sizes in the Current Genome tab.

...

Panel

$ svn ci genome.txt -m "Output from twoBitInfo on current version of X_tropicalis_Nov_2009.2bit"
Adding         genome.txt
Transmitting file data .
Committed revision 439.

Downloading genome annotations (ESTs and RefSeq genes) from UCSC Table Browser

My plan (currently) is to get the RefSeq genes track from UCSC and deploy it on our QuickLoad site. I'll provide meta-data about the data set using the annots.xml file. (More on this later.)

...

I also checked the ESTs data set. There are around 1.5 million ESTs for Xenopus. I'll download that data set as well, but set it up so that users can access it on a region-by-region basis by sorting and indexing using the tabix utility. I'll also need to massage the format a bit to get it to make the PSL (blat output) format.

Processing ESTs using tabix to support fast access in IGB

To process the EST data set, I used the following command to strip off the first column

Panel

gunzip -c X_tropicalis_Nov_2009_all_est.gz | grep -v bin | cut -f2- > X_tropicalis_Nov_2009_all_est.psl

Next, I sorted and created an index using bgzip and  tabix:

Panel

$ sort -k14,14 -k16,16n X_tropicalis_Nov_2009_all_est.psl > sorted.psl

$ mv sorted.psl X_tropicalis_Nov_2009_all_est.psl

$ bgzip X_tropicalis_Nov_2009_all_est.psl

$ tabix -s 14 -b 16 -0 X_tropicalis_Nov_2009_all_est.psl.gz

The sort command first sorts on fields 14 through 14, inclusive (-k 14,14) and then sorts on field 16 through 16, inclusive (-k16,16). The first sort (field 14) sorts the file by target sequence name and the second sort (field 16) sorts numerically on the start position for each alignment. After sorting, bgzip block-compresses the file. 

The last command (tabix) creates a tabix index (.tbi) file that IGB knows to look for when it encounters files with extension .gz.

Next, I added each of the data files to IGB QuickLoad subversion repository - the sorted, bgzip-compressed EST data file, its index file, a compressed (gzip) file for the RefGene data set, and a compressed, gzip'd file for the mRNA data set. Since both will be loaded as soon as the user visits the genome version, I did not bother to make tabix'd versions of those files.

Deploying the files on QuickLoad

I've created a number of data files, and so my next step will be to try opening them in IGB. I also want to test whether IGB will be able to open and display the genome sequence.

So my next step is to create an annots.xml file IGB will use to get a listing of the annotations available for this genome as well as styling information, e.g., the background colors to use for gene annotations, whether to load all the annotations immediately, and so on.

Creating new annots.xml file for Xenopus genome and annotations

Being a bit lazy, I usually just copy and paste another annots.xml file from another part of the repository when setting up a new genome. I use svn for this, however:

Panel

$ svn cp ../V_vinifera_Mar_2010/annots.xml .
A         annots.xml

I then open a simple text editor (like TextEdit) and edited the file, like so:


Recently, we added the capability to specify various styles for annotation files delivered via QuickLoad. Most of these (e.g., foreground and background) are self-explanatory, but a few are not so obvious as they pertain to specialized aspects of how IGB presents data.

The "max_depth" parameter refers to the maximum number of overlapping annotations that can appear in a stack within a track. The "name_size" parameter specifies the font size for the track labels that appear on the left-hand side of the main display menu. The "url" parameter specifies where IGB links to when the user clicks the info button (blue "i" icon) next to a data set. In this case, we link back to the main UCSC Web page, since this is where the data came from originally. The "name" parameter indicates the file name IGB should load, and the "description" parameter specifies the tooltip text that will appear when the user hovers the mouse over the data set in the Data Access Panel. The "title" specifies the name of the data set as it will appear to the user.

Testing new genome version directory in IGB

Finally, I added my local QuickLoad site to IGB (using the Configure link in the Data Access panel) and loaded up the new genome.

...

My last step was to log into the main IGB QuickLoad site and run "svn up" to deploy the new genome and all its associated data files on the main site to ensure that all IGB users can now access the data.

Conclusion

All in all, the entire process, subtracted breaks for lunch and meetings, took about four hours, including the time it took to write this tutorial.

Visualization of the newly added genome

Here is what the genome will look like for some-one visiting this genome version for the first time: