Adding a new genome - X. tropicalis Nov 2009

Introduction

An IGB user contacted us with a request to add a new genome to the IGB QuickLoad system - the Xenopus tropicalis genome assembly dated November, 2009, also called JGI 4.2/xenTro3. (See: http://genome.ucsc.edu/cgi-bin/hgGateway?hgsid=262072157&clade=vertebrate&org=X.+tropicalis&db=0).

Because this genome is supported at the UCSC Genome Browser, adding it to the system should be fairly straightforward. All we need to do is download the data, format it, and move it onto the main IGB QuickLoad site currently hosted at UNC Charlotte, in the Bioinformatics and Genomics Department's server room.

Methods

Step one is to download the sequence data.

The UCSC Web site provides this page that lists the different organisms supported in their system. I followed the link to the X. tropicalis genome and clicked the link labeled Full data set, which took me to the "bigZips" page here: http://hgdownload.cse.ucsc.edu/goldenPath/xenTro3/bigZips/.

Fortunately, it appears that UCSC already provides the fasta sequence data in 2bit format, saving me a step.

First, I added a new directory for this genome to my checked-out copy of the IGB QuickLoad subversion repository, like so:

$ svn mkdir X_tropicalis_Nov_2009
A X_tropicalis_Nov_2009
$ svn ci X_tropicalis_Nov_2009 -m "Adding new genome for frog; this is the same genome as JGI 4.2/xenTro3"
Adding X_tropicalis_Nov_2009

Committed revision 437.

Next, I changed in the new directory and then downloaded this 2bit file using wget in the usual way:

wget http://hgdownload.cse.ucsc.edu/goldenPath/xenTro3/bigZips/xenTro3.2bit

Once the file downloaded, I changed the name and checked the file size:

$ mv xenTro3.2bit X_tropicalis_Nov_2009.2bit
$ ls -lh *.2bit
-rw-r-r- 1 pi staff 373M Sep 9 2011 X_tropicalis_Nov_2009.2bit

I changed the name because IGB expects the genome sequence file name to match the genome version name. That is, when a user requests the sequence to be loaded into IGB, IGB will look for a file named X_tropicalis_Nov_2009.2bit and then use that file to retrieve the sequence data.

I checked the file size to determine if it would be reasonable for me to add the file to repository. I've done this for all the plant genomes we support, mainly as a convenience for mirroring the QuickLoad site on different locations. If the sequence file is version controlled, I can easily check it out to a new location and also associated notes via my commit messages. However, if the file were prohibitively large (> 1000 megabytes, for example), then I might use another strategy to deploy the file. But this size (373M) isn't too bad, so I'll add it to the QuickLoad subversion repository in the usual way:

$ svn add X_tropicalis_Nov_2009.2bit
A (bin) X_tropicalis_Nov_2009.2bit
$ svn ci X_tropicalis_Nov_2009.2bit -m "Downloaded from UCSC (http://hgdownload.cse.ucsc.edu/goldenPath/xenTro3/bigZips/xenTro3.2bit) today."
Adding (bin) X_tropicalis_Nov_2009.2bit

Transmitting file data .
Committed revision 438.

Next, I used twoBitInfo to make a "genome.txt" file reporting the names of the assembled chromosomes and contigs and their sizes:

$ twoBitInfo X_tropicalis_Nov_2009.2bit genome.txt

This creates the genome.txt file IGB needs to display contig and chromosome names and their sizes in the Current Genome tab.

Curious about how complete the assembly is, I counted the contigs:

$ wc -l genome.txt

19550 genome.txt

Yikes! That's a lot. Probably the genome is not quite close to being finished. But I know from experience with other less complete genomes that IGB should be able to handle such a large number of contigs. Users who want to view the largest ones can sort on size in the Current Genome table, which displays the contigs and their sizes.

So I added to the repository and committed it like so:

$ svn ci genome.txt -m "Output from twoBitInfo on current version of X_tropicalis_Nov_2009.2bit"
Adding genome.txt
Transmitting file data .
Committed revision 439.

My plan (currently) is to get the RefSeq genes track from UCSC and deploy it on our QuickLoad site. I'll provide meta-data about the data set using the annots.xml file. (More on this later.)

Next, I used the UCSC Table Browser to get the RefSeq genes for this species. Here are the settings I used for this:

Unfortunately, there don't appear to be a lot of RefSeq gene annotations available for this species:

$ gunzip -c X_tropicalis_Nov_2009_refGene.bed.gz | wc -l
9797

Probably it would be a good idea to look for another data set that might provide a more complete view of the Xenopus expressed gene repertoire.

Using the Table Browser, I explored the available data sets for Xenopus. To do this, I just choose a table and then click the button "describe table schema," which takes me to a page reporting the number of annotations available in the selected table.

It looks like the table all_mrna may be the most complete; it contains slightly more than 20,000 rows. So, probably users will want to see this data, as well as the RefGene track. I'll download this data set, add it to the repository, and add it to the annots.xml file.

I also checked the ESTs data set. There are around 1.5 million ESTs for Xenopus. I'll download that data set as well, but set it up so that users can access it on a region-by-region basis by sorting and indexing using the tabix utility. I'll also need to massage the format a bit to get it to make the PSL (blat output) format.

To process the EST data set, I used the following command to strip off the first column

gunzip -c X_tropicalis_Nov_2009_all_est.gz | grep -v bin | cut -f2- > X_tropicalis_Nov_2009_all_est.psl

I've created a number of data files, and so my next step will be to try opening them in IGB. I also want to test whether IGB will be able to open and display the genome sequence.

So my next step is to create an annots.xml file IGB will use to get a listing of the annotations available for this genome as well as styling information, e.g., the background colors to use for gene annotations, whether to load all the annotations immediately, and so on.

Being a bit lazy, I usually just copy and paste another annots.xml file from another part of the repository when setting up a new genome. I use svn for this, however:

$ svn cp ../V_vinifera_Mar_2010/annots.xml .
A annots.xml

I then open a simple text editor (like TextEdit) and modify the file, like so:

To be continued.....

Page tree

Adding a new genome - X. tropicalis Nov 2009

Introduction

Methods