Page tree
Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 6 Next »

UNDER CONSTRUCTION

Introduction

UCSC genome bioinformatics supports mammalian, insect, fish, avian, and some fungal genomes.

IGBQuickLoad contains genome directories with sequence and annotations data for some (not all) genome assemblies supported at UCSC.

The following describes how to add a new genome version to the IGB QuickLoad repository and update IGB synonyms.txt.

Command-line utilities you'll need need

  • faToTwoBit from UCSC
  • twoBitInfo from UCSC
  • UNIX wget (not installed by default on Mac but available on most other UNIX systems)
  • UNIX sort (should be pre-installed on any UNIX system, including Mac)

Compiled UCSC software tools are available from http://hgdownload.cse.ucsc.edu/admin/exe/.

Get the compiled programs these in a directory in your PATH. Make sure they are executable on your system.

If you're doing this on a Mac desktop or laptop computer, create a directory called "bin" in your home directory and save all compiled binaries there. Edit your .bash_profile file to include a line like the following to ensure that the shell can find the programs.

export PATH=.:$HOME/bin:$PATH

For example, the following sequence of commands downloads the software, moves it to a directory named "bin" in the home directory, and then makes it executable using the chmod command.

$ wget http://hgdownload.cse.ucsc.edu/admin/exe/macOSX.i386/twoBitInfo
$ mv twoBitInfo ~/bin
$ chmod a+x ~/bin/twoBitInfo

Step-by-step guide to adding a new UCSC genome to IGBQuickLoad

Get the QuickLoad data repo

Check out or update a copy of IGB QuickLoad data and source code directories.

Open a terminal (UNIX shell) and change into the directory where you want your checked-out copy of the genomes repository to reside.

For example, you might do this:

$ cd /mydata

Use svn to get a copy of the repo:

$ svn co https://svn.transvar.org/repos/genomes/trunk/pub/quickload

If you already have a copy, just update using svn up.

$ svn up

This will update everything in the current working directory and all the directories beneath it.

Open a Web browser and find the genome you would like to add in the Table Browser at UCSC

Go to http://genome.ucsc.edu/cgi-bin/hgTables

Use the genome version menu to determine the month and year of the genome release you want.

Make note of the genome version synonyms UCSC is using. This usually in parentheses next to the month and year of the release. These will need to be included in IGB's curated list of genome version synonyms to ensure compatibility with Galaxy, UCSC DAS, or other external resources.

For example, the assembly menu for zebrafish reads Jul. 2010 (Zv9/danRer7). The terms in parentheses are genome version synonyms for this assembly. The one on the right (danRer7) is what UCSC calls the "database" for this assembly and is used as identifier of the genome in the UCSC DAS1 data sources. The term on left (Zv9) is usually another commonly-used term, sometimes assigned by the sequencing consortium that generated the assembly or the original sequence. Sometimes, however, this term is not unique. For example, some genome versions are reported with the term "Broad," which is an organization, not an assembly.

Name the genome for month, year, and species

Choose an IGB genome assembly version identifier to represent the UCSC genome.

IGB genome assembly versions identifiers look like:

IGB uses genus, species, strain (optionally), and release month and year to identify genome assembly versions for a species, individual, strain, or cultivar whose genome was sequenced.

G_species_strain_Mmm_YYYYY

where

  • G is the first letter (upper-case) of the genus name
  • species is the species name (lower-case)
  • strain is cultivar, strain, or individual whose genome was sequenced (this is optional and not usually needed for UCSC-managed genomes)
  • Mmm is the three-letter English abbreviation of the month of the release (first letter is upper-case)
  • YYYY is the year of the release

Examples)

  • P_troglodytes_Oct_2010 genome assembly for chimp released Oct 2010
  • Z_mays_B73_Mar_2010 genome assembly for maize plant cultivar B73 released March 2010

Use svn mkdir to create a new genome version directory for the genomes repo

Use svn mkdir to create a new genome directory.

$ svn mkdir G_species_Mmm_YYYYY

DO NOT use the UNIX mkdir command to create the directory and then use svn add later to add it later to the repo. If you do this and are not careful, you could end up adding a lot more to the repo than you intended.

Change directories into the newly created genome version directory

$ cd G_species_Mmm_YYYYY

Download the sequence data

Go to UCSC Genome Bioinformatics and click Downloads > Genome Data. Click the link for your species and then click the link labeled Full data set.

This will take you to a directory where you can download files. Typically, the address of the directory is

http://hgdownload.soe.ucsc.edu/goldenPath/UCSC name/bigZips/

For example, UCSC's danRer7 genome is in http://hgdownload.soe.ucsc.edu/goldenPath/danRer7/bigZips/.

Is the 2bit file available?

If yes, download it using wget.

Right-click the link in your browser and select "Copy Link Location."

Return to your terminal UNIX shell, type wget, and paste the URL into the shell.

For example,

$ wget http://hgdownload.soe.ucsc.edu/goldenPath/danRer7/bigZips/danRer7.2bit
Rename the file to G_species_Mmm_YYYYY.2bit.

The prefix (file name part) of the 2bit file for a genome should be the genome version identifier and suffix (file extension) should be 2bit.

For example,

$ mv danRer7.2bit D_rerio_Jul_2010.2bit
If not, download the sequence data in fasta format.

For older genomes, UCSC provides sequence data in a so-called "bigZip" file which contains each assembled sequence (typically one per physical chromosome) in a separate fasta file.

For example, as of this writing, the dm3 (fruit fly) genome is provided in this format.

Right-click the link in your browser and select "Copy Link Location."

Return to your terminal UNIX shell, type wget, and paste the URL into the shell.

For example,

$ wget http://hgdownload.soe.ucsc.edu/goldenPath/dm3/bigZips/chromFa.tar.gz
Unpack the file using tar
ldld
Create 2bit file using faToTwoBit

Convert the fa files to twoBit, using the faToTwoBit. This program (from UCSC) will read one or more fasta files and convert them to a single 2bit file. To understand how to run it, type the name of the program. If you run it without any arguments, it will print a usage message.

$ faToTwoBit
faToTwoBit - Convert DNA from fasta to 2bit format
usage:
   faToTwoBit in.fa [in2.fa in3.fa ...] out.2bit
options:
   -noMask       - Ignore lower-case masking in fa file.
   -stripVersion - Strip off version number after . for genbank accessions.
   -ignoreDups   - only convert first sequence if there are duplicates

The prefix (file name part) of the 2bit file you create for this genome version should be the genome version identifier and suffix (file extension) should be "2bit."

$ faToTwoBit *.fa G_species_Mmm_YYYY.2bit
Delete .fa files.
$ rm *.fa

You can always re-create the .fa files using twoBitToFa, another UCSC tool.

The 2bit file is smaller than the .fa files combined. It also supports random access, allowing IGB to support partial loading of sequence from the IGBQuickLoad site into IGB.

Make genome.txt file

Use twoBitInfo to create a genome.txt file for the genome. This file lists sequences and their sizes for the genome.

$ twoBitInfo G_species_Mmm_YYYY.2bit genome.txt

Sort the genome.txt file sequence size

The order of sequences listed in the genome.txt is how they will appear in the IGB Current Sequence tab when users visit the genome version in IGB.

Check the genome.txt file. Are the chromosomes listed in a reasonable order?

If you created the 2bit file from fasta files, then probably they will be listed in alphabetical.

Depending on the state of the genome assembly, i.e., how close to complete it is, it's usually much better to list the chromosomes by size, with larger sequences appearing first in the list.

To ensure that the chromosomes are listed with largest ones first, sort the file

$ sort -k2,2nr genome.txt > tmp
$ mv tmp genome.txt

Add genome.txt to the repo

Add the genome.txt file to the repo

svn add genome.txt
  • No labels