Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

For example, the following sequence of commands downloads the software, moves it to a directory named "bin" in the home directory, and then makes it executable using the chmod command.

Code Block
$ wget http://hgdownload.cse.ucsc.edu/admin/exe/macOSX.i386/twoBitInfo
$ mv twoBitInfo ~/bin
$ chmod a+x ~/bin/twoBitInfo

Step-by-step guide to adding a new UCSC genome to IGBQuickLoadto IGBQuickLoad

The following instructions explain how to

  • set up your local copy of the public QuickLoad data repository
  • set up your environment to run QuickLoad scripts
  • get annotation files and sequence files from UCSC Genome Bioinformatics
  • convert files to random access, indexed file formats that enable partial data loading in IGB
  • update meta-data files IGB requires to update its interface and allow users to access the newly added genome
  • update HEADER.html and other files describing the new genome
  • commit your new files and updates the repo
  • submit a Jira ticket requesting that the main site and mirror sites be updated

If you have questions don't hesitate to ask Ann.

Get the QuickLoad data repo

...

To set up a src directory for checked-out code:

Code Block
cd 
mkdir src
cd src

Then, use svn to get a copy of the QuickLoad source code from the repo and save it to a directory named quickload_src:

...

Use svn mkdir to create a new genome directory. The name of the new directory should be identical to the IGB QuickLoad genome version name.

Code Block
$ svn mkdir G_species_Mmm_YYYYY

...

Change directories into the newly created genome version directory

Code Block
$ cd G_species_Mmm_YYYYY

Download the sequence data

...

Is the 2bit file available?

For most of the more recent genomes, UCSC is using the 2bit format to distribute sequence data. However, some older versions may not make this available.

If yes, download it using wget.

...

Return to your terminal UNIX shell, type wget, and paste the URL into the shell.

For example,

Code Block
$ wget http://hgdownload.soe.ucsc.edu/goldenPath/danRer7/bigZips/danRer7.2bit

...

The prefix (file name part) of the 2bit file for a genome should be the genome version identifier and suffix (file extension) should be 2bit.

For example,

Code Block
$ mv danRer7.2bit D_rerio_Jul_2010.2bit

...

Return to your terminal UNIX shell, type wget, and paste the URL into the shell.

For example,

Code Block
$ wget http://hgdownload.soe.ucsc.edu/goldenPath/dm3/bigZips/chromFa.tar.gz

...

Use tar with options xvf to uncompress the file while also extracting its contents. An ".fa" file for each chromosome will appear when completed.

Code Block
$ tar xvf chromFa.tar.gz
Note

Once you've created the 2bit file for the genome assembly, you'll delete the .fa and the .gz files.

Create 2bit file using faToTwoBit

Get faToTwoBit installed. In the following example, the MacOS version is downloaded:

Code Block
$ wget http://hgdownload.cse.ucsc.edu/admin/exe/macOSX.i386/faToTwoBit
$ mv faToTwoBit ~/bin
$ chmod a+x ~/bin/faToTwoBit

h7. Convert the fa files to twoBit;

faToTwoBit will read one or more fasta files and convert them to a single 2bit file. To understand how to run it, type the name of the program. If you run it without any arguments, it will print a usage message.

Code Block
$ faToTwoBit
faToTwoBit - Convert DNA from fasta to 2bit format
usage:
   faToTwoBit in.fa [in2.fa in3.fa ...] out.2bit
options:
   -noMask       - Ignore lower-case masking in fa file.
   -stripVersion - Strip off version number after . for genbank accessions.
   -ignoreDups   - only convert first sequence if there are duplicates

The prefix (file name part) of the 2bit file you create for this genome version should be the genome version identifier and suffix (file extension) should be "2bit."

Code Block
$ faToTwoBit *.fa G_species_Mmm_YYYY.2bit
Delete .fa files and the chromFa.tar.gz file
Code Block
$ rm *.fa
$ rm chromFa.tar.gz
Note

You can always re-create the .fa files using twoBitToFa, another UCSC tool.

Note

The 2bit file is typically smaller than the compressed chromFa.tar.gz file it replaced. Unlike the .tar.gz file, it supports random access, allowing IGB to support partial loading of sequence from the IGBQuickLoad site into IGBthe coordinates track.

Make genome.txt file

Use twoBitInfo to create a genome.txt file for the genome. This file lists sequences and their sizes for the genome.

Code Block
$ twoBitInfo G_species_Mmm_YYYY.2bit genome.txt
 genome.txt
Note

You can use twoBitInfo to make BED files marking the location of N's in the genome or calculate the amount of non-N sequence in an assembly.

Sort the genome.txt file sequence size

...

To ensure that the chromosomes are listed with largest ones first, sort the file

Code Block
$ sort -k2,2nr genome.txt > tmp
$ mv tmp genome.txt

Add genome.txt to the repo

...

Warning

Note that this file is tab-separated, so when you edit the file, be sure to use a tab character to separate the genome version and geome title fields.

Test the new genome

...

Testing is absolutely critical as it is easy to make an error along with way. Plan to spend at least as much time testing as you spent on building the site.

Test under the released version of IGB

Download the latest release of IGB from http://www.bioviz.org

...

.

Configure data sources under Data Sources tab

...