...
For example, the following sequence of commands downloads the software, moves it to a directory named "bin" in the home directory, and then makes it executable using the chmod command.
Code Block |
---|
$ wget http://hgdownload.cse.ucsc.edu/admin/exe/macOSX.i386/twoBitInfo $ mv twoBitInfo ~/bin $ chmod a+x ~/bin/twoBitInfo |
Step-by-step guide to adding a new UCSC genome to IGBQuickLoadto IGBQuickLoad
The following instructions explain how to
- set up your local copy of the public QuickLoad data repository
- set up your environment to run QuickLoad scripts
- get annotation files and sequence files from UCSC Genome Bioinformatics
- convert files to random access, indexed file formats that enable partial data loading in IGB
- update meta-data files IGB requires to update its interface and allow users to access the newly added genome
- update HEADER.html and other files describing the new genome
- commit your new files and updates the repo
- submit a Jira ticket requesting that the main site and mirror sites be updated
If you have questions don't hesitate to ask Ann.
Get the QuickLoad data repo
...
To set up a src
directory for checked-out code:
Code Block |
---|
cd
mkdir src
cd src
|
Then, use svn to get a copy of the QuickLoad source code from the repo and save it to a directory named quickload_src:
...
Use svn mkdir to create a new genome directory. The name of the new directory should be identical to the IGB QuickLoad genome version name.
Code Block |
---|
$ svn mkdir G_species_Mmm_YYYYY
|
...
Change directories into the newly created genome version directory
Code Block |
---|
$ cd G_species_Mmm_YYYYY
|
Download the sequence data
...
Is the 2bit file available?
For most of the more recent genomes, UCSC is using the 2bit format to distribute sequence data. However, some older versions may not make this available.
If yes, download it using wget.
...
Return to your terminal UNIX shell, type wget, and paste the URL into the shell.
For example,
Code Block |
---|
$ wget http://hgdownload.soe.ucsc.edu/goldenPath/danRer7/bigZips/danRer7.2bit
|
...
The prefix (file name part) of the 2bit file for a genome should be the genome version identifier and suffix (file extension) should be 2bit.
For example,
Code Block |
---|
$ mv danRer7.2bit D_rerio_Jul_2010.2bit
|
...
Return to your terminal UNIX shell, type wget, and paste the URL into the shell.
For example,
Code Block |
---|
$ wget http://hgdownload.soe.ucsc.edu/goldenPath/dm3/bigZips/chromFa.tar.gz
|
...
Use tar with options xvf to uncompress the file while also extracting its contents. An ".fa" file for each chromosome will appear when completed.
Code Block |
---|
$ tar xvf chromFa.tar.gz
|
Note |
---|
Once you've created the 2bit file for the genome assembly, you'll delete the .fa and the .gz files. |
Create 2bit file using faToTwoBit
Get faToTwoBit installed. In the following example, the MacOS version is downloaded:
Code Block |
---|
$ wget http://hgdownload.cse.ucsc.edu/admin/exe/macOSX.i386/faToTwoBit $ mv faToTwoBit ~/bin $ chmod a+x ~/bin/faToTwoBit |
h7. Convert the fa files to twoBit;
faToTwoBit will read one or more fasta files and convert them to a single 2bit file. To understand how to run it, type the name of the program. If you run it without any arguments, it will print a usage message.
Code Block |
---|
$ faToTwoBit
faToTwoBit - Convert DNA from fasta to 2bit format
usage:
faToTwoBit in.fa [in2.fa in3.fa ...] out.2bit
options:
-noMask - Ignore lower-case masking in fa file.
-stripVersion - Strip off version number after . for genbank accessions.
-ignoreDups - only convert first sequence if there are duplicates
|
The prefix (file name part) of the 2bit file you create for this genome version should be the genome version identifier and suffix (file extension) should be "2bit."
Code Block |
---|
$ faToTwoBit *.fa G_species_Mmm_YYYY.2bit
|
Delete .fa files and the chromFa.tar.gz file
Code Block |
---|
$ rm *.fa $ rm chromFa.tar.gz |
Note |
---|
You can always re-create the .fa files using twoBitToFa, another UCSC tool. |
Note |
---|
The 2bit file is typically smaller than the compressed chromFa.tar.gz file it replaced. Unlike the .tar.gz file, it supports random access, allowing IGB to support partial loading of sequence from the IGBQuickLoad site into IGBthe coordinates track. |
Make genome.txt file
Use twoBitInfo to create a genome.txt file for the genome. This file lists sequences and their sizes for the genome.
Code Block |
---|
$ twoBitInfo G_species_Mmm_YYYY.2bit genome.txt genome.txt |
Note |
---|
You can use twoBitInfo to make BED files marking the location of N's in the genome or calculate the amount of non-N sequence in an assembly. |
Sort the genome.txt file sequence size
...
To ensure that the chromosomes are listed with largest ones first, sort the file
Code Block |
---|
$ sort -k2,2nr genome.txt > tmp $ mv tmp genome.txt |
Add genome.txt to the repo
...
Warning |
---|
Note that this file is tab-separated, so when you edit the file, be sure to use a tab character to separate the genome version and geome title fields. |
Test the new genome
...
Testing is absolutely critical as it is easy to make an error along with way. Plan to spend at least as much time testing as you spent on building the site.
Test under the released version of IGB
Download the latest release of IGB from http://www.bioviz.org
...
.
Configure data sources under Data Sources tab
...