IGB Terminology

Genomes and chromosomes

All data viewed in IGB, regardless of its source, is organized into distinct genomes and chromosomes.

In IGB, a chromosome refers to any single sequence. Often this will correspond to the sequence of a physical chromosome. At other times it may represent an assembled contig, a BAC, or any other DNA sequence. All chromosomes in IGB are assumed to be DNA, rather than RNA sequences.

A genome version refers to a group of chromosome sequences that you or another group assembled and made available.

For example, NCBI release 35 and 36 of the human genome are considered to be two separate genomes. Each one contains multiple chromosome sequences, including the expected chromosomes 1 to 22, X, and Y. Other sequences, such as "chr22_random" are also considered distinct chromosomes for the purposes of display in IGB.

Each sequence in IGB is identified by its genome and chromosome names, which must therefore be distinct. No two chromosomes within the same genome version can have the same name.

Naming a genome

If you are building a genome for display in IGB, we recommend you give it an IGB-friendly name, consisting of the month and year of release combined with genus and species, following the pattern G_species_mon_yyyy, where G is the first letter of the genus, mon is the three-letter English abbreviation for the month the genome was released, and year is the year of the release. For example:

A_thaliana_Jun_2009
A_mellifera_Jan_2005
H_sapiens_Feb_2009

Using this scheme will ensure that IGB displays the latest genome first in the genome menu under the Data Access tab.

Adding a common name for a species

When users operate the pulldown menu to choose a species to view in IGB, a short message indicating the common name of the species appears. If you are adding a new species, contact the IGB developers and ask to have your common name added to the species.txt file under version control at sourceforge.net. This is a tab delimited file that lists all the species that IGB supports, including common names for many of them.

Synonyms

Unfortunately, different groups tend to refer to the same genome or chromosome by different names. For example, NCBI human genome build 35 is also known as hg17 and ensembl1834, as well as H_sapiens_May_2004. When IGB is able to recognize that two names refer to the same genome or chromosome, it will merge the data. Otherwise it will keep the two data sets distinct. Currently, IGB uses a simple table of synonyms to store these associations. You can create your own set of synonyms that will extend this set if needed.

Annotations, sequences, graphs, and alignments

IGB can work with four distinct types of data: annotations, alignments (typically from Illumina sequencing experiments), graphs, and genomic sequences. Some features of the program make sense only with some of these types of data.

Annotations indicate the known or suspected locations of genomic landmark features, such as genes, exons, promoter regions, pseudogenes, and so forth. Alignments of EST sequences, GeneChip probe sequences, and other sequences onto chromosome are also sometimes as annotations, particularly when they don't include the sequence of the aligned entity. Annotation data can be loaded from files, QuickLoad, and DAS servers.

Sequences are sets of DNA residues comprising a chromosome. Sequences can be loaded from files, QuickLoad, and DAS servers. It is a good diea to load sequence data only for small regions of the genome at a time.

Graphs indicate scores or other numeric values as a function of genomic position. Graphs are generally displayed as some form of plot (x,y-plot, bar plot, etc.). The results from tiling arrays are generally represented as graphs. There are two types of graphs data: point-based graphs, in which numerical values are associated with individual (single) base positions, and interval graphs, which capture values associated with ranges of genomic positions.

Alignments represent how sequences (such as short reads from an RNA-Seq experiment) align onto the reference genomic sequence. At low zoom, they look like regular annotations, but with marks representing mismatches whenever these data are available. At high zoom, they show the sequence of the aligned read and sometimes indicate scores and the degree of agreement with the reference sequence. These are typically loaded from BAM (binary alignment) files.

Page tree