Sunday, March 18, 2012

Genome Coordinate Conventions

Denoting a contiguous region in a reference genome seems straightforward enough. However, I’ve seen many a bioinformatician get tripped up by the different conventions and formats used on web sites, file formats and genomic tools. Here are some things to keep in mind.

Chromosome

Taking the human genome as our example, chromosome names may or may not use a prefix so chromosome 12 may be “chr12”, “ch12” or simply “12”. Also, the mitochondrial chromosome may be “MT” or simply “M” and the sex chromosomes might be denoted as “X” and “Y” or “23” and “24”.

Strand

Typical conventions for denoting the plus vs minus strand orientation of a sequence include using “1” and “-1” or “+” and “-“.  It is worth noting that for most of the genomic resources I use, genomic positions are almost always given relative to the plus strand even if the feature is said to be on the minus strand.  For example, BLAT-ing a minus strand sequence against the human genome at UCSC will return a search result with "-" strand but the start and stop positions will be given relative to the plus strand.  Minus strand positions would start counting from 0 at the opposite end of the chromosome (q-arm).

Position

Several conventions exist which differ in (1) what number they start with when numbering bases, (2) whether they number the bases themselves or the spaces between bases and (3) whether the interval is considered "open" or "closed".  The notion of "closed" and "open" intervals is a mathematical concept which in this context implies either that the start and stop of the interval should be included (a closed interval) or they should not be included (an open interval).  Sometimes you will see square brackets used to denote closed intervals (e.g. [12,20]) and parentheses used to denote open intervals (e.g. (12,20)).

Below are examples of three of the more common conventions. In each case I show the coordinates of an ATG subsequence (in red) and a cut site (marked by a red triangle).

Base-counted, one-start (a.k.a. one-based, fully-closed)

ATG location: 7-9 or [7,9]
Cut site: 11^12 or (11,12)
Interval length = stop - start + 1
Notes:
  • This is by far the most common convention used by the major genome browsers, tools like BLAST and BLAT, etc.
  • Using a base-counted system is problematic for describing features that occur between bases such as insertions or enzyme cut sites.  To deal with this, some conventions replace the "-" usually used to separate the start and stop position with an alternative notation such as "^".  Alternatively, one could use the parentheses notation. 
  • Base-counted systems can short-hand a one base interval with just the start (or stop) location.  This is useful for denoting the location of SNPs, for example.


Space-counted, zero-start

ATG location: 6-9
Cut site: 11-11
Interval length = stop - start


Notes:
  • Less common, this convention is attractive because of its natural way of denoting features between bases such as insertions, etc.  Length calculations are also simple.  


Zero-based, half-open

ATG location: 6-9 or [6,9)
Cut site: 11-11 or [11,11)
Interval length = stop - start


Notes:
  • In this convention, the start base is included in the interval but not the stop base.
  • This convention is used in data formats (especially at UCSC) such as BED.
  • Although conceptually different, space-counted, zero-start and zero-based, half-open give the same start and stop coordinates for intervals.
  • Python programmers will find this convention familiar as Python indexes arrays in the same manner.


Conclusions

As you can see, the differences between conventions are subtle, which makes it very easy to make undetected errors. In my experience, programmers prefer to use one of the zero-start conventions since most computer languages use zero-based indexing where as biologists are more fond of the one-start conventions. This commonly leads to a juggling of conventions where software using a zero-start convention is switched to one-start for displaying results to the user. The UCSC Genome Browser is a good example of a site with straddling conventions between the user interface and the underlying data tables and it must cause some confusion because they've devoted a FAQ entry to the issue.

My personal favorite convention is the space-counted, zero-start convention. All intervals require a start and stop location (even for single base features like SNPs), which makes up for in consistency what it might lack in conciseness. Denoting a location between two bases, such as for an insertion or enzyme cut site, is conceptually clear and does not require syntatic trickery like the base-counted methods. My understanding is that one of the many differences between the public and Celera human genome sequencing efforts was the public effort used base-counted, one-start while Celera used space-counted, zero-start although I don't have a reference to confirm this.

I haven't touched on the coordinate conventions used in the human variant world which give positions relative to cDNA. The HGVS variant nomenclature is a good example of this.  Among the interesting features of this convention are starting counting (at 1) with the A of the ATG initiation codon, using negative coordinates and skipping the 0 base altogether. Good times.

Until our evil bioinformatics overlords descend upon us and force us to standardize to one system, I've started a cheat sheet to help me remember who's using what convention. Please have a look, correct me where I'm wrong and point out what I'm missing.

References

  • A great blog post from the Bergman lab dealing with genome coordinate conventions with an emphasis on transposable elements annotation.
  • UCSC's description of the zero-based, half-open convention.

3 comments:

Anonymous said...

In the half-open system, does the half that is open / closed change based on the strand?

Anonymous said...

In the half-closed / half-open system, does the end that is open depend on the strand?

mr.AA said...

Hey Brian, great question! So, regardless of the coordinate convention being used, a genome interval that's reported relative to the + strand would almost always be expressed as a different interval relative to the - strand, since you would start counting bases (or gaps) from different ends of the sequence. However, in practice most genomic data set coordinates are given relative to the + strand even when the feature is reported to be on the - strand. For example, the start of a gene on human chr2 is almost always given relative to its distance to the p-arm of the chromosome, even if the gene is on the - strand.