Saturday, November 20, 2010

Identity Crisis: How to Pick Good Identifiers

What makes a good identifier? It seems like a simple question, but making the wrong decision can create mistakes that you'll live with for a long time. Unlike other tasks in bioinformatics, choosing how to identify reagents, samples, microarrays, etc is one area where you'll have lots of opinions and lots of "help". For example, in the lab I've often had scientists want to cram sample names, day of the week, batch number, favorite flavor of ice cream, etc all into the name used to identify a tube or plate. The only limit to their imagination was label real estate and font size.

When you get into a situation of identifier overload, it's good to take a step back and re-evaluate. What's the primary function of an identifier? By its very definition, an identifier should be a unique and unambiguous way to identify something (in math speak, this is called a 1-to-1 mapping). When I give you an identifier you should know exactly what I'm referring to - that's the unambiguity. On the flip side, I shouldn't have a whole bunch of identifiers referring to the same thing - that's the uniqueness.

It's not hard to find identifiers in genomics that fail to live up to these properties. Entrez gene identifiers are simply integers, which makes these identifiers completely ambiguous without context. Genes also often go by several names which breaks the uniqueness and is why you should stick to official HUGO nomenclature. Although Genbank sequence identifiers include a version number (the part of the name after the decimal point), many people neglect it and the identifier becomes ambiguous. Mutations have historically been identified in a format such as E7V. Again, without context it's impossible to know which gene sequence is being referenced or sometimes even whether the mutations are denoting a nucleotide or amino acid change. The Human Genome Variation Society is trying to replace this last system with a new nomenclature. It couldn't come too soon.

How about in your own lab? The most common issue with home grown identifiers is the unintended breaking of the uniqueness or the unambiguity rule due to identifier overloading. "Identifier overloading" is when an identifier is layered with added information beyond what's needed for unique and unambiguous identification. There's almost always a good reason - "I really want the label to contain the date that it's run" or "using the sample name as the name for my microarray will be convenient". This all sounds good, but inevitably an unforeseen circumstance leads to a break down in the system. What if an array needs to be run again? What if a label was generated on one date but not run until the next day?

As a concrete example, I once worked with a group that used an internal, automatically generated database primary key as an identifier for reagents in the lab. This worked great until a software upgrade forced a dump and re-population of the database. The internal database identifier no longer matched the labels pasted on tubes in the lab. The identifiers' double function of reagent name and internal database key were at odds.

So, how do you come up with a good identifier? As a bioinformatician, I know that having unique, unambiguous identifiers is of the utmost importance and the best way to achieve this is through a simple, incremental naming system with no additional encumbrances. As a pragmatist, I know that an identifier that is this anonymous requires the user to do some sort of look-up to know anything useful, and until augmented reality becomes mainstream, this will be a problem.

In general, I try to stick to the follow rules:
  1. Begin with a short prefix (2-5 letters). The sole purpose of the prefix is to clearly distinguish what type of object is being identified. Examples in the wild include "rs" (SNPs in dbSNP), "ENST" (transcript at Ensembl), etc
  2. Follow with a simple, incremental integer. Do not pad with 0's to make the identifiers a constant width. It may look pretty, but people are bad at counting 0's and you've capped the number of identifiers at your disposal. Do not get into a habit of trying to always start a new day's experiment at a thousand or similar. You're starting to overload your identifier.
  3. If you must use versioning and won't settle for just taking the next available ID, do it just once using a simple decimal point followed by an integer. Decide to do this from the start even if it's not obvious you need it.
  4. If your user demands more, make a deal with them. Stick with identifiers as above, but where space, time and convenience allow, let the user to come up with a small amount of information that will make their lives easier. Include this when labels are printed, identifiers are shown on web pages, etc. The key here is the user's data doesn't have to be unique or even well thought out. It's your simple ID that's important.
Have your own rules? Would love to hear them!