Sunday, December 12, 2010

Excel Hacks: Calculating GC%

As a quick hack, you can calculate the GC% of a DNA sequence in Excel without breaking out the macros. Here’s how you do it assuming your sequence is in Excel cell A1.

= (1 - LEN(SUBSTITUTE(SUBSTITUTE(A1,"G",""),"C",""))/LEN(A1)) * 100

Although the formula is a bit daunting, the principle is straightforward. First, use the SUBSTITUTE command to get a version of your sequence with all G’s and C’s removed. Determine the length of this new sequence using the LEN command and compare this to the length of the original sequence. This gives you the fraction of sequence that is NOT a G or a C. To get the GC%, subtract this fraction from 1 and multiply by 100% to go from a fraction to a percent.

This will probably satisfy most folks, but it’s not perfect. For an accurate calculation, we need to take into account ambiguous IUPAC base codes. For example, a "N" should be treated as 0.5 of a G-C base. A "D" should count as 0.33. Although it’s possible to extend the hack above to handle these cases, it might be a good time to break out Perl or Python.

Sunday, December 5, 2010

Excel Hacks: Plate Well to Number Conversions

Here’s a simple but useful transformation for going from a well “address” on a plate (e.g. “B5”) to a numbered location. For a 96 well plate, the numbered location will be a number from 1 to 96. Typically, this numbering goes first from left to right across the 12 columns and then top to bottom down the 8 rows. So, A1 would map to 1, B1 would map to well 13 and H12 would map to 96.

Let's assume the plate address you want to convert is in the Excel cell A1. The conversion is done in Excel as shown below:

= (CODE(LEFT(A1,1))-65)*12 + RIGHT(A1,LEN(A1)-1)

The formula assumes the well format is one character denoting the row followed by one or more integers which denote the column. The left half of the formula before the '+' converts the character into an integer using the CODE function. Since "A" has a code of 65, we subtract this off before multiplying by the row length. The portion of the formula after the '+' simply extracts the integer portion of the well address and adds it to get the well number. The formula works with padded (A01)and unpadded (A1) representations.

To go back from a numbered address to the letter-number format, use the following:

= CHAR(INT((A1-1)/12)+65) & (MOD(A1-1, 12)+1)

The portion of the formula left of the '&' handles the row calculation. The CHAR function performs the opposite duty of CODE, taking a number and converting it back to a character. The portion to the right of the '&' calculates the integer (column) portion of the well address.

It's quite easy to do 384-well or other formats. Just replace the red 12 with the correct number of columns. In the case of a 384-well format, you'd replace it with 24.

It is less common to see plate wells converted into numbers from top to bottom and then left to right, although it does happen. In this case, B1 maps to well 2 and H2 to 16 for a 96 well plate.

The conversion from plate well to number in this case is:

= (CODE(LEFT(A1,1))-65) + 1 + (RIGHT(A1,LEN(A1)-1)-1)*8

And to go back:

= CHAR(MOD((A1-1),8)+65) & (INT((A1-1)/8)+1)

Note that instead of using the number of columns (12), we're now using the number of rows (8). It's left as an exercise to the student to tweak the formula for 384-well formatted plates ;)



Saturday, November 20, 2010

Identity Crisis: How to Pick Good Identifiers

What makes a good identifier? It seems like a simple question, but making the wrong decision can create mistakes that you'll live with for a long time. Unlike other tasks in bioinformatics, choosing how to identify reagents, samples, microarrays, etc is one area where you'll have lots of opinions and lots of "help". For example, in the lab I've often had scientists want to cram sample names, day of the week, batch number, favorite flavor of ice cream, etc all into the name used to identify a tube or plate. The only limit to their imagination was label real estate and font size.

When you get into a situation of identifier overload, it's good to take a step back and re-evaluate. What's the primary function of an identifier? By its very definition, an identifier should be a unique and unambiguous way to identify something (in math speak, this is called a 1-to-1 mapping). When I give you an identifier you should know exactly what I'm referring to - that's the unambiguity. On the flip side, I shouldn't have a whole bunch of identifiers referring to the same thing - that's the uniqueness.

It's not hard to find identifiers in genomics that fail to live up to these properties. Entrez gene identifiers are simply integers, which makes these identifiers completely ambiguous without context. Genes also often go by several names which breaks the uniqueness and is why you should stick to official HUGO nomenclature. Although Genbank sequence identifiers include a version number (the part of the name after the decimal point), many people neglect it and the identifier becomes ambiguous. Mutations have historically been identified in a format such as E7V. Again, without context it's impossible to know which gene sequence is being referenced or sometimes even whether the mutations are denoting a nucleotide or amino acid change. The Human Genome Variation Society is trying to replace this last system with a new nomenclature. It couldn't come too soon.

How about in your own lab? The most common issue with home grown identifiers is the unintended breaking of the uniqueness or the unambiguity rule due to identifier overloading. "Identifier overloading" is when an identifier is layered with added information beyond what's needed for unique and unambiguous identification. There's almost always a good reason - "I really want the label to contain the date that it's run" or "using the sample name as the name for my microarray will be convenient". This all sounds good, but inevitably an unforeseen circumstance leads to a break down in the system. What if an array needs to be run again? What if a label was generated on one date but not run until the next day?

As a concrete example, I once worked with a group that used an internal, automatically generated database primary key as an identifier for reagents in the lab. This worked great until a software upgrade forced a dump and re-population of the database. The internal database identifier no longer matched the labels pasted on tubes in the lab. The identifiers' double function of reagent name and internal database key were at odds.

So, how do you come up with a good identifier? As a bioinformatician, I know that having unique, unambiguous identifiers is of the utmost importance and the best way to achieve this is through a simple, incremental naming system with no additional encumbrances. As a pragmatist, I know that an identifier that is this anonymous requires the user to do some sort of look-up to know anything useful, and until augmented reality becomes mainstream, this will be a problem.

In general, I try to stick to the follow rules:
  1. Begin with a short prefix (2-5 letters). The sole purpose of the prefix is to clearly distinguish what type of object is being identified. Examples in the wild include "rs" (SNPs in dbSNP), "ENST" (transcript at Ensembl), etc
  2. Follow with a simple, incremental integer. Do not pad with 0's to make the identifiers a constant width. It may look pretty, but people are bad at counting 0's and you've capped the number of identifiers at your disposal. Do not get into a habit of trying to always start a new day's experiment at a thousand or similar. You're starting to overload your identifier.
  3. If you must use versioning and won't settle for just taking the next available ID, do it just once using a simple decimal point followed by an integer. Decide to do this from the start even if it's not obvious you need it.
  4. If your user demands more, make a deal with them. Stick with identifiers as above, but where space, time and convenience allow, let the user to come up with a small amount of information that will make their lives easier. Include this when labels are printed, identifiers are shown on web pages, etc. The key here is the user's data doesn't have to be unique or even well thought out. It's your simple ID that's important.
Have your own rules? Would love to hear them!