Thursday, September 22, 2011

DIY Bioinformatics: A Whole New Galaxy

This blog post was first published by my company on June 16, 2011

Inspired by Will’s recent book review of Biopunk and the analysis crowd sourcing of the European E. Coli outbreak, I thought I would take another look at DIY (Do It Yourself) biology in this week’s post. Unlike some, I have no interest in trying to run molecular biology experiments out of my kitchen. As anyone who has had the misfortune of trying my cooking would tell you, if there is a way to make a PCR reaction dry and tasteless I'm sure I'd find it. DIY bioinformatics I find more intriguing. I'm not a practitioner as I'm too busy with PSTDIFY (Pay Somebody To Do It For You) bioinformatics, but I like the vision of the lone, amateur scientist, sitting amongst a pile of empty pizza boxes and Red Bull cans finding unknown biological treasure with just their laptop, curiosity and some serendipity. This vision is not so unlikely. Large biological data sets are readily available including thousands of microarray experiments, genotypes and even full genomes. Someone modestly adept at programming or a package like R can interrogate, correlate and mine this data - and indeed this is happening all the time. What about the true amateur, however, who even lacks programming skills? Can the Excel Warrior or my web savy grandma participate in their own DIY bioinformatics adventure? That’s what I set out to discover this week. As a test, I went back to a favorite paper of mine by Majewski and Ott (Genome Research, 2004). What I like about the paper is the number of insights made simply through careful mining of genomic databases. For example, even with inherently noisy data sets like dbSNP and the annotated human genome, the authors were able to clearly see the extent and importance (for splice regulation) of sites near exon-intron boundaries simply by looking at the overall frequency of SNPs discovered in these positions compared to other sites. This figure (F2) from the paper shows the low SNP frequency in the immediate 5' and 3' positions of the intron where it meets the neighboring exons. My test was to see if I could reproduce at least a part of this analysis by simply using free public tools and without programming. I settled on the web-based analysis tool Galaxy as it seemed to have a lot of the functionality I would need and I wasn’t very familiar with it - making me a better stand in for the Red Bull-intoxicated amateur scientist. After some time poking around, I settled on these steps in Galaxy:
  1. Get introns from chromosome 12 via UCSC’s Table Browser (I just did chromosome 12 to keep my data sets manageable for this example).
  2. Get all SNPs from chromosome 12
  3. Join the introns and SNPs producing a table of only those SNPs that fall within an intron
  4. Calculate the position of the SNP relative to the 5’ end of the intron
  5. Count up number of SNPs found at each 5’ position
  6. Sort results by position (probably not necessary)
  7. Limit results to just positions within 50 bp of the exon-intron boundary
  8. Plot the SNP frequency vs SNP location
For more detail, you can see my workflow here. And this was my final result: This corresponds pretty well to the left portion of Majewski and Ott's own intron plot and was finished before cracking my second can of Red Bull. Score one for DIY! Before I quit my day (and night) job to make room for the waves of empowered amateurs, it's worth pointing out a few minor details. First, this rudimentary analysis glosses over many important details (such as normalizing for intron lengths) and any publication ready workflow would be much more complex. Second, like all tools of this type, Galaxy walks a fine line to balance functionality and usability. It took me quite a bit of exploring to find the right functions and many of these functions probably only made sense to me because I knew a lot about programming, databases, genomics, etc. Third, it's near impossible to match the power, speed and flexibility a programmer has to analyze data with a web based tool like Galaxy. And finally, although I am empowered by Galaxy to do the steps, the know-how of what questions to ask and the science to understand the observations I make comes from many years of experience - unfortunately there's no short cut around that. With that said, Galaxy has some very nice features and is a powerful addition to the DIY's tool box. Stock up on your Red Bull now.

Tuesday, September 6, 2011

Excel Hacks: Calculating Oligo Temperature with the Wallace Rule

Calculating oligonucleotide melting temperatures using the Wallace-Itakura formula (2*(number of A's and T's) + 4*(number of G's and C's)) in Excel is a piece of cake. I rarely use it myself as I typically use nearest neighbor with SantaLucia thermodynamic parameters for oligo design, but on occasion it does come up. Assuming your oligo sequence is in A1, this formula will calculate the Wallace TM:
=2*(LEN(A1)-LEN(SUBSTITUTE(SUBSTITUTE(A1,"A",""),"T","")))+4*(LEN(A1)-LEN(SUBSTITUTE(SUBSTITUTE(A1,"G",""),"C","")))
The formula counts the numbers of A's and T's by making a new oligo sequence with A's and T's removed. It then compares the length of the original oligo to the new one - the difference is the number of A's and T's. Counting G's and C's is done in the same manner. Any base other than A,C, G or T will be ignored in the calculation.