Monday, February 7, 2011

False positives

I tend to get afraid when people (especially undergrads) tend to have this idealized notion that whatever technology or methodology that has been invented is TRIED, TRUE and 100% ACCURATE.

This was one of the things I had to grapple with when I first realized that scientific papers can be deeply flawed and controversial during my undergrad days. Those were the days that you thought publications were the gospel truths, and held the answers to your assignments or projects. The public likes to believe scientists are at the top of the academic food chain, which basically generates the ideas that push the whole human society forward in science and technology. Those are true MOSTLY.

Now that I am deep within the realm of academic pursuits, it is probably more saddening personally to learn that people take things at face value (even more frightening is that I do that too). For example, Genome-wide association studies (GWAS) are NOT exactly genome-wide. They are SELECTED genome variations that spans the genome, biased towards certain regions of the genome. And the fact that GWAS relies heavily on probabilistic and statistical underpinnings imply a necessity for large sample sizes, often of the order of thousands. Anything smaller runs the risk of generating false positives. Even then, GWAS cannot detect everything. In fact, it doesn't. For starters, it only accounts for the common variations in the COMMON diseases. Those diseases that are of lower frequency, or have lower penetrance, are not detectable by GWAS, just because it is simply statistically not possible to do that at an optimally accurate level with insufficient sample size (funding) and coverage of the genome. In any case, GWAS have, IMO, failed on so many instances because of the large number of false positives and non-reproducible results generated.

If we want total "GWAS" studies, we would need sequencing. Even sequencing has its caveats. The sequencing technologies are moving fast to very accurate levels at a high throughput rate by parallel sequencing whole molecules. But currently most of the mature technologies work by finding a consensus signal from an amplification of the query genome and then aligning that to a reference genome, which is also really a consensus of several biased genomes. There has been many cases where amplification can cause ambiguous sequence generation (due to equal number of signals) and the reference genome contains sites which are actually rare (i.e. not typical of a "normal"/consensus genome) as compared to the general population (due to heterogeneity amongst humans).

What I want to emphasize here is not the fallacies of all science that has been done. But rather impress upon the idea that while science typically is based on STATISTICALLY sound observations, there are caveats everywhere. I still believe science working for us. But we definitely have to be cognizant of the pitfalls and underlying assumptions that we are making in order to make better decisions with regards to scientific issues. Because only then will we be able to weigh out the pros and cons to address the matter at hand most efficiently.

No comments: