也跑：Large data medical research, 海外华人的失落感 and religious testimony ...

2014-11-10 05:29 AM| 发布者: 星光| 查看: 1168| 评论: 8|原作者: alsoRun

摘要: 昨夜雨has posted a statistical story in estimating the number of rats in NYC. Iam following with this one, also on request from 666.Ok, statistics is aboutanalyzing data when the individuals in your s ...

昨夜雨 has posted a statistical story in estimating the number of rats in NYC. I am following with this one, also on request from 666.

Ok, statistics is about analyzing data when the individuals in your study population have substantial variability. If you are interested in people, you know they are are so same while at the same time so different. People differ in their height, weight, taste and political views. In medical research, patients' response to a treatment can be very different. So if a new drug can extend lung cancer patient's life, it does not have to extend everybody's, just majority of them so that the average one year survival rate is higher, for example.

The FDA's basic criterion of new drug approval is for the drug company to demonstrate, via clinical trials, that the new drug can, on average, extend life for a meaningful amount of time. Other end points can also be used but survival is a more objective one. Of course, adverse effect and economic burden also need to considered. But because of the population variability, a demonstrated extension of life in 500 patients may due to the fact that your sample consists of 500 "good patients" who are more responsive to thee drug than most other people in the target population. The statistical tools to combat this issue are two folds. One is randomization, in which patients are allocated to different treatment groups via computer randomization to free it from human bias. The other is to require a large enough sample size. Specifically, FDA requires that number of patients is large enough so that the probability (chance) or finding such effect size is less than 5% if the drug has, on average, no effect in the entire target population.

This so called 5% false positive rate requirement can potentially be used by a drug to seek approval of useless drugs: a drug company may just do clinical trials on 100 useless drugs and 5 of them will turn out to be false positive. But a clinical trial is hugely expensive and a drug company usually only invests on a drug with real potential. So this has not been an issue.

Starting from 1990s, however, large data medical research came to the scene. One example is the genetic research. The microarray technology made it possible to compare the expression levels of thousands of genes between cancer and normal populations. So it is like conducting thousands of clinical trials simultaneously. There were an explosion of false claims of scientific discoveries in research journals.

People finally came to their senses and here is a simple way to explain it. Let x_i be the measurement of difference between cancer and normal groups. We then have x_i, i=1,....,n, where n is the number of genes. Suppose that an larger x_i means more difference between the cancer and normal group and a large enough x_i is taken as evidence that there is underlying difference in gene i between cancer and normal group. Now you do one thousand genes and you pick up the maximum x_i among x_1,..., x_n. This max(x_i) is generally much larger than a random x_i. So the old standard of judging x_i can no longer be used in judging the size of max(x_i). In summary, you cannot take the maximum of x_i, i=1,...,n, as if you did one experiment and observe max(x_i).

This can also explain 海外华人的失落感. Many 海外华人 visited homeland and met a few very successful old friends or classmates. They cannot help but wonder: what if I did not come to USA? The fallacy of this thinking is to take the extreme value as the average behavior. Suppose you had 50 classmates and each person's success level is x_i. The few successful classmates are max(x_i), they are not just an x_i. You in general should not expect to be to that special classmate who happened to be the max(x_i).

Another lesson learned in the process is to demand replication for findings obtained from large scale data. Nowadays, both Science and Nature require replication of findings in a independent set of subjects for genome-wide association studies. To appreciate this, suppose 100 people drove via a very risky road in a dark rainy day. 99 fell off the cliff and one survived. The survivor will testify that the God favors him. But the matter of truth can also be: the God in fact favored a random one out of the 100 (everybody had the same 1/100 probability) . How do you know which is true: ask the driver to drive that road in that condition again to replicate.

收藏分享邀请

上一篇：昨夜雨：纽约老鼠的数量是怎样统计出来的下一篇：星光：也老请进，也谈临床试验的统计学问题

发表评论

		自动登录	找回密码
密码			注册

也跑：Large data medical research, 海外华人的失落感 and religious testimony ...

最新评论

相关分类