天涯小站 2.0

 找回密码
 注册
搜索
天涯小站 2.0 首页 拾萃 科技网络 查看内容

也跑:Large data medical research, 海外华人的失落感 and religious testimony ...

2014-11-10 05:29 AM| 发布者: 星光| 查看: 1168| 评论: 8|原作者: alsoRun

摘要: 昨夜雨has posted a statistical story in estimating the number of rats in NYC. Iam following with this one, also on request from 666.Ok, statistics is aboutanalyzing data when the individuals in your s ...
 
昨夜雨 has posted a statistical story in estimating the number of rats in NYC. I am following with this one, also on request from 666.

Ok, statistics is about analyzing data when the individuals in your study population have substantial variability. If you are interested in people, you know they are are so same while at the same time so different. People differ in their height, weight, taste and political views. In medical research, patients' response to a treatment can be very different. So if a new drug can extend lung cancer patient's life, it does not have to extend everybody's, just majority of them so that the average one year survival rate is higher, for example.

The FDA's basic criterion of new drug approval is for the drug company to demonstrate, via clinical trials, that the new drug can, on average,  extend life for a meaningful amount of time. Other end points can also be used but survival is a more objective one. Of course, adverse effect and economic burden also need to considered. But because of the population variability, a demonstrated extension of life in 500 patients may due to the fact that your sample consists of 500 "good patients" who are more responsive to thee drug than most other people in the target population. The statistical tools to combat this issue are two folds. One is randomization, in which patients are allocated to different treatment groups via computer randomization to free it from human bias. The other is to require a large enough sample size. Specifically, FDA requires that number of patients is large enough so that the probability (chance) or finding such effect size is less than 5% if the drug has, on average, no effect in the entire target population.


This so called 5% false positive rate requirement can potentially be used by a drug to seek approval of useless drugs: a drug company may just do clinical trials on 100 useless drugs and 5 of them will turn out to be false positive. But a clinical trial is hugely expensive and a drug company usually only invests on a drug with real potential. So this has not been an issue.


Starting from 1990s, however, large data medical research came to the scene.  One example is the genetic research. The microarray technology made it possible to compare the expression levels of thousands of genes between cancer and normal populations. So it is like conducting thousands of clinical trials simultaneously. There were an explosion of false claims of scientific discoveries in research journals.


People finally came to their senses and here is a simple way to explain it. Let x_i be the measurement of difference between cancer and normal groups. We then have x_i, i=1,....,n, where n is the number of genes. Suppose that an larger x_i means more difference between the cancer and normal group and a large enough x_i is taken as evidence that there is underlying difference in gene i between cancer and normal group. Now you do one thousand genes and you pick up the maximum x_i among x_1,..., x_n. This max(x_i) is generally much larger than a random x_i. So the old standard of judging x_i can no longer be used in judging the size of max(x_i). In summary, you cannot take the maximum of x_i, i=1,...,n, as if you did one experiment and observe max(x_i).


This can also explain 海外华人的失落感. Many 海外华人 visited homeland and met a few very successful old friends or classmates. They cannot help but wonder: what if I did not come to USA? The fallacy of this thinking is to take the extreme value as the average behavior. Suppose you had 50 classmates and each person's success level is x_i. The few successful classmates are max(x_i), they are not just an x_i. You in general should not expect to be to that special classmate who happened to be the max(x_i).    


Another lesson learned in the process is to demand replication for findings obtained from large scale data. Nowadays, both Science and Nature require replication of findings in a independent set of subjects for genome-wide association studies. To appreciate this, suppose 100 people drove via a very risky road in a dark rainy day. 99 fell off the cliff and one survived. The survivor will testify that the God favors him. But the matter of truth can also be: the God in fact favored a random one out of the 100 (everybody had the same 1/100 probability) . How do you know which is true: ask the driver to drive that road in that condition again to replicate.


   



 
发表评论

最新评论

引用 2014-11-12 07:59 AM
哈哈,也老在给我们上统计课了。

我也写了几句。不过,有点长。另发日志了。请也老移步,回答俺的问题。   
引用 2014-11-11 11:01 AM
This can also explain 海外华人的失落感 -- the reverse can be said also. the chinese overseas is not a random sub sample of all chinese...so the that fact that they are successful here does not prove USA is better.  you must compare an equal population in China (eg. those can come but did not).
引用 2014-11-11 10:59 AM
not clear..."Suppose that an larger x_i means more difference between the cancer and normal group and a large enough x_i is taken as evidence that there is underlying difference in gene i between cancer and normal group"  are you talking about the frequency of a gene in a population?
引用 2014-11-11 01:01 AM
5. The problem is that some chunks of DNA have higher x_i even among normal people in the population. And that, in some cases,  may lead to false positives.

I'm guessing on 5, and I'm not sure about 3 and 4 either.   At this point I'm no longer following you.
引用 2014-11-11 12:52 AM
3. In genetic analysis,  a chunk of DNA is treated like a drug.  So, researchers try to find the chunk that's most correlated to a certain cancer.
4. Here, x_i is the measure of variation in the DNA chunk between the patients and the normal subjects.  A larger x_i is considered to correspond to a higher correlation between the DNA chunk and the disease.
引用 2014-11-11 12:26 AM
First of all, thank you for writing this piece.  I love it.
I'm not sure I understand everything, so I'll just try to summarize key points and/or my interpretation to see where I may have gotten it wrong.
1. For drug trials,  a theoretical 5% chance of being wrong on the drug's effectiveness is acceptable to the fda.
2. If one chooses the sample size to be just large enough to meet the requirements, 5 out of 100 different useless drugs may test false positive, and thus be eligible for marketing to the public.
引用 2014-11-10 10:10 PM
最后一段很cool.
引用 2014-11-10 09:50 PM
看没有懂。

查看全部评论(8)

手机版|天涯小站

GMT-5, 2026-6-30 11:45 AM

Powered by Discuz! X3.4

© 2001-2017 Comsenz Inc.

返回顶部