Biostatistics: A Framework for Critique

The following is based on a lecture given at the AACR (American Association for Cancer Research conference in 2011 by Donna Neuberg, from the Dana-Farber Cancer Institute. It was presented in a session entitled “Biostatistics, a Framework for Critique”. Neuberg’s speciality is helping oncologists design and evaluate the results of their clinical trials. The purpose of her presentation to advocates was to provide us with the tools with which to question trial designs and results.

Neuberg recommended reading three books to gain a simple introduction to statistics and genetics: “The Cartoon Guide to Statistics”, “The Cartoon Guide to Genetics”, and “How to Lie with Statistics”.

Neuberg opened with the classic story, from “The Cartoon Guide to Genetics,” entitled “The Case of Jacob’ Flock”. In the bible story, Jacob agrees to tend the goats of his father-in-law, Laban, provided that he gets to keep all the speckled goat offspring while Laban keeps all the black ones. Laban’s original flock is all black, so he thinks this is a good way to get Jacob to watch his flock for free. But, Jacob has a plan. He removes the branches from a tree and strips off the bark, making them all appear white. He puts them in the watering hole where the goats drink, and, lo and behold, one-quarter of the new lambs are born speckled! In this way, Jacob, innocent of any notion of genetics, accumulates a herd to support himself and his family and he leaves town. It turns out, of course, that the sheep were not pure bred black, but had one gene for black, dominant, and one for speckled, recessive, so three-quarters of the offspring looked black and one-quarter speckled. These were “heterozy-goats[1]”?!

Neuberg’s intention was to illustrate the statistical paradigm that underlies medical research as well as the tools for defensive statistics, the latter enabling an advocate to approach a researcher, no matter how important he or she is, and say: ”That doesn’t look right to me, please explain it”.

Study design statistics cover: significance level, power, sample size, detectable difference, survival curves, and p-values.

The first thing to do when summarizing data is to calculate the mean, which is the average of all data points. Next, the standard deviation measures the spread of the data, from high values to low values based on the distance from the mean. So if a population has heights that average 5’6” (the mean) and the standard deviation is 6 inches, it would be unusual to find someone whose height is less than or greater than one standard deviation from the mean, or deviating outside of the 5 foot to 6 foot range. It is unusual, but not impossible, as only 68% of a normally distributed population lies within one standard deviation of the mean.

Given a large data set and a normal distribution (like a bell curve), the mean and standard deviation are valid ways of presenting the data. However, if there is a small data set, say a 10-15 patient trial, then the data must be examined more carefully as outliers (data points that are far away from the mean) can skew the results and create an asymmetric distribution with a large standard deviation. In these cases, one must be very careful in interpreting the results. In cases where there are outliers, the median (middle value) would be a better choice than the mean. Generally speaking, the mean and median should be very close in value, and, if they are not, there is a problem with the data.

Neuberg gave an example of a very small Phase II trial for an experimental therapy in which the progression-free survival (PFS) for ten patients was, in months: 1,1,2,3,3,3,3,4,4,48. The mean pfs is 7.2 months, but, as is evident, 90% of the patients had a pfs less than the mean. The median pfs is 3 months. The outlying patient with a pfs of 48 months skewed the results.  Neuberg asked the question, what pfs should a patient taking this therapy expect? The lesson learned here is that in a small trial, always take a look at the individual data points.

This brings us to p-value, which is a way of assessing the degree of difference between groups.  P-value is the probability that, when comparing the two groups, you will find a difference between them as large as the value that you observed. For example, say you are running a trial of drug A, the standard therapy, and drug B, the new therapy that you want to evaluate for its efficacy. The null hypothesis states that the two drugs have equal efficacy. This is a cancer trial, so the parameter that you use to measure the differences between the two drugs is progression-free survival (PFS). Let’s say that you observe in the trial that the PFS for drug A is 14 months and for drug B is 18 months, and the calculated p-value (calculated by a computer program) is .035, or 3.5%. Assuming the two drugs have equal efficacy, the probability of having a difference of 18 months PFS versus 14 months PFS is only .035. Since the commonly accepted standard of measuring population variance, or level of significance, is less than or equal to .05, the .035 is a significant difference so we reject the null hypothesis and declare drug B superior to drug A since the probability is too small to occur by chance.

Action Null True Null False
Accept null:
New same as standard
Correct action Wrong action
Type II Error (10% – 20%)
Reject null:
New better than standard
Wrong action
Type I Error (5% – 10%)
Correct action
Power at least (80%)

This table makes no sense without some background. When researches run an experiment and want to make a comparison between two variables, they make observations about their data while trying to minimize errors of interpretation. But it is impossible to eliminate all error, for which there are consequences. Let’s take a different example to try and explain the above table.

Let’s say that a person is accused of committing a crime. In the U.S. system of justice, the accused must be proven guilty. In other words, the null hypothesis states that he is not guilty. If one accepts the null hypothesis and the null is true, then you have made the correct decision. If the null hypothesis is false, i.e., the accused is really guilty, but you accepted it as being true, you have made a Type II Error, or a false negative. If you reject the null hypothesis, and it is false, you again made the correct decision. On the other hand, if you reject the null hypothesis and it is true, you have made a Type I Error, or false positive.

Other things being equal, setting your parameters, e.g., the level of significance, to reduce one type of error, increases the other. In the criminal trial example, most people would rather commit a Type II Error (releasing an accused who is guilty) than a Type I Error (convicting an innocent man). Type I Error is denoted by α and Type II by β with power = 1 – β (power is the probability of rejecting the null hypothesis when it is false). Generally, statisticians want to have a power of 80-90%. One can also reduce the Type I and II Errors by increasing the sample size, in the case of a drug trial, by increasing the number of patients on the trial.

You may ask why the errors decrease if you increase your test size. We’ll use the analogy of calculating probabilities in a coin toss to see how the probability of getting extreme values changes when your test size increases. The probability of getting a heads or tails in a single toss is 50-50, so if we’re testing a coin, out of 10 tosses, we would expect 5 heads and 5 tails, or at times 6 and 4 or even 7 and 3. But what about more extremes? The probability of getting 8 or more heads out of 10 tosses is .055 (there’s a formula for this – look it up on the Internet). However, if we increase the sample to 20 tosses, the probability of getting 16 or more heads out of 20 tosses is .006. So if our null hypothesis stated that our test coin was no different from a standard coin, the coin would have to be pretty special to classify it as different, given a sufficient number of tosses.

If the study size was too large and the p-value was less than .05, then the results are statistically significant but may not be clinically relevant. Neuberg gives an example whereby the standard therapy has a 20% response rate. If you enroll 3500 patients in a single-arm trial, you will increase the precision of the test to the point that an increase in the response rate to 22% will be significant. In this case, the statistics are fine but the clinical result is not meaningful, and she says that studies of this type that attempt to prove the efficacy of a new drug over the standard therapy are unethical.

The importance of statistical analysis can be illustrated in the case in which two statisticians from M.D. Anderson Cancer Center reviewed the papers and found statistical errors in a Duke study that purported to show that a certain company’s biomarkers could predict  response in lung cancer patients to chemotherapy based on the patient’s genetic makeup. When the statisticians reported their results, Duke stopped the trials, the lead researcher resigned from Duke, the papers were retracted, and the families of the patients on the trials hired lawyers. See July 7, 2011 “New York Times” article by Gina Kolata, at

[1] A plant or animal has two genes for a trait, one from each parent. If the two genes express the same trait, black haired sheep, then the sheep are homozygous for that trait. If the genes each express a different trait, black and speckled, then the sheep are heterozygous for that trait.