I’m currently reading this book. But find my head swimming….. I keep wondering about all the misinterpreted stats that I’ve used in the past. And realising, that I never had any concept of the Power of a test.(though guess I was always aware that more samples tended to give more accurate results). The power for any hypothesis test is the probability that it will yield a statistically significant outcome (defined in this example as p < 0.05). A fair coin will show between 40 and 60 heads in 95% of trials, so for an unfair coin, the power is the probability of a result outside this range of 40-60 heads. The power is affected by three factors:
1. The size of the bias you're looking for. A huge bias is much easier to detect than a tiny one.
2. The sample size. By collecting more data (more coin flips), you can more easily detect small biases.
3. Measurement error. It's easy to count coin flips, but many experiments deal with values that are harder to measure, such as medical studies investigating symptoms of fatigue or depression.
Though with all the fine tuning, I’m reminded of what Dick Jackson…. my Agricultural Botany lecturer said to us once. “If you need to use statistics to see if something works, then the effect in the field is unlikely to make much difference”. And combine this with the “fact” that when experimental findings are applied to general farm practice, they only deliver about 60% of the original outcomes. (Can’t remember if it was 60% exactly… but something like that).
He gives some good tips. For example: Ensure that your statistical analysis really answers your research question. Additional measurements that are highly dependent on previous data do not prove that your results generalize to a wider population —they merely increase your certainty about the specific sample you studied.
One thing that really made an impact on me was the diagram on p40 where he shows an example where ten drugs actually work (out of 100). In his experiment he gets statistically significant results for 13 but 5 of them are false positives. So the chance of any of his "working" drugs being truly effective is just 8 in 13 or 62%. And the false discovery rate is 38%.
Each square in the grid represents one drug. In reality, only the 10 drugs in the top row work. Because most trials can't perfectly detect every good medication, he assumes his tests have a statistical power of 0.8, though you know that most studies have much lower power.
So of the 10 good drugs, he'll correctly detect around 8 of them.
Because his p value threshold is 0.05, he has a 5% chance of falsely concluding that an ineffective drug works. Since 90 of his tested drugs are ineffective, this means he'll conclude that about 5 of them have significant effects.
He performs his experiments and concludes there are 13 "working" drugs: 8 good drugs and 5 false positives. The chance of any given "working" drug being truly effective is therefore 8 in 13-just 62%! In statistical terms, his false discovery rate—the fraction of statistically significant results that are really false positives—is 38%. to my chagrin I probably would have been quite happy to conclude that the 13 I'd discovered were all working drugs!
So when someone cites a low p value to say their study is probably right, remember that the probability of error is actually almost certainly higher.
Anyway, lots of stuff like this. And a number of interesting observations such as :
If I test 20 jelly bean flavors that do not cause acne at all and look for a correlation at p < 0.05 significance, I have a 64% chance of getting at least one false positive result. If I test 45 flavors, the chance of at least one false positive is as high as 90%.
Don't just torture the data until it confesses. Have a specific statistical hypothesis in mind before you begin your analysis.
There are lots of issues with published experimental data: Ideally, the steps in a published analysis would be reproducible: fully automated, with the computer source code available for inspection as a definitive record of the work. Errors would be easy to spot and correct, and any scientist could download the dataset and code and produce exactly the same results. Even better, the code would be combined with a description of its purpose. Statistical software has been advancing to make this possible. But data "decays". One study of 516 articles published between 1991 and 2011 found that the probability of data being available decayed over time. For papers more than 20 years old, fewer than half of datasets were available. The Dryad Digital Repository, partners with scientific journals to allow authors to deposit data during the article submission process and encourages authors to cite data they have relied on. Dryad promises to convert files to new formats as older formats become obsolete, preventing data from fading into obscurity as programs lose the ability to read it.
Another issue is that many of the scientists omit results. The review board filings listed outcomes that would be measured by each study: side-effect rates, patient-reported symptoms, and so on.....Statistically significant changes in these outcomes were usually reported in published papers, but statistically insignificant results were omitted, as though the researchers had never measured them. A similar review of 12 antidepressants found that of studies submitted to the United States Food and Drug Administration during the approval process, the vast majority of negative results were never published or, less frequently, were published to emphasize secondary outcomes.
It is possible to test for publication and outcome reporting bias. The test has been used to discover worrisome bias in the publication of neurological studies of animal experimentation.? Animal testing is ethically justified on the basis of its benefits to the progress of science and medicine, but evidence of strong outcome reporting bias suggests that many animals have been used in studies that went unpublished, adding nothing to the scientific record.
He has some interesting material on how students learn best. Lectures do not suit how students learn. Students have preconceptions about basic physics, from their everyday experience—for example, everyone "knows" that something pushed will eventually come to a stop because every object in the real world does so. But we teach Newton's first law, in which an object in motion stays in motion unless acted upon by an outside force, and expect students to immediately replace their preconception with the new understanding that objects stop only because of frictional forces. Interviews of physics students have revealed numerous surprising misconceptions developed during introductory courses, many not anticipated by instructors. If lectures do not force students to confront and correct their misconceptions, we will have to use a method that does. A leading example is peer instruction. Students are assigned readings or videos before class, and class time is spent reviewing the basic concepts and answering conceptual questions. Forced to choose an answer and discuss why they believe it is true before the instructor reveals the correct answer students immediately see when their misconceptions do not match reality, and instructors spot problems before they grow. Peer instruction has been successfully implemented in many physics courses. Surveys using the Force Concept Inventory found that students typically double or triple their learning gains in a peer instruction course, filling in 50% to 75% of the gaps in their knowledge revealed at the beginning of the semester.13,19,20 And despite the focus on conceptual under-standing, students in peer instruction courses perform just as well or better on quantitative and mathematical questions as their lectured peers.
When I bought this book I thought it would be an update of Darrell Huff's book...."How to lie with Statistics". But as Reinhart says: "How to Lie with Statistics" didn't focus on statistics in the academic sense of the term—it was perhaps better titled How to Lie with Charts, Plots, and Misleading Numbers-the book was still widely adopted in college courses. And, I must confess that I have forever been grateful and always look for gaps in the scale or log transformations or percentages compared with concrete numbers, etc.
Anyway, the book is really interesting to me in that it demonstrates that many (maybe most) scientists don’t know enough stats to correctly design experiments and use their data. And it has certainly made me realise that I am much weaker in Statistics (despite scoring a High Distinction in my final year of Stats and using lots of stats in my work) than i had thought. I have recommended this book to a friend who is a lecturer in Statistics and he's thinking of using some of it in his lectures. Five stars from me.