What a nerdy debate about p-values shows about science — and how to fix it

The case for, and against, redefining “statistical significance.”

There’s a huge debate going on in social science right now. The question is simple, and strikes near the heart of all research: What counts as solid evidence?

The answer matters because many disciplines are currently in the midst of a “replication crisis” where even textbook studies aren’t holding up against rigorous retesting. The list includes: ego depletion, the idea that willpower is a finite resource; the facial feedback hypothesis, which suggested if we activate muscles used in smiling, we become happier; and many, many more.

Scientists are now figuring out how to right the ship, to ensure scientific studies published today won’t be laughed at in a few years.

One of the thorniest issues on this question is statistical significance. It’s one of the most influential metrics to determine whether a result is published in a scientific journal.

Most casual readers of scientific research know that for results to be declared “statistically significant,” they need to pass a simple test. The answer to this test is called a p-value. And if your p-value is less than .05 — bingo! — you got yourself a statistically significant result.

Now a group of 72 prominent statisticians, psychologists, economists, biomedical researchers, and others want to disrupt the status quo. A forthcoming paper in the journal Nature Human Behavior argues that results should only be deemed “statistically significant” if they pass a higher threshold.

“We propose a change to P< 0.005,” the authors write. “This simple step would immediately improve the reproducibility of scientific research in many fields.”

This may sound nerdy, but it’s important. If the change is accepted, the hope is that fewer false positives will corrupt the scientific literature. It’s become too easy — using shady techniques known as p-hacking, and outcome switching — to find some publishable result that reaches the .05 significance level.

Don’t be mistaken: This proposal won’t solve all the problems in science. “I see it as a dam to contain the flood until we make sure we have the more permanent fixes,” Ioannidis says. He calls it a “quick fix.” Though not everyone agrees it’s the best course of action.

At best, the proposal is an easy change to implement to protect academic literature from faulty change. At worst, it’s a patronizing decree that avoids addressing the real problem at the heart of science’s woes.

There is a lot to unpack and understand here. So we’re going to take it slow.

What is a p-value?

Mick Wiggins / Getty Creative Images

Even the simplest definitions of p-values tend to get complicated. So bear with me as I break it down.

When researchers calculate a p-value, they’re putting to the test what’s known as the null hypothesis. First thing to know: This is not a test of the question the experimenter most desperately wants to answer.

Let’s say the experimenter really wants to know if eating one bar of chocolate a day leads to weight loss. To test that, they assign 50 participants to eat one bar of chocolate a day. Another 50 are commanded to abstain from the delicious stuff. Both groups are weighed before the experiment, and then after, and their average weight change is compared.

The null hypothesis is the devil’s advocate argument. It states: There is no difference in the weight loss of the chocolate eaters versus the chocolate abstainers.

Rejecting the null is a major hurdle scientists need to clear to prove their theory. If the null stands, it means they haven’t eliminated a major alternative explanation for their results. And what is science if not a process of narrowing down explanations?

So how do they rule out the null? They calculate some statistics.

The researcher basically asks: How ridiculous would it be to believe the null hypothesis is true answer, given the results we’re seeing?

Rejecting the null is kind of like the “innocent until proven guilty” principle in court cases, Regina Nuzzo, a mathematics professor at Gallaudet University, explains. In court, you start off with the assumption that the defendant is innocent. Then you start looking at the evidence: the bloody knife with his fingerprints on it, his history of violence, eyewitness accounts. As the evidence mounts, that presumption of innocence starts to look naive. At a certain point, jurors get the feeling, beyond a reasonable doubt, that the defendant is not innocent.

Null hypothesis testing follows a similar logic: If there are huge and consistent weight differences between the chocolate eaters and chocolate abstainers, the null hypothesis — that there are no weight differences — starts to look silly. And you can reject it.

You might be thinking: Isn’t this a pretty roundabout way to prove an experiment worked?

You are correct!

Rejecting the null hypothesis is indirect evidence of an experimental hypothesis. It says nothing about whether your scientific conclusion is correct.

Sure, the chocolate eaters may lose some weight. But is it the because of the chocolate? Maybe. Or maybe they felt extra guilty eating candy every day, and they knew they were going to be weighed by strangers wearing lab coats (weird!), so they skimped on other meals.

Rejecting the null doesn’t tell you anything about the mechanism by which chocolate causes weight loss. It doesn’t tell you if the experiment is well designed, or well controlled for, or if the results have been cherry-picked.

It just helps you understand how rare the results are.

But — and this is a tricky, tricky point — it’s not how rare the results of your experiment are. It’s how rare the results would be in the world where the null hypothesis is true. That is, it’s how rare the results would be if nothing in your experiment worked, and the difference in weight was due to random chance alone.

Here’s where the p-value comes in: The p-value quantifies this rareness. It tells you how often you’d see the numerical results of an experiment — or even more extreme results — if the null hypothesis is true and there’s no difference between the groups.

If the p-value is very small, it means the numbers would rarely (but not never!) occur by chance alone. And so, when the p is small, researchers start to think the null hypothesis looks improbable. And they take a leap to conclude “their [experimental] data are pretty unlikely to be due to random chance,” Nuzzo explains.

And here’s another tricky point: Researchers can never completely rule out the null (just like jurors are not firsthand witnesses to a crime). So scientists instead pick a threshold where they feel pretty confident that they reject the null. That’s now set at less than .05.

Ideally, a p of .05 means if you ran the experiment 100 times — again, assuming the null hypothesis is true — you’d see these same numbers (or more extreme results) five times.

And one last, super-thorny concept that almost everyone gets wrong: A p<.05 does notmean there’s less than a 5 percent chance your experimental results are due to random chance. It does not mean there’s only a 5 percent chance you’ve landed on a false positive. Nope. Not at all.

Again: A p of .05 means there’s a less than 5 percent chance that in the world where the null hypothesis is true, the results you’re seeing would be due to random chance. This sounds nitpicky, but it’s critical. It’s is the misunderstanding that leads people to be unduly confident in p-values. The false-positive rate for experiments at p=.05 can be much, much higher than 5 percent.

Okay. Still with me? It’s okay if you need to take a break. Grab a soda. Catch up with Mom. She’s wondering why you haven’t called in a while. Tell her about your summer plans.

Because now we’re going to dive into…

The case against p<.05

erhui1979 / Getty Creative Images

“Generally, p-values should not be used to make conclusions, but rather to identify possibilities — like a sniff test,” Rebecca Goldin, the director for Stats.org and a math professor at George Mason University, explains in an email.

And for a long while, a sniff of p that’s less than .05 smelled pretty good. But over the past several years, researchers and statisticians have realized that a p<.05 is not as strong of evidence as they once thought.

And to be sure, evidence for this is abundant.

Here’s the most obvious, easy-to-understand piece of evidence: Many papers that have used the .05 significance threshold have not replicated with more methodologically rigorous designs.

A famous 2015 paper in Science attempted to replicate 100 findings published in a prominent psychological journal. Only 39 percent passed. Other disciplines have fared somewhat better. A similar replication effort in economic papers found 60 percent of findings replicated. There’s a reproducibility “crisis” in biomedicine too, but it hasn’t been so specifically quantified.

The 2015 Science paper on psych studies offered some clues to which papers were more likely to replicate. Studies that yielded highly significant results (less than p=.01) are more likely to reproduce than those that are just barely significant at the .05 level.

“Reporting effects that really aren’t there undermine the credibility of science,” says Valen Johnson, a co-author of the Nature Human Behavior proposal who heads the statistics department at Texas A&M. “It’s important that science adopt these higher standards, before they claim they have made a discovery.”

Elsewhere, researchers find evidence of an “epidemic” of statistical significance. “Practically everything that you read in a published paper has a nominally statistically significant result,” say Ioannidis. “The large majority of these p-values of less than .05 do not correspond to some true effect.”

For a long while, scientists thought p<.05 represented something rare. New work in statistics shows that it’s not.

In a 2013 PNAS paper, Johnson used more advanced statistical techniques to test the assumption researchers commonly make: that a p of .05 means there’s a 5 percent chance the null hypothesis is true. His analysis revealed that it didn’t. “In fact there’s a 25 percent to 30 percent chance the null hypothesis is true when the p-value is 05,” Johnson said.

Remember: The p-value is supposed to assure researchers that their results are rare. Twenty-five percent is not rare.

For another way to think about all this, let’s flip the question around: What if instead of assuming the null hypothesis is true, let’s assume an experimental hypothesis is true?

Scientists and statisticians have shown that if assuming experimental hypotheses are true, it should actually be somewhat uncommon for studies to keep churning out p-values of around .05. More often, assuming an effect is true, the p-value should come in lower.

Psychology PhD student Kristoffer Magnusson has designed a pretty cool interactive calculator that estimates the probability of obtaining a range of p-values for any given true difference between groups. I used it to create the following scenario.

Let’s say there’s a study where the actual difference between two groups is equal to half a standard deviation. (Yes, this is a nerdy way of putting it. But think of it like this: It means 69 percent of those in the experimental group show results higher than the mean of the control group. Researchers call this a “medium-sized” effect.) And let’s say there are 50 people each in the experimental group and the control group.

In this scenario, you should only be able to obtain a p-value between .03 and .05 around 7.62 percent of the time.

If you ran this experiment over and over and over again, you’d actually expect to see a lot more p-values with a much lower number. That’s what the following chart shows. The x-axis are the specific p-values, and the y-axis is the frequency you’d find them repeating this experiment. Look how many p-values you’d find below .001.

(And from this chart you’ll see: Yes, you can obtain a p-value of greater than .05 while your experimental hypothesis being true. It just shouldn’t happen as often. In this case, around 9.84 percent of all p-values should fall between .05 and .1.)

A change in the definition of statistical significance could nudge researchers into adopting more rigorous methods

The biggest change the paper is advocating for is rhetorical: Results that currently meet the .05 level will be called “suggestive,” and those that reach the stricter standard of .005 will be called statistically significant.

“Journals can still publish weak (and of course null) results just like they always could,” says Simine Vazire, a personality psychologist who edits Social Psychological and Personality Science (though is not speaking on the behalf of the journal)The language tweak will hopefully trickle down to press releases and news reports, which might avoid buzzwords such as “breakthroughs.”

The change, Vazire says, “should make it so that authors need stronger results before they can make strong claims. That’s all.”

Historians of science are always quick to point out that Ronald Fisher, the UK statistician who invented the p-value, never intended it to be the final word on scientific evidence. That “statistical significance” means the hypothesis is worthy of a follow-up investigation. “In a way, we’re proposing to returning to his original vision of what statistical significance means,” Daniel Benjamin, a behavioral economist at the University of California and the lead author of the proposal, says.

If labs do want to publish “statistically significant” results, it’s going to be much harder.

Most concretely, it mean labs will need to increase the number of participants in their studies by 70 percent. “The change essentially requires six times stronger evidence,” Benjamin says.

The increased burden of proof — the proposal authors hope — would nudge labs into adopting other practices science reformers have been calling for, such as sharing data with other labs to reach consensus conclusion and thinking more long-term about their work. Perhaps their first experiment doesn’t reach this new threshold. But a second experiment might. The higher threshold encourages labs to reproduce their own work before submitting to a publication.

The case against p<.005

erhui1979 / Getty Creative Images

The proposal has critics. One of them is Daniel Lakens, a psychologist at Eindhoven University of Technology in the Netherlands, who is currently organizing a rebuttal paper with dozens of authors.

Mainly, he says the significance proposal might work to stifle scientific progress.

“A good metaphor is driving a car and setting a maximum speed,” Lakens says. “You can set the maximum speed in your country to 20 miles an hour, and no one is going to get killed. You hit someone, they won’t die. So that’s pretty good, right? But we don’t do this. We set the maximum speed a little higher, because then we actually get somewhere a little bit quicker. … The same is for science.”

Ideally, Lakens says, the level of statistical significance needed to prove a hypothesis depends on how outlandish the hypothesis is.

Yes, you’d want a very low p-value in a study that claims mental telepathy is possible. But do you need such an extreme level testing out a well-worn idea? The high standards could impede young PhDs with low budgets from testing out their ideas.

Again, a p-value of .05 doesn’t necessarily mean the experiment will be a false positive. A good researcher would know how to follow up and suss out the truth.

Another critique of the proposal: It keeps scientific communities fixated on p-values, which, as discussed in the sections above, don’t really tell you much about the merits of a hypothesis.

There are better, more nuanced approaches to evaluating science.

Such as:

  • Concentrating on effect sizes (how big of a difference does an intervention make, and it is practically meaningful)
  • Confidence intervals (what’s the range of doubt built into any given answer?)
  • Whether a result is novel study or a replication (put some more weight into a theory many labs have looked into)
  • Whether a study’s design was preregistered (so that authors can’t manipulate their results post-test), and that the underlying data is freely accessible (so anyone can check the math)
  • There are also new, advanced statistical techniques — like Bayesian analysis — that, in some ways, more directly evaluate a study’s outcome

Ioannidis admits that “statistical significance [alone] doesn’t convey much about the meaning, the importance, the clinical value, utility [of research].”

Ideally, he says, scientists would retrain themselves not to rely on null-hypothesis testing. But we don’t live in the ideal world. In the real world, p-values are a quick and easy tool any scientist can easily use to run their tests. And in our real world, p-values still carry a lot of weight into saying what gets published.

With the proposal, “you don’t need to train all these millions of people in heavy statistics,” Ioannidis says. “And it would work. It would help.”

Redefining statistical significance is not an ideal solution to the problem of replication. It’s a solution that nudges people to adopt the ideal solution.

Though no one I spoke to said it directly, I wouldn’t be surprised if some scientists find that a bit patronizing. Why couldn’t they learn advanced statistics? Or come to appreciate more nuanced way of evaluating results?

The real problem isn’t with statistical significance; it’s with the culture of science

There’s a critique of the proposal the authors whom I spoke to agree completely with: Changing the definition of statistical significance doesn’t address the real problem. And the real problem is the culture of science.

In 2016, Vox sent out a survey to more than 200 scientists, asking, “If you could change one thing about how science works today, what would it be and why?” One of the clear themes in the responses: The institutions of science need to get better at rewarding failure.

One young scientist told us: «I feel torn between asking questions that I know will lead to statistical significance and asking questions that matter.”

The biggest problem in science isn’t statistical significance. It’s the culture. She felt torn because young scientists need publications to get jobs. Under the status quo, in order to get publications, you need statistically significant results. Statistical significance alone didn’t lead to the replication crisis. The institutions of science incentivized the behaviors that allowed it to fester.

Keep in mind, this is all just a proposal, something to spark debate. To my knowledge, journals are not rushing to change their editorial standards overnight.

This will continue to be debated.

But if it becomes that case where it’s still hard to publish “suggestive” results, and if it’s still difficult to secure grant money off “suggestive” results, then the institutions of science will not have learned their lesson. Yes, a lot of this is just tweaking the language of how we talk about science. But we have to make words “suggestive” and “null” results matter.

“‘Failures,’ on average, are more valuable than positive studies,” Ioannidis says.

Scientific institutions and journals know this. They don’t always act like they do.

Fuente: https://www.vox.com/science-and-health/2017/7/31/16021654/p-values-statistical-significance-redefine-0005

Comparte este contenido:

Los científicos han inventado una máquina para leer la mente.

América del Norte/Estado Unidos/21.06.2016/Fuente:http://www.vox.com/

Brian Resnick

Podemos tomar la memoria de alguien . y podemos tirar de él hacia fuera de sus cerebros.

El neurocientífico Brice Kuhl me dijo recientemente algo sorprendente. «Podemos tomar la memoria de alguien – que suele ser algo interno y privado – y podemos tirar de él hacia fuera de sus cerebros», dijo Kuhl, quién está en la Universidad de Oregon.

Eso suena muy parecido a … leer la mente. Así que tuve que preguntar: ¿Se puede leer la mente?

«Algunas personas utilizan diferentes definiciones de la lectura de la mente, pero sin duda, que está acercando», dijo.

Kuhl y su colega Hongmi Lee recientemente publicaron un artículo en The Journal of Neuroscience con una conclusión sacada de la ciencia ficción: El uso de una resonancia magnética, algunos software de aprendizaje de máquina, y unos conejillos de indias humanos desafortunados, Kuhl y Lee crearon imágenes directamente desde memorias.

Muy muy cerca de la lectura de la mente.

En primer lugar, Kuhl y Lee cargan los participantes (un total de 23) en una resonancia magnética. Los imanes de la MRI puede detectar cambios sutiles en el flujo sanguíneo.Y en el cerebro, el flujo de sangre es igual a la actividad neurológica.

Una vez que la máquina estaba encendida, los participantes comenzaron a ver las imágenes de cientos de caras.

La primera fase de la prueba es un ejercicio de entrenamiento. No para el participante, pero para un programa de inteligencia artificial que está conectado a la resonancia magnética, la lectura de los datos en tiempo real.

Ese programa AI recibe dos conjuntos de información. Se trata de los patrones de actividad cerebral de los participantes. La otra es una descripción matemática de cada cara del participante está viendo. (Kuhl y Lee se acercó con 300 números para diferentes aspectos físicos de una cara – como el color de la piel o la expresión emocional A continuación, cada foto se le asigna un código para describir sus atributos.).

Lo que el programa de IA no se trata de hacer conexiones: ¿Qué tan bien esas ráfagas de actividad cerebral se correlacionan con esos números?

A medida que el programa de IA se acumula esta información, que crece más inteligente, o al menos mejor juego en la actividad cerebral a las caras.

La segunda fase de la prueba es donde las cosas se ponen raras.

Los participantes, que aún se encuentran en la resonancia magnética, se muestran fotos de marcas caras nuevas. El programa de ordenador no puede ver estas caras.Pero puede hacer algunas suposiciones.

Aquí hay unos ejemplos. La fila superior son las caras originales. Los segundos dos filas son conjeturas sobre la base de la actividad en dos regiones diferentes del cerebro. La ANG es el giro angular, que se activa cuando recordamos algo vívidamente. La OTC es la corteza occipitotemporal, que responde a las entradas visuales.
Las primeras cinco columnas representan las reconstrucciones más precisos. El dos a la derecha son ejemplos de las menos exactas.

Sé lo que estás pensando: Estas conjeturas son horribles!

Sí, no van a ser utilizados para la elaboración de bocetos en cualquier momento pronto.

Pero incluso estas imágenes borrosas contienen información útil.

Kuhl y Lee mostraron estas imágenes reconstruidas a un grupo separado de los encuestados en línea y preguntas sencillas como: «¿Es hombre o mujer?» «¿Es esta persona feliz o triste?» Y «¿Es su luz color de la piel u oscuro?» Para un grado mayor que el azar, las respuestas desprotegido. Estos detalles básicos de las caras se pueden extraer de lectura de la mente.

Los científicos han hecho en realidad este tipo de lectura de la mente antes. (Un ejemplo fresco: En el pasado, se utilizó una técnica para reconstruir los detalles básicos de clips de película a partir de la actividad cerebral.)

Está bien, pero podría el programa de reconstrucción de una cara únicamente a partir de un recuerdo?

En otro ensayo, y Lee Kuhl mostraron los participantes dos caras. A continuación, les pidió que se mantenga una de las caras en sus memorias. Las caras fueron tomadas fuera de la pantalla a continuación. Mientras que los participantes se les mantiene en sus pensamientos, el MRI escanea el cerebro. Y a continuación, el equipo trató de recrear sus pensamientos en una foto.

conjeturas de la computadora empeoraron. Mucho peor.

Para este ensayo, sólo los datos de la ANG produjeron datos significativos.

Pero no falló por completo . «Comparamos la reconstrucción de las dos imágenes, y nos preguntamos [el equipo] sólo en términos de valores de píxeles, se ve la imagen reconstruida más similar a la que se les dijo que hay que recordar que el otro?», Explica Kuhl. Alrededor del 54 por ciento de las veces, el equipo dijo que estaba más cerca de la meta. No es un gran avance total, pero es un comienzo intrigante que necesita más trabajo y más participantes.

Pero ¿cuál es el punto? ¿Por qué diseñar un programa para leer la mente?

El objetivo final de la ciencia no es leer la mente. Es de entender mejor cómo funciona el cerebro.

Por lo general, los neurólogos utilizan escáneres cerebrales para observar qué estructuras del cerebro «encienden» cuando participa en una tarea mental particular.Pero sólo hay tanta información para deducir de las exploraciones por sí solos. El hecho de que una parte del cerebro está activo no les dice a los investigadores demasiado acerca de lo que está haciendo la tarea específica.

Las regiones Kuhl y Lee dirigidos en la resonancia magnética se han sabido por mucho tiempo que estar relacionado con recuerdos vívidos. «¿Es esa región que representa los detalles de lo que viste? – O simplemente [iluminando] debido a que eran sólo confía en la memoria», dice.

Kuhl y los resultados de Lee son la evidencia que es el primero. La IA era capaz de hacer la conexión entre las caras y la actividad cerebral. Si las regiones del cerebro no estaban involucrados en los detalles visuales, que no podría haber hecho esas conexiones.

Unas cuantas más candentes preguntas sobre lectura de la mente

Eso está bien y todo. Pero todavía estaba atrapado en la naturaleza salvaje del aspecto de leer la mente del experimento: ¿Cuánto más se puede conseguir la máquina en la reconstrucción de caras? ¿Podríamos reconstruir un rostro perfectamente?

«No quiero poner un tope en ella,» Kuhl me dijo. «Podemos hacerlo mejor.»

La razón es simple: Es un problema de procesamiento de la computadora pura. Si tan sólo los participantes podían pasar más tiempo en la resonancia magnética la formación de la IA, IA crecería más «inteligente», y las imágenes reconstruidas se haría más reconocible.

«Realmente nos gustaría tener a alguien en el escáner y ver 10.000, 20.000 caras», dice. «Usted tendría que estar allí durante un par de días.» Así que no es del todo factible, o ético. Es difícil traer de vuelta a los mismos sujetos a sesiones posteriores.Uno, es muy caro de operar una máquina de MRI. Y dos, que es muy difícil de ángulo de la cabeza de un participante en la misma posición exacta como un ensayo anterior.Predicción funciona mejor cuando la prueba se realiza en una sesión muy prolongada.(Él mismo Kuhl dijo que no voluntario para una prueba tan larga.)

¿Está en el ámbito de la posibilidad de leer la mente de una persona sin su permiso?

«Se necesita a alguien para jugar a la pelota», dijo. «No se puede extraer la memoria de alguien si no lo están recordando, y la gente la mayoría de las veces está en control de sus recuerdos.»

Oh bien.

¿Qué hay de esto: ¿Puede usted utilizar esta técnica para ver lo que alguien está soñando?

«Una fMRI papel trató de descifrar el contenido de los sueños», dijo Kuhl. «Podría decodificar – con lo que sea por ciento de exactitud – que alguien estaba soñando con un miembro de la familia o algo por el estilo.» (Ese documento no fue reconstruyendo caras, pero se predice ampliamente las categorías de bienes y personas a los participantes estuvieron soñando.)

Whoa. Todo esto es tan fresco.

«Las cosas que estamos haciendo ahora, si le preguntas a la gente hace 20 años cuando la fMRI se estaba poniendo en marcha, si se les preguntó acerca de este tipo de cosas, habrían pensado que era una locura», dijo.

Fuente: http://www.vox.com/2016/6/20/11905500/scientists-invent-mind-reading-machine

Imagen: http://comps.canstockphoto.com/can-stock-photo_csp18063383.jpg

Comparte este contenido: