Re: [AUDITORY] Registered reports

Hi, I’m brand new to this list and just jumping right in to this very interesting conversation, so I apologize if anything I say has already been covered in previous contributions on this topic. My colleague, Erick, makes many compelling and valuable points. I do think that a few other facets of the “replication crisis”/”false positives” phenomenon need to be considered, however, in order to round out the picture. None of this is to negate or argue against the points Erick raised—on the contrary, I think they are all very helpful, and specifically I think testing the same subject multiple times in multiple conditions is almost always preferable to designs that conflate subject groups with conditions—I just wanted to explain that I think they have certain limitations to their applicability and that there are additional factors at play that cannot be addressed with these methods alone.

It is certainly more convenient when the questions researchers are interested in asking lend themselves nicely to mathematical formulations that are relatively straightforward to assess, but not every question or problem is equally well-suited to this approach. By forcing a hypothesis into a specific quantitative framework for evaluation just because results are easier to interpret, one runs the risk of missing the forest for the trees, or of simply shifting the problem around rather than solving it. That is, it may be easier to interpret what the results tell you about the hypothesis you formulated, but that is of limited use if, in the process, your formulated hypothesis has become too far removed from your original question of interest. Again, I’m not arguing to avoid all strict quantitative approaches, just issuing a note of caution on their appropriate use.
Even if we assume for the moment that problems and questions can be stated with mathematical precision, I think it is also important to move away from the entire idea that there is anything magical or special or even particularly reasonable about p < .05 as a criterion for “significance”. Frankly, I think the entire concept of “hypothesis testing” in the threshold-based manner it is commonly conceived of today is quite problematic, no matter what criterion is used, and is virtually never the best possible way of mathematically evaluating an idea about how the world works. There are a number of reasons for this:
1. The p-value in classical frequentist statistics is a perfect example, in my opinion, of the aforementioned practice of formulating hypotheses for mathematical convenience, but in such a way that they no longer actually correspond to the original question. No scientist sets out saying “I am interested in seeing if there is a 5% or lower probability of observing this data set I am about to collect under the assumption of the null hypothesis that there is no difference in population means between these two conditions.” That is completely removed from any genuine research question, and does not effectively model much of anything someone might actually ask. It’s just convenient to use because there are recipe-book approaches for obtaining it.
2. Even if, for the sake of argument, we make the most favorable assumptions—namely, that there is no p-hacking, no publication bias, appropriate corrections for multiple tests, etc.—the idea that only 5% of results would be Type I errors if everyone used the standard alpha=.05 and beta=.2 is misleading. It is misleading because the question that is actually of interest is: what portion of positive findings are Type I errors (since null findings by definition cannot be Type I errors and will not pose this problem for us). The answer to that question, we will never know exactly, but broadly we can say: definitely much higher than 5%, and likely a majority of all positive findings. The precise answer would depend on what portion of all tested hypotheses are “true” – which for multiple reasons is unknowable, but it is almost certainly several times smaller than the portion that are false, which is what leads to this phenomenon. If that portion were 1% (.01), then a=.05 and beta=.2 would correspond to about 86% of “significant” results being false positives:
  
  PPV = beta*truerate/(beta*truerate + alpha*(1 – truerate)) = 0.2*0.01/(0.2*0.01 + 0.05*0.99) = ~0.139 à 1 – PPV = 1 – 0.139 = 0.861 = 86.1%.
3. The entire concept of using “zero mean difference” as a common null hypothesis is fairly nonsensical, given that for any continuous distribution, the likelihood of any single exact value converges to zero, because the integral over a range of zero is zero. That is, there is essentially no chance that the true difference in population means is exactly zero in any given case. That is why the p-value, of necessity, gets defined as “assuming the real difference is zero, p is the chance of seeing data at least as extreme as what we measured.” So, first, there is almost always a bit of unintentional legerdemain at this point, in that this formulation is either explicitly or implicitly understood to be equivalent to “p is the chance that the real difference is zero” – but that’s not true and does not logically follow from the actual definition of the p-value! This reinforces my previous point about p being at modeling the real question. Second, because any given value is infinitesimally unlikely, it seems reasonable to build some concept of uncertainty into our models and interpretations, including into our “default” or “null” hypothesis. Third, why would we choose zero as the default for cases where both prior knowledge and the data suggest the most likely value to be somewhere other than zero? Why not say that the maximum likelihood is centered around the measured value and then express the uncertainty relative to that?
Because of many of the points raised above, I would argue that a crucial component in any solution here would need to be a more rigorously theory-driven approach to experimental designs and statistical analyses. That is, start by attempting to analyze the problem using an existing theory—for instance: model what you think the data should look like based on what you consider to be the best existing theory (including building in uncertainty and conditionality as appropriate), and then compare it to the measured results, noting in what ways the existing theory holds up, and in what ways it fails to explain the observations. Then use the shortcomings in these predictions to try to sharpen and refine the theory, update the models to reflect that, and test it against new data, etc. As I see it, this approach has many advantages over the “null hypothesis” testing approach. For one, in that it places each new aspect of the theory in the context of the broader theoretical framework of understanding rather than “disembodying” the results to be assessed in isolation. Furthermore, it assesses how well a theory (or multiple competing theories) predicts reality rather than starting with the assumption that nothing affects anything else and then requiring overwhelming evidence from your individual sample (as divorced from all previous studies and accumulated theoretical knowledge) in order to try to prove that it does. As it stands, our quantitative methods often wind up disconnected from the way we actually think through the science.
It seems to me that a limitation of the “high n through repeated measures with small number of subjects” method is that it may be more prone to overfitting (i.e., lack of generalizability) than would making more uncertain predictions with larger and more varied groups of subjects. That is, even with many measurements, a small number of subjects will risk under-representing the true spread in the size/direction of effects across individuals and/or failing to account for differential effects based on subject characteristics not represented in the sample. This may be an acceptable/desirable tradeoff in many cases, especially if there is a good theoretical basis for generalizing from the results, but, flowing from everything I said above, if the only reason for doing it is to get your standard deviations down, so as to narrow the confidence intervals, so the p-values come out nicer, I think that may not always be the best approach.

Finally, totally separate from all that, I just want to say that as far as RR specifically is concerned, I support it insofar as it is tied to the idea of publications being pre-approved/commissioned based on study design and methodology rather than choosing whether they should be reported based on the results. I think it would be an attractive option for many and have no serious downsides so long as the traditional avenues for submissions remained open in parallel. I also really agree with the point Erick made about it being an avenue for discussion and consideration of methods before sinking a bunch of resources into a particular project.

Thanks for the stimulating conversation!

Brandon

Brandon Madsen, AuD, CCC-A

Research Audiologist

NCRARcolorSIG

National Center for Rehabilitative Auditory Research (NCRAR)
VA Portland Health Care System

3710 SW US Veterans Hospital Road/P5

Portland, OR 97239

brandon.madsen@xxxxxx
Tel. 503.220.8262, x55873

From: AUDITORY - Research in Auditory Perception [mailto:AUDITORY@xxxxxxxxxxxxxxx] On Behalf Of Frederick Gallun
Sent: Tuesday, June 12, 2018 11:46 AM
To: AUDITORY@xxxxxxxxxxxxxxx
Subject: [EXTERNAL] Re: [AUDITORY] Registered reports

I will add a comment on Les’ point about the unfamiliarity of replication crises and failures to publish null results in some of the areas of hearing science. This is relevant to the registered reports question because it is actually very important to not that psychophysics is not in a replication crisis and when a model prediction fails in a psychophysical laboratory, everyone is still interested in knowing about it. What then is the difference between psychophysics and other areas of psychology, other than what is being studied?

A compelling answer is made quite well by a recent paper (Smith, P.L. & Little, D.R. Psychon Bull Rev (2018) https://doi.org/10.3758/s13423-018-1451-8) on the power of small-n repeated measures designs. The authors argue that the replication crisis is not going to be solved by overpowering all of our experiments, as some have proposed. Instead, we should look to the methods of psychophysics in which the individual participant is the replication unit, theories are quantitative and make mathematical predictions, and the hypothesis testing is thus on much firmer ground.

So, what makes psychophysics so useful as a model, and why don’t we see failures of replication weakening our theories of auditory perception? Smith and Little might say that it is because 1) we work hard to find and use measurement instruments that appear to be monotonically related to the psychological entity that we are trying to understand (i.e., intensity perception or binaural sensitivity), 2) we spend a lot of time coming up theories that can be formulated mathematically and thus the hypothesis to be tested takes the form of a mathematical prediction, and 3) these model predictions are directly expressed at the individual level. The last piece is extremely important, because it gives a level of control over error variance that is nearly impossible at the level of group effects. The Smith and Little article is not particularly surprising to those of us used to controlling variance by repeatedly testing our participants until they are well-practiced at the task and only then introducing variations in the tasks or stimuli that we expect to produce specific effects at the level of the individual participant.

This approach is not common in the areas of psychology suffering from the replication crisis. Consequently, the common suggestion has been to increase the number of participants rather than question the wisdom of using large-n designs with ordinal hypotheses based on theories that cannot be described mathematically and measurement instruments that are designed based more on convenience than on monotonic relationships to the putative psychological entity to be tested. As Smith and Little argue, this is an opportunity to change the field of scientific psychology in a very positive way, and the path is by focusing on increasing sample size at the participant level through repeated testing across multiple theoretically-connected conditions rather than at the group level. As a psychophysicist who works with clinical populations (and an Editor and Reviewer of many clinical research manuscripts), I find this question very relevant, because those who work with patients are much more likely to come from a background of large-n designs, where experimental rigor is associated with assigning each participant to a single condition and comparing groups. In this case, it is obviously important to have as large a number of participants in each group as possible and to make each participant as similar to the others as possible. This often leads to enormous expenditures of time and effort in recruiting according to very strict inclusion criteria. For practical reasons, either the inclusion criteria or the sample size is almost an impossible barrier to achieving the designed experiment. The result is unless both money and time are in great supply, the study ends up being underpowered.

From this perspective, I see the registered report as a useful way to have the discussion about the most powerful methods before large amounts of time and resources have been devoted to the study, and I would encourage those with expertise in controlling error variance and experience in developing robust tools to do their best to bring this knowledge to the other areas of the field in as constructive a manner as possible. I would hope that the registered report could be a vehicle for this discussion.

Erick Gallun

Frederick (Erick) Gallun, PhD  

Research Investigator, VA RR&D National Center for Rehabilitative Auditory Research
Associate Professor, Oregon Health & Science University

Editor in Chief - Hearing, Journal of Speech, Language, and Hearing Research

http://www.ncrar.research.va.gov/AboutUs/Staff/Gallun.asp

On Tue, Jun 12, 2018 at 6:16 AM Les Bernstein <lbernstein@xxxxxxxx> wrote:

I agree with Ken and Roger. It's neither clear that the current system falls short nor that RRs would, effectively, solve any such problem. To the degree there is a problem, I fail to see how making RRs VOLUNTARY would serve as an effective remedy or, voluntary or not, serve to increase "standards of publication." If people wish to have the option, that sounds benign enough, save for the extra work required of reviewers.

As suggested by Matt, I tried to think of the "wasted hours spent by investigators who repeat the failed methods of their peers and predecessors, only because the outcomes of failed experiments were never published." Across the span of my career, for me and for those with whom I've worked, I can't identify that such wasted hours have been spent. As Ken notes, well-formed, well-motivated experiments employing sound methods should be (and are) published.

Likewise, re Matt's comments, I cannot recall substantial instances of scientists "who cling to theories based on initial publications of work that later fails replication, but where those failed replications never get published." Au contraire. I can think of a quite a few cases in which essential replication failed, those findings were published, and the field was advanced. I don't believe that it is the case that many of us are clinging to theories that are invalid but for the publication of failed replications. Theories gain status via converging evidence.

It seems to me that for what some are arguing would, essentially, be an auditory version of The Journal of Negative Results (https://en.wikipedia.org/wiki/Journal_of_Negative_Results_in_Biomedicine).

Still, if some investigators wish to have the RR option and journals are willing to offer it, then, by all means, have at it. The proof of the pudding will be in the tasting.

Les

On 6/9/2018 5:13 AM, Roger Watt wrote:

3 points:

1. The issue of RR is tied up with the logic of null hypothesis testing. There are only two outcomes for null hypothesis testing: (i) a tentative conclusion that the null hypothesis should be regarded as inconsistent with the data and (ii) no conclusion about the null hypothesis can be reached from the data. Neither outcome refers to the alternative hypothesis, which is never tested. A nice idea in the literature is the counter-null. If I have a sample of 42 and an effect size of 0.2 (r-family), then my result is not significant: it is not inconsistent with a population effect size of 0. It is equally not inconsistent with the counter-null, a population effect size of ~0.4. It is less inconsistent with all population effect sizes in between the null and the counter-null. (NHST forces all these double negatives).

2. The current system of publish when p<0.05 is easy to game, hence all the so-called questionable practices. Any new system, like RR, will in due course become easy to game. By a long shot, the easiest (invalid) way to get an inflated effect size and an inappropriately small p is to test more participants than needed and keep only the “best” ones. RR will not prevent that.

3. NHST assumes random sampling, which no-one achieves. The forms of sampling we use in reality are all possibly subject to issues of non-independence of participants which leads to Type I error rates (false positives) that are well above 5%.

None of this is to argue against RR, just to observe that it doesn’t resolve many of the current problems. Any claim that it does, is in itself a kind of Type I error and Type I errors are very difficult to eradicate once accepted.

Roger Watt

Professor of Psychology

University of Stirling

From: AUDITORY - Research in Auditory Perception [mailto:AUDITORY@xxxxxxxxxxxxxxx] On Behalf Of Ken Grant
Sent: 09 June 2018 06:19
To: AUDITORY@xxxxxxxxxxxxxxx
Subject: Re: Registered reports

Why aren’t these “failed” experiments published? What’s the definition of a failed experiment anyway.

I think that if the scientific question is well formed and well motivated AND the methods sound and appropriate for addressing the question, then whatever the result may be, this seems like a good experiment and one that should be published.

Sent from my iPhone

Ken W. Grant, PhD

Chief, Scientific and Clinical Studies

National Military Audiology and Speech-Pathology Center (NMASC)

Walter Reed National Military Medical Center

Bethesda, MD 20889

kenneth.w.grant.civ@xxxxxxxx

ken.w.grant@xxxxxxxxx

Office: 301-319-7043

Cell: 301-919-2957

On Jun 9, 2018, at 12:48 AM, Matthew Winn <mwinn2@xxxxxx> wrote:

The view that RRs will stifle progress is both true and false. While the increased load of advanced registration and rigidity in methods would, as Les points out, become burdensome for most of our basic work, there is another side to this. This is not a matter of morals (hiding a bad result, or fabricating a good result) or how to do our experiments. It’s a matter of the standards of *publication*, which you will notice was the scope of Tim’s original call to action. In general, we only ever read about experiments that came out well (and not the ones that didn’t). If there is a solution to that problem, then we should consider it, or at least acknowledge that some solution might be needed. This is partly the culture of scientific journals, and partly the culture of the institutions that employ us. There's no need to question anybody's integrity in order to appreciate some benefit of RRs.

Think for a moment about the amount of wasted hours spent by investigators who repeat the failed methods of their peers and predecessors, only because the outcomes of failed experiments were never published. Or those of us who cling to theories based on initial publications of work that later fails replication, but where those failed replications never get published. THIS stifles progress as well. If results were to be reported whether or not they come out as planned, we’d have a much more complete picture of the evidence for and against the ideas. Julia's story also resonates with me; we've all reviewed papers where we've thought "if only the authors had sought input before running this labor-intensive study, the data would be so much more valuable."

The arguments against RRs in this thread appear in my mind to be arguments against *compulsory* RRs for *all* papers in *all* journals, which takes the discussion off course. I have not heard such radical calls. If you don’t want to do a RR, then don’t do it. But perhaps we can appreciate the goals of RR and see how those goals might be realized with practices that suit our own fields of work.

Matt

--------------------------------------------------------------

Matthew Winn, Au.D., Ph.D.
Assistant Professor
Dept. of Speech & Hearing Sciences
University of Washington

The University achieved an overall 5 stars in the QS World University Rankings 2018

The University of Stirling is a charity registered in Scotland, number SC 011159.

--
Leslie R. Bernstein, Ph.D. | Professor
Depts. of Neuroscience and Surgery (Otolaryngology)| UConn School of Medicine
263 Farmington Avenue, Farmington, CT 06030-3401
Office: 860.679.4622 | Fax: 860.679.2495

---------------------------------------------
Frederick (Erick) Gallun, PhD  Research Investigator, VA RR&D National Center for Rehabilitative Auditory Research
Associate Professor, Oregon Health & Science University
http://www.ncrar.research.va.gov/AboutUs/Staff/Gallun.asp