Sunday, March 11, 2012

Replication issues and a solution

Most social psychologists were unaware of the journal PLoS One until John Bargh brought it to their attention a week ago.  PLoS One published an article by Doyen et al. that failed to replicate Bargh's famous priming task in which seeing words associated with the concept 'elderly' led people to subsequently walking more slowly.  Doyen et al. provided evidence to suggest that rather than primes affecting subjects themselves, experimenter expectancies could have altered how experimenters interacted with subjects and recorded the walking speed.  By this account it was a self-fulfilling prophecy on the part of the experimenters, not a true priming effect in the subjects.  Bargh wrote a blog at Psychology Today that was pretty hostile to the Doyen paper as well as PLoS One.  Some of the points Bargh made probably weren't such a good idea, especially arguing that PLoS One, at the forefront of the Open Access movement, is a journal that indiscriminately accepts all articles to make money (this is pretty much the direct opposite of the truth - and by the way, they just rejected a paper I'm a co-author on!).  As a result, the rhetoric itself is now getting a fair amount of attention in blogs (here and here).  That said, Bargh is probably right about the larger issues at stake.

In essence, Doyen turned their initial failure to replicate into an opportunity.  By introducing an expectancy effect (with some experimenters expecting subjects to speed up and others expecting them to slow down) they were able to manipulate subject walking speed.  This is not surprising - my first advisor in graduate school, Robert Rosenthal, ran studies like this with rats back in the 1960s.  Experimenters who expected their rats to perform better in a maze did perform better (objectively better).  The critical questions are (a) do expectancy effects explain the previously observed elderly-slow effects and (b) are primed-induced automatic behavior effects real.  Let's take these in turn and then I will suggest a solution for the larger issue of replication, particularly within social psycholgy where the issue is particularly salient due to the recent Stapel faked data debacle.

Do expectancy effects explain the elderly-slow effects?
Possibly, but my hunch is that they don't.  The point of expectancy effects is that you can take almost any effect, null or real, and move it around through subject or experimenter expectations.  There was a big difference between the Bargh and Doyen studies.  Doyen et al went out of their way to explicitly induce expectations in their experimenters (who in a sense were now really the subjects).  In contrast, Bargh et al. went out of their way to avoid giving their experimenters any sense of the priming conditions for expected effects.  Is there a sliver of possibility that experimenter expectancies were involved in the Bargh study?  It is incredibly difficulty to rule this out 100%, but I think Bargh did a better job than most in ruling this out.  I think skeptics of scientific findings should refrain from saying 'Because there's a 0.01% chance of an effect being due to a different mechanism than the one proposed we should assume the finding can't be trusted'.  There's a proportionality that's needed but often lacking in such discussions.  Most results and accounts of such results are not 100% airtight, but this does not mean they are 0% correct either.

Years ago, I tried to replicate the elderly-slow effect and failed.  I don't pretend to know what that means.  I also tried to replicate the mere exposure effect multiple times before it worked, but once I figured it out it continued to replicate reliably (though my twist on it never worked, hence no publications on the effort).  There are many more possibly reasons for failure than success (think about the runner who wins vs. loses a race).  Even with my own failure to replicate I still believe in the effect.  Despite what some have said there really are countless conceptual replications in which priming someone with the concept of elderly changed their behavior in someway to become more like the stereotype of elderly behavior (see reviews here and here).  There's even nice data showing that the amount of time a subject has spent with the elderly in their lifetimes modulates the effect (here).  That could not be explained in terms of experimenter expectancies without really tortured logic.

Priming-induced automatic behavior
At a certain level it does not matter whether the exact primes Bargh used produce a change in walking speed over the exact distance he measured.  People say 'we need to replicate this exactly.  conceptual replications aren't good enough'.  Who cares about this specific manipulation?  Are we about to start using it as an intervention to treat patients?  Nope.  What we care about is whether priming-induced automatic behavior in general is a real phenomenon.  Does priming a concept linguistic cause us to act as if we embody the concept within ourselves?    The answer to this question is a resounding yes.  This was a shocking finding when Bargh first discovered it in 1996.  We had long known that priming a concept can lead us to interpret another person's ambiguous behavior in terms of the prime.  However, ever since movie theaters in the 1950s failed in their attempts to use subliminal priming to get us to buy more Coca-Cola, we have assumed that primes couldn't change our behavior.  Since Bargh's initial findings, hundreds of studies focusing on a wide range of behaviors, stereotypes, and contexts have shown this general class of phenomena to be real.  Its also been extended to show that goals and motivation can be primed too.  I don't think its fully agreed upon why these effects occur, but I think the existence proof is complete.

A way forward for social psychology
Social psychologists are notorious for producing counterintuitive findings that are enchanting to many, but headscratch-inducing to others.  Its also true that pure replications do not get published often enough.  No top journal will publish pure replications ('what's its novel contribution?') and no one can make a name for themselves running pure replications.  Additionally, because there are both good and bad reasons for the failure to replicate, its even harder to get failures to replicate published.

Here's my solution.  Each year social psychologists nominate 10 findings (the number is arbitrary) from the previous year or two that they would like to see replicated.  These would be findings that if solidified would be extremely significant to the field (but if false, should be done away with quickly).  Perhaps this would happen over the summer.  Then all first and second year graduate students in social psychology would be required to choose one of these to replicate as closely as possible (or perhaps with a gradient from exact to conceptual replication).  This would be part of their training - learning how to run studies well.  They could add other conditions of interest, but the main goal would be to institutionalize the replication of the most important recent findings.  Everyone would get a first author publication in the to-be-created online Journal of Psychology Replications (I'm making that up).    Once a study was in this replication pool for a year, an initial meta-analysis would be written up of all the replications and then this would be updated after more studies come in during the second year.

This would be a useful exercise for new graduate students.  They would each get a publication from this.  And the field would know pretty definitively within a year or two which effects replicate and which don't.  There would be no file drawer effects because all would included regardless of the p-values obtained.

Follow me on twitter: @social_brains


  1. Matthew,
    I really like your ideas about how to address the replication problem (especially having such replications run and authored by 1st and 2nd year grad students as part of their training). You should propose the idea of such a journal to the publication board at SPSP (and possibly SESP).
    Todd Nelson

  2. I think this would work well in the states where a graduate might be studying for 6 years, but in the UK here we get a somewhat more intensive 3 years: giving up time for unrelated PhD study really does need to be something you take an interest in rather than as a training exercise. Here it might suit as a penultimate undergraduate year project in preparation for the final-year dissertation, a practice run at it if you like.

    I'd love to see a Journal of Replicated Psychology, what a fantastic idea! I'd personally love to be involved in that

    Chris NH Street

  3. Nice post. I agree with you, apart from your first sentence.

    "Most social psychologists were unaware of the journal PLoS One until John Bargh brought it to their attention a week ago".

    Are you serious? Anyone who calls themself a "Social Psychologist" and doesn't even know what PLoS is, is kidding themselves.

    How can you anyone claim to have studied the length and breath of Psychology and yet never have encountered PLoS? How could you fail to notice that it's contents are some of the only papers in the field which are published in a high impact journal which doesn't require the headache inducing rigmarole of pissing about jumping through recursive Athens, shibboleth, and third party portals before giving up and VPN'ing in to university in the vain hope they have a sub, all just to get over the sodding paywall? When I see the letters "PLoS" in a citation I want to read, I breath a sigh of relief.

    Take that Psychnet review you cited for instance. Good luck getting in to that from the vast majority of institutions in the world.

    "Bargh is probably right about the larger issues at stake".

    A search of PLoS returns over 10,000 expertly reviewed "social psychology" papers that *unlike the status quo* anyone in the whole world can read instantly. In this case, PLoS IS the larger issue at stake.

    1. Thanks for the comments (I'll take 99% agreement anytime). In the first sentence, I didnt meant to suggest that social psychologists shouldn't care about PLoS One, only that it is not a central journal for them. If you go to the PLoS One site and search for "social psychology" you get fewer than 300 responses, but only 41 of these are in the category of psychology whereas 200 or so are in the category of neuroscience. In other words, most PLoS One articles that use the term social psychology are actually social neuroscience papers. BTW, in contrast, the term "cognitive neuroscience" gets about 1500 hits. Here's why social psychologists rarely publish at PLoS One:

      Social psychologists virtually never pay a penny during the publication process. Our mainstays include Psychological Science, Journal of Personality and Social Psychology, Personality and Social Psychology Bulletin, and Journal of Experimental Social Psychology. None of these charge for submissions and social psychologists virtually never publish color figures. Consequently, for social psychologists, $1300 fees to publish is onerous and a bit shocking. As a social neuroscientist I pay more than $1300 to publish papers all the time, but for social psychologists there's no reason to go to PLoS unless you can't get into the other journals that social psychologists typically publish in for free. I don't think its accidental that the Doyen paper, published at PLoS had first and last authors who are not social psychologists.

      Anyway, my opening was by no means an indictment of PLoS One but simply commenting on how Bargh brought this to the attention of social psychologists in a way that it wouldn't have reached them on its own. I'm pretty sure that's still true.

  4. Dear Matt,
    You mention that "Years ago, I tried to replicate the elderly-slow effect and failed". Would you be willing to post a notice of that on ? As you write, the replicability has become a contentious issue, and we need as much information about attempts as we can get. If you are reluctant to post a notice to PsychFileDrawer, we (the creators of the site) would love to know the reasons why- we are trying to find ways to overcome peoples' reluctance!!

  5. I'm surprised that even people who fail te replicate this effect "still believe in the effect". If failure to replicate this effect (a pretty common experience, it seems) is not enough for you to doubt it, than what do you think science should do to provide convincing support for the non-existence of an effect? It almost looks like it is never possible to correct for a type 1 error (irrespective of whether the Bargh study is one or not).

  6. Good post.

    However - this is not a criticism of you, Matthew, but a general point - I'm not sure I'm comfortable with the outbreak of failed replications of Bargh that seem to have appeared in the form of blog posts and comments, in the past few days.

    I'm all in favor of blog posts as part of science, but I'm also in favor of open data, and saying (essentially) "We did an experiment, and we found the following result: no effect" is about as closed as could be.

    PsychFileDrawer would be a much better approach I feel.

    Like I said this is not a comment aimed at you as such, it applies equally to everyone who's done this.

  7. Personally I was less surprised that PLoS ONE was not well known to social psychologists - after all, PLoS ONE doesn't even have a proper psychology section, despite publishing psychology papers - than that the whole concept of paid-for Open Access seemed to be alien. (disclaimer - I have published in PLoS ONE, although, incidentally, I didn't pay anything to do so). With regard to not publishing there unless you could get in the other social psych. journals, one would have to ask why the Doyen paper wouldn't get in one of those? Bargh's critique, was, frankly, rather limp, as I detailed on the original blog.

    In the original response I put on psychology today (in hindsight I'd moderate my tone a bit, but I'd cheekily suggest I was primed to respond...robustly), I detailed a number of points where I felt his account differed from that in the paper, and from a close reading of his experiments 2a and 2b, I'd still say the possibility for experimenter bias is there. It's remote, given that there seemed to be little opportunity for the confederate to discover which condition the participants were in, but still there, and Doyen et al. do the right thing in completely ruling it out. An issue like that *cannot* be ruled out by "conceptual" replication, and whether the concept is real and a specific demonstration of it is real are two separate things.

    Anyway, to get to a more general point, this is perhaps a good time to think of what you were saying the other day about Type I and Type II errors, and how statistical thresholds create either/or distinctions that can be misleading. Rather than thinking (a) Bargh got the effect right, therefore a null is wrong, or (b) Doyen et al. found a null, therefore the significant effect is wrong, perhaps what we should be saying is (c) if the effect is real, our best estimate of the size of the effect is somewhere between these two estimates. If the effect itself is useful, then we ought to be finding the best way/paradigm to quantify it and elicit it in order to be able to use it to tell us something we really want to know.

    Like Neuroskeptic, I'm a little wary of how many failed replications seem to have cropped up: we know pretty much nothing about those replications, and proper reporting of those experiments is necessary. But you can see why people would be wary of reporting them when the response tends toward mud-slinging.

    I like the general idea of getting people to run replications as part of their research training, although as others have pointed out it may be difficult to fit that in relatively short programmes, and I get the feeling it would be a logistical nightmare.

  8. I'm probably missing something, but should your initial string of failures to replicate not persuade you just as your later success in replicating did? Did you just have pretty high priors about the effect at the outset, and then what did actually doing the experiment (instead of just thinking about it gedanken-style) achieve? Was there a gross mistake in the initial attempt, or was it a small difference in the setup? The latter would naively seem to be interesting.

    (Random visitor! Just happened upon this post as a result of following this brouhaha)

  9. Matthew,
    Here's a link to another discussion of the controversy at a blog at Discover magazine....

  10. Hi Matt! (and Todd!),

    Great post. I love at least in principle the idea of both a Journal of (Social?) Psychology Replications (maybe we start small and local?), and the idea of some sort of consortium getting together to require it of early grads and then having a publication outlet.

    No one doubts the existing of priming as a phenomenon. But you wrote: "Does priming a concept linguistic cause us to act as if we embody the concept within ourselves? The answer to this question is a resounding yes."

    I think the answer to this question is a resounding "Maybe, kinda, sorta, occasionally, if one rigs the study "just so" (see Bargh's critique, much of which is of the genre, "They did not do it 'just so' so of course they couldn't find it!")

    If one cannot replicate such a fundamental study in a simple and direct manner -- and, apparently, Matt, you are not the only one to report an inability to do so -- then it would seem that one (not all, just one) of the pillars of that conclusion has been shown to have feet of clay. Or, at least, the finding is **nowhere near as powerful or general as it is cracked up to be."

    This brouhaha reminds me of the exten to which so much foundational stuff in social cognition has feet of clay, either because the interpretations are wrong, distorted, or they have proven irreplicable. I have a book coming out on exactly these issues, with respect to self-fulfilling prophecies, biases, accuracy, and stereotypes (go here for more info:

    Here are few snippets:
    -- Hastorf&Cantril (1954) Princeton/Dartmouth football study provided vastly more evidence of objectivity and agreement than of subjectivity and disagreement. Vastly.

    -- Exclude 5 bizarre outliers (kids with 100 point IQ gains in one year!!??) and the ENTIRE self-fulfilling prophecy effect in Rosenthal & Jacobson (1968) disappears

    -- Darley & Gross's (1983) "stereotypes as hypothesis confirming biases" study has been cited over 700 times. The two direct attempts at replication failed (Baron et al, 1995), and have been cited less than 30. The broader literature shows **the exact opposite* empirical pattern as D&G.

    -- Snyder & Swann (1978) is STILL routinely cited to show that people "seek to confirm" their social hypotheses. They study **required* people to choose from biased questions that would require people to either confirm or disconfirm their hypotheses. Every subsequent study that has given people the chance to ask diagnostic questions has shown that, overwhelmingly, people ask diagnostic questions, and almost never ask confirmatory ones.

    -- Stereotype accuracy -- the correspondence of people's beliefs about groups with what those groups are actually like -- is one of the largest relationships in all of social psychology and routinely is ignored. Claims about stereotype inaccuracy still abound everywhere.

    Feet of clay.

    I leave this post with a quote from my book:
    I am sure that nothing in this book will change the mind of the many true believers in the power of stereotypes and expectancy-based biases. For the rest of you, though, I have one simple request. Don’t believe me. Do with the accumulated social science data exactly what Fiske & Neuberg (1990) say you should do, when judging a person. Just pay attention to the data. Not just your favorite data. All of the data. And if it is not possible to pay attention to all of the data (sometimes, there is just too much, or it requires too much professional expertise, etc.), at least avoid the pitfall of focusing your attention on the data that you want to be true. Instead, work hard to get the full, big picture.


    Lee Jussim

  11. Hi,
    I just wanted to take a minute to tell you that you have a great site! Keep up the good work.

    Field Failure Relay

  12. I know that's been many years since this was on active discussion here, but I just couldn't let this go by.

    First, I think it great that a field of inquiry that strives to assert itself as just as valid as any other has finally come to realise the importance of conducting systematic protocols to ascertain the reproducibility of field studies, especially those deemed fundamental to core tenets of the discipline. That it took so many decades for this realisation to arrive is, frankly, awkward, especially amidst an environment where basically every other discipline had already long committed to such efforts of self-validation. But still.

    It then comes as an unpleasant surprise that every one here apparently thinks it a great idea to discharge such an important enterprise upon graduate students. That is, in your core you still regard reproducibility studies as basically bureaucratic paper-pushing, at best an opportunity for grads to flounder about in the lab, bumping and feeling their way against the walls while already accumulating some valuable bibliographic records of their own.

    Well, turns out, it is precisely for this important task that you should reserve your most experienced and talented researchers. Who else if not they can afford the risk of spending valuable research hours with studies that, for the most part, will go completely unnoticed, and, on the rare (hopefully) occasion when something untoward actually surfaces, can muster the stamina required to affront the predictable and understandable resistance of those whose long-held beliefs have just been thrown into doubt? Not to mention that one can hardly expect graduate students to possess the self-confidence that will allow them to conduct these replication experiments with the required indifference (towards the actual result obtained) that is, afterall, the single best guarantee of experimental integrity; on the contrary, it is obvious that the students, and their advisors alike, will internally regard these experiments not as serving to validate some theory, but rather as a contest where they must show their competence in arriving (supposedly independently) at a foregone conclusion.

    That professional researchers such as yourselves, whose very field of study involves the way people's expectations undermine their performance, can't see the trap you're setting up for yourselves, is baffling. Ridiculous, even. Imagine biomed researchers entrusting mere students to design and conduct review trials of pharmaceutics, not as a mere academic exercise, but when actual evidence exists that some well-established drug may have had its benefits overhyped by its maker.

    Then you wonder why professionals in fields of actual scientific research look down upon you.