Uncanny academic valley: Brian Wansink as proto-chatbot
Back in 2017 I posted this exchange that psychologist Tim Smits had had in 2012 with former food researcher and NPR star Brian “Pizzagate” Wansink.
I happened to come across this post yesterday pretty much at random (I was writing something else that linked to my post on storytelling as predictive model checking, and it linked to the previous and following posts on the blog:
Both these old posts looked (and, indeed were, interesting). One of the advantages of having all this old material is that I can reread it and be reminded of some story or bit of reasoning that had, over the years, receded beneath my threshold of consciousness.
Rereading Smits’s decade-old exchange with Wansink from the perspective of modern chatbots was an uncanny experience.
I’ll share the exchange, but first some context.
You know how we keep talking about how chatbots are excellent bullshitters, they can emit a string of words that on the surface seem reasonable, and they can even associate those strings of words with what looks like logical reasoning, if you don’t probe too carefully? That’s just what Wansink did!
In his email, the now-disgraced food researcher did a kind of Turing test, simulating scientific inquiry in the same way that so many published papers present words and sentences and images in a way to imitate real research in a way that is convincing enough to fool journal editors and reviewers–or, at least to exhaust them into submission.
I could tell at the time that Wansink was bullshitting (and Smits could tell this too; that’s why he emailed it to me); in hindsight this form of response just seems so familiar. Even to the point where, after Wansink is called on the absurdity of his response, he quickly pivots to a new line of b.s. As with the chatbot, once you let go of the idea that it understands or wants to understand what’s going on, we can step back and be amazed by the impressive fluidity of the text, in some ways more impressive given that it is pure association with no underlying meaning.
OK, here’s the story from Smits:
In 2011, the Cornell research published an article (Zampollo, Kiffin, Wansink & Shimizu, 2011) on how children’s preferences for food are differentially affected by the how the foods are presented on a plate compared to adults. . . . some of the findings were incomprehensible from the article . . . I [Smits] wrote a polite email. Asking for some specific information about the statistics. This was the response I got.
Dear Tim, Thank you for being interested in our paper.Actually there are several errors in the results section and Table 1. What we did was two step chi-square tests for each sample (children and adults), so we did not do chi-square tests to compare children and adults.As indicated in the section of statistical analysis, we believe doing so is more conclusive to argue, for example, that children significantly prefer six colors whereas adults significantly prefer three colors (rather than that children and adults significantly differ in their preferred number of color). Thus, for each sample, we first compared the actual number of choices versus the equal distribution across possible number of choices. For the first hypothesis, say #1=0, #2=0, #3=1, #4=0, #5=2, #6=20 (n=23), then we did a chi-square test (df=5) to compare those numbers with 3.83 — this verified the distribution is not equal. Then, we did second chi-square test (df=1) to compare 20 and 0.6 (the average of other choices), which should yield 18.3. However, as you might already notice, some of values in the text and the table are not correct — according to my summary notes, the first 3 results for children should be: 18.3 (rather than 40.4)16.1 (rather than 23.0)9.3 (rather than 26.88) Also, the p-value for .94 (for disorganized presentation) should not be significant apparently. I am sorry about this confusion — but I hope this clarify your question.
Well, that was interesting. Just one email, and immediately a bunch of corrections followed. Too bad the answer was nonsensical. So I [Smits] wrote back to them (bold added now):
When reading the paper, I did understand the first step of the chi-square tests. I was puzzled by the second step, and to be honest, I still am a bit. The test you performed in that second step boils down to a binomial test, examining the difference between the observed number of counts in the most preferred cell and the H0 expected number of counts. Though this is informative, it does not really tell you something about how significant the preferences were. For instance, if you would have the following hypothetical cell counts [0 ; 0 ; 11; 0; 0 ; 12], cell 6 would still be preferred the most, but a similar binomial test on cell 3 would also be strongly significant. In my opinion, I thus believe that the tests do not match their given interpretations in the article. From a mathematical point of view, your tests on how much preferred a certain type of plate is raise the alpha level to .5 instead of .05. What you do test on the .05 level is just the deviation in the observed cell count from the hypothesized count in that particular cell, but this is not really interesting
Then, this remarkable response came. . . . they agree with the “shoddy statistics” . . . Moreover, they immediately confess to having published this before.
I carefully read your comments and I think I have to agree with you regarding the problem in the second-step analysis.I employed this two-step approach because I employed similar analyses before (Shimizu & Pelham, 2008, BASP). But It is very clear that our approach is not appropriate test for several cases like the hypothetical case you suggested. Fortunately, such case did not happen so often (only case happened in for round position picture for adults). But more importantly, I have to acknowledge that raising the p-value to .5 in this analysis has to be taken seriously. Thus, like you suggested, I think comparing kids counts and adults counts (for preferred vs rest of cells) in 2×2 should be better idea. I will try to see if they are still significant as soon as I have time to do.
So, back in 2012, someone sent Wansink a devastating criticism, he or someone in his lab responded in a cordial and polite way, gave tons of thanks, and then for the next several years they did essentially nothing. As Smits put it, “Their own press releases and outreach about that study did not show a single effort of self-correction. You can still find some of that material on their website. Similarly, despite the recent turmoil, I have seen them just continue their online communication efforts.”
At the time, I saw this as a story about Wansink and his Food and Brand Lab, and about Cornell University’s continuing support of that organization, and, more broadly, the academia and news-media environment that incentivized such behavior.
And, yes, that’s all still a concern. But, looking back, what strikes me most is the chatbot-like nature of Wansink’s replies.
Just as the development of flying machines gave us insight into the ways that birds fly, so does the development of chatbots give us insight into human reasoning, including bullshitting, which is some associative form of reasoning, not quite the same as logical/rational/conscious/attentive reasoning but of interest for its own sake.
And this isn’t just about Wansink, or even just about junk science; it’s also about how we have implicitly been trained to respond to criticism, which is to supply smooth words that imply a logical train of thought without necessarily being coherent.