Having gotten snowclone.pl working again, it's time to put it to work. A few weeks ago, the webcomic xkcd ran a strip consisting of a chart of the number of Google hits for variations on the pattern died in a[n] X accident. I'm resisting the urge to explain the joke, because that's never a good idea, but it's worth noticing that this looks kind of like a snowclone (though it's not, about which more below). You're in my territory now, xkcd!
I had two questions about the strip. First, were the numbers reported in the chart actual Google hit counts, or were they invented by the creator of xkcd to make the joke work? Second, what other fillers occur in the pattern died in a[n] X accident?
To answer the first, I ran my script on four patterns:
- died in a X accident
- died in a X accident, excluding pages with car
- died in an X accident
- died in an X accident, excluding pages with auto and automobile
This caught several dozen variants. I also searched by hand on a few terms in xkcd's chart that slipped through the cracks. After collating results and removing the duplicates, I was able to compare the real numbers with xkcd's. First, here are the numbers from the strip:
Next, here are the numbers I found for the same variations (leaving out the ones with knitting and blogging, about which more below):
Notice that the first seven variants listed in xkcd's chart appear in the same order in the data I collected. What's more, the relative sizes of the counts are also a very close match. This is very unlikely to have happened by chance, which implies the numbers in the strip are real Google hit counts, probably from a couple of months back (the Web is always growing). So it looks like xkcd is legit! That's a relief. I'd hate to think Randall Munroe was basing his humor on fabricated data.
I mentioned above that I left out the variants with knitting and blogging. Too see why, take a look at the 30 most common variants of died in a[n] X accident:
Not surprisingly, and as mentioned in the note that pops up if you hover over the strip, car, automobile, and auto are far and away the most common fillers for that phrase. What is surprising is that the two least-common variants in the original chart, those with knitting and blogging, are now in the top 30 on the web as a whole. The explanation is obvious, I think: xkcd is widely read, and people have been either mentioning this strip or simply repeating the two funniest variants. It's impressive that in just over three weeks the punchline variant, blogging, has leapfrogged over all other variants except for the automotive ones. (I suspect the later count for skydiving was inflated somewhat, too.) It's like a backwards web version of the Heisenberg Uncertainty Principle: you can't make a funny comment about how seldom something appears on the web without causing that very thing to suddenly appear on the web, as long as you get enough people's attention.
I collected all this data with a script I developed to investigate snowclones, but the pattern in question here, died in a[n] X accident, isn't really a snowclone in the strictest sense. It's true that it has a similar form, a phrasal template with various fillers, but it's not a repurposing of an existing phrase or idiom intended to be recognized as a reference to the original. Instead, it's simply a common turn of phrase that one might expect to occur in reports of fatal accidents. In fact, rather than being about a snowclone, this xkcd strip is actually an implicit example of another Geoff Pullum favorite: an argument by linguification. The idea behind the strip is that you can gauge how dangerous some activities are, not directly by looking at statistics about deaths due to that activity, but indirectly by looking at statistics about language referring to such deaths.
I guess I've gone ahead and explained the joke after all. Sigh. Let me leave you, then, with a bit of cruel wisdom my brother once imparted to me after I committed this same sin some years ago: "Oh, yeah, it's funnier now that you've explained it."
If you do requests, how about a snowclone study of the phrase "* dollar word"?
Posted by: Steve | February 08, 2008 at 03:55 PM
I remember seeing blog posts the same day as that xkcd strip came out that commented on the fact that now "died in a blogging accident" was a lot more frequent. I've also witnessed this firsthand when I commented about a User Friendly strip that mentioned a made up inventor "Ernst Dinklefwat". At the time of my post it was a Googlewhackblatt and I got an insane amount of traffic from Google that day with people searching for dinklefwat. Apparently that User Friendly strip didn't generate the same amount of interest as the xkcd one, because there are only 472 results for dinklefwat now. A very interesting phenomenon.
Posted by: Jason Adams | February 08, 2008 at 05:12 PM
The day xkcd published that cartoon I searched for all the variants. xkcd's numbers were right on or within a few digits. A slight difference is likely unless we happen to be using the same Google dataceneter. One week week later I checked again and the numbers were already skewed.
Posted by: Anton | February 08, 2008 at 05:36 PM
"* dollar word"
I ran the script on "X dollar word" and "a[n] X dollar word", but it found very few variants, because "ten", "two", and "five" seem to swamp all the others. Doing the first few integers by hand, I found the following, in the format [COUNT VARIANT]:
598 one
1280 two
377 three
273 four
2360 five
53 six
34 seven
8 eight
159 nine
5160 ten
3 eleven
8 twelve
7 fifteen
1 seventeen
601 twenty
1 twenty-four
66 twenty-five
5 thirty
1 thirty-two
10 forty
291 fifty
1 sixty
1 ninety
273 one/a hundred
6 1
171 2
130 3
28 4
325 5
5 6
6 7
2 8
9 9
411 10
2 11
9 12
1 13
2 15
1 16
1 18
95 20
1 24
10 25
1 27
4 30
3 40
86 50
3 60
76 100
If you plot these data, you'll find that the charts are pretty much the same for numerals and number words. In particular, the top three in both cases are ten/10, five/5, and two/2.
Posted by: The Tensor | February 08, 2008 at 10:04 PM
"If they say summat funny, pretend like you didn't hear. Then pretend like you didn't understand. Because nothing's funny if it's repeated and explained." Andy Dalziel in one of Reginald Hill's great novels (Advancement of Learning, I think, but I'm not sure)
Posted by: The Ridger | February 09, 2008 at 11:46 AM
It's mildly interesting to compare this to "died in a {tragic/freak} * accident", where the numbers are unpolluted; you get a much better sense of what's dangerous, or at least what's dangerous in a tragic way. This chart shows tragic accidents (excluding "car", at 13,300), and this one shows freak accidents. Motorcycle accidents can apparently be tragic, but only expectedly so, whereas boating and hunting accidents can be freak as well as tragic.
Well, it's interesting at four in the morning, anyway.
Posted by: Lance | February 10, 2008 at 01:51 AM
Thanks for the x dollar word info ...
Posted by: Steve | February 11, 2008 at 03:44 PM
xkcd rocks!
Found your blog via the Discover option on Google Reader. Looking forward to more entries.
よろしくお願いします
Posted by: James | February 13, 2008 at 02:14 AM
I'm appalled that "bizarre gardening" has slipped out of the Top Thirty. Something must be done.
Posted by: mollymooly | February 15, 2008 at 01:49 PM
On the topic of using statistics about language about stuff to find out information about that stuff: it was once told me that an investigation of a corpus of English revealed that the most common mentions of screwdriver usage were related to killing, harming, and threatening people.
Posted by: Russell | February 28, 2008 at 04:44 PM
i really agree
Posted by: reza | July 28, 2008 at 06:58 AM