Sunday November 12, 2006

snowclone.pl

One of the most popular sports in the linguisitblogosphere (along with decrying coverage of language issues in the mainstream press) is proposing snowclones and then examining their usage using search engines like Google.  I've taken part several times—see here and here for examples.  A typical post goes something like this:

  1. Hey, I think I've discovered a snowclone!
  2. If I search for it on Google I find the following variants...
  3. Of those, here are the few that occur most often...

In writing the most recent of my snowclone posts, it occurred to me that this process could easily be automated—so I automated it.

I had originally investigated the "slouching toward(s) X" snowclone back in May, including finding variants by vgrepping through the Google results and counting the number of hits each received.  I meant to write the post immediately, but school got busy and I set the data aside.  When I finally got around to writing the post in late October, it occurred to me that five months is an eternity in web time and the counts had very likely changed.  I was damned if I was going to redo all that Google searching manually, so I wrote a Perl program to do it for me.

Here's how it works.  First, the user gives it a pattern like "Eskimos have X words for snow".  The program searches through the first 1000 Google hits for "Eskimos have * words for snow", extracting from the excerpts all distinct variants on that pattern that appear.  It then performs another search on each variant, keeping track of the number of hits each receives.  It then sorts the list by the number of hits and prints it out.  Result: instant blog post!  (And, in case you're interested, the top ten variants are: 50 different, hundreds of, n, 52, many, 40, dozens of, a hundred, 50, and 100.)

I've attempted to make the program portable and customizable. In order to run it, you need only to have Perl and the URL-fetching program wget installed, so it should run on a wide variety of platforms.  (I've been running it under the Cygwin environment on Windows, for example.)  Some simple options in the code make it easy to vary options like the number of results searched for snowclone variants, the minimum number of hits a variant must receive in order to be printed out, and the search engines used for finding variants and counting them (which need not be the same).

In trying out the various options, I've found that using Google to collect variants seems to produce the widest variety of results. However, as various folks over on Language Log have discussed (start here), the values Google reports as the number of hits for some searches are a little fishy.  For example, when I run my program on the well-known snowclone "all your X are belong to us", the third most common variant is "all your audioscrobbler are belong to us".  This is because Audioscrobbler includes that phrase, automatically generated, on a large number of pages.  It looks like Google counts each of these as a separate hit, but some other search engines, including Yahoo and Windows Live, seem not to.  By default, the program uses Google for both finding and counting variants (for consistency), but I've made it possible to change that if you want more reliable counts.  See the comments in the code for more details. (BTW, if your first reaction to the code is "Jeez, this guy writes ugly Perl", well, fine, but allow me to point out that it has the not inconsiderable virtues of (a) working and (b) being free.)

For your amusement, here are some sample results: the top ten variants, using the default settings, for a few well-known snowclones:

  • I for one welcome our new X overlords
    democratic, bush obsessed, insect, dalek, intel, robot, google, soviet, alien, and cetacean
  • X is a verb
    apple juice, love, god, luv, parenting, faith, davezilla, seeing, verb, and google
  • all your X are belong to us
    base, blogs, audioscrobbler, snakes, bass, bias, skyscrapers, iraq, athens, and typos
  • have X will travel
    gun, love, laptop, sword, space suit, spacesuit, camera, joystick, plutonium, and koala
  • one man's X is another man's Y
    trash/treasure, terrorist/freedom fighter, religion/belly laugh, meat/poison, junk/treasure, bookshelf/library, ceiling/floor, conspiracy/business plan, constant/variable, and theology/belly laugh
  • I'm not X but I play one on TV
    a doctor, an actor, a lawyer, a homo, a leftist, russian, a mixer, a judge, a tuba player, and a reporter

So, without further ado, I present to you snowclone.plUse it in good health (wear, keep, enjoy, drive, ...).  Let a thousand snowclones bloom (flowers, reactors, choices, filters, ...).

You get the idea.

[Update:  At Mark Liberman's suggestion, I've made a couple of tweaks to the program.  First, I reversed the order of the output from <variant, count> to <count, variant>, which he says is "more of less the default for textual histograms".  Second, I added a couple of examples of how to call the program to the documentation in the code.  Finally, I replaced the old version (1.0) with the updated version (1.01) at the link above.]

[Update:  Version 1.03:  Added a 30 second delay to avoid triggering the CAPTCHA Google has added to prevent rapid, repeated wildcard searches.  For more details, see here.]

I am The Tensor, and I approve this post.
04:33 PM in Linguistics | Submit: | Links:

TrackBack

TrackBack URL for this entry:
http://www.typepad.com/services/trackback/6a00d8341c88ad53ef00d834fa433969e2

Listed below are links to weblogs that reference snowclone.pl:

Comments

Nice!

Allow me to suggest that you rewrite the script using Google's search API. At least in theory, Google does not take screen-scraping of their search results lightly. (Though I know of no concrete cases of such scripts having been blocked.)

Posted by: Arnt Richard Johansen at Nov 13, 2006 2:56:23 AM

If it used Google's search API, wouldn't any person who wants to use it have to sign up for a Google account and license key? Sounds inconvenient. I'm hoping that Google either won't notice the tiny amount of traffic generated by people using this script (a few hundred searches in a few minutes must be a tiny, tiny fraction of their server load), or if they do, they'll chalk it up to public service instead of cracking down—you know, "Do No Evil"? It's for Science, after all.

Posted by: The Tensor at Nov 13, 2006 7:39:36 AM

Yesterday, I was talking to a colleague, and said, "Science turns money into knowledge. Engineering turns knowledge into money." Then I thought, "that sounds like a snowclone," so I downloaded snowclone.pl, ran it on "X turns money into Y Z turns Y into money" and found nothing. That struck me as a rather surprising result.

Posted by: Pete Bleackley at Nov 23, 2006 5:39:49 AM

...I downloaded snowclone.pl, ran it on "X turns money into Y Z turns Y into money" and found nothing. That struck me as a rather surprising result.

I tried searching on "turns money into" and "into money", and turned up variants like the following:

"Research turns money into knowledge;. Innovation turns knowledge into money."

"Research turns money into knowledge... innovation turns knowledge into money"

"R&D turns money into ideas, innovation turns ideas back into money"

The problem is that these contain periods, ampersands, and semicolons, but snowclone.pl will only find variants that contains no punctuation (except commas). This snowclone is therefore sort of a pathological case for the program—it contains two clauses and so is very likely to contain sentential punctuation the program can't handle, and a lot of the variants also include quotation marks or ampersands that are similarly excluded.

It's possible to address some of this stuff by tweaking the regular expressions, but in ways that may break the program for other, garden-variety snowclones. Feel free to experiment!

Posted by: The Tensor at Nov 25, 2006 5:34:15 AM

You might be interested to know that in 2004 I wrote a tool that does this :)

Examples:
http://blogoscoped.com/archive/2004_06_06_index.html

Tool:
http://www.findforward.com/?q=have+*+will+travel&t=wildcards

Posted by: Philipp Lenssen at Dec 23, 2007 2:37:14 AM

Replacement google_count. Copes with 1: changes to Google's phrasing, 2: avoids counting when Google rewrites queries for "typos" that aren't, and 3: searching for only 1 result occasionally returns spurious estimates (compare searching for "olthar is best pony" for 1 versus 10 results)

sub google_count {
  my $pat = shift;
  #
  # only ask for one result, since we're really after the count
  my $res = google_search($pat, 10, 0);
  #
  # check if Google has unhelpfully turned out quoted search unquoted
  if ($res =~ /No results found for <b>/) {
    return "0";
  }
  #
  # Did Google try to autocorrect?
  if($res =~ /Search instead for/) {
	  return 0;
  }
  #
  # extract the count
  $res =~ /About\s+([\d,]+)\s+results/;
  #
  return $1;
}

Posted by: Alan at Dec 3, 2011 1:27:35 PM