Wednesday February 6, 2008

Throttling snowclone.pl

As I occasionally do, I recently ran snowclone.pl on a phrasal template to see what sorts of fillers could be found out on the 'net.   To my surprise, I got back zero results, which didn't seem right.  Some investigation revealed that Google has introduced a CAPTCHA to prevent automated queries with wildcards (the * operator), which is just how snowclone.pl works.  Oops!

This put me in a slightly tricky position.  On the one hand, I was tempted to somehow circumvent the CAPTCHA (by using an anonymizing proxy, for example) to get the script working again.  On the other hand, I don't want to appear (or have people using my script to appear) to be some sort of black hat—it's perfectly reasonable for Google to want to prevent software from using wildcard searches to harvest email addresses from the web in order to spam them, for example.  It's also possible I'll apply for a job at Google some day, and it would be nice not to have been engaged in an effort to outwit Google's security measures.  (I imagine that would make for an uncomfortable interview.)

To get snowclone.pl working again within Google's new restrictions, I added a 30 second delay between wildcard searches.  Every time the script is run, there are ten such searches, retrieving the first 1000 variants of the phrasal pattern in question.  For each variant, the script then makes a non-wildcard search on that pattern.  It doesn't appear that rapidly repeating non-wildcard searches triggers the CAPTCHA, but to avoid hammering the search servers I threw in an additional one second delay between those searches.  These delays seem to be enough to avoid triggering the CAPTCHA, but not so slow that the script becomes unusable.  Running snowclone.pl now takes at least five minutes (ten wildcard searches taking 30 seconds each), plus a little over one second for each variant found, of which there can be as many as a thousand, but more typically about a hundred—so it now takes around seven minutes for each run of the script.  The new version, 1.03, can be downloaded here.

So snowclone.pl now works again, hopefully in a way that won't upset the nice (and non-evil!) folks over at Google.  It's for science, right?

I am The Tensor, and I approve this post.
03:31 PM in Linguistics | Submit: | Links:

TrackBack

TrackBack URL for this entry:
http://www.typepad.com/services/trackback/6a00d8341c88ad53ef00e5501aebbe8833

Listed below are links to weblogs that reference Throttling snowclone.pl:

Comments

Thank you Tensor!

Posted by: Kivi at Feb 6, 2008 7:47:27 PM

Ahh, no wonder I kept getting 0 results for my queries a few weeks ago. I'd better retry those.

Posted by: Erin at Feb 7, 2008 9:28:25 AM

Thanks for updating it. I occasionally play with it, and it was a shame that it stopped working. (Today's query: "Don't trust anyone over X", which really isn't a snowclone, since only ages really work as the last word, but I was curious how much the phrase has mutated over the years.)

Posted by: Alan De Smet at Apr 21, 2008 8:07:30 PM

I ran into some unexpected behavior:

./snowclone.pl "Don't trust anyone over X"

1850000 don't trust anyone over thirty 00
1710000 don't trust anyone over 30 since most teachers are
1380000 don't trust anyone over 30 bp
653000 don't trust anyone over thirty sea scouts
428000 don't trust anyone over 30 by chris holm monday dec
254000 don't trust anyone over 30 08
252000 don't trust anyone over 30 03
214000 don't trust anyone over 30 93
214000 don't trust anyone over thirty future events london walker museum willia ms college new york www
146000 don't trust anyone over 70 the baby boomers may be growing up but as con sumers they are still young at heart

Some of those seem odd. Checking one by hand:

http://www.google.com/search?hl=en&safe=off&client=firefox&rls=org.mozilla%3Aen-US%3Aunofficial&hs=lro&q=%22don%27t+trust+anyone+over+thirty+future+events+london+walker+museum+williams+college+new+york+www%22&btnG=Search

Results 1 - 10 of about 214,000 for "don't trust anyone over thirty future events london walker museum williams college new york www". (0.17 seconds)

Information No results found for "don't trust anyone over thirty future events london walker museum williams college new york www".


Results for don't trust anyone over thirty future events london walker museum williams college new york www (without quotes):

So it looks like the script found some phrases that don't actually exist. Google automatically removed the quotes in an attempt to be helpful, and snowclone.pl counted them. I believe there are two undesirable behaviors:

1. Why did snowclone.pl even think that was a phrase?

2. snowclone.pl should count it as 0 if Google reruns the search with the quotes removed.

Thanks again for writing snowclone.pl!

Posted by: Alan De Smet at Apr 21, 2008 8:54:13 PM

Hmm, I think I see what's happening. If I search on "don't trust anyone over thirty future events" (seven words), it returns the single web page with that sequence of words. (Whoops, make that two web pages--Google just indexed this page while I was typing. :) If I search for more than seven, it removes the quotes, adds an error message saying "no results found", but then unhelpfully switches it to a non-quoted search with many more hits. It's not a hard limit at seven words, though--"the way to a man's heart is through his stomach" doesn't have the quotes removed. I guess I'll have to enhance it to detect the "no results found" message.

Are you using the latest version (1.04), by the way? I silently fixed a bug some time ago where it was ignoring all the variants found in results 1-900 (i.e. only 901-1000 were counted). I really should go update the original post.

Posted by: The Tensor at Apr 21, 2008 9:28:37 PM

OK, I've uploaded a new version (1.05) with a fix for the "no results found" problem. The script now detects that the search has been expanded for us by removing the quotes and returns zero instead.

Posted by: The Tensor at Apr 22, 2008 3:24:12 AM

Post a comment