Tuesday February 14, 2006

Hugo SIPs

Some time after Amazon added the "Search Inside!" feature, they also began displaying a list of "SIPs" and "CAPs" for most searchable books.  CAPs are Captitalized Phrases that occur frequently in the book.  They usually include things like character and place names.  SIPs (Statistically Improbable Phrases) are more interesting.  They're characteristic phrases that occur more often in the book in question than in all the other searchable books.  To show you what SIPs look like, I've gone through the (searchable) novels that have won the Hugo Award and collected their SIPs.  It's interesting to see how the SIP algorithm sometimes successfully distills the flavor of the language of a novel (and sometimes not).

Here are the novels and their SIPs in chronological order.  Note that in some early years they didn't give a Hugo for Best Novel, and that there have been a couple of ties.  My comments are in [brackets].

The Demolished Man, Alfred Bester (1953)

not searchable

[Damn!  I was hoping to find out if any phrases from Tenser, said the Tensor showed up.]

They'd Rather Be Right, Mark Clifton and Frank Riley (1955)

not searchable

[Not only not searchable, but thoroughly out of print—since 1981, looks like.  I've read it, and I don't remember a thing about it.]

Double Star, Robert A. Heinlein (1956)

pseudo limbs, two gravities, grand network, adoption ceremony

The Big Time, Fritz Leiber (1958)

control divan, bomb chest, little commandant, bronze chest, his tentacles, peace message

A Case of Conscience, James Blish (1959)

jungle suit, equatorial sea, fusion bombs

Starship Troopers, Robert A. Heinlein (1960)

hand flamer, assistant section leader, retrieval boat, cap troopers, bulkhead thirty, shines the name, powered suit, powered armor, combat drop, your platoon, platoon sergeant, third lieutenant, administrative punishment, drop room, sack time

[This is the first home run—it's almost a summary of the novel.]

A Canticle for Leibowitz, Walter M. Miller, Jr. (1961)

not searchable

Stranger in a Strange Land, Robert A. Heinlein (1962)

stereo tank, bounce tube, babble box, water brothers, posing show, his time sense, speak rightly, naughty picture, eternally saved, water ceremony, tattooed lady

[Another good one.  Heinlein sure had a way with words—er, make that phrases.]

The Man in the High Castle, Philip K. Dick (1963)

silver triangle, yarrow stalks, wicker hamper

Way Station, Clifford Simak (1964)

not searchable

The Wanderer, Fritz Leiber (1965)

not searchable

Dune, Frank Herbert (1966)

his stillsuit, colonel bashar, gom jabbar, factory crawler, ducal signet, poison snooper, weirding way, dew collectors, diamond tattoo, little makers, water flagon, other mote, message cylinder, spice liquor, shield belt, terrible purpose, palm lock, prison planet, death commandos, desert power, total blue, shield generator, nose plugs

[A bumper crop!  Herbert's futurespeak really was memorable.]

...And Call Me Conrad (a.k.a This Immortal), Roger Zelazny (1966)

searchable, but no SIPs

[What's up with this, I wonder?  No phrase was statistically improbable?]

The Moon is a Harsh Mistress, Robert A. Heinlein (1967)

lock thirteen, catapult head, executive cell, ballistic radars, old catapult, new catapult, ejection end, new chum, grain barges, gentleman member, laser drills, tonne for tonne, other warrens, parking orbit, escape speed, grain shipments

Lord of Light, Roger Zelazny (1968)

purple grove, thunder chariot, demons pierced, phantom cats, death gaze, eastern continent, fire elementals

Stand on Zanzibar, John Brunner (1969)

not searchable

The Left Hand of Darkness, Ursula K. Le Guin (1970)

white weather

[This is odd.  Why only one SIP?  I guess Le Guin tends to coin single words rather than phrases, like mindspeech, Ekumen, and ansible (though this last one occurs in a SIP for Speaker for the Dead, below).]

Ringworld, Larry Niven (1971)

shadow square wire, stepping discs, sonic fold, hyperdrive shunt, intercom image, cziltang brone, second quantum hyperdrive, puppeteer fleet, crash balloons, transfer booths, sleeping plates, galactic axis, flashlight laser, electromagnetic cannon, shadow squares, puppeteer worlds, fusion motors, ring foundation, lifting motors, crash couch, infinity horizon, zap gun, floating building, scope screen, fusion drives

[Ah, Larry Niven.  If I had to start all over again, I might name this blog Cziltang Brone.]

To Your Scattered Bodies Go, Philip Jose Farmer (1972)

not searchable

The Gods Themselves, Isaac Asimov (1973)

not searchable

Rendezvous With Rama, Arthur C. Clarke (1974)

not searchable

The Dispossessed, Ursula K. Le Guin (1975)

four decads, temporal physics, knobby one, physics office, orange blanket, bed platform

[Again, Le Guin doesn't seem to have many statistically improbable phrases.  It would be nice if Amazon also generated a list of SIWs (words), though I imagine that might get out of hand...]

The Forever War, Joe Haldeman (1976)

collapsar field, collapsar jumps, nova bombs, pressor field, general freak, portal planets, gigawatt laser, stasis field, logistic computer, five gees, fighting suits, one gee, image converter, laser fire

[Here's something annoying: in the original version of the novel, Haldman referred repeatedly to a bevawatt laser.  It turns out he was confused—beva- isn't one of the metric prefixes, it's from BeV 'billion electron volts', a measure of energy rather than power.  I don't care, I like the sound of bevawatt—sounds BIG.  Unfortunately, he's changed every occurrence to gigawatt laser in a later edition.  Booo!]

Where Late the Sweet Birds Sang, Kate Wilhelm (1977)

searchable, but no SIPs

Gateway, Frederik Pohl (1978)

prayer fans, food mines, ship handling

Dreamsnake, Vonda McIntyre (1979)

not searchable

[Also out of print.]

The Fountains of Paradise, Arthur C. Clarke (1980)

forty thousand kilometers, butterfly nut, hundred klicks, synchronous orbit

[Forty thousand kilometers and synchronous orbit is almost a spoiler.]

The Snow Queen, Joan D. Vinge (1981)

trefoil tattoo, sibyl mind, killing mers, been offworld

Downbelow Station, C. J. Cherryh (1982)

not searchable

Foundation's Edge, Isaac Asimov (1983)

not searchable

Startide Rising, David Brin (1984)

not searchable

Neuromancer, William Gibson (1985)

toxin sacs, new pancreas, shark thing, leather jeans

[I'm surprised there aren't more SIPs in Neuromancer.  I guess a lot of Gibson's coinages were capitalized—Panther Moderns, Villa Straylight—but what about jack in or cyberspace cowboy?]

Ender's Game, Orson Scott Card (1986)

bugger ships, bugger fleet, bugger wars, beat the buggers, flash suits, null gravity, toon leaders, simulator field, green green brown, frozen soldier, launch group

[Hmm, bugger, bugger, bugger, and buggers.  I wonder what Orson has on his mind?]

Speaker for the Dead, Orson Scott Card (1987)

ansible connection, hive queen, other piggies, dead piggies, genetic molecules, speakers for the dead, third life, metal eyes

The Uplift War, David Brin (1988)

not searchable

Cyteen, C. J. Cherryh (1989)

young sera, two azi, endocrine learning, old azi, wing supervisor, your maman, cold lab, time sera, entertainment tapes, when maman, tape structures, where maman, security flag, psych tests, your majority

Hyperion, Dan Simmons (1990)

not searchable

[Too bad, I'll bet this would have been interesting.  Brilliant book; didn't like the sequel nearly as much.  Did the later books get better?]

The Vor Game, Lois McMaster Bujold (1991)

not searchable

Barrayar, Lois McMaster Bujold (1992)

not searchable

A Fire Upon the Deep, Vernor Vinge (1993 tie)

coldsleep boxes, radio cloaks, her dataset, his fronds, voder voice, drive spines, cargo shell, flying house, command deck, refugee ship, scarred one, inner keep, alien member, other hull, single pack, zero gee, most packs

Doomsday Book, Connie Willis (1993 tie)

not searchable

Green Mars, Kim Stanley Robinson (1994)

not searchable

Mirror Dance, Lois McMaster Bujold (1995)

not searchable

The Diamond Age, Neal Stephenson (1996)

not searchable

Blue Mars, Kim Stanley Robinson (1997)

not searchable

Forever Peace, Joe Haldeman (1998)

forever peace, joe haldeman, being jacked, been jacked, memory modification, old platoon, your platoon

[Heh, the SIP calculator has included the header on each page ("joe haldeman") in the statistics.]

To Say Nothing of the Dog, Connie Willis (1999)

not searchable

A Deepness in the Sky, Vernor Vinge (2000)

play twine, ubiquitous law enforcement, booze parlor, taxi locks, fleet library, main torch, old cobber, game helmet, joint command post, eating hands, wire gun, electric jets, zero gee, three hundred seconds, heavy lifters, one gee, missile fields, trading culture

Harry Potter and the Goblet of Fire, J. K. Rowling (2001)

not searchable

American Gods, Neil Gaiman (2002)

little brown cat, buffalo man, pale suit, fat kid

Hominids, Robert Sawyer (2003)

not searchable

Paladin of Souls, Lois McMaster Bujold (2004)

dowager royina, castle warder, parley officer, spiritual conductor, courier girl, courier station, demon light, fifth god, her inner vision, demon magic, baggage mules, red stallion, sewing woman, her stirrups, grinned briefly, court mourning, her demon

Jonathan Strange & Mr Norrell, Susanna Clarke (2005)

new manservant, madhouse attendants, fairy roads, practical magician, nameless slave, fairy servants, school for magicians, little surprized, four hundred guineas, little stone figure, two magicians, packhorse bridge, new magicians, moss oak, chalk road, much surprized, magical history, country servants, century magician, her ladyship, magic done, second shall, great surprize, fairy spirits, other magicians

SIPs are interesting to me, in that computational linguistics, statistical processing kind of way, but I wonder how useful they are for non-lingeeks.  I guess the idea is to allow people to search for phrases they remember from books, even if those phrases aren't the title or the author's name.  They haven't made the feature very promient, though.  If you search for Villa Straylight, for example, you get no hits, but if you then click on the "additional results" link, Neuromancer is the first result.

Anyway, I'm back to writing my abstract for the LSA Summer Meeting which, coincidentally, involves searching through a dictionary looking for statistically improbable correlations between headwords and their definitions.

I am The Tensor, and I approve this post.
10:47 PM in Linguistics , Science Fiction | Submit: | Links:

TrackBack

TrackBack URL for this entry:
http://www.typepad.com/services/trackback/6a00d8341c88ad53ef00d8345d59a369e2

Listed below are links to weblogs that reference Hugo SIPs:

» Statistically Improbable Phrases in SF from SF Signal
Tenser blog has an interesting post about Amazon's new search feature which looks for Statistically Improbable Phrases (SIPs), defined by Amazon as "the most distinctive phrases in the text of books in the Search Inside! program". The interesting thing... [Read More]

Tracked on Feb 16, 2006 11:02:42 AM

Comments

Oh, Tensor, I'm sure you know perfectly well what "bugger" means in the context of Ender's Game.

[re: Hyperion] Brilliant book; didn't like the sequel nearly as much. Did the later books get better?

In my opinion, not really, though there were some interesting ideas. I liked the extra-fast starship drive that necessarily killed its passengers, requiring them to be resurrected afterward.

Posted by: Tim May at Feb 15, 2006 9:30:10 AM

Harry Potter got a Hugo? Jesus.

Posted by: language hat at Feb 16, 2006 8:23:12 AM

I think what's striking about Dune's phrases is how non-futurespeaky many of them are. Ducal signet? Water flagon? But the whole aristocratic-retro thing is a big part of what made the book so great.

Oh, and Language Hat, come on man, don't be a Pottuh Hatuh!

Posted by: includedmiddle at Feb 16, 2006 12:12:43 PM

I'd like to see a search for statistically probable phrases, to find the most cliche book ;-)

Posted by: Steve at Feb 17, 2006 6:36:40 PM

They'd Rather Be Right was also published under the title The Forever Machine. That version is also O.P., but more recently and more cheaply. Under any title, it's probably the worst novel to win the Hugo.

Posted by: scott at Feb 17, 2006 8:56:28 PM

...And Call Me Conrad (a.k.a This Immortal), Roger Zelazny (1966)

searchable, but no SIPs

[What's up with this, I wonder?  No phrase was statistically improbable?]

Who can forget the famous SIP "rabbit-venom"?

And re-reading it a bit finds plenty more.

"kallikanzaroi blood" "influential galactojournalist" "delicately webbed" "mutant fungus" "spiderbat glided" "galactic culture" "pseudotelepathic wish-fulfillment" "piezo-electric radar mesentery"
"meta-cyanide" "supernumary legs" "boxed shajadpa"

And so on and so forth. It's worth a re-read, I think, just to track down Zelazny's playful use of scientifictional language.

Posted by: Owlmirror at Feb 19, 2006 12:34:40 PM

Damn. I left out some closing tags and broke the comments. Sorry about that.

Posted by: Owlmirror at Feb 19, 2006 12:38:22 PM

Oh, and one final comment: I don't think Amazon is listing all SIPs in all of the above. They're listing all of the SIPs found by those books preferred by Amazon SIP-finders, which in turn depends on the SIP-finders reading tastes. Or perhaps the book is chosen for SIP-finding based on the book's Amazon rank?

Which I suspect is the real reason why there are so many from Heinlein and Herbert, and so few from Le Guin and Zelazny.

Amazon sales ranks for some of the above (although it varies by edition):

  • Dune: #1,599
  • Moon is a Harsh Mistress: #4,378
  • Starship Troopers: #5,516
  • Stranger in a Strange Land: #13,092
  • The Dispossessed: #24,358
  • Ringworld: #56,152
  • Left Hand of Darkness: #76,978
  • This Immortal: #273,700

Hmm.

Posted by: Owlmirror at Feb 19, 2006 1:06:41 PM

Oh, and one final comment: I don't think Amazon is listing all SIPs in all of the above. They're listing all of the SIPs found by those books preferred by Amazon SIP-finders, which in turn depends on the SIP-finders reading tastes.

I'd be surprised if SIP-finding was done by humans. It's easy to see how it could be done purely automatically:

  1. Calculate the bigram frequencies for each book
  2. Use those to calculate the overall bigram frequencies
  3. The SIPs for each book are the bigrams with the highest ratio of book-frequency to overall-frequency

How they deal with trigrams is more of a question (do you treat them separately or somehow mixed in with the bigrams?), but I doubt they're doing any of it by hand. All that should matter is whether the full text of the book is available for processing. It's possible that a human later prunes the SIP lists in case there are any real oddbals, but the main work must be done by software.

Or perhaps the book is chosen for SIP-finding based on the book's Amazon rank?

This is probably closer to the mark. A book can only have its SIPs calculated if the full text is available, and that's probably only true for books that have had recent editions (new or reissued), which ought to be strongly correlated with sales.

Posted by: The Tensor at Feb 19, 2006 6:34:51 PM

A book can only have its SIPs calculated if the full text is available, and that's probably only true for books that have had recent editions (new or reissued), which ought to be strongly correlated with sales.

True, but This Immortal does have the full text available.

I searched on some of the phrases I posted above, and realized that most of them, while unusual, weren't "SIPs" as defined by Amazon, since they don't show up more than once in a search inside. However, since I know that "pseudotelepathic wish-fulfillment" does indeed occur more than once, I wondered what was wrong with that as a SIP. Well, some investigation shows that the OCR is very imperfect, and "wish-fulfillment" does not show up properly to make the phrase stand out.

So I suspect that human intervention is actually required in order to clean up OCR goofs. Once that has been done, SIP finding can work properly. And it seems most likely that works chosen for OCR cleanup are done so by sales rank.

Which is mostly what you said, but with crappy OCR being an additional factor, in addition to the book being scanned in the first place.

Posted by: Owlmirror at Feb 19, 2006 8:44:40 PM

Just out of curiosity -- what dictionary are you using as the fodder for your LSA abstract?

Posted by: Erin at Feb 20, 2006 2:01:29 PM

It's the 1913 edition of Webster's, which is one of the files available here:

ftp://ftp.dict.org/pub/dict/

Posted by: The Tensor at Feb 20, 2006 5:06:34 PM

Brilliant book; didn't like the sequel nearly as much. Did the later books get better?

Not really. I can't say I actively regretted reading them, but probably not far from it.

Posted by: FS at Apr 25, 2007 8:57:24 AM

Post a comment