Friday June 23, 2006
This I Believe #12
...that automatic language identification systems, which attempt to determine what language a sample of unknown text is written in, needlessly endanger the sanity and survival of the human race. In particular, such systems fail to take into account that some knowledge has been wisely hidden away from human eyes for millennia, and that such dangerous knowledge is generally preserved in one of a handful of ancient, disturbingly alien languages.
Consider a scenario. For your Outer Classics dissertation, you're researching certain, shall we say, difficult-to-obtain texts of questionable provenance. As you flip through a moldering copy of the Pnakotic Manuscripts, a single sheet of parchment, made of oddly familiar-feeling leather, drops out onto your desk. You pick the sheet up and are just able to make out queer, unfamiliar, somehow disquieting characters written in some kind of dark brown ink.
Your curiosity piqued, you fire up your computer and head to your favorite language identification site. You painstakingly enter the characters (which is only possible thanks to the hubris of those rash madmen at the Unicode Consortium), then, with a faint tremor in your hand, you click the button labeled "Submit", unaware of just how terrifyingly approriate that label is.
What should the language identification system return? Given the commitment of computational linguists to science (bah!) and truth (naive fools!), you're likely to receive an accurate answer: Atlantean, Old Lomarian, or even, heaven help you, Yithese. Who knows where this first hint will lead you and what ancient secrets, long lost in the mists of an inhuman prehistory, you might uncover?
It would be better, instead, if the creators of language identification systems designed them to return a stern warning in such a case:
Some knowledge is too terrifying for merely mortal minds to safely contain. You may have stumbled upon such knowledge. For the sake of humanity, burn the original, scatter its ashes, forget that you ever saw it, and retreat, ignorant, into the safety of a new dark age!
We'd all of us be safer if such practices were universally adopted. At least you would be safe, gentle reader, for I fear my fate is sealed. Had I but known the horror that awaited me in that accursed tome, wisely concealed deep in the bowels in the linguistics library...
TrackBack URL for this entry:
Listed below are links to weblogs that reference This I Believe #12:
XRCE's identifier thinks "Ia! Ia! Cthulhu Fthagn! Ph'nglui mglw'nfah Cthulhu R'lyeh wgah'nagl fhtagn!" is Irish. That strikes me as a safer answer, really—a red herring is less of a goad to curiosity than a dire warning is.
Many language guessing systems have been compromised, or perhaps “protected” is a better term, to prevent the kind of accidental acquisition of inappropriate knowledge.
For those guessing systems that have not been
hacked “protected”, there is an unnameable artform that is practiced by those who must safeguard such knowledge.. writing in such a way that the language is consistently identified incorrectly.
For example, see a certain letter to the editor at SpecGram (search for “Anot Lanywassufte”). The text is clearly not English, but pretty much any language identifier will tell you that it strongly identifies as English. It's a difficult task to pull off, but excellent examples are things of meta-poetic beauty, it is vaguely akin to triolet, but taken to a meta-meta-level.
By the way, I can’t reveal what the text of the letter says, and I recommend against reading it aloud unless you are standing in a properly constructed pentagram of protection.
"XRCE's identifier thinks "Ia! Ia! Cthulhu Fthagn! Ph'nglui mglw'nfah Cthulhu R'lyeh wgah'nagl fhtagn!" is Irish."
That can be easily explained. For about six years ago, I translated S. Albert Kivinen's piece of parodic Cthulhu fan fiction, Keskiyön Mato Ikaalisissa ("The Midnight Worm in Ikaalinen") from Finnish into Irish and made it accessible on the Internet (as Péist an Mheán Oíche). Probably they used that story as input material for their identifier.
Posted by: Panu at Jan 22, 2008 5:19:03 AM
XRCE's identifier thinks "Ia! Ia! Cthulhu Fthagn! Ph'nglui mglw'nfah Cthulhu R'lyeh wgah'nagl fhtagn!" is Irish.
Because you can't even spell Iä, that is. May you be eaten next to last.
Posted by: David Marjanović at Sep 5, 2008 4:04:38 PM