March 18th, 2013
|…where I should go…
“Directions” image by flickr user Peat Bakke used by permission.
After a several year postponement while I established and led Harvard’s nascent library skunkworks, the Office for Scholarly Communication, my request for a sabbatical leave has been approved for the 2013–14 academic year. I hope to work on a variety of projects related to library and publishing issues, computational-linguistics–related projects, and teaching preparation.
My plans are to remain in the Boston area during the year but with perhaps several short external residencies at other institutions who might be interested in having me around for a couple of weeks. If your institution might be such a place — willing to provide an office, a place to crash, maybe a plane ticket — please let me know.
[This is a transient post.]
January 3rd, 2013
|…our little tiff in the late 18th century…“NYC – Metropolitan Museum of Art: Washington Crossing the Delaware” image by flickr user wallyg. Used by permission.|
I’m shortly off to give a talk at the annual meeting of the Linguistic Society of America (on why open access is better for scholarly societies, which I’ll be blogging about soon), but in the meantime, a linguistically related post about punctuation.
Careful readers of this blog (are there any careful readers of this blog? are there any readers at all?) will note that I generally eschew the peculiarly American convention of moving punctuation within a closing quotation mark. Examples from The Occasional Pamphlet abound: here, here, here, here, here, here, here, and here. And that’s just from 2012. It’s surprising how often this punctuation convention comes into play.
Instead, I use the convention that only the stuff being quoted is put within the quotation marks. This is sometimes called the “British” convention, despite the fact that other nationalities use it as well, presumably to emphasize the American/British dualism extant from our little tiff in the late 18th century. I use the “British” convention because the “American” convention is, in technical terms, stupid.
The story goes that punctuation appearing within the quotation mark is more aesthetically pleasing than punctuation outside the quotation mark. But even if that were true, clarity trumps beauty. Moving the punctuation means that when you see a quoted string with some final punctuation, you don’t know if that punctuation is or is not intended to be part of the thing being quoted; it is systematically ambiguous.
Apparently, my view is highly controversial. For example, when working with MIT Press on my book on the Turing test, my copy editor (who, by the way, was wonderful, and amazingly patient) moved all my punctuation around to satisfy the American convention. I moved them all back. She moved them again. We got into a long discussion of the matter; it seems she had never confronted an author who felt strongly about punctuation before. (I presume she had never copy-edited Geoff Pullum, from whom more later.) As a compromise, we left the punctuation the way I liked it—mostly—but she made me add the following prefatory editorial note:
Throughout the text, the American convention of moving punctuation within closing quotation marks (whether or not the punctuation is part of what is being referred to) is dropped in favor of the more logical and consistent convention of placing only the quoted material within the marks.
I would now go on to explain why the “British” convention is better than the “stupid” convention, except that Geoff Pullum has done so much better a job, far better than I ever could. Here is an excerpt from his essay “Punctuation and human freedom” published in Natural Language and Linguistic Theory and reproduced in his book The Great Eskimo Vocabulary Hoax. I recommend the entire essay to you.
I want you to first consider the string ‘the string’ and the string ‘the string.’, noting that it takes ten keystrokes to type the string in the first set of quotes, and eleven to type the string in the second pair. Imagine you wanted to quote me on the latter point. You might want to say (1).
(1) Pullum notes that it takes eleven keystrokes to type the string ‘the string.’
No problem there; (1) is true (and grammatical if we add a final period). But now suppose you want to say this:
(2) Pullum notes that it takes ten keystrokes to type the string ‘the string’.
You won’t be able to publish it. Your copy-editor will change it before the first proof stage to (3), which is false (though regarded by copy-editors as grammatical):
(3) Pullum notes that it takes ten keystrokes to type the string ‘the string.’
Why? Because the copy-editor will insist that when a sentence ends with a quotation, the closing quotation mark must follow the punctuation mark.
I say this must stop. Linguists have a duty to the public to use their expertise in arguing for changes to the fabric of society when its interests are threatened. And we have such a situation here.
What say we all switch over to the logical quotation punctuation approach and save the fabric of society, shall we?
October 16th, 2012
|Karen Spärck Jones, 1935-2007|
In honor of Ada Lovelace Day 2012, I write about the only female winner of the Lovelace Medal awarded by the British Computer Society for “individuals who have made an outstanding contribution to the understanding or advancement of Computing”. Karen Spärck Jones was the 2007 winner of the medal, awarded shortly before her death. She also happened to be a leader in my own field of computational linguistics, a past president of the Association for Computational Linguistics. Because we shared a research field, I had the honor of knowing Karen and the pleasure of meeting her on many occasions at ACL meetings.
One of her most notable contributions to the field of information retrieval was the idea of inverse document frequency. Well before search engines were a “thing”, Karen was among the leaders in figuring out how such systems should work. Already in the 1960′s there had arisen the idea of keyword searching within sets of documents, and the notion that the more “hits” a document receives, the higher ranked it should be. Karen noted in her seminal 1972 paper “A statistical interpretation of term specificity and its application in retrieval” that not all hits should be weighted equally. For terms that are broadly distributed throughout the corpus, their occurrence in a particular document is less telling than occurrence of terms that occur in few documents. She proposed weighting each term by its “inverse document frequency” (IDF), which she defined as log(N/(n + 1)) where N is the number of documents and n the number of documents containing the keyword under consideration. When the keyword occurs in all documents, IDF approaches 1 for large N, but as the keyword occurs in fewer and fewer documents (making it a more specific and presumably more important keyword), IDF rises. The two notions of weighting (frequency of occurrence of the keyword together with its specificity as measured by inverse document frequency) are combined multiplicatively in the by now standard tf*idf metric; tf*idf or its successors underlie essentially all information retrieval systems in use today.
In Karen’s interview for the Lovelace Medal, she opined that “Computing is too important to be left to men.” Ada Lovelace would have agreed.
June 16th, 2012
Image of the statue of the Golem of Prague at the entrance to the Jewish Quarter of Prague by flickr user D_P_R. Used by permission (CC-BY 2.0).
Alan Turing, the patron saint of computer science, was born 100 years ago this week (June 23). I’ll be attending the Turing Centenary Conference at University of Cambridge this week, and am honored to be giving an invited talk on “The Utility of the Turing Test”. The Turing Test was Alan Turing’s proposal for an appropriate criterion to attribute intelligence (that is, capacity for thinking) to a machine: you verify through blinded interactions that the machine has verbal behavior indistinguishable from a person.
In preparation for the talk, I’ve been looking at the early history of the premise behind the Turing Test, that language plays a special role in distinguishing thinking from nonthinking beings. I had thought it was an Enlightenment idea, that until the technological advances of the 16th and 17th centuries, especially clockwork mechanisms, the whole question of thinking machines would never have entertained substantive discussion. As I wrote earlier,
Clockwork automata provided a foundation on which one could imagine a living machine, perhaps even a thinking one. In the midst of the seventeenth-century explosion in mechanical engineering, the issue of the mechanical nature of life and thought is found in the philosophy of Descartes; the existence of sophisticated automata made credible Descartes’s doctrine of the (beast-machine), that animals were machines. His argument for the doctrine incorporated the first indistinguishability test between human and machine, the first Turing test, so to speak.
Uniformly, the evidence for Talmudic discussion of the Turing Test is a single quote from Sanhedrin 65b.
Rava said: If the righteous wished, they could create a world, for it is written, “Your iniquities have been a barrier between you and your God.” For Rava created a man and sent him to R. Zeira. The Rabbi spoke to him but he did not answer. Then he said: “You are [coming] from the pietists: Return to your dust.”
Rava creates a Golem, an artificial man, but Rabbi Zeira recognizes it as nonhuman by its lack of language and returns it to the dust from which it was created.
This story certainly describes the use of language to unmask an artificial human. But is it a Turing Test precursor?
It depends on what one thinks are the defining aspects of the Turing Test. I take the central point of the Turing Test to be a criterion for attributing intelligence. The title of Turing’s seminal Mind article is “Computing Machinery and Intelligence”, wherein he addresses the question “Can machines think?”. Crucially, the question is whether the “test” being administered by Rabbi Zeira is testing the Golem for thinking, or for something else.
There is no question that verbal behavior can be used to test for many things that are irrelevant to the issues of the Turing Test. We can go much earlier than the Mishnah to find examples. In Judges 12:5–6 (King James Version)
5 And the Gileadites took the passages of Jordan before the Ephraimites: and it was so, that when those Ephraimites which were escaped said, Let me go over; that the men of Gilead said unto him, Art thou an Ephraimite? If he said, Nay;
6 Then said they unto him, Say now Shibboleth: and he said Sibboleth: for he could not frame to pronounce it right. Then they took him, and slew him at the passages of Jordan: and there fell at that time of the Ephraimites forty and two thousand.
The Gileadites use verbal indistinguishability (of the pronounciation of the original shibboleth) to unmask the Ephraimites. But they aren’t executing a Turing Test. They aren’t testing for thinking but rather for membership in a warring group.
What is Rabbi Zeira testing for? I’m no Talmudic scholar, so I defer to the experts. My understanding is that the Golem’s lack of language indicated not its own deficiency per se, but the deficiency of its creators. The Golem is imperfect in not using language, a sure sign that it was created by pietistic kabbalists who themselves are without sufficient purity.
Talmudic scholars note that the deficiency the Golem exhibits is intrinsically tied to the method by which the Golem is created: language. The kabbalistic incantations that ostensibly vivify the Golem were generated by mathematical combinations of the letters of the Hebrew alphabet. Contemporaneous understanding of the Golem’s lack of speech was connected to this completely formal method of kabbalistic letter magic: “The silent Golem is, prima facie, a foil to the recitations involved in the process of his creation.” (Idel, 1990, pages 264–5) The imperfection demonstrated by the Golem’s lack of language is not its inability to think, but its inability to wield the powers of language manifest in Torah, in prayer, in the creative power of the kabbalist incantations that gave rise to the Golem itself.
Only much later does interpretation start connecting language use in the Golem to soul, that is, to an internal flaw: “However, in the medieval period, the absence of speech is related to what was conceived then to be the highest human faculty: reason according to some writers, or the highest spirit, Neshamah, according to others.” (Idel, 1990, page 266, emphasis added)
By the 17th century, the time was ripe for consideration of whether nonhumans had a rational soul, and how one could tell. Descartes’s observations on the special role of language then serve as the true precursor to the Turing Test. Unlike the sole Talmudic reference, Descartes discusses the connection between language and thinking in detail and in several places — the Discourse on the Method, the Letter to the Marquess of Newcastle — and his followers — Cordemoy, La Mettrie — pick up on it as well. By Turing’s time, it is a natural notion, and one that Turing operationalizes for the first time in his Test.
The test of the Golem in the Sanhedrin story differs from the Turing Test in several ways. There is no discussion that the quality of language use was important (merely its existence), no mention of indistinguishability of language use (but Descartes didn’t either), and certainly no consideration of Turing’s idea of blinded controls. But the real point is that at heart the Golem test was not originally a test for the intelligence of the Golem at all, but of the purity of its creators.
Idel, Moshe. 1990. Golem: Jewish magical and mystical traditions on the artificial anthropoid, Albany, N.Y.: State University of New York Press.
January 4th, 2012
|“…time to switch…”
A very old light switch (2008) by RayBanBro66 via flickr. Used by permission (CC by-nc-nd)
The journal Research in Learning Technology has switched its approach from closed to open access as of New Year’s 2012. Congratulations to the Association for Learning Technology (ALT) and its Central Executive Committee for this farsighted move.
This isn’t the first journal to make the switch. The Open Access Directory lists about 130 of them. In my own research field, the Association for Computational Linguistics (ACL) converted its flagship journal Computational Linguistics to OA as of 2009, and has just announced a new open-access journal Transactions of the Association for Computational Linguistics. Each such transition is a reminder of the trajectory that journal publishing ought to head.
The ALT has done lots of things right in this change. They’ve chosen the ideal licensing regime for papers, the Creative Commons Attribution (CC-BY) license. They’ve jettisoned one of the largest commercial subscription journal publishers, and gone with a small but dedicated professional open-access publisher, Co-Action Publishing. They’ve opened access to the journal retrospectively, so that the entire archive, back to 1993, is available from the publisher’s web site.
Here’s hoping that other scholarly societies are inspired by the examples of the ALT and ACL, and join the many hundreds of scholarly societies that publish their journals open access. It’s time to switch.
July 17th, 2011
I used to use as my standard example of why translation is hard — and why fully automatic high-quality translation (FAHQT) is unlikely in our lifetimes however old we are — the translation of the first word of the first sentence of the first book of Proust’s Remembrance of Things Past. The example isn’t mine. Brown et al. cite a 1988 New York Times article about the then-new translation by Richard Howard. Howard chose to translate the first word of the work, longtemps, as time and again (rather than, for example, the phrase for a long time as in the standard Moncrieff translation) so that the first word time would resonate with the temporal aspect of the last word of the last volume, temps, some 3000 pages later. How’s that for context?
I now have a new example, from the Lorin Stein translation of Grégoire Bouillier‘s The Mystery Guest. Stein adds a translator’s note to the front matter “For reasons the reader will understand, I have refrained from translating the expression ‘C’est le bouquet.’ It means, more or less, ‘That takes the cake.’” That phrase occurs on page 14 in the edition I’m reading.
The fascinating thing is that the reader does understand, fully and completely, why the translator chose this route. But the reason is, more or less, because of a sentence that occurs on page 83, a sentence that shares no words with the idiom in question. True the protagonist perseverates on this latter sentence for the rest of the novella, but still, I challenge anyone to give an explanation in less than totally abstract terms, as far from the words actually used as you can imagine, to explain the reasoning, perfectly clear to any reader, of why the translator made this crucial decision.
Language is ineffable.