Karen Spärck Jones, 1935-2007

In honor of Ada Lovelace Day 2012, I write about the only female winner of the Lovelace Medal awarded by the British Computer Society for “individuals who have made an outstanding contribution to the understanding or advancement of Computing”. Karen Spärck Jones was the 2007 winner of the medal, awarded shortly before her death. She also happened to be a leader in my own field of computational linguistics, a past president of the Association for Computational Linguistics. Because we shared a research field, I had the honor of knowing Karen and the pleasure of meeting her on many occasions at ACL meetings.

One of her most notable contributions to the field of information retrieval was the idea of inverse document frequency. Well before search engines were a “thing”, Karen was among the leaders in figuring out how such systems should work. Already in the 1960’s there had arisen the idea of keyword searching within sets of documents, and the notion that the more “hits” a document receives, the higher ranked it should be. Karen noted in her seminal 1972 paper “A statistical interpretation of term specificity and its application in retrieval” that not all hits should be weighted equally. For terms that are broadly distributed throughout the corpus, their occurrence in a particular document is less telling than occurrence of terms that occur in few documents. She proposed weighting each term by its “inverse document frequency” (IDF), which she defined as log(N/(n + 1)) where N is the number of documents and n the number of documents containing the keyword under consideration. When the keyword occurs in all documents, IDF approaches 1 for large N, but as the keyword occurs in fewer and fewer documents (making it a more specific and presumably more important keyword), IDF rises. The two notions of weighting (frequency of occurrence of the keyword together with its specificity as measured by inverse document frequency) are combined multiplicatively in the by now standard tf*idf metric; tf*idf or its successors underlie essentially all information retrieval systems in use today.

In Karen’s interview for the Lovelace Medal, she opined that “Computing is too important to be left to men.” Ada Lovelace would have agreed.

...time to switch...
A very old light switch (2008) by RayBanBro66 via flickr. Used by permission (CC by-nc-nd)

The journal Research in Learning Technology has switched its approach from closed to open access as of New Year’s 2012. Congratulations to the Association for Learning Technology (ALT) and its Central Executive Committee for this farsighted move.

This isn’t the first journal to make the switch. The Open Access Directory lists about 130 of them. In my own research field, the Association for Computational Linguistics (ACL) converted its flagship journal Computational Linguistics to OA as of 2009, and has just announced a new open-access journal Transactions of the Association for Computational Linguistics. Each such transition is a reminder of the trajectory that journal publishing ought to head.

The ALT has done lots of things right in this change. They’ve chosen the ideal licensing regime for papers, the Creative Commons Attribution (CC-BY) license. They’ve jettisoned one of the largest commercial subscription journal publishers, and gone with a small but dedicated professional open-access publisher, Co-Action Publishing. They’ve opened access to the journal retrospectively, so that the entire archive, back to 1993, is available from the publisher’s web site.

Here’s hoping that other scholarly societies are inspired by the examples of the ALT and ACL, and join the many hundreds of scholarly societies that publish their journals open access. It’s time to switch.

Gregoire Bouillier
I used to use as my standard example of why translation is hard — and why fully automatic high-quality translation (FAHQT) is unlikely in our lifetimes however old we are — the translation of the first word of the first sentence of the first book of Proust’s Remembrance of Things Past. The example isn’t mine. Brown et al. cite a 1988 New York Times article about the then-new translation by Richard Howard. Howard chose to translate the first word of the work, longtemps, as time and again (rather than, for example, the phrase for a long time as in the standard Moncrieff translation) so that the first word time would resonate with the temporal aspect of the last word of the last volume, temps, some 3000 pages later. How’s that for context?

I now have a new example, from the Lorin Stein translation of Grégoire Bouillier‘s The Mystery Guest. Stein adds a translator’s note to the front matter “For reasons the reader will understand, I have refrained from translating the expression ‘C’est le bouquet.’ It means, more or less, ‘That takes the cake.’” That phrase occurs on page 14 in the edition I’m reading.

The fascinating thing is that the reader does understand, fully and completely, why the translator chose this route. But the reason is, more or less, because of a sentence that occurs on page 83, a sentence that shares no words with the idiom in question. True the protagonist perseverates on this latter sentence for the rest of the novella, but still, I challenge anyone to give an explanation in less than totally abstract terms, as far from the words actually used as you can imagine, to explain the reasoning, perfectly clear to any reader, of why the translator made this crucial decision.

Language is ineffable.