Ethan Zuckerman has a delightful post recounting an apparently delightful talk by Oxford University Press lexicographer Erin McKean at the Pop!Tech 2006 conference. (Hat tip to John Palfrey.) The convergence of technology and lexicography is an exciting place. Ethan talks a little about dictionary mash-ups and peer production of dictionaries. McKean has spoken elsewhere about using search engines to track the evolution of language.
But what caught my eye (and John’s) is the discussion in Ethan’s post about the need for dictionary-makers to scan billions of words digitally in order to analyze linguistic trends effectively:
This scanning shouldn’t be threatening to publishers. “I don’t care about your plot, or your ideas – I just want to analyze your use of the language.” It should be considered fair use… “but this is America – anyone can sue anyone for anything.” And just the threat of a lawsuit is enough to prevent lexicographers from analysing some texts.
She begs us to make changes to the copyright pages of our books so that lexicographers have the explicit right to analyze them. (I’ll be putting the idea in front of Larry Lessig, to see if this can be yet another selling point for Creative Commons.)
Indeed, it sounds like fair use. We are back to the same question raised by the litigation over the Google Library Project. Does scanning — the kind of crawling over text as raw data that now fuels the internet and increasingly the whole world — violate copyright? In the Google fight, the authors and publishers who sued argue that (1) the very act of scanning is an infringement and that (2) the indexing Google proposes to undertake, for commercial purposes, is not a fair use that immunizes the infringement. I have my own opinions on that dispute, but I recognize that it is a close call about which reasonable minds could differ.
Scanning the books for lexicographic analysis seems like a superior test case. I suppose the dictionary is a commercial venture too, but the type of use is more clearly transformative and it is less likely to intrude on any market available to the rightsholders. The fair use issue tilts more clearly toward the user.
That allows a clearer view of what seems to me the more fundamental question. It may be time to revisit whether scanning, even though it is literally a form of reproduction, should be considered more like machine-assisted reading, at least when the purpose is the kind of data-crunching of either indexing (as in Google) or analyzing linguistic patterns. These are projects formerly completed by laborious analog effort. When done in that fashion, neither a library card catalogue nor a dictionary represents even a close call on copyright infringement — both are perfectly legal. Is it sensible that automating that process, which is the only way to deal with the modern explosion of information, should convert such innocent acts into infringements? If the fair use doctrine as currently constructed is not the place to work all this out (as I discussed here), then should we change copyright law, either through Congress or the courts, to permit such scanning explicitly?
Ethan’s suggestion that authors could use Creative Commons to allow such scanning is a good one (although wouldn’t the lexicographers be concerned about a skewed sample of language use predominated by Info/Law geeks?). But on a broader scale, this is yet more evidence of the need for a reform of the underlying law.