Via BoingBoing comes news of another big crop of United States Court of Appeals decisions being scanned and made publicly available by public.resource.org. They are scanning the entire Federal Reporter (First Series), which includes late-19th and early-20th century United States case law. Enormous PDFs (and even more enormous TIFFs) of the scanned volumes are available here. A few of the volumes have been OCR’ed, with the rest to follow. The OCR text is much in need of cleanup, as this illustrative page shows. But public.resource.org has lots of experience at this, as you can see from what they’ve already accomplished with their scans of F.2d and F.3d.
The move from paper to digital is a daunting challenge, and one we should all be pleased to see farsighted organizations tackling. It’s a simple enough problem: Knowledge stored in books is available only to those with access to the physical medium in which the information is contained. Knowledge stored online is available anywhere. Books are pricey, but bits are free, or nearly so. The staggering quantity of paper documents that don’t exist at all in digital form (including, as of last year, 99.94% of the textual holdings in the Library of Congress) is an impediment to learning. Projects that do something, anything, to bring formerly print-only information into the digital age surely deserve applause for that reason alone.
So it feels a little churlish to point out one of the shortcomings of public.resource.org‘s Federal Reporter scans. The scanning process produces graphic files (enormous graphic files, at that) which can be OCR’ed and turned into error-filled text, which can in turn be cleaned up and turned into error-free text. But what’s missing at the end of that process is one of the strongest comparative advantages of information sharing in the digital world: hypertext.
Paging through public.resource.org‘s Federal Reporter database, I looked up an old opinion that I happened to have worked on while I was clerking many years ago. All the text seemed to be intact, but all the formatting was gone; no indented block quotes, no emphasis, section headings run together haphazardly with the surrounding body text. No big deal, of course — you’d lose that much or more just in converting to MS Word from WordPerfect, and the court’s reasoning is not affected — but it’s still a little sad to see that structural and contextual data lost in the OCR process. But what really got me was the absence of hyperlinks from the court’s opinion to any of the other authorities cited therein — authorities that were, in many cases, contemporaneously digitized and put online by public.resource.org itself.
My favorite feature on Wikipedia is the “What Links Here” tool — there it is, if you haven’t noticed, in the little Toolbox block along the left-hand side of the page. “What Links Here” reveals the often surprising contextual linkages among Wikipedia articles. Back-tracing the interrelationships among concepts on Wikipedia via “What Links Here” is frequently as illuminating and entertaining as the articles themselves.
The usefulness of such functionality for legal source materials ought to be self-evident — it should be a simple matter to see which cases have been linked to by which other cases; which cases have construed which provisions of which statutes; which cases have been reasonably faithfully followed by later courts and which have been scandalously misinterpreted; and so forth. But it’s not, because in moving from “flat” text on paper to “flat” text online, we’ve lost the contextual depth that the medium of hypertext provides when well used.