A Big Day for Open Access, But More Work Remains

Via BoingBoing comes news of another big crop of United States Court of Appeals decisions being scanned and made publicly available by public.resource.org. They are scanning the entire Federal Reporter (First Series), which includes late-19th and early-20th century United States case law. Enormous PDFs (and even more enormous TIFFs) of the scanned volumes are available here. A few of the volumes have been OCR’ed, with the rest to follow. The OCR text is much in need of cleanup, as this illustrative page shows. But public.resource.org has lots of experience at this, as you can see from what they’ve already accomplished with their scans of F.2d and F.3d.

The move from paper to digital is a daunting challenge, and one we should all be pleased to see farsighted organizations tackling. It’s a simple enough problem: Knowledge stored in books is available only to those with access to the physical medium in which the information is contained. Knowledge stored online is available anywhere. Books are pricey, but bits are free, or nearly so. The staggering quantity of paper documents that don’t exist at all in digital form (including, as of last year, 99.94% of the textual holdings in the Library of Congress) is an impediment to learning. Projects that do something, anything, to bring formerly print-only information into the digital age surely deserve applause for that reason alone.

So it feels a little churlish to point out one of the shortcomings of public.resource.org‘s Federal Reporter scans. The scanning process produces graphic files (enormous graphic files, at that) which can be OCR’ed and turned into error-filled text, which can in turn be cleaned up and turned into error-free text. But what’s missing at the end of that process is one of the strongest comparative advantages of information sharing in the digital world: hypertext.

Paging through public.resource.org‘s Federal Reporter database, I looked up an old opinion that I happened to have worked on while I was clerking many years ago. All the text seemed to be intact, but all the formatting was gone; no indented block quotes, no emphasis, section headings run together haphazardly with the surrounding body text. No big deal, of course — you’d lose that much or more just in converting to MS Word from WordPerfect, and the court’s reasoning is not affected — but it’s still a little sad to see that structural and contextual data lost in the OCR process.  But what really got me was the absence of hyperlinks from the court’s opinion to any of the other authorities cited therein — authorities that were, in many cases, contemporaneously digitized and put online by public.resource.org itself.

My favorite feature on Wikipedia is the “What Links Here” tool — there it is, if you haven’t noticed, in the little Toolbox block along the left-hand side of the page. “What Links Here” reveals the often surprising contextual linkages among Wikipedia articles. Back-tracing the interrelationships among concepts on Wikipedia via “What Links Here” is frequently as illuminating and entertaining as the articles themselves.

The usefulness of such functionality for legal source materials ought to be self-evident — it should be a simple matter to see which cases have been linked to by which other cases; which cases have construed which provisions of which statutes; which cases have been reasonably faithfully followed by later courts and which have been scandalously misinterpreted; and so forth. But it’s not, because in moving from “flat” text on paper to “flat” text online, we’ve lost the contextual depth that the medium of hypertext provides when well used.

4 Responses to “A Big Day for Open Access, But More Work Remains”

  1. It is too bad that they don’t have hyperlinking – I think that is one of the most valuable things about researching on Westlaw or Lexis. At least they are getting stuff online, though!

    On another topic, I’d be really interested in anyone on this blog’s take on the “Spokeo” site that I just encountered today – if you go by their PR, it’s a great way to know everything your friends are doing online. If you go by my 5-10 minute experience with it, it’s a creepy stalker tool, an invasion of privacy, and unprecedented in the way it aggregates online content by any one person.

    I wrote about it in more detail here: http://foresthouse.livejournal.com/448242.html

    (Note: I’m not trying to plug my journal or anything – I am really, REALLY concerned about this, and since info privacy law on the internet isn’t my expertise, I think it would be very helpful if someone here who is more informed could blog about this from a legal perspective.)

  2. Altlaw (and others) are working on the hyperlinking and formatting part of the equation. Think of it as division of labor. :)

  3. Good point, Luis — that is, after all, one of the major points of Everything Is Miscellaneous, isn’t it? All the meta-info beyond the bare text can, and maybe should, be crowdsourced. Looking at AltLaw’s copy of the same case I referenced above, I see that they do indeed have hyperlinks to the S. Ct. and court of appeals opinions, although not to the U.S. Code (which is freely available online in multiple locations) or the cited provisions of the legislative history (which often, rather scandalously in my view, isn’t). Now can AltLaw just whip up a “What Links Here” button? Then we could really begin doing some actual research. :-)

  4. [...] These drawbacks aren’t universally shared. For an example of a government web site that gets it (mostly) right, have a look at the Supreme Court. The quill-pen set may not allow cameras in the courtroom, but they’ve created an online presence that other public agencies should envy, with downloadable transcripts dating back eight years and complete volumes of the U.S. Reports since 1991. There’s a strong bias in favor of the contemporary, as with most government-operated sites; it has typically fallen to outside groups like public.resource.org and AltLaw to bring older judicial records into the digital age. [...]