Public records, one JPEG at a time?

To its credit, the U.S. government has placed a tremendous quantity of legal information online. You can look up any patent ever issued at the USPTO’s web site and see either the full text (since 1976) or a scanned image (since 1790) of the issued patent. Pending legislation can be downloaded from THOMAS, which also tracks the status of pending bills. More legislative and executive information is available at the GPO Access site.

Many government web sites, however, share some common design flaws. First, they require you to have a fairly precise idea what you are looking for. Text searching of pre-1976 patents can’t be done; you must know a precise patent number to retrieve the images. To retrieve current legislation from GPO Access’s Statutes at Large collection, you must supply either a the public law number or a precise citation by volume and page number. Searching is limited and barely functional, and one of the defining modalities of accessing information on the internet—browsing—isn’t possible. In effect, you must do your research elsewhere, and then come get what you wanted to find from the government in the first place.

Second, many of these sites make offline access to content difficult or impossible. The USPTO will, for a small fee, e-mail you the page scans of a patent so you can browse them at the pool, but many sites offer nothing comparable. As net access becomes ever more ubiquitous and universal, perhaps this objection will recede in importance, but at the moment, at least, it seems needlessly old-fashioned.

Third, in many instances (pre-1976 patents among them), all you get is scanned images, one per page. The technology certainly exists to create an on-the-fly OCR scan extracting all the text from each scanned page of a multipage document (see, e.g., this site), but if you want even a rudimentary page scan of most government documents, you’ll have to do it yourself. But on the internet, content that doesn’t exist in text-searchable form can’t be indexed, which means that for many purposes, it may as well not be online at all.

These drawbacks aren’t universally shared. For an example of a government web site that gets it (mostly) right, have a look at the Supreme Court. The quill-pen set may not allow cameras in the courtroom, but they’ve created an online presence that other public agencies should envy, with downloadable transcripts dating back eight years and complete volumes of the U.S. Reports since 1991. There’s a strong bias in favor of the contemporary, as with most government-operated sites; it has typically fallen to outside groups like public.resource.org and AltLaw to bring older judicial records into the digital age.

To try to imagine what it might look like if a govermnent-information site were modeled on the Supreme Court’s example, I created Early United States Statutes, a repository that currently contains the first twenty volumes of the United States Statutes at Large. The content came from the Library of Congress, whose A Century of Lawmaking for a New Nation project includes page scans of a wealth of historical information. On the Library’s site, the early Statutes at Large can be viewed one image at a time, but there is no ability to download complete volumes nor any scans of the text of each page. I wrote two very short shell scripts to download the scans of each volume from the Library of Congress’s web site, clean up the images, and stitch them all together into a single large file (which is offered in two different formats, the familiar PDF and the newer DjVu). En route, the scans were passed through Google’s Tesseract OCR software, and the results have been made available for download as well.

In substance, the content of Early United States Statutes is the same as that offered by the Library of Congress, but in form, it is (I hope) more useful, since it enables offline browsing and searching of the text. The OCR text output is pretty raw, and I have made no attempt to clean it up, although avenues exist for doing work of that sort, and the raw OCR text could be useful as an initial input to future projects to foster open access through crowdsourcing.

Now, I’m clearly way out of my depth here; I’m a law professor with no training in archiving, managing, or retrieving information, and barely enough facility with information technologies to create the rudimentary scripts that produced the content now online at Early United States Statutes. Those drawbacks aside, though, the site does at least offer public information in a way that is, at least by some criteria, better than what the Library of Congress has offered, and I’d like to see many more such efforts spring up. In the future, it will probably take someone with feet planted in both the library and technology communities—a John Palfrey type, perhaps?— to get open-access projects like this off the ground and keep them moving. Whoever solves that problem first, though, and brings historical records online in a way that is browseable and fully searchable, will have done a great service to knowledge.

2 Responses to “Public records, one JPEG at a time?”

  1. Thats quite a nice effort there, the content looks good!

    Personally, I find that the real problem here is not one of a lack of data available, its the lack of a uniform method of accessing that data. It is simply not sufficient to have so many different ways to access this data, mostly HTTP and web scraping. THOMAS has done a better job than most in providing a standardized means of access, but its all “one off.” If I write code to access data via THOMAS its not the same as what I have to do to get S.Ct cases, or what I have to do to get Patents.

    I have thought of numerous intriguing things that I could do if I were able to access this wealth of information programmatically. Before we can do any of these neat things, we need to able to mine the data. Before we can do real mining, we need to be able to build a searchable repository (all real search efficiency is driven by indexed search data, depending on a singular (if distributed) search index). Before we can do all that, we need a uniform API to access this diverse data.

    The world of imaginable research becomes your oyster if you have this API acting as a wrapper to all of these data sources.

    However, developing this API is no small task, and as much of a fan of open software as I am, I have little faith that the community development process could develop such a complex API efficiently.

  2. JPEGS are not ideal because they are more difficult to print, being images. If they have been scanned this way then they are sometimes a nightmare to read on screen because they are slow to load, and printing them sometimes means having to print a gray blurry background (depending on the scan quality).