To its credit, the U.S. government has placed a tremendous quantity of legal information online. You can look up any patent ever issued at the USPTO’s web site and see either the full text (since 1976) or a scanned image (since 1790) of the issued patent. Pending legislation can be downloaded from THOMAS, which also tracks the status of pending bills. More legislative and executive information is available at the GPO Access site.
Many government web sites, however, share some common design flaws. First, they require you to have a fairly precise idea what you are looking for. Text searching of pre-1976 patents can’t be done; you must know a precise patent number to retrieve the images. To retrieve current legislation from GPO Access’s Statutes at Large collection, you must supply either a the public law number or a precise citation by volume and page number. Searching is limited and barely functional, and one of the defining modalities of accessing information on the internet—browsing—isn’t possible. In effect, you must do your research elsewhere, and then come get what you wanted to find from the government in the first place.
Second, many of these sites make offline access to content difficult or impossible. The USPTO will, for a small fee, e-mail you the page scans of a patent so you can browse them at the pool, but many sites offer nothing comparable. As net access becomes ever more ubiquitous and universal, perhaps this objection will recede in importance, but at the moment, at least, it seems needlessly old-fashioned.
Third, in many instances (pre-1976 patents among them), all you get is scanned images, one per page. The technology certainly exists to create an on-the-fly OCR scan extracting all the text from each scanned page of a multipage document (see, e.g., this site), but if you want even a rudimentary page scan of most government documents, you’ll have to do it yourself. But on the internet, content that doesn’t exist in text-searchable form can’t be indexed, which means that for many purposes, it may as well not be online at all.
These drawbacks aren’t universally shared. For an example of a government web site that gets it (mostly) right, have a look at the Supreme Court. The quill-pen set may not allow cameras in the courtroom, but they’ve created an online presence that other public agencies should envy, with downloadable transcripts dating back eight years and complete volumes of the U.S. Reports since 1991. There’s a strong bias in favor of the contemporary, as with most government-operated sites; it has typically fallen to outside groups like public.resource.org and AltLaw to bring older judicial records into the digital age.
To try to imagine what it might look like if a govermnent-information site were modeled on the Supreme Court’s example, I created Early United States Statutes, a repository that currently contains the first twenty volumes of the United States Statutes at Large. The content came from the Library of Congress, whose A Century of Lawmaking for a New Nation project includes page scans of a wealth of historical information. On the Library’s site, the early Statutes at Large can be viewed one image at a time, but there is no ability to download complete volumes nor any scans of the text of each page. I wrote two very short shell scripts to download the scans of each volume from the Library of Congress’s web site, clean up the images, and stitch them all together into a single large file (which is offered in two different formats, the familiar PDF and the newer DjVu). En route, the scans were passed through Google’s Tesseract OCR software, and the results have been made available for download as well.
In substance, the content of Early United States Statutes is the same as that offered by the Library of Congress, but in form, it is (I hope) more useful, since it enables offline browsing and searching of the text. The OCR text output is pretty raw, and I have made no attempt to clean it up, although avenues exist for doing work of that sort, and the raw OCR text could be useful as an initial input to future projects to foster open access through crowdsourcing.
Now, I’m clearly way out of my depth here; I’m a law professor with no training in archiving, managing, or retrieving information, and barely enough facility with information technologies to create the rudimentary scripts that produced the content now online at Early United States Statutes. Those drawbacks aside, though, the site does at least offer public information in a way that is, at least by some criteria, better than what the Library of Congress has offered, and I’d like to see many more such efforts spring up. In the future, it will probably take someone with feet planted in both the library and technology communities—a John Palfrey type, perhaps?— to get open-access projects like this off the ground and keep them moving. Whoever solves that problem first, though, and brings historical records online in a way that is browseable and fully searchable, will have done a great service to knowledge.