Open Access Law: Two Cheers for Northwestern

Via Larry Solum’s Legal Theory Blog comes word of an important announcement from the editors of the Northwestern University Law Review. The editors have been paying close attention to the open-access debate (see here for Bill’s terrific compilation of links to many of the most interesting recent posts), and after giving the matter careful thought, are putting themselves squarely on the side of the good guys. From their announcement:

Starting with the fourth issue of our ninety-ninth volume and moving forward, all of our content has been, and will continue to be, available as a PDF download through our past issues tab. As a result, anyone will be able to find Northwestern University Law Review content using an internet search engine, and download it for free. Furthermore, we will maintain a fully permissive policy regarding authors who wish to post drafts of their forthcoming articles to SSRN, Bepress or other locations on the web. That’s the easy part.

The hard part is that we are currently sitting on a mountain of information which is not readily convertible to PDF format — nearly 100 years of scholarship published solely in print in the Law Review. We are committed to making this information freely available as well. However, the technical and financial challenges that accompany scanning the mountain of material that was published before PDFs existed make this a project that will be ongoing, and contingent on donated funding.

This is really wonderful news, particularly the part about bringing the Review‘s older printed material into the modern era of digital permanence. But I still think that Larry Solum’s “three cheers for the Northwestern University Law Review” is too generous by a factor of one cheer.

Issuing current scholarship in PDF format makes a certain amount of logistical sense. What was a relative rarity (although not at all unheard of) when I went to law school is now commonplace: spurred by economic considerations, more and more journals have taken the digital typesetting process in-house. Articles are edited in digital format (chiefly Microsoft Word, it seems) and laid out according to the journal’s own in-house templates, yielding a final PDF copy that can then be transmitted electronically to be printed and bound. Because the journal is going to produce a PDF copy anyway for its own purposes, it takes little extra effort to post the PDF online, and voilà, you’ve gone open-access.

But in circumstances where a journal wouldn’t be producing PDFs for independent in-house reasons (such as when digitizing one’s back catalog), why standardize on PDF? PDF is fine, but it’s still a proprietary format. It’s a proprietary format that makes sense when one needs to preserve the exact layout and formatting of a page to be printed, but why do that with the older issues? Why not simply convert to HTML/XHTML/XML instead?

Posting journal articles online in a truly open format like HTML is a way of bringing those articles, even older articles, into the contemporary debate in a way that PDF can’t hope to match. To take only two examples:

  • If an HTML-ized article has been properly tagged, other web sites can hyperlink directly into the portion of the article they’re citing. I can write an article with a link that takes the reader directly into the portion of the 100-page original piece that I’m most interested in. It’s needlessly clumsy to give a reader a link to a PDF-format article and then tell them to manually locate page 478 within that article to find the part you’re citing. This is one of the limitations of the printed page that the Web potentially frees us from, if we simply take advantage of it.
  • Cutting and pasting text from a PDF for purposes of quotation is a dicey proposition. It’s altogether impossible if the PDF has been generated as a scanned series of images instead of as OCR’ed text; and even if there is text within the PDF that can be copied-and-pasted, the original’s formatting may come along for the ride. This necessarily entails removing surplus line breaks, deleting surplus hyphens, and generally “fighting” the page formatting of the original piece in circumstances where it’s no longer relevant.

Even where the original page formatting is important for some reason—such as for the purpose of enabling pinpoint cites—that alone isn’t a sufficient reason to use PDF, it seems to me. For example, FindLaw posts Supreme Court opinions in HTML format, with the original page breaks maintained intact through small intralineal notations in a different color. Unobtrusive, but sufficient to permit citation (and hyperlinking!) directly to the pertinent page of a decision.

None of this is meant to detract from the value in what Northwestern is doing, and it would be wonderful to see other journals follow suit. Many of the big-name law reviews have decades’ (and in some cases, more than a century’s) worth of important pieces in their stacks that aren’t accessible online in any form (even in the big-name commercial research databases, which go back only so far). It would be wonderful to get all those resources online in freely accessible format. Those who strive towards the goal of open access, though, would do well to consider that the greatest strides will come from choosing the most open formats available.

10 Responses to “Open Access Law: Two Cheers for Northwestern”

  1. Prof Armstrong, my guess for the reasoning behind PDF format is simply that it is easy to take scanned images and put them into “pages” inside a PDF document. Adobe software has built in “OCR” such that you can select and copy text from an image that is embedded in the document and that selected section of the image is converted to text on the fly by Adobe. Why is this an advantage over just using a high speed scanner and quality OCR software? The problems all stem from the limitations to OCR and the type of formatting used in myst journals. OCR is not 100% by any stretch of the imagination, and the text produced by the scans of the journal would be full of misread characters etc. Consequently, the journal would have to review that material for all the gobblydygook that the OCR software will produce at times. In addition, they would have to then figure out some convenient way to mark up the scanned text and identify that material which is footnotes, that which is text body, OCR has problems with subscript/superscript… the technical hurdles to getting all of that old data into XML + stylesheets, or HTML are fairly staggering… Even the big players in online journals (eg, Hein Online) scan in as images embedded in PDF.

    I would love to see these types of projects go forward in a truly open format, but the ability to do it may be well out of the reach of an individual journal. Perhaps the best way would be to get someone like Google involved?

  2. Just a clarification on the above post. As mentioned in my comment, just a minor disagreement with your bullet point 2 above: Images stored in PDF can be copy and pasted from, but maybe its just the full version of Acrobat that I use. It “seems” like it is doing OCR on the fly from PDF’s I have that have images in them. Granted, its not so good for copy and paste (linebreaks are foobared, as are subscript/superscript, as are many case citations (missing dots, volume numbers not translated properly, etc))

    Theres problems with both methods. Its really technically difficult to OCR scan large volumes of journal text and keep such information as footnates and subsequently mark up the text into XML or html with CSS. If you scan as images, you run into the copy and paste limitations, and all of the stuff you mentioned in your post.

    I am hoping that the decision to go with PDF was simply a balancing act, better to get the material out there than to attempt to overcome the much larger technical hurdles involved in getting the data into a really open format.

  3. For what it is worth, Tim, I believe PDF meets the MA standards for openness. I haven’t looked into the legal details myself, but there are probably a lot more PDF implementations in the wild than .odf implementations, so adobe is doing something right.

  4. Oh, and Tim- AbiWord (I believe both Windows and Linux versions) will happily open and edit pdf files. Thank Dom Lachowicz, my partner in crime from JP and Urs’ class.

  5. AbiWord will indeed open and edit PDF docs on Linux (use it at work all the time). Just don’t get fancy with the formatting ;)

  6. Duke Law School journals should get the three cheers; their current and archived issues (from at least 1997) include articles in both HTML and PDF. Check it out at http://www.law.duke.edu/journals/.

  7. I really like what Duke is doing — I don’t really care for their use of frames, but that’s a personal peeve and has nothing to do with opening access to their work. It looks like they’ve made good use of NAME tags in their HTML, which allows the pinpoint linking from outside that is missing when journals post only PDFs. Now if Duke’s journals would only follow Northwestern’s lead and get their back catalogs online (academic history did not begin in the late 1990s, after all), they will really warrant that third cheer.

    To Chris and Luis: good points all around, and I don’t really mean to run down PDF per se. After all, in my Microsoft lock-in post, I applauded a few journals’ decisions to accept PDF-format submissions on the grounds that PDF can be created and edited on any platform. I use PDFs on a daily basis, although I use LaTeX rather than AbiWord (like David Carradine in Kill Bill 2, I’m all about the old school). :-)

    But to acknowledge that PDF is a good tool for some purposes isn’t to say that it’s the best tool for all purposes. A great deal depends on how smart the creator of the PDF version of an old journal article was; whether they used OCR rather than just scanning images of the page as a bitmap graphic, whether they inserted internal PDF hyperlinks from the page numbers in the table of contents to the corresponding page of the article itself, and on and on. (Chris, on the one hand I’m intrigued by your suggestion that the full version of Acrobat may be able to do OCR “on the fly” to allow cutting and pasting text even where a PDF has been created from a scanned bitmap image. There are a bunch of one-page bitmap scans available at Texas’s Current Copyright Literature site where you can put Acrobat to the test. But on the other hand, a document that can be cut-and-pasted only using the full version of a piece of proprietary software that’s not available for all platforms doesn’t really count as “open,” does it?) As long as human effort is going to be necessary anyway when digitizing old articles, though, why not output to HTML rather than PDF? The advantages are many and the costs are few.

    [edited to add:  Maybe something like this would work as a distributed project.  In other words, perhaps the model we should be following isn't Northwestern (please send money so we can digitize our back catalog) or Google (we are rich and can afford to digitize whatever we want), but Wikipedia, or better still, Project Gutenberg.  Northwestern (or any other journal, hint hint, people) could post scans of its old issues page by page, together with the machine-generated OCR output, which will be full of typos and the like, and then delegate to the community the job of cleaning up the text.  I'd happily spend some of my research assistant funding to get students working on projects like this over the summer, even... :-)]

  8. /me runs scared from teh LaTeX. I haven’t touched that since long long ago in undergrad/grad CS degree timeframe. (circa 1993) Maybe the tools for putting together LaTeX are now better than when I had to script it all by hand.. There really is no beating LaTeX for scientific text, but man, what a pain.

    Here’s the way I would approach this problem:
    1) I would first develop an XML schema for journals, something which can capture the full panoply of the different datatypes that need to be represented in a journal (footnotes, illustrations, paragraphs, headings, subheadings, etc).
    2) Develop a tool (or use an existing tool, which may be out there) that allows me to tag the OCR scanned text with the appropriate XML schema tags
    3) Develop a stylesheet (or set of stylesheets) to use to display the data online

    A community process for actually tagging the data (given the right tagging tools to ensure uniformity of formatting) could work quite well actually.

    The upside to do the Google, we are so rich and can afford to digitize anything, approach is that presumably Google could actually afford/have the expertise to develop either some custom OCR software (which they likely already have for their book scanning project) or custom post processing software to take a lot of the manual tagging work out of the process.

    If we ever do manage to get the IP law journal off the ground at UC hopefully we can consider such things from the outset…

  9. Correction to previous comment -
    Acrobat does NOT do automatic OCR. It does something like an OCR “lite”. I was looking through the PDF format spec, and what is being done is that there is a text

  10. Doh, watch what you click… cont.
    What is being done is the PDF format allows for page by page images, and associated OCR text to be stored in the PDF document. So, if you have a scanner that knows how to create these PDF’s, the scanner creates a page with the image, and associates a text block with the image. Adobe then takes the selection area and finds the appropriate text from the text block.