There have been some interesting conversations lately regarding the OAI-ORE specification, especially its use of Atom as a serialization format. I have great respect for the folks behind OAI-ORE, and their success with OAI-PMH is well-deserved. But the world has changed quite a lot since OAI-PMH was released and filled a pressing need: we had the RSS wars (RSS 2.0 won), we had the REST vs. WS-* wars (REST won), then some very smart folks came along to basically moot the RSS wars with a well-specified but simple format called Atom. Based on REST principles and doing syndication one better by also specifying a “write back” protocol, Atom seems to embody what’s best about web architecture. Not only is it a great specification, with all kinds of applicability outside of its blog-oriented upbringing, but the process of creating the spec was, and continues to be open. I cannot begin to express how useful and enlightening it has been to participate in ongoing discussions on Atom mailing lists, and to peruse the “back issues” to gain clarity on design decisions that guide my implementation. I have little doubt that Atom, a more widely supported specification than OAI-PMH, could supplant OAI-PMH, and I think that would be a good thing. Why? Because we could begin to see a dismantling of the silos that hold “library” or “archive” material on the web and keep it distinct from all of the other great stuff on the web. The distinctions are completely arbitrary and artificial anyway, since machines can do all of the understanding they need to do with a URI and mime type. By the way, when I say Atom is widely supported, I mean really widely supported.
I think the ORE specification is taking a more “designed by a panel of experts” approach than Atom. That’s fine, but I am not sure you can design a good specification these days, and expect it to be used, unless you bring the potential implementors into the discussion. I want to see a widely-applicable specification for describing aggregations of web-accessible resources, but if this is a DSpace/Fedora/ePrints only solution, I’ll bow out of the conversation gracefully. Of course we must be careful about our aims: we already have a pretty darn good means to describe aggregations of web-accessible resources and it is called HTML. So what is the value of ORE?
A lot, really, and it is not so different from other efforts (such as XLink) to “link” resources on the web in a way that allows us to make assertions about that link or about a collection of links (a “named graph”). But where is the proper abstraction? The “hypermedia as the engine of state” REST principle seems to suggest that the resource itself should carry its own semantics along with it, the possible state changes (“links”) providing the best context for deriving meaning. Of course PageRank taught us that incoming links are also valuable for deriving meaning. So where are we left? What sort of named graphs do we want to create, and towards what end — what is the added value? Here is where very specific use cases, well thought out and described, would be useful (an absolute necessity, really). I DID see mention of one project aimed at moving journals from JSTOR into DSpace (exactly the sort of “use case” that is helpful to ponder), but that’s it. Now the conversation rages about Atom/not-Atom, isomorphic/non-isomorphic serializations, etc., but with no “reality checks” against possible implementations. It’s all very abstract, and although it may result in some highly scholarly journal articles, I fear that those same articles will end up stuck in whatever institutional repository that holds them, awaiting the arrival of the Semantic Web to be linked and discoverable to the rest of the world, while students continue to cite Wikipedia, since “that’s what a Google search turned up.” Count me as a fan of Wikipedia — I would just like to see peer-reviewed scholarship as easily/readily discoverable. And I don’t just mean SEO — I want to to see new resources created — “named graphs” if you will — that give me new contexts in which to discover new resources. My favorite “named graphs” these days are the front page of the New York Times (web edition) and the programming page on Reddit. Aggregations par excellence and quite valuable, if much more ephemeral than what the OAI folks had in mind. But I’ll guarantee you that a historian 50 years from now will love to be able to be able to track the day-to-day (or hour-to-hour, or minute-to-minute) changes to that front page. Do we have a way to do that reliably? Sure we do. My favorite currently is wget. Combined with a nice version control system like git, and you are good to go. Preservable aggregations and no (new) spec necessary.
As a little experiment, I wrote python script that would create an aggregation of a web resource (I chose an online journal article that has been used in some of the OAI-ORE examples). It simply uses wget to grab the article from the web, with the recursive level set to ’1′, so I would also get the resources (style sheet, images, pdf copy, journal front page) that the article itself linked to. I then create an Atom resource map that is nothing more than a listing of all of those “aggregated resources” with a time stamp (atom:updated) on each — indicating when it was aggregated. Pretty handy in some ways, and I could image real usefulness to such a tool. If nothing else, it provides a one-stop machine readable manifest for the directory of files that wget created for me.
see the code here: aggregator.py.html
see the resulting resource map here: aggregation_one.atom
Now a thought experiment: what if this was the starting point of an effort to really express aggregations in Atom. First, I think we’d need to define what sorts of content types might be aggregated (web site, journal article, cv, blog, encyclopedia article, etc) and spell out some best practices for modelling those in Atom. Keeping in mind that you are limited to things like atom:title, atom:subtitle, atom:category, atom:source, etc. for that modelling. That’s quite useful in its minimalism, actually, and keeps us from the temptation to repeat semantics already established in the items themselves. Atom:category is the real gem here: this is the key to making Atom “work” in all kinds of scenarios.
Atom is quite good for the uses it was intended, and for many uses outside of that (I note particularly Google’s GData services). But it is also almost perfectly suited to the use cases for which the Open Archives Initiative was originally chartered. What is needed is guidance — essentially, an implementors guide to Atom for these use cases. Do we need another spec for that — perhaps not. But why not channel that library/archive expertise into Atom implementation guidelines. And begin working with the Atom community to created standardized practices and extensions. Currently, I need to rely on the troubled Media RSS extensions (troubled, but widely used and supported) to describe media assets attached to an Atom entry. I would much prefer a richer and more carefully designed extension to atom:link or atom:content or even a better “Media RSS” extension — that’s where the help is needed. And it involves not just designing something and throwing it out there, but the hard work of educating, listening, consensus building, etc. Not, I suspect, the most natural approach for the library/archives world. Real leadership — an immersion in what’s going on outside the library walls, a willingness to educate folks about what libraries and archives are all about, the humility to recognize when others have beat us at our own game — that’s what is needed now. It’s actually an incredible opportunity.
Comments are now closed.