OAI-ORE & Atom

 There have been some interesting conversations lately regarding the OAI-ORE specification, especially its use of Atom as a serialization format.  I have great respect for the folks behind OAI-ORE, and their success with OAI-PMH is well-deserved.  But the world has changed quite a lot since OAI-PMH was released and filled a pressing need: we had the RSS wars (RSS 2.0 won), we had the REST vs. WS-* wars (REST won), then some very smart folks came along to basically moot the RSS wars with a well-specified but simple format called Atom.  Based on REST principles and doing syndication one better by also specifying a “write back” protocol, Atom seems to embody what’s best about web architecture.  Not only is it a great specification, with all kinds of applicability outside of its blog-oriented upbringing, but the process of creating the spec was, and continues to be open.  I cannot begin to express how useful and enlightening it has been to participate in ongoing discussions on Atom mailing lists, and to peruse the “back issues” to gain clarity on design decisions that guide my implementation.  I have little doubt that Atom, a more widely supported specification than OAI-PMH, could supplant OAI-PMH, and I think that would be a good thing.  Why?  Because we could begin to see a dismantling of the silos that hold “library” or “archive” material on the web and keep it distinct from all of the other great stuff on the web.  The distinctions are completely arbitrary and artificial anyway, since machines can do all of the understanding they need to do with a URI and mime type.  By the way, when I say Atom is widely supported, I mean really widely supported.

I think the ORE specification is taking a more “designed by a panel of experts” approach than Atom.  That’s fine, but I am not sure you can design a good specification these days, and expect it to be used, unless you bring the potential implementors into the discussion.  I want to see a widely-applicable specification for describing aggregations of web-accessible resources, but if this is a DSpace/Fedora/ePrints only solution, I’ll bow out of the conversation gracefully.   Of course we must be careful about our aims: we already have a pretty darn good means to describe aggregations of web-accessible resources and it is called HTML.  So what is the value of ORE?

A lot, really, and it is not so different from other efforts (such as XLink) to “link” resources on the web in a way that allows us to make assertions about that link or about a collection of links (a “named graph”).  But where is the proper abstraction?  The “hypermedia as the engine of state” REST principle seems to suggest that the resource itself should carry its own semantics along with it, the possible state changes (”links”) providing the best context for deriving meaning.  Of course PageRank taught us that incoming links are also valuable for deriving meaning.  So where are we left? What sort of named graphs do we want to create, and towards what end — what is the added value?  Here is where very specific use cases, well thought out and described, would be useful (an absolute necessity, really).  I DID see mention of one project aimed at moving journals from JSTOR into DSpace (exactly the sort of “use case” that is helpful to ponder), but that’s it.  Now the conversation rages about Atom/not-Atom, isomorphic/non-isomorphic serializations, etc., but with no “reality checks” against possible implementations.  It’s all very abstract, and although it may result in some highly scholarly journal articles, I fear that those same articles will end up stuck in whatever institutional repository that holds them, awaiting the arrival of the Semantic Web to be linked and discoverable to the rest of the world, while students continue to cite Wikipedia, since “that’s what a Google search turned up.”  Count me as a fan of Wikipedia — I would just like to see peer-reviewed scholarship as easily/readily discoverable.  And I don’t just mean SEO — I want to to see new resources created — “named graphs” if you will — that give me new contexts in which to discover new resources.  My favorite “named graphs” these days are the front page of the New York Times (web edition) and the programming page on Reddit.   Aggregations par excellence and quite valuable, if much more ephemeral than what the OAI folks had in mind.  But I’ll guarantee you that a historian 50 years from now will love to be able to be able to track the day-to-day (or hour-to-hour, or minute-to-minute) changes to that front page.  Do we have a way to do that reliably?  Sure we do. My favorite currently is wget.  Combined with a nice version control system like git, and you are good to go.  Preservable aggregations and no (new) spec necessary.

As a little experiment, I wrote python script that would create an aggregation of a web resource (I chose an online journal article that has been used in some of the OAI-ORE examples).  It simply uses wget to grab the article from the web, with the recursive level set to ‘1′, so I would also get the resources (style sheet, images, pdf copy, journal front page) that the article itself linked to.  I then create an Atom resource map that is nothing more than a listing of all of those “aggregated resources” with a time stamp (atom:updated) on each — indicating when it was aggregated.  Pretty handy in some ways, and I could image real usefulness to such a tool.  If nothing else, it provides a one-stop machine readable manifest for the directory of files that wget created for me.

see the code here: aggregator.py.html

see the resulting resource map here:  aggregation_one.atom

Now a thought experiment: what if this was the starting point of an effort to really express aggregations in Atom.  First, I think we’d need to define what sorts of content types might be aggregated (web site, journal article, cv, blog, encyclopedia article, etc) and spell out some best practices for modelling those in Atom.  Keeping in mind that you are limited to things like atom:title, atom:subtitle, atom:category, atom:source, etc. for that modelling.  That’s quite useful in its minimalism, actually, and keeps us from the temptation to repeat semantics already established in the items themselves.  Atom:category is the real gem here: this is the key to making Atom “work” in all kinds of scenarios.

Atom is quite good for the uses it was intended, and for many uses outside of that (I note particularly Google’s GData services).  But it is also almost perfectly suited to the use cases for which the Open Archives Initiative was originally chartered.  What is needed is guidance — essentially, an implementors guide to Atom for these use cases.  Do we need another spec for that — perhaps not.  But why not channel that library/archive expertise into Atom implementation guidelines.   And begin working with the Atom community to created standardized practices and extensions.  Currently, I need to rely on the troubled Media RSS extensions (troubled, but widely used and supported) to describe media assets attached to an Atom entry.  I would much prefer a richer and more carefully designed extension to atom:link or atom:content or even a better “Media RSS” extension — that’s where the help is needed.  And it involves not just designing something and throwing it out there, but the hard work of educating, listening, consensus building, etc. Not, I suspect, the most natural approach for the library/archives world.  Real leadership — an immersion in what’s going on outside the library walls,  a willingness to educate folks about what libraries and archives are all about,  the humility to recognize when others have beat us at our own game — that’s what is needed now.  It’s actually an incredible opportunity.

REST Presentation

Here’s the PDF slide set of a talk I gave the IT community at UT Austin on June 25, 2008.
Real World REST with Atom/AtomPub

t-shirt design

Roy Fielding’s latest post discusses the differences between software implementations, architectures, and architectural styles. Having spent the last few months rewriting DASe to adhere more closely to the REST architectural style, I have come to know the occasionally dizzying challenge of making good design decisions based on a set of principles operating at about 2 or 3 levels of abstraction above the nitty-gritty implementation that I am putting down in code. I had thought that DASe was fairly RESTful to begin with in its original design, but important RESTful constraints were NOT being followed, in fact. It has been an interesting adventure, to say the least, but I am as convinced as ever that the ultimate benefits will be worth it, both for DASe and for my own understanding of distributed information systems.

The architecture of DASe relies heavily on the Atom & AtomPub specifications, which I think is a pretty good base, since those specs are shot through with an understanding of and desire to capitalize on the principles of REST.  When REST is discussed it is as likely as not in terms of HTTP and half again as likely to be in terms of Atom/AtomPub.  So there’s much to be learned from some very intelligent folks on the subject.  And frankly, when a design decision needs to be made, it can be difficult to bring abstract concepts down to the concrete and thus “what does Google do?” or “what does Amazon do?” is often a pretty good start. Same with Atom.  Why not let the Atom spec design decisions flow back up into my design?  So my database tables tend to have an “updated” column, and an “updated_by” column that map nicely to atom:updated and atom:author.   And my primary domain classes generally have an “asAtom” method.  It’s not so different, I think, from Mark Nottingham’s recent post about allowing the four principle HTTP verbs to inform his data model.  Although ostensibly about getting “beyond” methods, it strikes as simply “convention-over-configuration” that says “OK — tight coupling with HTTP methods is fine” so we can move on to a level of abstraction up.  Design is ALWAYS a balance, and I was completely mistaken when I assumed REST would be a cookbook of answers for the design challenges of web application architecture and design.  But it does, I think, provide a framework for considering those decisions and an awareness of possible trade offs/benefits lurking when we choose one path over another.

The Roy Fielding post drew a  comment/response from Sam Ruby, and that, a response from Roy stating “…but hypertext as the engine of hypermedia state is also about late binding of application alternatives that guide the client through whatever it is that we are trying to provide as a service.”  I must have read it somewhere before (perhaps in the dissertation?), but I often use that phrase “late binding” in describing REST.  It really is key — that the state of a representation need not be set until I choose to interact with it.  And in my interaction with it, that state is bound (and thus usable & interesting to me) but there is no contract regarding that state.  It can continue to grow and change and everything it links to can grow and change in their own time.  I (a resource owner) am NOT bound by other’s interactions with the resource.  Well, as a librarian, that is a compelling/revolutionary/subversive/threatening (not to mention terrifying) proposition! But it certainly does seem to offer incredible opportunity and captures much of the promise and messiness of information flow in the real world.

DASe & Metadata

Early in the development of the DASe project we decided/realized that the ONLY way we would be able to quickly and efficiently deal with all of the various digital collections we hoped to incorporate would be to NOT enforce any kind of metadata scheme on anyone, but rather simply let folks describe their “stuff” anyway they wish. Not to mention, since many of these were legacy collections set up in a FileMaker or Access database or even an Excel spreadsheet, there was often already a schema in place and folks (rightly) didn’t want to change. Note that we are talking about faculty members and department administrators who have lots better things to do that figure out how to use Dublin Core to describe the images that the have already been using for years in their classes, research, and publications.

We (Liberal Arts Instructional Technology Services at The University of Texas at Austin) had an interest in “rationalizing” this hodge-podge of data & metadata towards two ends: one, we wanted folks to be able to share their collections easily if they wished, and two, we wanted a means by which we could easily repurpose the digital assets in all sorts of ways: podcasts, websites, specialized search interfaces, etc. So we went with what is essential key-value pairs: collection managers create “attributes” (e.g., title, description, person depicted, time period, etc.) that best describes their assets and we provide an interface that allows them to add metadata to any item by filling in a value for any/all attributes that apply. Well, turns out this works REALLY well. We currenly have 88 collections, comprising over 300,000 items (images, audio, video, documents, etc) and the system holds over 4 million pieces of metadata (i.e. the “values” table has over 4 million rows). Searching is fast, adding new collections is easy, and application maintenance (including backing up collections as XML documents) is painless.

The current version of DASe runs on PHP4 with a PostgreSQL back end. The next rev, which is a significant retooling of the current architecture and code base will be PHP5 and will be able to use PostgeSQL, MySQL, SQLite, or XML files as a backend. How that all works, where Atom, REST, RDF and more fit in, problems encountered along the way, as well as solutions settled on (tentative and otherwise) will be some of the topics explored in future posts.

Sam Ruby has a post that is particularly on target for answering (or at least exploring) a couple of questions I posed to the atom-syntax group (see Threads: “Why Use Atom?” and “Atom Inside web application architecture”). His was a follow-up to Yaron Goland’s entertaining Revenge of Babble post. The comments are enlightening, both in the ways folks try to answer/address the problem AND in the ways that no one really DOES, at least definitively (or so it seems to me). The closest is Sam Ruby:

Yaron Yoland: “What are the criteria that identify a problem that ATOM is a good solution for? And even more importantly what are some key flags to look for that identify a problem that ATOM is probably a bad idea for?”

Sam Ruby: “My take: Atom is good for circumstances where data can be organized into “chunks” that you can identify by title, location, who made the last change, when that change was made, and the data itself is either textual or a brief textual summary can be obtained/synthesized. When is it bad? While the above list suggests several pieces of data that should be present, these requirements don’t tend to be equally weighted. The most fundamental pieces of information in my experience are ones that allow a client to answer the following two questions: have I seen this information before, and did it change? Where can I find it and did it meaningfully change are next.
(…) But my experience is that being able to answer “did this resource change?” in both machine and human terms is essential for proper use of HTTP and Atom respectively.”

The question I need to answer is: “what is the cost associated with Atom, and what are the benefits?”. DASe is built in PHP5, and the XML tools available (XMLReader, XMLWriter, SimpleXML, etc.) are REALLY easy to use when dealing with XML that’s not too complex. In all cases, the XML I am passing around is very simple: lists of collections, lists of items, lists of media files. Seems to me there IS some cost to using Atom internally in the application since it inevitably makes those very simple XML structures more complex. I’m beginning work on a generic DASe Atom class that will help simplify the process of serializing and unserializing PHP objects and object arrays in to Atom entries and/or feeds and see what that gives us. Two things I hope to discover: one: does Atom help me answer that question “did this resource change?” and/or help me “bake” that into the representation of that data, and two: obviously, Atom transforms any/all data endpoints into service endpoints. What’s gained there? Perhaps a lot, in fact, since I can “subscribe” to internal processes in the application (e.g., error logging) as a monitoring/maintenance tool.

a beginning

Protected by AkismetBlog with WordPress