Layers

I’d like drill down on one element of the arguments I made in my last two posts. I alluded to it, and it’s an idea that’s ever-present in my thinking. It’s the idea of layering or separation of concerns. I suspect it is the most common reason that I get myself into trouble explaining my ideas — while I assume it is understood that I’m talking about layered solutions, I don’t always make that point explicitly clear.

The idea (and it has some very important implications) is that a solution/technology/standard, etc. needs to address a limited, well-defined problem at the proper level of abstraction. I’m a huge fan of the Unix philosophy: “do one thing and do it well,” which captures the idea in its simplest form quite nicely. Tied into the idea of “levels of abstaction,” it is a paradigm/pattern that turns up all over the place in computer science: machine code is abstracted into assembly language, which is abstracted into something like “C,” which is abstracted into Perl/Python/PHP, etc. Each “layer” need only follow the interface provided by its underlying layer and provide a reasonably useful interface for the layer above. It’s a very powerful concept — indeed the Internet itself is built on layers such as this.

But this idea of layered solutions is in no way limited to the “plumbing” of the systems we use. I would contend that all of the work we do as programmers, information technologists, librarians is exactly this: we use the tools and resources that we have available (the underlying layer) to provide a useful abstraction for those customers/clients (the layer above) who will use our systems/collections/guidance, etc. Everywhere I look in my work, whether it be with code, applications, co-worker or faculty interactions, I see this pattern popping up again and again.

I’m actually mixing a few different ideas here — the layers of abstraction pattern and the separation of concerns pattern are related, but not exactly the same thing. No big deal though, since they go together quite well. I’ll add a third idea to the mix, since it’s an important element of my thinking. It’s called Gall’s Law (I actually have it printed out and tacked up next to my desk) and says:

“A complex system that works is invariably found to have evolved from a simple system that worked. The inverse proposition also appears to be true: A complex system designed from scratch never works and cannot be made to work. You have to start over, beginning with a working simple system.”

We fall into a trap in academia of trying to solve problems vertically (hmm, the image of a silo pops into mind) rather than horizontally. We see a problem, are smart enough to fashion a solution, and so thus solve it. But we run into trouble, because we come upon a related problem and wind up starting the process all over again, fashioning a new solution for this specific problem. Then our users come asking for help solving problems we hadn’t exactly thought of, and the problems compound — we’re thrashing, constantly on the treadmill of solving today’s problem. We typically make two big mistakes. One is not following Gall’s Law (and indeed the Unix philosphy). We have not properly decomposed our problem — a working solution to a given problem might very well be a set of of small solutions nicely tied together. Two, we have not adequately abstracted our problem. We simply do not recognize the problem as one that has already been solved in some other arena because superficially, they look different (e.g., not recognizing that the problem Amazon S3 tries to solve is remarkably like the problem we try to solve with our Institutional Repositories).

This also flows into the idea of “services,” which we are simply not seeing enough of in libraries and I think it is a key piece of our survival. My “Aha!” moment came a number of years back when I discovered Jon Udell’s Library Lookup project. It was the first time I saw the OPAC as simply a service that I can interact with, either with the dedicated interface or with a scripted bookmarklet. It didn’t matter to the catalog which I used. It simply filled its role, performed the search and returned a result. Jon Udell had unlocked the power of REST — a theory, derived from HTTP itself and described by the guy who wrote the HTTP spec that elegantly laid out a set of rules (the REST “constraints”) that when followed, would result in robust, evolvable, scalable systems. I no longer felt bound to one user experience for web applications I was building. I could begin thinking about an application as a piece of a larger, loosely coupled system. “Do one thing and do it well” in this case was indeed freeing.

But this brought up a whole new slew of problems: how do I know the piece I build, properly abstracted and decomposed, will be useful to the other pieces of the system? If I’m building all of the pieces, no problem, but that sort of defeats the purpose, no?  I want it to be useful to some other piece that someone else has built.

This is where my interest in standards and specifications began in earnest. What had seemed utterly mundane and boring to me previously, suddenly seemed really useful and (I hate to admit ;-) ) exciting. People get together, compare notes, argue, sweat over tiny details, and end up defining a contract that allows two separate services to interoperate. Maybe not seamlessly and perhaps not without the occassionaly glitch, but ultimately it works and the results can be astonishing (cf. THE WEB).

Let me be clear about one thing: this work of separating concerns, finding the proper level of abstraction, creating useful and useable service contracts is a hugely difficult undertaking. And it’s work librarians should be and need to be involved in. No one knows our problems like we do, and while I refuse to believe that our problems are unique, who better than us to recognize that our problem X is exactly the same problem addressed by Y in some totally different realm. No one is going to decompose our problems or hit upon appropriate abstractions for us — that is OUR work to do.

I can offer a bit of empirical evidence to back up my assertion that it is a useful endeavor. Taking Gall’s words to heart, we built the DASe project as drop-dead simple as possible, made everything a service and evolved out from there, with modules and clients that can be quickly implemented, designed to do one thing well (the good ones stick around, becoming part of the codebase). I’m not here to promote DASe (that’s another topic — my preference would be to see the ideas we have incubated reimplemented in an application that actually has a good solid community already — DSpace 2.0 anyone? :-) ), but rather to push the idea that these basic, solid principles upon which the Web itself is based have real value and offer huge return on investment.

And so to answer incisive criticism of my original post: my suggestions are indeed of a technical nature, but that is exactly the layer I am addressing. It’s not vertical solutions I am after (“here is a problem…here’s how you fix it”), but rather horizontal: if we get the tooling layer right, we will have the opportunity to build, on top of that layer, more robust solutions — solutions which will get down into the nitty-gritty social and cultural problems that can seem at times to be quite intractable. My opinion is that they ARE intractable if they are attempted atop an unstable underlying layer. Let’s allow the folks who daily work in the incredibly diverse and challenging social realm to do so without the obstacles presented by tools that are not doing their job effectively.

Take Two

I’ve had a few responses to my last post that take issue with my seemingly one-size-fits-all, here’s-a-silver-bullet proposal, including Magical thinking in data curation from Dorothea Salo (whose previous blog post I had cited in mine).  That was not my intended message, but I suspect I left enough unconnected dots, mushed a few different ideas together, and failed to define some terms such that I left the wrong impression.

I’ll also confess that my post was not really about data curation per se (at least in the sense that Dorothea means it), but rather the tools we use to interact with data. I do think that better tools will make the hard work of data curation easier, or will at least (in many cases) push the complexity into a more manageable space.  I’ll also note that what I am proposing is in no way a “novel” approach — in fact, it’s based completely on decade-or-more-old standards and is quite common.  Examples of what I am proposing are all over the place, but we simply don’t see them often enough in academia or the library.

Here is an attempt to bullet-point a few conclusions/take-aways:

  • There’s no silver bullet.  If someone suggests there is, tell them they are wrong or write a blog post telling them as much ;-) .
  • We’re all using the Internet to share data.  It’s a good place to be doing so.  The Web, in particular (by this I specifically mean HTTP-based technologies) is excellent for this.  Email’s pretty good, too, as is FTP, etc. But the Web rules.
  • The Web is based on some basic principles that are not nearly well-enough understood.  A better understanding, especially by the folks who build the web applications and write the specifications we use is crucial. Many of the tools we use in the libraries/repositories are poorly attuned to the core principles of the web.  Seeing those systems evolve is in everyone’s interest.
  • These principles are actually quite elegant and powerful.  A better understanding of these principles by librarians, information technologists, administrators is also quite important, and (I’ll contend) something we should strive for.
  • All data has structure.  Much of that structure can be captured by the tools we all use to create data.  It’s critically important that we advocate for and use tools that do so and that make that data portable, i.e., available for reuse by other applications.  When the tool does not or cannot capture the structure of the data automatically (and I’m really talking about metadata here), make it as easy as possible for the user to add that metadata at the point of creation.
  • Human-created metadata is exponentially more difficult/expensive to create than metadata that can be captured automatically.  When human has to create a piece of metadata, make sure it does not have to be re-created by someone else later. That’s a huge waste of time.
  • The tasks we are engaged in in academia and the library (esp. when it comes to managing data) are not special. At all. The extent to which the library/repository sees its mission as unique (i.e., demands that it “solves its own problems”) is the extent to which it is doomed to extinction (and the sooner the better).
  • To put it another way — “we are all librarians/data curators now.”  There are great strides being made.  We ought not miss out due to some notion of the “specialness” of The Library.
  • Try to see analogs to our work in unlikely places.  Look at Twitter, Google, Amazon, Facebook, but look beneath the obvious use cases and think about the implications — can using the library (for some cases) be as easy as using Amazon or Google?  Could the rich ecosystem of client applications we see forming around Twitter form around our OPAC?   Can our systems be offered as a platform by client app developers like Facebook?  Following the basic web principles makes this much more possible.
  • Why is our Institutional Repository not more like Amazon S3?  Would something like SimpleDB (Amazon’s key-value store) be a better way to capture information about our resources?
  • Why authorities are not available as web resources is mind-boggling to me.  Oh wait, they are!  Ed Summers et. al. at the Library of Congress are at the forefront of this whole approach I am talking about.  Keep an eye on their work for ideas/inspiration.

There’s lots more to say, but I’ll leave that for another post….

I read with great interest both Peter Brantley’s Reality dreams (for Libraries) and Dorothea Salo’s Top-down or bottom-up? as both address the increasingly obvious need for data systems support in higher ed. The issue, which in practice comes in many shades and hues, is that our increasingly digital and connected world offers challenges that are not being met and opportunities that are not being exploited. Centralized data curation, the formation of “communities of practice,” individualized faculty support — these are a few directions that institutions might look to help matters, but I cannot help but think that a lasting solution demands a more fundamental foundation that simply does not yet exist. To put it more bluntly (cliché-and-all) we don’t yet have our “killer app” for data management.

Dorothea wisely asks if a single tool could possibly fit the bill. Well, no, but it’s likely not a single tool we are after. While it might have seemed in the early 1990′s that the Mosaic web browser was the Internet’s killer app, it was actually HTML and HTTP that allowed browsers to be created and the Web to explode in popularity and usefulness. Likewise, our solution will likely be a set of protocols, formats, and practices all of which will enable the creation of end-user applications that can “hide the plumbing.” Indeed, training our users to practice better data “hygiene” will be a fruitless task unless they have applications that force them (by way of the path of least resistance) to do so. It’s not a stretch — everyday we send emails, post to blogs, upload pictures to Facebook, etc. without a thought that our data must be properly structured to achieve our aims.

Here are a few essential characteristics of our killer app:

  • The inherent structure of the data must be captured and maintained from the moment of its creation through its entire lifecycle.
  • Separable concerns need to be separated (data, presentation, access control, etc. each have their own “layer”).
  • Reuse, repurposing, and “remixing” must be first-class operations. In fact, the line between use and reuse should simply be erased (i.e., reuse IS use and vice-versa).
  • The feedback loop with the user should be as tight as possible. I.e., I can immediately visualize a whole range of possible uses of the data I have created.

These are not pie-in-the-sky demands: lots of applications already do a decent job at this (almost any web application fits the bill, since the medium practically demands it). But the tools I see faculty regularly using (Excel, FileMaker, Microsoft Word, the desktop computer filesystem) do NOT. That these are seen as anything but disastrous for the creation of structured data is surprising to me.

Which leads me to my next point: I don’t think there is such a thing (anymore) as data that “does not need to be structured” or “data that will never need to be shared.” If there is one point of data hygiene that we do really need to get across to all of our non-technical folks it is that everything you create will be reused and will be shared no matter how unlikely that might seem right now. If not by someone else, then by you. Systems that don’t distinguish use from reuse or originator access from secondary/shared access are exactly what is called for (n.b. I’m not suggesting that authorization/access controls go by the wayside, but rather that they happen in another layer).

Too often our digital systems perpetuate distinctions that, while logical in a pre-digital world, are actively harmful in the digital realm. Consider, for example, a digital photograph taken by a faculty member on a family vacation. The camera automatically attaches useful metadata: time and date, location (if the camera is geo-enabled), camera model and settings, etc. The data and metadata are in no way tied to a specific use, but will be useful no matter how that photo is used/reused. I’ve seen plenty of cases where that vacation photo is as likely to appear on a holiday greeting card as it is to be used in a classroom lecture, as part of a research data set, or as an addendum to a published paper. As things stand, our faculty member would turn to different systems for the various use cases (e.g., Flickr or Picasa for personal uses, PowerPoint for classroom use, DSpace for “institutional” use). While I’m not suggesting that Picasa be expected to serve the purposes of an institutional repository or DSpace that of a classroom presentation tool, part of me thinks “well, why not?”

A more practical expectation would be that our systems interoperate more seamlessly than they do, and that moving an item (data plus metadata) from one to the other is a trivial, end-user task. As I mentioned above, we need protocols, practices, and formats to allow this sort of interoperation. I think that for the most part we have all of the pieces that we need — we simply lack applications that use them. For protocols, I think that HTTP and related Open Web standards (AtomPub, OpenID, OAuth, OpenSearch, etc.) offer a fairly complete toolset. Too often, our systems either don’t interoperate, or offer services that are simply proprietary, application-specific protocols on top of HTTP (e.g., SOAP-based RPC-style web services), which misses the whole point of the Web: HTTP is not just a transport protocol, but IS the application protocol itself. The growing awareness of and interest in REST-based systems is simply that: using the Web (specifically HTTP) exactly as it was intended. Thus REST-based architectures and the design principles it promotes offer the “practices” part of the equation.

As an example of a REST-based system (or standard) you cannot do much better than the Atom Publishing Protocol. While it may not have taken the world by storm the way its creators had hoped, it still has loads of potential as “plumbing” — perhaps not something that end-users would be aware of, but hugely useful for application developers. And were one tempted to try some other approach, or “just use HTTP” I am pretty sure they’d end up developing something quite like Atom/AtomPub anyway. In either case, there is no hiding or abstracting away HTTP — it’s in full view in any RESTful system on the web and is in all cases (this being the Web we are talking about) simply unavoidable.

The next bit to tackle is the format: what data formats allow the sort of interoperabilty we are seeking? Certainly as the format behind AtomPub, Atom Syndication Format would be an obvious choice. But there are others: JSON, HTML, XHTML, RDF/XML, RDFa, etc. I tend to favor Atom and JSON, having found both quite suited to the sort of tasks I have in mind. RDF is widely viewed as the basis of the “Web of Data,” but its lack of a containment model makes it ill-suited for the sort of RESTful interactions that are a critical component of the interoperability I’m envisioning. What RDF does offer, which is key, is the ability to “say anything about anything.” I’d contend though, that RDF is not the only way to do that (or indeed even the best way, in many cases). Atom itself offers some RDFish extension points that I quite like, and efforts like OAI-ORE/Atom do so as well. The point is that if a content publisher has some “stuff” they want to describe, they should be able to do so in whatever way they wish, and this original “description” should stay with the item in full fidelity, even if it is mapped to standardized description/metadata schemes if/when necessary. I could say more, especially about OAI-ORE/Atom, since it so closely captures the sort of aggregating and re-aggregating that we see in academic data work. And as an Atom-based format (with an emphasis on the entry rather than the feed), it has a “write-back” story (AtomPub) built in.

I believe things are moving in the direction I describe, but by and large NOT (sadly) in academia and (more sadly) not within the library community. AtomPub is built into products by IBM, Microsoft, Google and others. In fact, Google’s suite of GData-based web applications (GData itself is based on Atom/AtomPub) comes quite close to what I describe. At UT Austin, we’ve been making very wide use of our DASe platform, which is basically a reference implementation of just the sort of application/framework I have described above. A faculty member with “stuff” they need to manage, archive, repurpose, share (images, audio, video, pdfs, data sets, etc.) can add it to a DASe “collection” and thus have it maintained as interoperable, structured data and enjoy the benefits of a rich web interface and suite of web-based RESTful APIs for further development. We have over one hundred such collections, and while most live behind UT authentication, some have seen the light of day on the open web as rich media-based web resources, including eSkeletons and Onda Latina among others.

My conclusion after a few years of working with DASe is that yes this is definitely the way to go — RESTful architectures built on standard protocols and formats offer HUGE benefits from the start and are “engineered for serendipity” such that new uses for existing data are regularly hit upon. I’d also note that it requires buy in from all who wish to play — the hard work of developing specifications and standards, and understanding and following those specifications are the “cost of admission.” Likewise, a willingness to move away from (or build better, more RESTful interfaces around) legacy systems that don’t play well is crucial. This means a shared understanding among technologists, application developers, managers, administrators, and users must be promoted. It’s no easy task, in my experience — pushback can be fairly dramatic, especially when investment (as resources, mindshare, pride, etc.) in our legacy systems is significant. Our work as librarians, repository managers, information technologists is NOT, though, as much a matter of educating users as it is educating ourselves, looking “outside the walls,” and beginning the difficult conversations that start moving us in the right direction.

Here are the slides from my DLF Forum 2008 presentation titled “The Role of Atom/AtomPub in Digital Archive Services at The University of Texas at Austin.”

Today

Today was a very good day.

OAI-ORE & Atom

 There have been some interesting conversations lately regarding the OAI-ORE specification, especially its use of Atom as a serialization format.  I have great respect for the folks behind OAI-ORE, and their success with OAI-PMH is well-deserved.  But the world has changed quite a lot since OAI-PMH was released and filled a pressing need: we had the RSS wars (RSS 2.0 won), we had the REST vs. WS-* wars (REST won), then some very smart folks came along to basically moot the RSS wars with a well-specified but simple format called Atom.  Based on REST principles and doing syndication one better by also specifying a “write back” protocol, Atom seems to embody what’s best about web architecture.  Not only is it a great specification, with all kinds of applicability outside of its blog-oriented upbringing, but the process of creating the spec was, and continues to be open.  I cannot begin to express how useful and enlightening it has been to participate in ongoing discussions on Atom mailing lists, and to peruse the “back issues” to gain clarity on design decisions that guide my implementation.  I have little doubt that Atom, a more widely supported specification than OAI-PMH, could supplant OAI-PMH, and I think that would be a good thing.  Why?  Because we could begin to see a dismantling of the silos that hold “library” or “archive” material on the web and keep it distinct from all of the other great stuff on the web.  The distinctions are completely arbitrary and artificial anyway, since machines can do all of the understanding they need to do with a URI and mime type.  By the way, when I say Atom is widely supported, I mean really widely supported.

I think the ORE specification is taking a more “designed by a panel of experts” approach than Atom.  That’s fine, but I am not sure you can design a good specification these days, and expect it to be used, unless you bring the potential implementors into the discussion.  I want to see a widely-applicable specification for describing aggregations of web-accessible resources, but if this is a DSpace/Fedora/ePrints only solution, I’ll bow out of the conversation gracefully.   Of course we must be careful about our aims: we already have a pretty darn good means to describe aggregations of web-accessible resources and it is called HTML.  So what is the value of ORE?

A lot, really, and it is not so different from other efforts (such as XLink) to “link” resources on the web in a way that allows us to make assertions about that link or about a collection of links (a “named graph”).  But where is the proper abstraction?  The “hypermedia as the engine of state” REST principle seems to suggest that the resource itself should carry its own semantics along with it, the possible state changes (“links”) providing the best context for deriving meaning.  Of course PageRank taught us that incoming links are also valuable for deriving meaning.  So where are we left? What sort of named graphs do we want to create, and towards what end — what is the added value?  Here is where very specific use cases, well thought out and described, would be useful (an absolute necessity, really).  I DID see mention of one project aimed at moving journals from JSTOR into DSpace (exactly the sort of “use case” that is helpful to ponder), but that’s it.  Now the conversation rages about Atom/not-Atom, isomorphic/non-isomorphic serializations, etc., but with no “reality checks” against possible implementations.  It’s all very abstract, and although it may result in some highly scholarly journal articles, I fear that those same articles will end up stuck in whatever institutional repository that holds them, awaiting the arrival of the Semantic Web to be linked and discoverable to the rest of the world, while students continue to cite Wikipedia, since “that’s what a Google search turned up.”  Count me as a fan of Wikipedia — I would just like to see peer-reviewed scholarship as easily/readily discoverable.  And I don’t just mean SEO — I want to to see new resources created — “named graphs” if you will — that give me new contexts in which to discover new resources.  My favorite “named graphs” these days are the front page of the New York Times (web edition) and the programming page on Reddit.   Aggregations par excellence and quite valuable, if much more ephemeral than what the OAI folks had in mind.  But I’ll guarantee you that a historian 50 years from now will love to be able to be able to track the day-to-day (or hour-to-hour, or minute-to-minute) changes to that front page.  Do we have a way to do that reliably?  Sure we do. My favorite currently is wget.  Combined with a nice version control system like git, and you are good to go.  Preservable aggregations and no (new) spec necessary.

As a little experiment, I wrote python script that would create an aggregation of a web resource (I chose an online journal article that has been used in some of the OAI-ORE examples).  It simply uses wget to grab the article from the web, with the recursive level set to ’1′, so I would also get the resources (style sheet, images, pdf copy, journal front page) that the article itself linked to.  I then create an Atom resource map that is nothing more than a listing of all of those “aggregated resources” with a time stamp (atom:updated) on each — indicating when it was aggregated.  Pretty handy in some ways, and I could image real usefulness to such a tool.  If nothing else, it provides a one-stop machine readable manifest for the directory of files that wget created for me.

see the code here: aggregator.py.html

see the resulting resource map here:  aggregation_one.atom

Now a thought experiment: what if this was the starting point of an effort to really express aggregations in Atom.  First, I think we’d need to define what sorts of content types might be aggregated (web site, journal article, cv, blog, encyclopedia article, etc) and spell out some best practices for modelling those in Atom.  Keeping in mind that you are limited to things like atom:title, atom:subtitle, atom:category, atom:source, etc. for that modelling.  That’s quite useful in its minimalism, actually, and keeps us from the temptation to repeat semantics already established in the items themselves.  Atom:category is the real gem here: this is the key to making Atom “work” in all kinds of scenarios.

Atom is quite good for the uses it was intended, and for many uses outside of that (I note particularly Google’s GData services).  But it is also almost perfectly suited to the use cases for which the Open Archives Initiative was originally chartered.  What is needed is guidance — essentially, an implementors guide to Atom for these use cases.  Do we need another spec for that — perhaps not.  But why not channel that library/archive expertise into Atom implementation guidelines.   And begin working with the Atom community to created standardized practices and extensions.  Currently, I need to rely on the troubled Media RSS extensions (troubled, but widely used and supported) to describe media assets attached to an Atom entry.  I would much prefer a richer and more carefully designed extension to atom:link or atom:content or even a better “Media RSS” extension — that’s where the help is needed.  And it involves not just designing something and throwing it out there, but the hard work of educating, listening, consensus building, etc. Not, I suspect, the most natural approach for the library/archives world.  Real leadership — an immersion in what’s going on outside the library walls,  a willingness to educate folks about what libraries and archives are all about,  the humility to recognize when others have beat us at our own game — that’s what is needed now.  It’s actually an incredible opportunity.

REST Presentation

Here’s the PDF slide set of a talk I gave the IT community at UT Austin on June 25, 2008.
Real World REST with Atom/AtomPub

t-shirt design

Roy Fielding’s latest post discusses the differences between software implementations, architectures, and architectural styles. Having spent the last few months rewriting DASe to adhere more closely to the REST architectural style, I have come to know the occasionally dizzying challenge of making good design decisions based on a set of principles operating at about 2 or 3 levels of abstraction above the nitty-gritty implementation that I am putting down in code. I had thought that DASe was fairly RESTful to begin with in its original design, but important RESTful constraints were NOT being followed, in fact. It has been an interesting adventure, to say the least, but I am as convinced as ever that the ultimate benefits will be worth it, both for DASe and for my own understanding of distributed information systems.

The architecture of DASe relies heavily on the Atom & AtomPub specifications, which I think is a pretty good base, since those specs are shot through with an understanding of and desire to capitalize on the principles of REST.  When REST is discussed it is as likely as not in terms of HTTP and half again as likely to be in terms of Atom/AtomPub.  So there’s much to be learned from some very intelligent folks on the subject.  And frankly, when a design decision needs to be made, it can be difficult to bring abstract concepts down to the concrete and thus “what does Google do?” or “what does Amazon do?” is often a pretty good start. Same with Atom.  Why not let the Atom spec design decisions flow back up into my design?  So my database tables tend to have an “updated” column, and an “updated_by” column that map nicely to atom:updated and atom:author.   And my primary domain classes generally have an “asAtom” method.  It’s not so different, I think, from Mark Nottingham’s recent post about allowing the four principle HTTP verbs to inform his data model.  Although ostensibly about getting “beyond” methods, it strikes as simply “convention-over-configuration” that says “OK — tight coupling with HTTP methods is fine” so we can move on to a level of abstraction up.  Design is ALWAYS a balance, and I was completely mistaken when I assumed REST would be a cookbook of answers for the design challenges of web application architecture and design.  But it does, I think, provide a framework for considering those decisions and an awareness of possible trade offs/benefits lurking when we choose one path over another.

The Roy Fielding post drew a  comment/response from Sam Ruby, and that, a response from Roy stating “…but hypertext as the engine of hypermedia state is also about late binding of application alternatives that guide the client through whatever it is that we are trying to provide as a service.”  I must have read it somewhere before (perhaps in the dissertation?), but I often use that phrase “late binding” in describing REST.  It really is key — that the state of a representation need not be set until I choose to interact with it.  And in my interaction with it, that state is bound (and thus usable & interesting to me) but there is no contract regarding that state.  It can continue to grow and change and everything it links to can grow and change in their own time.  I (a resource owner) am NOT bound by other’s interactions with the resource.  Well, as a librarian, that is a compelling/revolutionary/subversive/threatening (not to mention terrifying) proposition! But it certainly does seem to offer incredible opportunity and captures much of the promise and messiness of information flow in the real world.

DASe & Metadata

Early in the development of the DASe project we decided/realized that the ONLY way we would be able to quickly and efficiently deal with all of the various digital collections we hoped to incorporate would be to NOT enforce any kind of metadata scheme on anyone, but rather simply let folks describe their “stuff” anyway they wish. Not to mention, since many of these were legacy collections set up in a FileMaker or Access database or even an Excel spreadsheet, there was often already a schema in place and folks (rightly) didn’t want to change. Note that we are talking about faculty members and department administrators who have lots better things to do that figure out how to use Dublin Core to describe the images that the have already been using for years in their classes, research, and publications.

We (Liberal Arts Instructional Technology Services at The University of Texas at Austin) had an interest in “rationalizing” this hodge-podge of data & metadata towards two ends: one, we wanted folks to be able to share their collections easily if they wished, and two, we wanted a means by which we could easily repurpose the digital assets in all sorts of ways: podcasts, websites, specialized search interfaces, etc. So we went with what is essential key-value pairs: collection managers create “attributes” (e.g., title, description, person depicted, time period, etc.) that best describes their assets and we provide an interface that allows them to add metadata to any item by filling in a value for any/all attributes that apply. Well, turns out this works REALLY well. We currenly have 88 collections, comprising over 300,000 items (images, audio, video, documents, etc) and the system holds over 4 million pieces of metadata (i.e. the “values” table has over 4 million rows). Searching is fast, adding new collections is easy, and application maintenance (including backing up collections as XML documents) is painless.

The current version of DASe runs on PHP4 with a PostgreSQL back end. The next rev, which is a significant retooling of the current architecture and code base will be PHP5 and will be able to use PostgeSQL, MySQL, SQLite, or XML files as a backend. How that all works, where Atom, REST, RDF and more fit in, problems encountered along the way, as well as solutions settled on (tentative and otherwise) will be some of the topics explored in future posts.