I read with great interest both Peter Brantley’s Reality dreams (for Libraries) and Dorothea Salo’s Top-down or bottom-up? as both address the increasingly obvious need for data systems support in higher ed. The issue, which in practice comes in many shades and hues, is that our increasingly digital and connected world offers challenges that are not being met and opportunities that are not being exploited. Centralized data curation, the formation of “communities of practice,” individualized faculty support — these are a few directions that institutions might look to help matters, but I cannot help but think that a lasting solution demands a more fundamental foundation that simply does not yet exist. To put it more bluntly (cliché-and-all) we don’t yet have our “killer app” for data management.
Dorothea wisely asks if a single tool could possibly fit the bill. Well, no, but it’s likely not a single tool we are after. While it might have seemed in the early 1990′s that the Mosaic web browser was the Internet’s killer app, it was actually HTML and HTTP that allowed browsers to be created and the Web to explode in popularity and usefulness. Likewise, our solution will likely be a set of protocols, formats, and practices all of which will enable the creation of end-user applications that can “hide the plumbing.” Indeed, training our users to practice better data “hygiene” will be a fruitless task unless they have applications that force them (by way of the path of least resistance) to do so. It’s not a stretch — everyday we send emails, post to blogs, upload pictures to Facebook, etc. without a thought that our data must be properly structured to achieve our aims.
Here are a few essential characteristics of our killer app:
- The inherent structure of the data must be captured and maintained from the moment of its creation through its entire lifecycle.
- Separable concerns need to be separated (data, presentation, access control, etc. each have their own “layer”).
- Reuse, repurposing, and “remixing” must be first-class operations. In fact, the line between use and reuse should simply be erased (i.e., reuse IS use and vice-versa).
- The feedback loop with the user should be as tight as possible. I.e., I can immediately visualize a whole range of possible uses of the data I have created.
These are not pie-in-the-sky demands: lots of applications already do a decent job at this (almost any web application fits the bill, since the medium practically demands it). But the tools I see faculty regularly using (Excel, FileMaker, Microsoft Word, the desktop computer filesystem) do NOT. That these are seen as anything but disastrous for the creation of structured data is surprising to me.
Which leads me to my next point: I don’t think there is such a thing (anymore) as data that “does not need to be structured” or “data that will never need to be shared.” If there is one point of data hygiene that we do really need to get across to all of our non-technical folks it is that everything you create will be reused and will be shared no matter how unlikely that might seem right now. If not by someone else, then by you. Systems that don’t distinguish use from reuse or originator access from secondary/shared access are exactly what is called for (n.b. I’m not suggesting that authorization/access controls go by the wayside, but rather that they happen in another layer).
Too often our digital systems perpetuate distinctions that, while logical in a pre-digital world, are actively harmful in the digital realm. Consider, for example, a digital photograph taken by a faculty member on a family vacation. The camera automatically attaches useful metadata: time and date, location (if the camera is geo-enabled), camera model and settings, etc. The data and metadata are in no way tied to a specific use, but will be useful no matter how that photo is used/reused. I’ve seen plenty of cases where that vacation photo is as likely to appear on a holiday greeting card as it is to be used in a classroom lecture, as part of a research data set, or as an addendum to a published paper. As things stand, our faculty member would turn to different systems for the various use cases (e.g., Flickr or Picasa for personal uses, PowerPoint for classroom use, DSpace for “institutional” use). While I’m not suggesting that Picasa be expected to serve the purposes of an institutional repository or DSpace that of a classroom presentation tool, part of me thinks “well, why not?”
A more practical expectation would be that our systems interoperate more seamlessly than they do, and that moving an item (data plus metadata) from one to the other is a trivial, end-user task. As I mentioned above, we need protocols, practices, and formats to allow this sort of interoperation. I think that for the most part we have all of the pieces that we need — we simply lack applications that use them. For protocols, I think that HTTP and related Open Web standards (AtomPub, OpenID, OAuth, OpenSearch, etc.) offer a fairly complete toolset. Too often, our systems either don’t interoperate, or offer services that are simply proprietary, application-specific protocols on top of HTTP (e.g., SOAP-based RPC-style web services), which misses the whole point of the Web: HTTP is not just a transport protocol, but IS the application protocol itself. The growing awareness of and interest in REST-based systems is simply that: using the Web (specifically HTTP) exactly as it was intended. Thus REST-based architectures and the design principles it promotes offer the “practices” part of the equation.
As an example of a REST-based system (or standard) you cannot do much better than the Atom Publishing Protocol. While it may not have taken the world by storm the way its creators had hoped, it still has loads of potential as “plumbing” — perhaps not something that end-users would be aware of, but hugely useful for application developers. And were one tempted to try some other approach, or “just use HTTP” I am pretty sure they’d end up developing something quite like Atom/AtomPub anyway. In either case, there is no hiding or abstracting away HTTP — it’s in full view in any RESTful system on the web and is in all cases (this being the Web we are talking about) simply unavoidable.
The next bit to tackle is the format: what data formats allow the sort of interoperabilty we are seeking? Certainly as the format behind AtomPub, Atom Syndication Format would be an obvious choice. But there are others: JSON, HTML, XHTML, RDF/XML, RDFa, etc. I tend to favor Atom and JSON, having found both quite suited to the sort of tasks I have in mind. RDF is widely viewed as the basis of the “Web of Data,” but its lack of a containment model makes it ill-suited for the sort of RESTful interactions that are a critical component of the interoperability I’m envisioning. What RDF does offer, which is key, is the ability to “say anything about anything.” I’d contend though, that RDF is not the only way to do that (or indeed even the best way, in many cases). Atom itself offers some RDFish extension points that I quite like, and efforts like OAI-ORE/Atom do so as well. The point is that if a content publisher has some “stuff” they want to describe, they should be able to do so in whatever way they wish, and this original “description” should stay with the item in full fidelity, even if it is mapped to standardized description/metadata schemes if/when necessary. I could say more, especially about OAI-ORE/Atom, since it so closely captures the sort of aggregating and re-aggregating that we see in academic data work. And as an Atom-based format (with an emphasis on the entry rather than the feed), it has a “write-back” story (AtomPub) built in.
I believe things are moving in the direction I describe, but by and large NOT (sadly) in academia and (more sadly) not within the library community. AtomPub is built into products by IBM, Microsoft, Google and others. In fact, Google’s suite of GData-based web applications (GData itself is based on Atom/AtomPub) comes quite close to what I describe. At UT Austin, we’ve been making very wide use of our DASe platform, which is basically a reference implementation of just the sort of application/framework I have described above. A faculty member with “stuff” they need to manage, archive, repurpose, share (images, audio, video, pdfs, data sets, etc.) can add it to a DASe “collection” and thus have it maintained as interoperable, structured data and enjoy the benefits of a rich web interface and suite of web-based RESTful APIs for further development. We have over one hundred such collections, and while most live behind UT authentication, some have seen the light of day on the open web as rich media-based web resources, including eSkeletons and Onda Latina among others.
My conclusion after a few years of working with DASe is that yes this is definitely the way to go — RESTful architectures built on standard protocols and formats offer HUGE benefits from the start and are “engineered for serendipity” such that new uses for existing data are regularly hit upon. I’d also note that it requires buy in from all who wish to play — the hard work of developing specifications and standards, and understanding and following those specifications are the “cost of admission.” Likewise, a willingness to move away from (or build better, more RESTful interfaces around) legacy systems that don’t play well is crucial. This means a shared understanding among technologists, application developers, managers, administrators, and users must be promoted. It’s no easy task, in my experience — pushback can be fairly dramatic, especially when investment (as resources, mindshare, pride, etc.) in our legacy systems is significant. Our work as librarians, repository managers, information technologists is NOT, though, as much a matter of educating users as it is educating ourselves, looking “outside the walls,” and beginning the difficult conversations that start moving us in the right direction.