Librarianship

You are currently browsing the archive for the Librarianship category.

Layers

I’d like drill down on one element of the arguments I made in my last two posts. I alluded to it, and it’s an idea that’s ever-present in my thinking. It’s the idea of layering or separation of concerns. I suspect it is the most common reason that I get myself into trouble explaining my ideas — while I assume it is understood that I’m talking about layered solutions, I don’t always make that point explicitly clear.

The idea (and it has some very important implications) is that a solution/technology/standard, etc. needs to address a limited, well-defined problem at the proper level of abstraction. I’m a huge fan of the Unix philosophy: “do one thing and do it well,” which captures the idea in its simplest form quite nicely. Tied into the idea of “levels of abstaction,” it is a paradigm/pattern that turns up all over the place in computer science: machine code is abstracted into assembly language, which is abstracted into something like “C,” which is abstracted into Perl/Python/PHP, etc. Each “layer” need only follow the interface provided by its underlying layer and provide a reasonably useful interface for the layer above. It’s a very powerful concept — indeed the Internet itself is built on layers such as this.

But this idea of layered solutions is in no way limited to the “plumbing” of the systems we use. I would contend that all of the work we do as programmers, information technologists, librarians is exactly this: we use the tools and resources that we have available (the underlying layer) to provide a useful abstraction for those customers/clients (the layer above) who will use our systems/collections/guidance, etc. Everywhere I look in my work, whether it be with code, applications, co-worker or faculty interactions, I see this pattern popping up again and again.

I’m actually mixing a few different ideas here — the layers of abstraction pattern and the separation of concerns pattern are related, but not exactly the same thing. No big deal though, since they go together quite well. I’ll add a third idea to the mix, since it’s an important element of my thinking. It’s called Gall’s Law (I actually have it printed out and tacked up next to my desk) and says:

“A complex system that works is invariably found to have evolved from a simple system that worked. The inverse proposition also appears to be true: A complex system designed from scratch never works and cannot be made to work. You have to start over, beginning with a working simple system.”

We fall into a trap in academia of trying to solve problems vertically (hmm, the image of a silo pops into mind) rather than horizontally. We see a problem, are smart enough to fashion a solution, and so thus solve it. But we run into trouble, because we come upon a related problem and wind up starting the process all over again, fashioning a new solution for this specific problem. Then our users come asking for help solving problems we hadn’t exactly thought of, and the problems compound — we’re thrashing, constantly on the treadmill of solving today’s problem. We typically make two big mistakes. One is not following Gall’s Law (and indeed the Unix philosphy). We have not properly decomposed our problem — a working solution to a given problem might very well be a set of of small solutions nicely tied together. Two, we have not adequately abstracted our problem. We simply do not recognize the problem as one that has already been solved in some other arena because superficially, they look different (e.g., not recognizing that the problem Amazon S3 tries to solve is remarkably like the problem we try to solve with our Institutional Repositories).

This also flows into the idea of “services,” which we are simply not seeing enough of in libraries and I think it is a key piece of our survival. My “Aha!” moment came a number of years back when I discovered Jon Udell’s Library Lookup project. It was the first time I saw the OPAC as simply a service that I can interact with, either with the dedicated interface or with a scripted bookmarklet. It didn’t matter to the catalog which I used. It simply filled its role, performed the search and returned a result. Jon Udell had unlocked the power of REST — a theory, derived from HTTP itself and described by the guy who wrote the HTTP spec that elegantly laid out a set of rules (the REST “constraints”) that when followed, would result in robust, evolvable, scalable systems. I no longer felt bound to one user experience for web applications I was building. I could begin thinking about an application as a piece of a larger, loosely coupled system. “Do one thing and do it well” in this case was indeed freeing.

But this brought up a whole new slew of problems: how do I know the piece I build, properly abstracted and decomposed, will be useful to the other pieces of the system? If I’m building all of the pieces, no problem, but that sort of defeats the purpose, no?  I want it to be useful to some other piece that someone else has built.

This is where my interest in standards and specifications began in earnest. What had seemed utterly mundane and boring to me previously, suddenly seemed really useful and (I hate to admit ;-) ) exciting. People get together, compare notes, argue, sweat over tiny details, and end up defining a contract that allows two separate services to interoperate. Maybe not seamlessly and perhaps not without the occassionaly glitch, but ultimately it works and the results can be astonishing (cf. THE WEB).

Let me be clear about one thing: this work of separating concerns, finding the proper level of abstraction, creating useful and useable service contracts is a hugely difficult undertaking. And it’s work librarians should be and need to be involved in. No one knows our problems like we do, and while I refuse to believe that our problems are unique, who better than us to recognize that our problem X is exactly the same problem addressed by Y in some totally different realm. No one is going to decompose our problems or hit upon appropriate abstractions for us — that is OUR work to do.

I can offer a bit of empirical evidence to back up my assertion that it is a useful endeavor. Taking Gall’s words to heart, we built the DASe project as drop-dead simple as possible, made everything a service and evolved out from there, with modules and clients that can be quickly implemented, designed to do one thing well (the good ones stick around, becoming part of the codebase). I’m not here to promote DASe (that’s another topic — my preference would be to see the ideas we have incubated reimplemented in an application that actually has a good solid community already — DSpace 2.0 anyone? :-) ), but rather to push the idea that these basic, solid principles upon which the Web itself is based have real value and offer huge return on investment.

And so to answer incisive criticism of my original post: my suggestions are indeed of a technical nature, but that is exactly the layer I am addressing. It’s not vertical solutions I am after (“here is a problem…here’s how you fix it”), but rather horizontal: if we get the tooling layer right, we will have the opportunity to build, on top of that layer, more robust solutions — solutions which will get down into the nitty-gritty social and cultural problems that can seem at times to be quite intractable. My opinion is that they ARE intractable if they are attempted atop an unstable underlying layer. Let’s allow the folks who daily work in the incredibly diverse and challenging social realm to do so without the obstacles presented by tools that are not doing their job effectively.

Take Two

I’ve had a few responses to my last post that take issue with my seemingly one-size-fits-all, here’s-a-silver-bullet proposal, including Magical thinking in data curation from Dorothea Salo (whose previous blog post I had cited in mine).  That was not my intended message, but I suspect I left enough unconnected dots, mushed a few different ideas together, and failed to define some terms such that I left the wrong impression.

I’ll also confess that my post was not really about data curation per se (at least in the sense that Dorothea means it), but rather the tools we use to interact with data. I do think that better tools will make the hard work of data curation easier, or will at least (in many cases) push the complexity into a more manageable space.  I’ll also note that what I am proposing is in no way a “novel” approach — in fact, it’s based completely on decade-or-more-old standards and is quite common.  Examples of what I am proposing are all over the place, but we simply don’t see them often enough in academia or the library.

Here is an attempt to bullet-point a few conclusions/take-aways:

  • There’s no silver bullet.  If someone suggests there is, tell them they are wrong or write a blog post telling them as much ;-) .
  • We’re all using the Internet to share data.  It’s a good place to be doing so.  The Web, in particular (by this I specifically mean HTTP-based technologies) is excellent for this.  Email’s pretty good, too, as is FTP, etc. But the Web rules.
  • The Web is based on some basic principles that are not nearly well-enough understood.  A better understanding, especially by the folks who build the web applications and write the specifications we use is crucial. Many of the tools we use in the libraries/repositories are poorly attuned to the core principles of the web.  Seeing those systems evolve is in everyone’s interest.
  • These principles are actually quite elegant and powerful.  A better understanding of these principles by librarians, information technologists, administrators is also quite important, and (I’ll contend) something we should strive for.
  • All data has structure.  Much of that structure can be captured by the tools we all use to create data.  It’s critically important that we advocate for and use tools that do so and that make that data portable, i.e., available for reuse by other applications.  When the tool does not or cannot capture the structure of the data automatically (and I’m really talking about metadata here), make it as easy as possible for the user to add that metadata at the point of creation.
  • Human-created metadata is exponentially more difficult/expensive to create than metadata that can be captured automatically.  When human has to create a piece of metadata, make sure it does not have to be re-created by someone else later. That’s a huge waste of time.
  • The tasks we are engaged in in academia and the library (esp. when it comes to managing data) are not special. At all. The extent to which the library/repository sees its mission as unique (i.e., demands that it “solves its own problems”) is the extent to which it is doomed to extinction (and the sooner the better).
  • To put it another way — “we are all librarians/data curators now.”  There are great strides being made.  We ought not miss out due to some notion of the “specialness” of The Library.
  • Try to see analogs to our work in unlikely places.  Look at Twitter, Google, Amazon, Facebook, but look beneath the obvious use cases and think about the implications — can using the library (for some cases) be as easy as using Amazon or Google?  Could the rich ecosystem of client applications we see forming around Twitter form around our OPAC?   Can our systems be offered as a platform by client app developers like Facebook?  Following the basic web principles makes this much more possible.
  • Why is our Institutional Repository not more like Amazon S3?  Would something like SimpleDB (Amazon’s key-value store) be a better way to capture information about our resources?
  • Why authorities are not available as web resources is mind-boggling to me.  Oh wait, they are!  Ed Summers et. al. at the Library of Congress are at the forefront of this whole approach I am talking about.  Keep an eye on their work for ideas/inspiration.

There’s lots more to say, but I’ll leave that for another post….

I read with great interest both Peter Brantley’s Reality dreams (for Libraries) and Dorothea Salo’s Top-down or bottom-up? as both address the increasingly obvious need for data systems support in higher ed. The issue, which in practice comes in many shades and hues, is that our increasingly digital and connected world offers challenges that are not being met and opportunities that are not being exploited. Centralized data curation, the formation of “communities of practice,” individualized faculty support — these are a few directions that institutions might look to help matters, but I cannot help but think that a lasting solution demands a more fundamental foundation that simply does not yet exist. To put it more bluntly (cliché-and-all) we don’t yet have our “killer app” for data management.

Dorothea wisely asks if a single tool could possibly fit the bill. Well, no, but it’s likely not a single tool we are after. While it might have seemed in the early 1990′s that the Mosaic web browser was the Internet’s killer app, it was actually HTML and HTTP that allowed browsers to be created and the Web to explode in popularity and usefulness. Likewise, our solution will likely be a set of protocols, formats, and practices all of which will enable the creation of end-user applications that can “hide the plumbing.” Indeed, training our users to practice better data “hygiene” will be a fruitless task unless they have applications that force them (by way of the path of least resistance) to do so. It’s not a stretch — everyday we send emails, post to blogs, upload pictures to Facebook, etc. without a thought that our data must be properly structured to achieve our aims.

Here are a few essential characteristics of our killer app:

  • The inherent structure of the data must be captured and maintained from the moment of its creation through its entire lifecycle.
  • Separable concerns need to be separated (data, presentation, access control, etc. each have their own “layer”).
  • Reuse, repurposing, and “remixing” must be first-class operations. In fact, the line between use and reuse should simply be erased (i.e., reuse IS use and vice-versa).
  • The feedback loop with the user should be as tight as possible. I.e., I can immediately visualize a whole range of possible uses of the data I have created.

These are not pie-in-the-sky demands: lots of applications already do a decent job at this (almost any web application fits the bill, since the medium practically demands it). But the tools I see faculty regularly using (Excel, FileMaker, Microsoft Word, the desktop computer filesystem) do NOT. That these are seen as anything but disastrous for the creation of structured data is surprising to me.

Which leads me to my next point: I don’t think there is such a thing (anymore) as data that “does not need to be structured” or “data that will never need to be shared.” If there is one point of data hygiene that we do really need to get across to all of our non-technical folks it is that everything you create will be reused and will be shared no matter how unlikely that might seem right now. If not by someone else, then by you. Systems that don’t distinguish use from reuse or originator access from secondary/shared access are exactly what is called for (n.b. I’m not suggesting that authorization/access controls go by the wayside, but rather that they happen in another layer).

Too often our digital systems perpetuate distinctions that, while logical in a pre-digital world, are actively harmful in the digital realm. Consider, for example, a digital photograph taken by a faculty member on a family vacation. The camera automatically attaches useful metadata: time and date, location (if the camera is geo-enabled), camera model and settings, etc. The data and metadata are in no way tied to a specific use, but will be useful no matter how that photo is used/reused. I’ve seen plenty of cases where that vacation photo is as likely to appear on a holiday greeting card as it is to be used in a classroom lecture, as part of a research data set, or as an addendum to a published paper. As things stand, our faculty member would turn to different systems for the various use cases (e.g., Flickr or Picasa for personal uses, PowerPoint for classroom use, DSpace for “institutional” use). While I’m not suggesting that Picasa be expected to serve the purposes of an institutional repository or DSpace that of a classroom presentation tool, part of me thinks “well, why not?”

A more practical expectation would be that our systems interoperate more seamlessly than they do, and that moving an item (data plus metadata) from one to the other is a trivial, end-user task. As I mentioned above, we need protocols, practices, and formats to allow this sort of interoperation. I think that for the most part we have all of the pieces that we need — we simply lack applications that use them. For protocols, I think that HTTP and related Open Web standards (AtomPub, OpenID, OAuth, OpenSearch, etc.) offer a fairly complete toolset. Too often, our systems either don’t interoperate, or offer services that are simply proprietary, application-specific protocols on top of HTTP (e.g., SOAP-based RPC-style web services), which misses the whole point of the Web: HTTP is not just a transport protocol, but IS the application protocol itself. The growing awareness of and interest in REST-based systems is simply that: using the Web (specifically HTTP) exactly as it was intended. Thus REST-based architectures and the design principles it promotes offer the “practices” part of the equation.

As an example of a REST-based system (or standard) you cannot do much better than the Atom Publishing Protocol. While it may not have taken the world by storm the way its creators had hoped, it still has loads of potential as “plumbing” — perhaps not something that end-users would be aware of, but hugely useful for application developers. And were one tempted to try some other approach, or “just use HTTP” I am pretty sure they’d end up developing something quite like Atom/AtomPub anyway. In either case, there is no hiding or abstracting away HTTP — it’s in full view in any RESTful system on the web and is in all cases (this being the Web we are talking about) simply unavoidable.

The next bit to tackle is the format: what data formats allow the sort of interoperabilty we are seeking? Certainly as the format behind AtomPub, Atom Syndication Format would be an obvious choice. But there are others: JSON, HTML, XHTML, RDF/XML, RDFa, etc. I tend to favor Atom and JSON, having found both quite suited to the sort of tasks I have in mind. RDF is widely viewed as the basis of the “Web of Data,” but its lack of a containment model makes it ill-suited for the sort of RESTful interactions that are a critical component of the interoperability I’m envisioning. What RDF does offer, which is key, is the ability to “say anything about anything.” I’d contend though, that RDF is not the only way to do that (or indeed even the best way, in many cases). Atom itself offers some RDFish extension points that I quite like, and efforts like OAI-ORE/Atom do so as well. The point is that if a content publisher has some “stuff” they want to describe, they should be able to do so in whatever way they wish, and this original “description” should stay with the item in full fidelity, even if it is mapped to standardized description/metadata schemes if/when necessary. I could say more, especially about OAI-ORE/Atom, since it so closely captures the sort of aggregating and re-aggregating that we see in academic data work. And as an Atom-based format (with an emphasis on the entry rather than the feed), it has a “write-back” story (AtomPub) built in.

I believe things are moving in the direction I describe, but by and large NOT (sadly) in academia and (more sadly) not within the library community. AtomPub is built into products by IBM, Microsoft, Google and others. In fact, Google’s suite of GData-based web applications (GData itself is based on Atom/AtomPub) comes quite close to what I describe. At UT Austin, we’ve been making very wide use of our DASe platform, which is basically a reference implementation of just the sort of application/framework I have described above. A faculty member with “stuff” they need to manage, archive, repurpose, share (images, audio, video, pdfs, data sets, etc.) can add it to a DASe “collection” and thus have it maintained as interoperable, structured data and enjoy the benefits of a rich web interface and suite of web-based RESTful APIs for further development. We have over one hundred such collections, and while most live behind UT authentication, some have seen the light of day on the open web as rich media-based web resources, including eSkeletons and Onda Latina among others.

My conclusion after a few years of working with DASe is that yes this is definitely the way to go — RESTful architectures built on standard protocols and formats offer HUGE benefits from the start and are “engineered for serendipity” such that new uses for existing data are regularly hit upon. I’d also note that it requires buy in from all who wish to play — the hard work of developing specifications and standards, and understanding and following those specifications are the “cost of admission.” Likewise, a willingness to move away from (or build better, more RESTful interfaces around) legacy systems that don’t play well is crucial. This means a shared understanding among technologists, application developers, managers, administrators, and users must be promoted. It’s no easy task, in my experience — pushback can be fairly dramatic, especially when investment (as resources, mindshare, pride, etc.) in our legacy systems is significant. Our work as librarians, repository managers, information technologists is NOT, though, as much a matter of educating users as it is educating ourselves, looking “outside the walls,” and beginning the difficult conversations that start moving us in the right direction.