DASe

You are currently browsing the archive for the DASe category.

Layers

I’d like drill down on one element of the arguments I made in my last two posts. I alluded to it, and it’s an idea that’s ever-present in my thinking. It’s the idea of layering or separation of concerns. I suspect it is the most common reason that I get myself into trouble explaining my ideas — while I assume it is understood that I’m talking about layered solutions, I don’t always make that point explicitly clear.

The idea (and it has some very important implications) is that a solution/technology/standard, etc. needs to address a limited, well-defined problem at the proper level of abstraction. I’m a huge fan of the Unix philosophy: “do one thing and do it well,” which captures the idea in its simplest form quite nicely. Tied into the idea of “levels of abstaction,” it is a paradigm/pattern that turns up all over the place in computer science: machine code is abstracted into assembly language, which is abstracted into something like “C,” which is abstracted into Perl/Python/PHP, etc. Each “layer” need only follow the interface provided by its underlying layer and provide a reasonably useful interface for the layer above. It’s a very powerful concept — indeed the Internet itself is built on layers such as this.

But this idea of layered solutions is in no way limited to the “plumbing” of the systems we use. I would contend that all of the work we do as programmers, information technologists, librarians is exactly this: we use the tools and resources that we have available (the underlying layer) to provide a useful abstraction for those customers/clients (the layer above) who will use our systems/collections/guidance, etc. Everywhere I look in my work, whether it be with code, applications, co-worker or faculty interactions, I see this pattern popping up again and again.

I’m actually mixing a few different ideas here — the layers of abstraction pattern and the separation of concerns pattern are related, but not exactly the same thing. No big deal though, since they go together quite well. I’ll add a third idea to the mix, since it’s an important element of my thinking. It’s called Gall’s Law (I actually have it printed out and tacked up next to my desk) and says:

“A complex system that works is invariably found to have evolved from a simple system that worked. The inverse proposition also appears to be true: A complex system designed from scratch never works and cannot be made to work. You have to start over, beginning with a working simple system.”

We fall into a trap in academia of trying to solve problems vertically (hmm, the image of a silo pops into mind) rather than horizontally. We see a problem, are smart enough to fashion a solution, and so thus solve it. But we run into trouble, because we come upon a related problem and wind up starting the process all over again, fashioning a new solution for this specific problem. Then our users come asking for help solving problems we hadn’t exactly thought of, and the problems compound — we’re thrashing, constantly on the treadmill of solving today’s problem. We typically make two big mistakes. One is not following Gall’s Law (and indeed the Unix philosphy). We have not properly decomposed our problem — a working solution to a given problem might very well be a set of of small solutions nicely tied together. Two, we have not adequately abstracted our problem. We simply do not recognize the problem as one that has already been solved in some other arena because superficially, they look different (e.g., not recognizing that the problem Amazon S3 tries to solve is remarkably like the problem we try to solve with our Institutional Repositories).

This also flows into the idea of “services,” which we are simply not seeing enough of in libraries and I think it is a key piece of our survival. My “Aha!” moment came a number of years back when I discovered Jon Udell’s Library Lookup project. It was the first time I saw the OPAC as simply a service that I can interact with, either with the dedicated interface or with a scripted bookmarklet. It didn’t matter to the catalog which I used. It simply filled its role, performed the search and returned a result. Jon Udell had unlocked the power of REST — a theory, derived from HTTP itself and described by the guy who wrote the HTTP spec that elegantly laid out a set of rules (the REST “constraints”) that when followed, would result in robust, evolvable, scalable systems. I no longer felt bound to one user experience for web applications I was building. I could begin thinking about an application as a piece of a larger, loosely coupled system. “Do one thing and do it well” in this case was indeed freeing.

But this brought up a whole new slew of problems: how do I know the piece I build, properly abstracted and decomposed, will be useful to the other pieces of the system? If I’m building all of the pieces, no problem, but that sort of defeats the purpose, no?  I want it to be useful to some other piece that someone else has built.

This is where my interest in standards and specifications began in earnest. What had seemed utterly mundane and boring to me previously, suddenly seemed really useful and (I hate to admit ;-) ) exciting. People get together, compare notes, argue, sweat over tiny details, and end up defining a contract that allows two separate services to interoperate. Maybe not seamlessly and perhaps not without the occassionaly glitch, but ultimately it works and the results can be astonishing (cf. THE WEB).

Let me be clear about one thing: this work of separating concerns, finding the proper level of abstraction, creating useful and useable service contracts is a hugely difficult undertaking. And it’s work librarians should be and need to be involved in. No one knows our problems like we do, and while I refuse to believe that our problems are unique, who better than us to recognize that our problem X is exactly the same problem addressed by Y in some totally different realm. No one is going to decompose our problems or hit upon appropriate abstractions for us — that is OUR work to do.

I can offer a bit of empirical evidence to back up my assertion that it is a useful endeavor. Taking Gall’s words to heart, we built the DASe project as drop-dead simple as possible, made everything a service and evolved out from there, with modules and clients that can be quickly implemented, designed to do one thing well (the good ones stick around, becoming part of the codebase). I’m not here to promote DASe (that’s another topic — my preference would be to see the ideas we have incubated reimplemented in an application that actually has a good solid community already — DSpace 2.0 anyone? :-) ), but rather to push the idea that these basic, solid principles upon which the Web itself is based have real value and offer huge return on investment.

And so to answer incisive criticism of my original post: my suggestions are indeed of a technical nature, but that is exactly the layer I am addressing. It’s not vertical solutions I am after (“here is a problem…here’s how you fix it”), but rather horizontal: if we get the tooling layer right, we will have the opportunity to build, on top of that layer, more robust solutions — solutions which will get down into the nitty-gritty social and cultural problems that can seem at times to be quite intractable. My opinion is that they ARE intractable if they are attempted atop an unstable underlying layer. Let’s allow the folks who daily work in the incredibly diverse and challenging social realm to do so without the obstacles presented by tools that are not doing their job effectively.

Take Two

I’ve had a few responses to my last post that take issue with my seemingly one-size-fits-all, here’s-a-silver-bullet proposal, including Magical thinking in data curation from Dorothea Salo (whose previous blog post I had cited in mine).  That was not my intended message, but I suspect I left enough unconnected dots, mushed a few different ideas together, and failed to define some terms such that I left the wrong impression.

I’ll also confess that my post was not really about data curation per se (at least in the sense that Dorothea means it), but rather the tools we use to interact with data. I do think that better tools will make the hard work of data curation easier, or will at least (in many cases) push the complexity into a more manageable space.  I’ll also note that what I am proposing is in no way a “novel” approach — in fact, it’s based completely on decade-or-more-old standards and is quite common.  Examples of what I am proposing are all over the place, but we simply don’t see them often enough in academia or the library.

Here is an attempt to bullet-point a few conclusions/take-aways:

  • There’s no silver bullet.  If someone suggests there is, tell them they are wrong or write a blog post telling them as much ;-) .
  • We’re all using the Internet to share data.  It’s a good place to be doing so.  The Web, in particular (by this I specifically mean HTTP-based technologies) is excellent for this.  Email’s pretty good, too, as is FTP, etc. But the Web rules.
  • The Web is based on some basic principles that are not nearly well-enough understood.  A better understanding, especially by the folks who build the web applications and write the specifications we use is crucial. Many of the tools we use in the libraries/repositories are poorly attuned to the core principles of the web.  Seeing those systems evolve is in everyone’s interest.
  • These principles are actually quite elegant and powerful.  A better understanding of these principles by librarians, information technologists, administrators is also quite important, and (I’ll contend) something we should strive for.
  • All data has structure.  Much of that structure can be captured by the tools we all use to create data.  It’s critically important that we advocate for and use tools that do so and that make that data portable, i.e., available for reuse by other applications.  When the tool does not or cannot capture the structure of the data automatically (and I’m really talking about metadata here), make it as easy as possible for the user to add that metadata at the point of creation.
  • Human-created metadata is exponentially more difficult/expensive to create than metadata that can be captured automatically.  When human has to create a piece of metadata, make sure it does not have to be re-created by someone else later. That’s a huge waste of time.
  • The tasks we are engaged in in academia and the library (esp. when it comes to managing data) are not special. At all. The extent to which the library/repository sees its mission as unique (i.e., demands that it “solves its own problems”) is the extent to which it is doomed to extinction (and the sooner the better).
  • To put it another way — “we are all librarians/data curators now.”  There are great strides being made.  We ought not miss out due to some notion of the “specialness” of The Library.
  • Try to see analogs to our work in unlikely places.  Look at Twitter, Google, Amazon, Facebook, but look beneath the obvious use cases and think about the implications — can using the library (for some cases) be as easy as using Amazon or Google?  Could the rich ecosystem of client applications we see forming around Twitter form around our OPAC?   Can our systems be offered as a platform by client app developers like Facebook?  Following the basic web principles makes this much more possible.
  • Why is our Institutional Repository not more like Amazon S3?  Would something like SimpleDB (Amazon’s key-value store) be a better way to capture information about our resources?
  • Why authorities are not available as web resources is mind-boggling to me.  Oh wait, they are!  Ed Summers et. al. at the Library of Congress are at the forefront of this whole approach I am talking about.  Keep an eye on their work for ideas/inspiration.

There’s lots more to say, but I’ll leave that for another post….

I read with great interest both Peter Brantley’s Reality dreams (for Libraries) and Dorothea Salo’s Top-down or bottom-up? as both address the increasingly obvious need for data systems support in higher ed. The issue, which in practice comes in many shades and hues, is that our increasingly digital and connected world offers challenges that are not being met and opportunities that are not being exploited. Centralized data curation, the formation of “communities of practice,” individualized faculty support — these are a few directions that institutions might look to help matters, but I cannot help but think that a lasting solution demands a more fundamental foundation that simply does not yet exist. To put it more bluntly (cliché-and-all) we don’t yet have our “killer app” for data management.

Dorothea wisely asks if a single tool could possibly fit the bill. Well, no, but it’s likely not a single tool we are after. While it might have seemed in the early 1990′s that the Mosaic web browser was the Internet’s killer app, it was actually HTML and HTTP that allowed browsers to be created and the Web to explode in popularity and usefulness. Likewise, our solution will likely be a set of protocols, formats, and practices all of which will enable the creation of end-user applications that can “hide the plumbing.” Indeed, training our users to practice better data “hygiene” will be a fruitless task unless they have applications that force them (by way of the path of least resistance) to do so. It’s not a stretch — everyday we send emails, post to blogs, upload pictures to Facebook, etc. without a thought that our data must be properly structured to achieve our aims.

Here are a few essential characteristics of our killer app:

  • The inherent structure of the data must be captured and maintained from the moment of its creation through its entire lifecycle.
  • Separable concerns need to be separated (data, presentation, access control, etc. each have their own “layer”).
  • Reuse, repurposing, and “remixing” must be first-class operations. In fact, the line between use and reuse should simply be erased (i.e., reuse IS use and vice-versa).
  • The feedback loop with the user should be as tight as possible. I.e., I can immediately visualize a whole range of possible uses of the data I have created.

These are not pie-in-the-sky demands: lots of applications already do a decent job at this (almost any web application fits the bill, since the medium practically demands it). But the tools I see faculty regularly using (Excel, FileMaker, Microsoft Word, the desktop computer filesystem) do NOT. That these are seen as anything but disastrous for the creation of structured data is surprising to me.

Which leads me to my next point: I don’t think there is such a thing (anymore) as data that “does not need to be structured” or “data that will never need to be shared.” If there is one point of data hygiene that we do really need to get across to all of our non-technical folks it is that everything you create will be reused and will be shared no matter how unlikely that might seem right now. If not by someone else, then by you. Systems that don’t distinguish use from reuse or originator access from secondary/shared access are exactly what is called for (n.b. I’m not suggesting that authorization/access controls go by the wayside, but rather that they happen in another layer).

Too often our digital systems perpetuate distinctions that, while logical in a pre-digital world, are actively harmful in the digital realm. Consider, for example, a digital photograph taken by a faculty member on a family vacation. The camera automatically attaches useful metadata: time and date, location (if the camera is geo-enabled), camera model and settings, etc. The data and metadata are in no way tied to a specific use, but will be useful no matter how that photo is used/reused. I’ve seen plenty of cases where that vacation photo is as likely to appear on a holiday greeting card as it is to be used in a classroom lecture, as part of a research data set, or as an addendum to a published paper. As things stand, our faculty member would turn to different systems for the various use cases (e.g., Flickr or Picasa for personal uses, PowerPoint for classroom use, DSpace for “institutional” use). While I’m not suggesting that Picasa be expected to serve the purposes of an institutional repository or DSpace that of a classroom presentation tool, part of me thinks “well, why not?”

A more practical expectation would be that our systems interoperate more seamlessly than they do, and that moving an item (data plus metadata) from one to the other is a trivial, end-user task. As I mentioned above, we need protocols, practices, and formats to allow this sort of interoperation. I think that for the most part we have all of the pieces that we need — we simply lack applications that use them. For protocols, I think that HTTP and related Open Web standards (AtomPub, OpenID, OAuth, OpenSearch, etc.) offer a fairly complete toolset. Too often, our systems either don’t interoperate, or offer services that are simply proprietary, application-specific protocols on top of HTTP (e.g., SOAP-based RPC-style web services), which misses the whole point of the Web: HTTP is not just a transport protocol, but IS the application protocol itself. The growing awareness of and interest in REST-based systems is simply that: using the Web (specifically HTTP) exactly as it was intended. Thus REST-based architectures and the design principles it promotes offer the “practices” part of the equation.

As an example of a REST-based system (or standard) you cannot do much better than the Atom Publishing Protocol. While it may not have taken the world by storm the way its creators had hoped, it still has loads of potential as “plumbing” — perhaps not something that end-users would be aware of, but hugely useful for application developers. And were one tempted to try some other approach, or “just use HTTP” I am pretty sure they’d end up developing something quite like Atom/AtomPub anyway. In either case, there is no hiding or abstracting away HTTP — it’s in full view in any RESTful system on the web and is in all cases (this being the Web we are talking about) simply unavoidable.

The next bit to tackle is the format: what data formats allow the sort of interoperabilty we are seeking? Certainly as the format behind AtomPub, Atom Syndication Format would be an obvious choice. But there are others: JSON, HTML, XHTML, RDF/XML, RDFa, etc. I tend to favor Atom and JSON, having found both quite suited to the sort of tasks I have in mind. RDF is widely viewed as the basis of the “Web of Data,” but its lack of a containment model makes it ill-suited for the sort of RESTful interactions that are a critical component of the interoperability I’m envisioning. What RDF does offer, which is key, is the ability to “say anything about anything.” I’d contend though, that RDF is not the only way to do that (or indeed even the best way, in many cases). Atom itself offers some RDFish extension points that I quite like, and efforts like OAI-ORE/Atom do so as well. The point is that if a content publisher has some “stuff” they want to describe, they should be able to do so in whatever way they wish, and this original “description” should stay with the item in full fidelity, even if it is mapped to standardized description/metadata schemes if/when necessary. I could say more, especially about OAI-ORE/Atom, since it so closely captures the sort of aggregating and re-aggregating that we see in academic data work. And as an Atom-based format (with an emphasis on the entry rather than the feed), it has a “write-back” story (AtomPub) built in.

I believe things are moving in the direction I describe, but by and large NOT (sadly) in academia and (more sadly) not within the library community. AtomPub is built into products by IBM, Microsoft, Google and others. In fact, Google’s suite of GData-based web applications (GData itself is based on Atom/AtomPub) comes quite close to what I describe. At UT Austin, we’ve been making very wide use of our DASe platform, which is basically a reference implementation of just the sort of application/framework I have described above. A faculty member with “stuff” they need to manage, archive, repurpose, share (images, audio, video, pdfs, data sets, etc.) can add it to a DASe “collection” and thus have it maintained as interoperable, structured data and enjoy the benefits of a rich web interface and suite of web-based RESTful APIs for further development. We have over one hundred such collections, and while most live behind UT authentication, some have seen the light of day on the open web as rich media-based web resources, including eSkeletons and Onda Latina among others.

My conclusion after a few years of working with DASe is that yes this is definitely the way to go — RESTful architectures built on standard protocols and formats offer HUGE benefits from the start and are “engineered for serendipity” such that new uses for existing data are regularly hit upon. I’d also note that it requires buy in from all who wish to play — the hard work of developing specifications and standards, and understanding and following those specifications are the “cost of admission.” Likewise, a willingness to move away from (or build better, more RESTful interfaces around) legacy systems that don’t play well is crucial. This means a shared understanding among technologists, application developers, managers, administrators, and users must be promoted. It’s no easy task, in my experience — pushback can be fairly dramatic, especially when investment (as resources, mindshare, pride, etc.) in our legacy systems is significant. Our work as librarians, repository managers, information technologists is NOT, though, as much a matter of educating users as it is educating ourselves, looking “outside the walls,” and beginning the difficult conversations that start moving us in the right direction.

REST Presentation

Here’s the PDF slide set of a talk I gave the IT community at UT Austin on June 25, 2008.
Real World REST with Atom/AtomPub

Roy Fielding’s latest post discusses the differences between software implementations, architectures, and architectural styles. Having spent the last few months rewriting DASe to adhere more closely to the REST architectural style, I have come to know the occasionally dizzying challenge of making good design decisions based on a set of principles operating at about 2 or 3 levels of abstraction above the nitty-gritty implementation that I am putting down in code. I had thought that DASe was fairly RESTful to begin with in its original design, but important RESTful constraints were NOT being followed, in fact. It has been an interesting adventure, to say the least, but I am as convinced as ever that the ultimate benefits will be worth it, both for DASe and for my own understanding of distributed information systems.

The architecture of DASe relies heavily on the Atom & AtomPub specifications, which I think is a pretty good base, since those specs are shot through with an understanding of and desire to capitalize on the principles of REST.  When REST is discussed it is as likely as not in terms of HTTP and half again as likely to be in terms of Atom/AtomPub.  So there’s much to be learned from some very intelligent folks on the subject.  And frankly, when a design decision needs to be made, it can be difficult to bring abstract concepts down to the concrete and thus “what does Google do?” or “what does Amazon do?” is often a pretty good start. Same with Atom.  Why not let the Atom spec design decisions flow back up into my design?  So my database tables tend to have an “updated” column, and an “updated_by” column that map nicely to atom:updated and atom:author.   And my primary domain classes generally have an “asAtom” method.  It’s not so different, I think, from Mark Nottingham’s recent post about allowing the four principle HTTP verbs to inform his data model.  Although ostensibly about getting “beyond” methods, it strikes as simply “convention-over-configuration” that says “OK — tight coupling with HTTP methods is fine” so we can move on to a level of abstraction up.  Design is ALWAYS a balance, and I was completely mistaken when I assumed REST would be a cookbook of answers for the design challenges of web application architecture and design.  But it does, I think, provide a framework for considering those decisions and an awareness of possible trade offs/benefits lurking when we choose one path over another.

The Roy Fielding post drew a  comment/response from Sam Ruby, and that, a response from Roy stating “…but hypertext as the engine of hypermedia state is also about late binding of application alternatives that guide the client through whatever it is that we are trying to provide as a service.”  I must have read it somewhere before (perhaps in the dissertation?), but I often use that phrase “late binding” in describing REST.  It really is key — that the state of a representation need not be set until I choose to interact with it.  And in my interaction with it, that state is bound (and thus usable & interesting to me) but there is no contract regarding that state.  It can continue to grow and change and everything it links to can grow and change in their own time.  I (a resource owner) am NOT bound by other’s interactions with the resource.  Well, as a librarian, that is a compelling/revolutionary/subversive/threatening (not to mention terrifying) proposition! But it certainly does seem to offer incredible opportunity and captures much of the promise and messiness of information flow in the real world.

DASe & Metadata

Early in the development of the DASe project we decided/realized that the ONLY way we would be able to quickly and efficiently deal with all of the various digital collections we hoped to incorporate would be to NOT enforce any kind of metadata scheme on anyone, but rather simply let folks describe their “stuff” anyway they wish. Not to mention, since many of these were legacy collections set up in a FileMaker or Access database or even an Excel spreadsheet, there was often already a schema in place and folks (rightly) didn’t want to change. Note that we are talking about faculty members and department administrators who have lots better things to do that figure out how to use Dublin Core to describe the images that the have already been using for years in their classes, research, and publications.

We (Liberal Arts Instructional Technology Services at The University of Texas at Austin) had an interest in “rationalizing” this hodge-podge of data & metadata towards two ends: one, we wanted folks to be able to share their collections easily if they wished, and two, we wanted a means by which we could easily repurpose the digital assets in all sorts of ways: podcasts, websites, specialized search interfaces, etc. So we went with what is essential key-value pairs: collection managers create “attributes” (e.g., title, description, person depicted, time period, etc.) that best describes their assets and we provide an interface that allows them to add metadata to any item by filling in a value for any/all attributes that apply. Well, turns out this works REALLY well. We currenly have 88 collections, comprising over 300,000 items (images, audio, video, documents, etc) and the system holds over 4 million pieces of metadata (i.e. the “values” table has over 4 million rows). Searching is fast, adding new collections is easy, and application maintenance (including backing up collections as XML documents) is painless.

The current version of DASe runs on PHP4 with a PostgreSQL back end. The next rev, which is a significant retooling of the current architecture and code base will be PHP5 and will be able to use PostgeSQL, MySQL, SQLite, or XML files as a backend. How that all works, where Atom, REST, RDF and more fit in, problems encountered along the way, as well as solutions settled on (tentative and otherwise) will be some of the topics explored in future posts.

Sam Ruby has a post that is particularly on target for answering (or at least exploring) a couple of questions I posed to the atom-syntax group (see Threads: “Why Use Atom?” and “Atom Inside web application architecture”). His was a follow-up to Yaron Goland’s entertaining Revenge of Babble post. The comments are enlightening, both in the ways folks try to answer/address the problem AND in the ways that no one really DOES, at least definitively (or so it seems to me). The closest is Sam Ruby:

Yaron Yoland: “What are the criteria that identify a problem that ATOM is a good solution for? And even more importantly what are some key flags to look for that identify a problem that ATOM is probably a bad idea for?”

Sam Ruby: “My take: Atom is good for circumstances where data can be organized into “chunks” that you can identify by title, location, who made the last change, when that change was made, and the data itself is either textual or a brief textual summary can be obtained/synthesized. When is it bad? While the above list suggests several pieces of data that should be present, these requirements don’t tend to be equally weighted. The most fundamental pieces of information in my experience are ones that allow a client to answer the following two questions: have I seen this information before, and did it change? Where can I find it and did it meaningfully change are next.
(…) But my experience is that being able to answer “did this resource change?” in both machine and human terms is essential for proper use of HTTP and Atom respectively.”

The question I need to answer is: “what is the cost associated with Atom, and what are the benefits?”. DASe is built in PHP5, and the XML tools available (XMLReader, XMLWriter, SimpleXML, etc.) are REALLY easy to use when dealing with XML that’s not too complex. In all cases, the XML I am passing around is very simple: lists of collections, lists of items, lists of media files. Seems to me there IS some cost to using Atom internally in the application since it inevitably makes those very simple XML structures more complex. I’m beginning work on a generic DASe Atom class that will help simplify the process of serializing and unserializing PHP objects and object arrays in to Atom entries and/or feeds and see what that gives us. Two things I hope to discover: one: does Atom help me answer that question “did this resource change?” and/or help me “bake” that into the representation of that data, and two: obviously, Atom transforms any/all data endpoints into service endpoints. What’s gained there? Perhaps a lot, in fact, since I can “subscribe” to internal processes in the application (e.g., error logging) as a monitoring/maintenance tool.