- New Harvard Law School Library Project Manager Positions for Innovative Projects w/the Berkman Center for Internet & Society
The HLS Library, in conjunction with the Berkman Center for Internet & Society, is pleased to announce two new project manager positions. The newly posted Library Technology (LT) Project Manager will be responsible for design, implementation, and management of special and ongoing projects across the HLS Library. The recently posted Academic Technology (AT) Project Manager will be based in the Teaching, Learning and Curriculum team at the HLS library and coordinate and promote new technologies and digital solutions for legal classrooms. Both project managers will take part in Berkman Center Geek Hours, learn the latest project management practices and tools, participate in brainstorming sessions and give feedback for Berkman developers working to create a new generation of academic digital tools.
“With the library collaborating on projects with the Berkman Center, we have opportunities for breakthrough innovation we might not be able to reach separately. The Library Technology Project Manager will work with Berkman developers and project managers to move from ideas to execution and back again in a virtuous feedback loop,” said Professor Jonathan Zittrain.
The new LT project manager will be responsible for three major projects currently in development. The first, Perma.cc, a pilot project developed by the Harvard Library Innovation Lab, is a web archive solution designed to combat the problem of “link rot,” by creating permanent links for legal and scholarly citations. The new position will also guide activities on the Free the Law project, which aims to digitize and make available the library’s collection of public source federal and state case law. Finally, the LT project manager will collaborate with the HLS academic technology team and Berkman staff to further develop H2O, a suite of open source, online classroom tools that allows professors to freely develop, remix, and share elements of online textbooks, casebooks, and modules under a Creative Commons Attribution Noncommercial, Share Alike License.
We are casting a wide net for this position. You can learn more details here.
About the HLS Library & the Library Innovation Lab: The library is housed in the neoclassical Langdell Hall, the largest building on campus, and is the largest academic law library in the world. The mission of the library is to support the research and curricular needs of its faculty and students by providing a superb collection of legal materials and by offering the highest possible level of service. The library explores and develops new and innovative technologies for the discovery and delivery of information to the world. The Harvard Library Innovation Lab is a group within the HLS Library that implements in software ideas about how libraries can be ever more valuable.
About the Berkman Center: The Berkman Center for Internet & Society at Harvard University is a research program founded to explore cyberspace, share in its study, and help pioneer its development. Founded in 1997, through a generous gift from Jack N. and Lillian R. Berkman, the Center is home to an ever-growing community of faculty, fellows, staff, and affiliates working on projects that span the broad range of intersections between cyberspace, technology, and society. More information can be found at http://cyber.law.harvard.edu.
- Humanizing the Web
I wrote this in April of 2008 for The Times, and don’t think I ever posted it here –
Humanizing the Web
The Web’s design reflects the open ethos of its early users: it has no central managers, no main menu, and no investment in content – indeed, no business plan whatsoever. Instead, its framers assumed that people and institutions could put their own material online, and the Web would grow to whatever size it might. Users could surf from one site to the next, following links that Web site authors saw fit to place on their pages.
Then the first search engines sprang up. They sent digital robots crawling from one link to the next, copying everything they found, hoping to index the entire Web in one place by obsessively following every path from several starting points. The engines worked: soon one could search the Web not only by following links, but also by entering a search term and finding all the pages containing that term.
This short cut offered by search engines rankled some webmasters. They wanted to be found only the old fashioned way. Even though they’d chosen to put their data on the Web for all to see, they felt far more exposed once any words within – including proper names – could turn up their pages as search results.
But these webmasters didn’t turn to lawyers. They didn’t assert some right to be excluded from the robots’ indexing, even though they might well have had one. Instead, an informal discussion in 1994 produced an obscure standard called “robots.txt.” With it, a webmaster could place very basic requests in a corner of his or her Web site to tell visiting robots which pages to ignore. The standard worked: every major search engine, no matter how data hungry, will pass over pages marked by webmasters as not to be searched. (Try visiting your favorite web site while adding /robots.txt to its name, such as http://www.harvard.edu/robots.txt, and you’ll see what I mean.) The standard solved most of a problem before judges or legislators had to be called in, by deploying one of the fundamental units of a civilised society: an unelaborated request to be let alone. Robots.txt has become a crucial part of the scaffolding that keeps the Web functioning socially. It creates semi-private spaces where before the choice was between entirely private or entirely public. With the norm established, it might even have attained the force of law as the Web has expanded and not every search robot is feeling that civilised.
The nerdy concordat of robots.txt offers an important lesson at a time when we ourselves are feeling exposed and vulnerable on the Web. Everyone has heard horror stories of invasions of privacy and cyber-bullying: photos taken out of context, embarrassing videos posted online, and a mob mentality where commenters can be vicious about what they see. In 2002, a Canadian boy filmed himself swinging a golf ball retriever as if he were a Jedi knight. He did it for his own amusement, and for awhile the tape lay forgotten. Some of his friends saw it, and without his permission, though perhaps without malice, placed it online. Within two weeks the video had been reposted in many places and viewed millions of times. Spin-off videos were produced, adding soundtracks and extra graphics. Much of this creative activity was no doubt in fun – but the boy was mortified. He walked around his school to chants of “Star Wars Kid!” He was diagnosed with depression. To this day he deeply regrets the airing of the film.
The first instinct in cases like these is to look to the law. Indeed, the boy sued, and a settlement was reached with the families of the boys who initially put the video online. More generally, stories like these cause many to rue the Web and the collateral damage its openness can cause, and perhaps to retreat as much as possible – placing as little about themselves online as they can, hoping to remain a grain of sand on the digital beach, in comfortable anonymity.
The law may have a role here, but its cumbersome machinery should be a last resort. Instead, we should recognize that the Web’s technology for moving around bits of data has far outpaced its ability to move social information. A person of good will encountering a video of Star Wars kid might happily forward a link to it on to friends – not realizing that the subject of the video is wounded by it, and thus not having an opportunity to consider the ethical implications of the act.
As tons of personal data floods the Web, we can design ways to imbue it with social cues. What is today a disembodied photo on a Google image search could be tagged by its photographer and subjects to indicate just how far they’d like it to spread – and to request that major uses of it be cleared first, providing a way to reach them for consultation without their having to divulge their actual identities. That photo then becomes anchored to the people who made it and are in it, and those encountering it online can see it as the social artefact that it is – rather than just a funny image. I believe that if the Star Wars kid could have tagged his video as private at the time he made it – just in case it should escape its videocassette home – his friends may not have posted it. Even if they did post it, others might not have rushed to help it go viral.
Take the Star Wars Kid Test yourself: Imagine someone forwarded you a link to the video and you thought it was funny. You’re about to share it with friends. But then you see that it’s been marked by the Kid himself as private, and a desperate plea is attached to please not fan the flames, along with an explanation of what happened to get it online in the first place. Would you forward the link?
I’m hoping not. And if enough people – not everyone, of course, just enough – decided to respect the Kid’s wishes, the video may never have reached the critical mass that took it viral. We know there are bad apples online; there are people who won’t respect reasonable requests made nicely. But it’s the vast middle, the rest of us, who transform run-of-the-mill privacy violations online into the truly awful phenomena that they can become. I don’t blame us – yet – because there’s no way to easily convey those requests and cues that make civilisation breathe. If we build those technologies and embed them in the Web, we’ll then be able to understand ourselves as facing true moral choices when we remix a video or redistribute information that might be all in fun – or might be personally devastating.
Hints of the possibility of a humanized Web abound. There is an extensive entry about Star Wars Kid within Wikipedia, the online encyclopedia that anyone can edit. The Kid’s name is not found in the article. On his article’s corresponding discussion page, a debate has raged about whether or not he should be identified. “He’s already been very prominently featured in a USA Today story, as well as being mentioned in many, many places on the Web,” writes one Wikipedian. Another says: “I read that his parents requested his last name to be kept confidential in future reproductions. Therefore, I think it would be wise if we deleted the … name, not in the fear that Wiki’ll be sued otherwise, just out of courtesy to the poor kid.”
Courtesy prevailed. Online masses 1, USA Today 0.
As I entered this in, in November of ’13, I checked the Wiki page again: and Star Wars Kid is now named in the first sentence of the article. As best I can tell from the article’s talk and history pages it was added in June of ’13, with the justification that “… as per talk page discussion, he has explicitly associated himself with the viral video through various high profile media outlets.” I can’t tell if that puts the score back to zero for Wikipedia, or if genuinely changing circumstances simply point to the encyclopedia’s flexibility and timeliness. I think it’s more the latter; see if you think the reasoning is persuasive — or at least attentive to the earlier opposite consensus.
And an interesting related post, from someone who had her own picture go unpleasantly viral describes what she did next.
- Joining Team Archive: Perma and the Ongoing Effort to Preserve the Web
The accessibility and flexibility of the Internet is a double-edged sword. A distributed web makes it easy to publish content and link to it, but it also means that this content is by no means permanent: any given server or page can disappear or change at any time. (For example, the U.S. federal government was partially shut down in late 2013, with thousands of formerly stable Web pages at .gov destinations no longer available.) When this happens, links that previously led users to those resources take them instead to error messages, unrelated content, or snarky commentaries on the transience of online content: this phenomenon is called linkrot.
As noted previously, the Harvard Law School Library in conjunction with over 30 partnering libraries and non-profits, has developed Perma.cc to mitigate the impact of linkrot on scholarly citations. Perma is an archive with a relatively narrow scope. Rather than undertake the daunting task of archiving the entire Web – something that the Internet Archive, a Perma partner, has been doing amazingly well – Perma is designed to take particular instances of particular web pages at the specific request of an author and place them in the hands of a community of libraries for safe-keeping.
The battle against linkrot is not one that Perma fights alone. There are other organizations and initiatives attacking this problem from angles that differ both in scope and technical implementation.
The Internet Archive has developed an unparalleled tool for preserving online content called the Wayback Machine. The crawler-based archive attempts to document every page on the Web by routinely saving the content hosted at all the URLs its crawler can find. If a Web surfer wants to see what the Google homepage looked like in 2002, she can go to the Wayback Machine, and it will show her all cached versions of the Google homepage saved in 2002 — or any other year — by the Internet Archive crawler.
A slightly narrower approach to large-scale archiving has been implemented by SiteStory. Instead of crawling the web, or caching instances at the request of a user, SiteStory monitors the servers hosting particular content, and makes a record of that content every time information is requested from the server. While this method won’t cache websites that are not being visited, it can create detailed archives of frequently visited sites that capture almost every change made as the sites evolve.
Atop these archives operate a number of protocols and interfaces that assist users in accessing archived content in a targeted fashion. While the functionality of displaying what a particular author wanted a reader to see is not present, as it is in Perma, these services enable users to pinpoint the content for which they are looking across archives. An example of such a tool is Memento. Memento is a framework with a Chrome plug-in that adds a temporal dimension to the HTTP protocol. Designed to make old versions of web pages more accessible, Memento enables a user’s browser to time travel through many different archives, including the Internet Archive’s Wayback Machine, Sitestory, and Archive.is. Because of the user-friendly power of the Memento tool, the Perma team is seeking to make Perma Memento-compatible in a future release.
Narrowing the scope of an archival mission even further, services such as WAIL, Arichive.is, and WebCite are designed to allow users to store a cache of a specific instance of a website — either on their own computers or with the service itself. WAIL serves as an interface for archiving tools, such as Heritrix and Apache Tomcat, giving users an easy way to use these technologies to create their own local repositories. WebCite and Archive.is also preserve user-directed copies of online content, but they host it themselves as part of centralized archives.
Perma enriches this community of archivists in several ways. One unprecedented feature is Perma’s institutional nature. Libraries are in the forever business. When a library promises to save something, it means it, and some of the libraries behind Perma go back hundreds of years. Libraries have ventured into digital archiving with tools such as Digital Object Identifiers (DOIs), but these are predominantly for uniquely identifying institutionally published materials, rather than archiving them in a particular place or manner. This blog post, for example, won’t readily have a DOI, but one could make a Perma link to it!
By bringing the institutional power and promise of libraries to bear on the transient content of the Internet, Perma aspires to create a hybrid of these two essential entities—and in doing so, to capture the best of both worlds. Perma is as accessible as the Internet, and meant to be as permanent as the libraries that stand behind it.
To fully realize the accessibility and durability to which Perma aspires, the Perma team is constructing an archive that spans a network of mirrored servers distributed throughout the consortium of partnering libraries. We aim to construct a network through which an independent copy of the cached content behind every Perma link will be stored on each of these servers. The federated nature of the archive will ensure that if the servers at any one institution go down, the content will still be accessible from the other mirroring partners. With each additional mirroring library, the probability that the Perma archive will remain accessible increases.
The projects mentioned above constitute some of the individuals and organizations that have stepped up to help preserve online content. Given the herculean task that is archiving the Internet, no one method or institution could singlehandedly serve as a silver bullet against online transience. However, the past two decades have seen the burgeoning of this multi-dimensional, collective effort to archive the Internet. With some collaboration, and interoperability, these services can help systematically preserve the ephemeral, yet increasingly important, space the Internet has become.
—by Shailin Thomas, Jonathan Zittrain, and Ben Sobel
- Perma: Scoping and addressing the problem of “link rot”
Kendra Albert, Larry Lessig and I are finishing up a study of link rot, available at http://papers.ssrn.com/abstract=2329161. Link rot is the phenomenon by which material we link to on the distributed Web vanishes or changes beyond recognition over time. (Wiki discusses link rot here.) This is a particular problem for academic scholarship, which is increasingly linking out to the Web rather than more formal, library-curated sources. That kind of linking makes clear sense, but having materials easily accessible right until they vanish means that academic work (government documents, such as judicial opinions) can end up with sources that can’t be checked or followed up upon by readers.
We found that half of the links in all Supreme Court opinions no longer work. And more than 70% of the links in such journals as the Harvard Law Review (in that case measured from 1999 to 2012), currently don’t work. As time passes, the number of non-working links increases.
Our work builds on other great link rot studies such as that by Raizel Liebler and June Liebert in the Yale Journal of Law and Technology, available here (PDF).
In response, the Harvard Library Innovation Lab has pioneered a project to unite libraries so that link rot can be mitigated. We are joined by about thirty law libraries around the world to start Perma.cc, which will allow those libraries on direction of authors and journal editors to store permanent caches of otherwise ephemeral links. Libraries are the ideal partners for this task: they think on a long timescale; they take user trust and service seriously; and they are non-commercial. You can see more about the system at perma.cc. The amazing Internet Archive has lent its archiving engine to the effort, and Instapaper has generously provided an alternative path to parse Web pages to be saved. CloudFlare has kindly ensured that the the system at Perma.cc can scale with use.
We’re grateful to these many institutions and people who have come together to help make the Web work for the ages — the only way this can work is as a peer effort.
Perma’s founding partners are:
- Pence Law Library, Washington College of Law, American University
- Law Library at Boston College, Boston College of Law
- Pappas Law Library, Boston University School of Law
- Biddle Law Library, University of Pennsylvania Law School
- Charleston School of Law Library
- Arthur W. Diamond Law Library, Columbia Law School
- Digital Public Library of America
- J. Michael Goodson Law Library, Duke University School of Law
- Florida State Law Research Center, Florida State University College of Law
- The Leo T. Kissam Memorial Library, Forham University School of Law
- Georgetown Law Library, Georgetown Law
- Internet Archive
- Harvard Law School Library
- Ruth Lilly Law Library, Robert H. McKinney School of Law, Indiana University
- Louis L. Biro Law Library, The John Marshall Law School
- Louisiana Statue University Law Library, LSU Law Center
- Thurgood Marshall Law Library, Francis King Carey School of Law, University of Maryland
- Melbourne Law School Law Library
- Bodleian Law Library, Bodleian Libraries, University of Oxford
- Harnish Law Library, Pepperdine University School of Law
- The Fred Parks Law Library, South Texas College of Law
- Robert Crown Law Library, Stanford Law School
- Hugh & Hazel Darling Law Library, UCLA School of Law
- Grisham Law Library, University of Mississippi School of Law
- Wiener-Rogers Law Library, UNLV William S. Boyd School of Law
- Tarleton Law Library, Jamail Center of Legal Research, The University of Texas School of Law
- Lillian Goldman Law Library, Yale Law School
NYT story here. And the Perma link to this very page can be found at http://perma.cc/0WNvsHVwhT5. (How’s that for recursive?)
- The generativity of programming languages: Why “open source” is about expressive power
[I feature this thoughtful contribution from Leonid Grinberg, who's been working with me this summer at the Berkman Center.]
In his famous dystopian novel Nineteen Eighty-Four, George Orwell conceived “Newspeak,” a language specifically constructed to make it impossible to express any thoughts that are contrary to the interests of the state. One can think of this as a totalitarian application of the strong Sapir-Whorf Hypothesis, which posits that language can determine thought: “Every concept that can ever be needed, will be expressed by exactly one word, with its meaning rigidly defined and all its subsidiary meanings rubbed out and forgotten.” (Nineteen Eighty-Four, chapter 5)
Natural languages are nothing like Newspeak—not just because they have synonyms and other redundancies, but more importantly because they have the capacity to express concepts that are not “built into” them. For example, similes allow speakers to express unfamiliar concepts in different terms, so that e.g. if someone doesn’t understand the word “twinkle” one can still say that a twinkling star is “like a diamond in the sky,” which is not a definition but may at least aid in understanding. Indeed, this property is precisely what makes translation possible. In an essay called “This Word Cannot be Translated,” the Russian writer Sergei Dovlatov describes a particular Russian word, “хамство,” for which he says Vladimir Nabokov could not come up with an English equivalent when teaching Russian literature at Cornell. (See http://www.sergeidovlatov.com/books/etoneper.html, in Russian)
But of course, translation works without perfection because the identical words of one language inevitably mean different things to different people. Even if there’s not a one-to-one mapping, one can get a pretty good approximation by using related words (e.g. for “хамство” it’s something between “boorishness,” “rudeness,” and “bullying”). Pei-Ying Lin, a student at the Royal College of Art in London, did just that: he made an infographic of “21 emotions for which there are no English words.” He shows emotions for which there are words in other languages as vertices on a graph connected to each other as well as to familiar English words. Thus, one can infer meaning that might not perfectly capture the idea, but come pretty close. And of course, the infographic can be readily encoded back into an English sentence, and thus we can in effect create new “English words.” This is the essence of a highly generative technology.
In thinking about this, we might be tempted to see if the same sort of reasoning holds not just for natural language, but also for programming languages. Here, we hit a bit of an interesting twist. One of the most fundamental theories in computer science, the Church-Turing thesis, says that all sufficiently powerful programming languages are interchangeable in the sense that any program in one language can be translated into an equivalent program in another. The notion of “equivalence” here, however, is pretty narrow. Two programs are “equivalent” if they produce identical outputs in response to identical inputs, but that’s the only thing that has to be the same. For example, the translated program might run much slower than the original, or use a lot more memory. Moreover, it might be much harder for a fellow programmer to understand. In fact, in order to actually perform the functional translation, one might first have to simulate the entire computer, right up from the logic on the circuit board. It would be the equivalent of “translating” from Russian into English by using Russian to describe the neural pathways and signals within the brain that are processed when English is spoken. At the end, with enough time, a Russian speaker would be able to speak and understand English, but only in the most technical sense.
When thinking about the generativity of a platform, we have to take into account not only whether it’s theoretically possible to create something new, but also how easy it is to do in practice. It takes remarkably little to make a programming language that’s theoretically as powerful as any other, but that doesn’t mean it’s actually easy to express anything in it. Thus, the choice of programming language has a much broader implication than the Church-Turing thesis might suggest. Performance issues aside, insofar as a computer program is an act of communication between two programmers, the choice of computer language has an effect on what a program is actually “saying” to someone reading the code and contemplating adapting it to a new, generative purpose.
As a concrete example, consider the following simple logic problem, adapted from Gerry Sussman’s famous textbook, Structure and Interpretation of Computer Programs:
Alice, Bob, and Claire live on different floors of an apartment house that contains only three floors. Bob does not live on the top floor, Alice does not live next to Claire, and Claire does not live below Bob. Where does everyone live?
Here are two potential programs in a made-up programming language that might solve this problem:
If you parse them carefully, you will be able to see that these programs do the exact same thing. The only difference is readability, or more fundamentally, coherence. The first version is basically a direct translation of the English problem statement—it literally just rewrites the constraints in a slightly more formal syntax. The second version actually takes the reader through the search process and lets her follow step by step as it finds the combination of values that satisfy all the constraints. That level of detail makes the program considerably less readable. The only way to really understand what it’s doing is to follow it step-by-step, whereas one can quickly scan through the first program and understand what it’s saying.
As it turns out, almost no mainstream programming language can easily express a program that looks like the first version. The mechanisms that are necessary to make it work—the goings-on “behind the scenes” that power the
forbid() commands—simply aren’t easily expressed in most mainstream languages. On the other hand, the second, less readable program can be translated to any number of commonly used languages almost verbatim. Pick anyone you like—Java, C++, Python—and the only changes that would need to be made for a program like this would be relatively minor syntactical variations.
But in this case, what’s the “source code”? If a software vendor wrote code in Scheme and ran Bigloo on it to produce C code, would its program be “open source” if the vendor provided just the C code? One could argue that it shouldn’t be considered open source because the C code that Bigloo generates is so complex as to be effectively unreadable. But there’s no particular reason Bigloo couldn’t be optimized to generate readable C code. And if it were, would the C code be good enough to be considered the “source code”?
It’s tempting to avoid this mess by simply defining the source code as the “original code” that the programmers write in, but that starts us down a really slippery slope. Modern programming environments frequently generate code—sometimes large amounts of it—from much more high-level descriptions provided to them by the programmers. (Here is one extreme example of this.) Some languages, like the educational language Scratch, forego traditional textual code altogether, and instead entirely rely on visual representation of code that look an awful lot like the flowcharts software engineers sometimes write on boards when planning out a software project. If there was a tool that translated those flowcharts to Scratch code, which was then translated to C, which was then compiled to executable code, would we have to limit the definition of “source code” to only include the flowcharts? Or is there something about C—the fact that it can’t easily express the sort of semantic connections that a visual representation can—that makes it “low level” enough to not be source code, even though plenty of software (much of it open source) is written by hand in C?
To make matters more confusing, C is almost never directly translated to executable code. Instead, C compilers usually translate it to a language called “assembly code,” which then gets translated to the executable code. Assembly code is more human readable than machine code, but it’s still extremely low-level—so much so that except for some very limited applications, almost no one programs in it by hand. But that wasn’t always so—a few decades ago, assembly code was the high-level language that people wrote in, and compilers (called “assemblers”) would then translate it to executable code.
These days, no open source project could get away with calling the compiler-generated assembly code “source code.” What’s remarkable about that is that even though C has a lot of features that assembly code doesn’t, the two are actually cognitively very close. There are very few things that are expressible in C that are not expressible in assembly code—the features of C that assembly code doesn’t have make C (significantly) more convenient to use than assembly code, but they don’t really make C more expressive. Conversely, Scheme really does have features that C can’t express, one of which is precisely the difference captured by the two programs above. So in a sense, compiling from Scheme to C is a bigger jump than compiling from C to assembly code.
As languages with features like the sort that Scheme has become more common, open source licenses will have to adapt. The ultimate goal of open source is to promote generativity, so a more nuanced approach that focuses on how easy it is for a “typical programmer” to modify the code will likely be needed. The ultimate goal is to prevent programs to only be distributed in the computer equivalent of Newspeak where no new ideas can come from the code. As long as most people understand the code enough to be able to change it and add new features, it may as well be considered the “source code.” Conversely, of course, languages that have fallen so far out of use so as not to be understandable by most programmers may at some point stop being meaningful “source code,” especially if they continue to be used as steps along a compilation path, like assembly is now.
The point of open licenses is to make their platforms more generative. For programming languages, a component of generativity is expressive power. Open source licenses ought to think about defining “source code” to reflect that. Otherwise, their meaning and effectiveness will be severely—and increasingly—limited.