August 12th, 2013 | by shailinthomas | Published in Future of the Internet
[I feature this thoughtful contribution from Leonid Grinberg, who's been working with me this summer at the Berkman Center.]
In his famous dystopian novel Nineteen Eighty-Four, George Orwell conceived “Newspeak,” a language specifically constructed to make it impossible to express any thoughts that are contrary to the interests of the state. One can think of this as a totalitarian application of the strong Sapir-Whorf Hypothesis, which posits that language can determine thought: “Every concept that can ever be needed, will be expressed by exactly one word, with its meaning rigidly defined and all its subsidiary meanings rubbed out and forgotten.” (Nineteen Eighty-Four, chapter 5)
Natural languages are nothing like Newspeak—not just because they have synonyms and other redundancies, but more importantly because they have the capacity to express concepts that are not “built into” them. For example, similes allow speakers to express unfamiliar concepts in different terms, so that e.g. if someone doesn’t understand the word “twinkle” one can still say that a twinkling star is “like a diamond in the sky,” which is not a definition but may at least aid in understanding. Indeed, this property is precisely what makes translation possible. In an essay called “This Word Cannot be Translated,” the Russian writer Sergei Dovlatov describes a particular Russian word, “хамство,” for which he says Vladimir Nabokov could not come up with an English equivalent when teaching Russian literature at Cornell. (See http://www.sergeidovlatov.com/books/etoneper.html, in Russian)
But of course, translation works without perfection because the identical words of one language inevitably mean different things to different people. Even if there’s not a one-to-one mapping, one can get a pretty good approximation by using related words (e.g. for “хамство” it’s something between “boorishness,” “rudeness,” and “bullying”). Pei-Ying Lin, a student at the Royal College of Art in London, did just that: he made an infographic of “21 emotions for which there are no English words.” He shows emotions for which there are words in other languages as vertices on a graph connected to each other as well as to familiar English words. Thus, one can infer meaning that might not perfectly capture the idea, but come pretty close. And of course, the infographic can be readily encoded back into an English sentence, and thus we can in effect create new “English words.” This is the essence of a highly generative technology.
In thinking about this, we might be tempted to see if the same sort of reasoning holds not just for natural language, but also for programming languages. Here, we hit a bit of an interesting twist. One of the most fundamental theories in computer science, the Church-Turing thesis, says that all sufficiently powerful programming languages are interchangeable in the sense that any program in one language can be translated into an equivalent program in another. The notion of “equivalence” here, however, is pretty narrow. Two programs are “equivalent” if they produce identical outputs in response to identical inputs, but that’s the only thing that has to be the same. For example, the translated program might run much slower than the original, or use a lot more memory. Moreover, it might be much harder for a fellow programmer to understand. In fact, in order to actually perform the functional translation, one might first have to simulate the entire computer, right up from the logic on the circuit board. It would be the equivalent of “translating” from Russian into English by using Russian to describe the neural pathways and signals within the brain that are processed when English is spoken. At the end, with enough time, a Russian speaker would be able to speak and understand English, but only in the most technical sense.
When thinking about the generativity of a platform, we have to take into account not only whether it’s theoretically possible to create something new, but also how easy it is to do in practice. It takes remarkably little to make a programming language that’s theoretically as powerful as any other, but that doesn’t mean it’s actually easy to express anything in it. Thus, the choice of programming language has a much broader implication than the Church-Turing thesis might suggest. Performance issues aside, insofar as a computer program is an act of communication between two programmers, the choice of computer language has an effect on what a program is actually “saying” to someone reading the code and contemplating adapting it to a new, generative purpose.
As a concrete example, consider the following simple logic problem, adapted from Gerry Sussman’s famous textbook, Structure and Interpretation of Computer Programs:
Alice, Bob, and Claire live on different floors of an apartment house that contains only three floors. Bob does not live on the top floor, Alice does not live next to Claire, and Claire does not live below Bob. Where does everyone live?
Here are two potential programs in a made-up programming language that might solve this problem:
alice = one_of(1,2,3) bob = one_of(1,2,3) claire = one_of(1,2,3) forbid(alice = bob or bob = claire or alice = claire) forbid(bob = 3) forbid(alice – claire = 1 or claire – alice = 1) forbid(claire < bob) display(alice, bob, claire)
try values for alice between 1 and 3: try values for bob between 1 and 3: try values for claire between 1 and 3: if alice = bob or bob = claire or alice = claire: try the next value for claire if bob = 3: try the next value for claire if alice – claire = 1 or claire – alice = 1: try the next value for claire if claire < bob: try the next value for claire else: display(alice, bob, claire) exit if claire = 3: reset claire back to 1 try the next value for bob if bob = 3: reset bob back to 1 try the next value for alice
If you parse them carefully, you will be able to see that these programs do the exact same thing. The only difference is readability, or more fundamentally, coherence. The first version is basically a direct translation of the English problem statement—it literally just rewrites the constraints in a slightly more formal syntax. The second version actually takes the reader through the search process and lets her follow step by step as it finds the combination of values that satisfy all the constraints. That level of detail makes the program considerably less readable. The only way to really understand what it’s doing is to follow it step-by-step, whereas one can quickly scan through the first program and understand what it’s saying.
The first version uses two mysterious commands (
forbid()) that somehow control the values as they change to match the constraints. The interesting thing about these commands is that they are non-linear. In particular, after the first
one_of() line, we don’t really know what value the variable
alice has. It’s some magical, non-determined placeholder that, along with the values of
claire, ultimately gets whittled down by the
forbid() commands until just one set of values remains.
As it turns out, almost no mainstream programming language can easily express a program that looks like the first version. The mechanisms that are necessary to make it work—the goings-on “behind the scenes” that power the
forbid() commands—simply aren’t easily expressed in most mainstream languages. On the other hand, the second, less readable program can be translated to any number of commonly used languages almost verbatim. Pick anyone you like—Java, C++, Python—and the only changes that would need to be made for a program like this would be relatively minor syntactical variations.
One area where this can have interesting implications is open source licenses. In the most basic sense of the word, for software to be “open source” its vendor must make accessible the source code that was used to generate the executable program. But the separation between “source code” and “executable code” is a lot less clean than it used to be. The dichotomy between the two makes sense for languages like C++ and Java, where a program called a “compiler” takes the source code and generates human-unreadable, executable code, which can then work on its own without the original. But people are increasingly using languages that don’t neatly fit into this workflow. For example, a compiler for a language resembling the one of the first program may very well first translate the code into something that looks like the second, and then translate that result into executable code. For example, Bigloo is a compiler for the Scheme programming language, which is a language that can express programs that look like the first one above. Bigloo works by first translating programs in Scheme into equivalent programs written in C, which can then be easily compiled into executable code.
But in this case, what’s the “source code”? If a software vendor wrote code in Scheme and ran Bigloo on it to produce C code, would its program be “open source” if the vendor provided just the C code? One could argue that it shouldn’t be considered open source because the C code that Bigloo generates is so complex as to be effectively unreadable. But there’s no particular reason Bigloo couldn’t be optimized to generate readable C code. And if it were, would the C code be good enough to be considered the “source code”?
It’s tempting to avoid this mess by simply defining the source code as the “original code” that the programmers write in, but that starts us down a really slippery slope. Modern programming environments frequently generate code—sometimes large amounts of it—from much more high-level descriptions provided to them by the programmers. (Here is one extreme example of this.) Some languages, like the educational language Scratch, forego traditional textual code altogether, and instead entirely rely on visual representation of code that look an awful lot like the flowcharts software engineers sometimes write on boards when planning out a software project. If there was a tool that translated those flowcharts to Scratch code, which was then translated to C, which was then compiled to executable code, would we have to limit the definition of “source code” to only include the flowcharts? Or is there something about C—the fact that it can’t easily express the sort of semantic connections that a visual representation can—that makes it “low level” enough to not be source code, even though plenty of software (much of it open source) is written by hand in C?
To make matters more confusing, C is almost never directly translated to executable code. Instead, C compilers usually translate it to a language called “assembly code,” which then gets translated to the executable code. Assembly code is more human readable than machine code, but it’s still extremely low-level—so much so that except for some very limited applications, almost no one programs in it by hand. But that wasn’t always so—a few decades ago, assembly code was the high-level language that people wrote in, and compilers (called “assemblers”) would then translate it to executable code.
These days, no open source project could get away with calling the compiler-generated assembly code “source code.” What’s remarkable about that is that even though C has a lot of features that assembly code doesn’t, the two are actually cognitively very close. There are very few things that are expressible in C that are not expressible in assembly code—the features of C that assembly code doesn’t have make C (significantly) more convenient to use than assembly code, but they don’t really make C more expressive. Conversely, Scheme really does have features that C can’t express, one of which is precisely the difference captured by the two programs above. So in a sense, compiling from Scheme to C is a bigger jump than compiling from C to assembly code.
As languages with features like the sort that Scheme has become more common, open source licenses will have to adapt. The ultimate goal of open source is to promote generativity, so a more nuanced approach that focuses on how easy it is for a “typical programmer” to modify the code will likely be needed. The ultimate goal is to prevent programs to only be distributed in the computer equivalent of Newspeak where no new ideas can come from the code. As long as most people understand the code enough to be able to change it and add new features, it may as well be considered the “source code.” Conversely, of course, languages that have fallen so far out of use so as not to be understandable by most programmers may at some point stop being meaningful “source code,” especially if they continue to be used as steps along a compilation path, like assembly is now.
The point of open licenses is to make their platforms more generative. For programming languages, a component of generativity is expressive power. Open source licenses ought to think about defining “source code” to reflect that. Otherwise, their meaning and effectiveness will be severely—and increasingly—limited.
 Confusingly, the term “generative” has a different—indeed, approximately opposite—meaning in linguistics as it’s being used here. A generative grammar is one in which a finite set of strict rules can be used to produce all valid sentences in a language. Newspeak, to fulfill its purpose, would need to be generative in that sense, or else one would be able to produce new words through the process described above. But we’re talking about generative technologies, which allow contributions that are not expected or built in at conception—the opposite of a generative grammar.
 Folks who enjoy thinking about strong AI might find this argument reminiscent of the “Chinese Room” thought experiment by John Searle. The idea is that if there was a computer that could converse convincingly in Chinese, a human that did not understand Chinese could nevertheless simulate the program’s instructions and effectively carry on a conversation in Chinese without really “understanding” the language.
 Free software licenses are stronger than open source licenses, so everything in this paper applies to them as well. However, for the purposes of this article, we only care about the definition of “source code,” so we can stick with “open source” to include other non-free software as well.
 As an example, C lets programmers specify a “type” for their variables—e.g. you can specify that a variable can only have integral values. Integers typically take up four bytes of memory, which means that creating an array of five integers requires allocating 20 bytes of memory. A C compiler would do this calculation automatically, whereas in assembly code the programmer would need to manage it herself. Is the C way more convenient? Sure. But is it expressing something assembly code fundamentally can’t? Not really.
–by Leonid Grinberg, MIT ’14