I’m generally a big fan of peer review. I think it plays an important role in the improvement and “chromatography” of the scholarly literature. But sometimes. Sometimes.

The Boyer-Moore MJRTY algorithm allows efficient determination of which shape (triangle, circle, square) is in the majority without counting each shape.
The Boyer-Moore MJRTY algorithm allows efficient determination of which shape (triangle, circle, square) is in the majority without counting each shape.

This past week I was reading Robert Boyer and J Strother Moore‘s paper on computing the majority element of a multiset, which presents a very clever simple algorithm for this fundamental problem and a description of a mechanical proof of its correctness. The authors aptly consider the work a “minor landmark in the development of formal verification and automated reasoning”.

Below is the postscript to that paper, in its entirety, which describes the history of the paper including how and why it was “repeatedly rejected for publication”. (It was eventually published as a chapter in a 1991 festschrift for Woody Bledsoeten years after it was written, and is now also available from Moore’s website.)

In this paper we have described a linear time majority vote algorithm and discussed the mechanically checked correctness proof of a Fortran implementation of it. This work has a rather convoluted history which we would here like to clarify.

The algorithm described here was invented in 1980 while we worked at SRI International. A colleague at SRI, working on fault tolerance, was trying to specify some algorithms using the logic supported by “Boyer-Moore Theorem Prover.” He asked us for an elegant definition within that logic of the notion of the majority element of a list. Our answer to this challenge was the recursive expression of the algorithm described here.

In late 1980, we wrote a Fortran version of the algorithm and proved it correct mechanically. In February, 1981, we wrote this paper, describing that work. In our minds the paper was noteworthy because it simultaneously announced an interesting new algorithm and offered a mechanically checked correctness proof. We submitted the paper for publication.

In 1981 we moved to the University of Texas. Jay Misra, a colleague at UT, heard our presentation of the algorithm to an NSF site-visit team. According to Misra (private communication, 1990): “I wondered how to generalize [the algorithm] to detect elements that occur more than n/k times, for all k, k ≥ 2. I developed algorithm 2 [given in Section 3 of [9]] which is directly inspired by your algorithm. Also, I showed that this algorithm is optimal [Section 5, op. cit.]. On a visit to Cornell, I showed all this to David Gries; he was inspired enough to contribute algorithm 1 [Section 2, op. cit.].” In 1982, Misra and Gries published their work [9], citing our technical report appropriately as “submitted for publication.”

However, our paper was repeatedly rejected for publication, largely because of its emphasis on Fortran and mechanical verification. A rewritten version emphasizing the algorithm itself was rejected on the grounds that the work was superceded by the paper of Misra and Gries!

When we were invited to contribute to the Bledsoe festschrift we decided to use the opportunity to put our original paper into the literature. We still think of this as a minor landmark in the development of formal verification and automated reasoning: here for the first time a new algorithm is presented along with its mechanically checked correctness proof—eleven years after the work.

I have to think the world would have been better off if Boyer and Moore had just posted the paper to the web in 1981 and been done with it. Unfortunately, the web hadn’t been developed yet.

Stamps to mark "restricted data" (modified from "atomic stamps 1" by flickr user donovanbeeson, used by permission under CC by-nc-sa)
Stamps to mark “restricted data” (modified from “atomic stamps 1” by flickr user donovanbeeson, used by permission under CC by-nc-sa)

Ten years ago today was the largest terrorist action in United States history, an event that highlighted the importance of intelligence, and its reliance on information classification and control, for the defense of the country. This anniversary precipitated Peter Suber’s important message, which starts from the fact that access to knowledge is not always a good. He addresses the question of whether open access to the scholarly literature might make information too freely available to actors who do not have the best interests of the United States (or your country here) at heart. Do we really want everyone on earth to have information about public-key cryptosystems or exothermic chemical reactions? Should our foreign business competitors freely reap the fruits of research that American taxpayers funded? He says,

You might think that no one would seriously argue that using prices to restrict access to knowledge would contribute to a country’s national and economic security. But a vice president of the Association of American Publishers made that argument in 2006. He “rejected the idea that the government should mandate that taxpayer financed research should be open to the public, saying he could not see how it was in the national interest. ‘Remember — you’re talking about free online access to the world,’ he said. ‘You are talking about making our competitive research available to foreign governments and corporations.’ “

Suber’s response is that “If we’re willing to restrict knowledge for good people in order to restrict knowledge for bad people, at least when the risks of harm are sufficiently high, then we already have a classification system to do this.” (He provides a more detailed response in an earlier newsletter.) He is exactly right. Placing a $30 paywall in front of everyone to read an article in order to keep terrorists from having access to it is both ineffective (relying on al Qaeda’s coffers to drop below the $30 point is not a counterterrorism strategy) and overreaching (since a side effect is to disenfranchise the overwhelming majority of human beings who are not enemies of the state). Instead, research that the country deems too dangerous to distribute should be, and is, classified, and therefore kept from both open access and toll access journals.

This argument against open access, that it might inadvertently abet competitors of the state, is an instance of a more general worry about open distribution being too broad. Another instance is the “corporate free-riding” argument. It is argued that moving to an open-access framework for journals would be a windfall to corporations (the canonical example is big pharma) who would no longer have to subscribe to journals to gain the benefit of their knowledge and would thus be free-riding. To which the natural response would be “and what exactly is wrong with that?” Scientists do research to benefit society, and corporate use of the fruits of the research is one of those benefits. Indeed, making research results freely available is a much fairer system, since it allows businesses both large and small to avail themselves of the results. Why should only businesses with deep pockets be able to take advantage of research, much of which is funded by the government.

But shouldn’t companies pay their fair share for these results? Who could argue with that? To assume that the subscription fees that companies pay constitute their fair share for research requires several implicit assumptions that bear examination.

Assumption 1: Corporate subscriptions are a nontrivial sum. Do corporate subscriptions constitute a significant fraction of journal revenues? Unfortunately, there are to my knowledge no reliable data on the degree to which corporate subscriptions contribute to revenue. Estimates range from 0% (certainly the case in most fields of research outside the life sciences and technology) to 15-17%  to 25% (a figure that has appeared informally and been challenged in favor of a 5-10% figure). (Thanks to Peter Suber for help in finding these references.) None of these estimates were backed up in any way. Without any well-founded figures, it doesn’t seem reasonable to be worrying about the issue. The onus is on those proposing corporate free-riding as a major problem to provide some kind of transparently supportable figures.

Assumption 2: Corporations would pay less under open access. The argument assumes that in an open-access world, journal revenues from corporations would drop, because they would save money on subscriptions but would not be supporting publication of articles through publication fees. That is, corporate researchers “read more than they write.” Of course, corporate researchers publish in the scholarly literature as well (as I did for the first part of my career when I was a researcher at SRI International), and thus would be contributing to the financial support of the publishing ecology. Here again, I know of no data on the percentage of articles with corporate authors and how that compares to the percentage of revenue from corporate subscriptions.

Assumption 3: Corporations shouldn’t be paying less than they now are, perhaps for reasons of justice, or perhaps on the more mercenary basis of financial reality. It is presumed that if corporations are not paying subscription fees (and, again by assumption, publication fees) then academia will have to pick up the slack through commensurately higher publication fees, so the total expenditure by academia will be higher. This is taken to be a bad thing, but the reason for that is not clear. Why is it assumed that the “right” apportionment of fees between academia and business is whatever we happen to have at the moment, resulting as it does from historical happenstance based on differential subscription rates and corporate and university budget decisions? Free riding in the objectionable sense is to get something without paying when one ought to pay.  But the latter condition doesn’t apply to the open-access scholarly literature any more than it applies to broadcast television.

Assumption 4: Corporations only support research through subscription fees. However, corporations also provide support for funded research through the corporate taxes that they pay to the government, which funds the research. And this mode of payment has the advantage that it covers all parts of the research process, not just the small percentage that constitutes the publishing of the final results. Corporate taxes constitute some 10% of total US tax revenue according to the IRS, so we can impute corporate underwriting of US-government funded research at that same 10% level. (In fact, since many non-corporate taxes, like FICA taxes, are earmarked for particular programs that don’t fund research, the imputed percentage should perhaps be even higher.) The subscription fees companies pay is above and beyond that. Is the corporate 10% not already a fair share? Might it even be too much?

If we collectively thought that the amount corporations are paying is insufficient, then the right response would be to increase the corporate taxes accordingly, so that all corporations contribute to the underwriting of scientific research that they all would be benefitting from. Let’s take a look at some numbers. The revenue from the 2.5 million US corporations paying corporate tax for 2009 (the last year for which data are available) was about $225 billion. The NSF budget for 2009 was $5.4 billion. So, for instance, a 50% increase in the NSF budget would require increasing corporate tax revenues by a little over 1%, that is, from a 35% corporate tax rate (say) to something like 35.4%. I’m not advocating an increase in corporate taxes for this purpose. First, I’m in no way convinced that corporations aren’t already supporting research sufficiently. Second, there are many other effects of corporate taxes that may militate against raising them. Instead, the point is that it is naive to pick out a single revenue source, subscription fees, as the sum total of corporate support of research.

Assumption 5: Subscription fees actually pay for research, or some pertinent aspect of research. But those fees do not devolve to the researchers or cover any aspect of the research process except for the publication aspect, and publishing constitutes only a small part of the costs of doing research. To avoid disingenuousness, shouldn’t anyone worrying about whether corporations are doing their fair share in underwriting that aspect be worrying about whether they are doing their fair share in underwriting the other aspects as well? Of course, corporations arguably are underwriting other aspects — through internal research groups, grants to universities and research labs, and their corporate taxes (the 10% discussed above). And in an open-access world, they would be covering the publication aspect as well, namely publication fees, through those same streams.

In summary, maintaining the subscription revenue model for reasons of distribution control — whether for purposes of state defense or corporate free-riding — is a misconstruction.

Cover of the first issue of the Philosophical Transactions of the Royal Society, dated March 6, 1665. Available from JSTOR's Early Journal Content collection.
Cover of the first issue of the Philosophical Transactions of the Royal Society, dated March 6, 1665. Available from JSTOR’s Early Journal Content collection.

JSTOR, the non-profit online journal distributor, announced yesterday that they would be making pre-1923 US articles and pre-1870 non-US articles available for free in a program they call “Early Journal Content”. The chosen dates are not random of course; they guarantee that the articles have fallen out of copyright, so such distribution does not run into rights issues. Nonetheless, that doesn’t mean that JSTOR could take this action unilaterally. JSTOR is further bound by agreements with the publishers who provided the journals for scanning, which may have precluded them contractually from distributing even public domain materials that were derived from the provided originals. Thus such a program presumably requires cooperation of the journal publishers. In addition, JSTOR requires goodwill from publishers for all of its activities, so unilateral action could have been problematic for its long-run viability. (Such considerations may even in part underly JSTOR’s not including all public domain material in the opened collection.)

Arranging for the necessary permissions — whether legal or pro forma — takes time, and JSTOR claims that work towards the opening of these materials started “about a year ago”, that is, prior to the recent notorious illicit download program that I have posted about previously. Predictably, the Twittersphere is full of speculation about whether the actions by Aaron Swartz affected the Early Journal Content program:

@grimmelm: JSTOR makes pre-1923 journals freely available http://about.jstor.org/participate-jstor/individuals/early-journal-content Would this have happened earlier or later w/o @aaronsw?

@mecredis: JSTOR makes all their public domain content available for free: http://about.jstor.org/news-events/news/jstor%E2%80%93free-access-early-journal-content I think this means @aaronsw wins.

@maxkaiser: Breaking: @JSTOR to provide free #openaccess to pre-1923 content in US & pre-1870 elsewhere – @aaronsw case had impact: http://about.jstor.org/news-events/news/jstor%E2%80%93free-access-early-journal-content

@JoshRosenau: JSTOR “working on releasing pre-1923 content before [@aaronsw released lotsa their PDFs], inaccurate to say these events had no impact.”

@mariabustillos: Stuff that in yr. pipe and smoke it, JSTOR haters!! http://bit.ly/qtrxdV Also: how now, @aaronsw?

So, did Aaron Swartz’s efforts affect the existence of JSTOR’s new program or its timing? As to the former, it seems clear that with or without his actions, JSTOR was already on track to provide open access to out-of-copyright materials. As to the latter, JSTOR says that

[I]t would be inaccurate to say that these events have had no impact on our planning. We considered whether to delay or accelerate this action, largely out of concern that people might draw incorrect conclusions about our motivations. In the end, we decided to press ahead with our plans to make the Early Journal Content available, which we believe is in the best interest of our library and publisher partners, and students, scholars, and researchers everywhere.

On its face, the statement implies that JSTOR acted essentially without change, but we’ll never know if Swartz’s efforts sped up or slowed down the release.

What the Early Journal Content program does show is JSTOR’s interest in providing broader access to the scholarly literature, a goal they share with open-access advocates, and even with Aaron Swartz. I hope and expect that JSTOR will continue to push, and even more aggressively, towards broader access to its collection. The scholarly community will be watching.