June 14th, 2012
Pity the poor, beleaguered “Impact Factor™” (IF), a secret mathematistic formula originally intended to serve as a proxy for journal quality. No one seems to like it much. The manifold problems with IF have been rehearsed to death:
- The calculation isn’t a proper average.
- The calculation is statistically inappropriate.
- The calculation ignores most of the citation data.
- The calculated values aren’t reproducible.
- Citation rates, and hence Impact Factors, vary considerably across fields making cross-discipline comparison meaningless.
- Citation rates vary across languages. Ditto.
- IF varies over time, and varies differentially for different types of journals.
- IF is manipulable by publishers.
The study by the International Mathematical Union is especially trenchant on these matters, as is Bjorn Brembs’ take. I don’t want to pile on, just look at some new data that shows that IF has been getting even more problematic over time.
One of the most egregious uses of IF is in promotion and tenure discussions. It’s been understood for a long time that the Impact Factor, given its manifest problems as a method for ranking journals, is completely inappropriate for ranking articles. As the European Association of Science Editors has said
Therefore the European Association of Science Editors recommends that journal impact factors are used only – and cautiously – for measuring and comparing the influence of entire journals, but not for the assessment of single papers, and certainly not for the assessment of researchers or research programmes either directly or as a surrogate.
Even Thomson Reuters says
The impact factor should be used with informed peer review. In the case of academic evaluation for tenure it is sometimes inappropriate to use the impact of the source journal to estimate the expected frequency of a recently published article.
“Sometimes inappropriate.” Snort.
|…the money chart…|
Check out the money chart from the recent paper “The weakening relationship between the Impact Factor and papers’ citations in the digital age” by George A. Lozano, Vincent Lariviere, and Yves Gingras.
They address the issue of whether the most highly cited papers tend to appear in the highest Impact Factor journals, and how that has changed over time. One of their analyses looked at the papers that fall in the top 5% for number of citations over a two-year period following publication, and depicts what percentage of these do not appear in the top 5% of journals as ranked by Impact Factor. If Impact Factor were a perfect reflection of the future citation rate of the articles in the journal, this number should be zero.
As it turns out, the percentage has been extremely high over the years. The majority of top papers fall into this group, indicating that restricting attention to top Impact Factor journals doesn’t nearly cover the best papers. This by itself is not too surprising, though it doesn’t bode well for IF.
More interesting is the trajectory of the numbers. At one point, roughly up through World War II, the numbers were in the 70s and 80s. Three quarters of the top-cited papers were not in the top IF journals. After the war, a steady consolidation of journal brands, along with the invention of the formal Impact Factor in the 60s and its increased use, led to a steady decline in the percentage of top articles in non-top journals. Basically, a journal’s imprimatur — and its IF along with it — became a better and better indicator of the quality of the articles it published. (Better, but still not particularly good.)
This process ended around 1990. As electronic distribution of individual articles took over for distribution of articles bundled within printed journal issues, it became less important which journal an article appeared in. Articles more and more lived and died by their own inherent quality rather than by the quality signal inherited from their publishing journal. The pattern in the graph is striking.
The important ramification is that the Impact Factor of a journal is an increasingly poor metric of quality, especially at the top end. And it is likely to get even worse. Electronic distribution of individual articles is only increasing, and as the Impact Factor signal decreases, there is less motivation to publish the best work in high IF journals, compounding the change.
Meanwhile, computer and network technology has brought us to the point where we can develop and use metrics that serve as proxies for quality at the individual article level. We don’t need to rely on journal-level metrics to evaluate articles.
Given all this, promotion and tenure committees should proscribe consideration of journal-level metrics — including Impact Factor — in their deliberations. Instead, if they must use metrics, they should use article-level metrics only, or better yet, read the articles themselves.
June 11th, 2012
Harvard made a big splash recently when my colleagues on the Faculty Advisory Council to the Harvard Library distributed a Memorandum on Journal Pricing. One of the main problems with the memo, however, is the relatively imprecise recommendations that it makes. It exhorts faculty to work with journals and scholarly societies on their publishing and pricing practices, but provides no concrete advice on exactly what to request. What is good behavior?
I just met with a colleague here at Harvard who raised the issue with me directly. He’s a member of the editorial board of a journal and wants to do the right thing and make sure that the journal’s policies make them a good actor. But he doesn’t want to (and shouldn’t be expected to) learn all the ins and outs of the scholarly publishing business, the legalisms of publication agreements and copyright, and the interactions of all that with the policies of the particular journal. He’s not alone; there are many others in the same boat. Maybe you are too. Is there some pithy request you can make of the journal that encapsulates good publishing practices?
(I’m assuming here that converting the journal to open access is off the table. Of course, that would be preferable, but it’s unlikely to get you very far, especially given that the most plausible revenue model for open-access journal publishing, namely, publication fees, is not well supported by the scholarly publishing ecology as of yet.)
There are two kinds of practices that the Harvard memo moots: it explicitly mentions pricing practices of journals, and implicitly brings up author rights issues in its recommendations. Scholar participants in journals (editors editorial board members, reviewers) may want to discuss both kinds of practices with their publishers. I have recommendations for both.
Here’s my candidate recommendation for ensuring a subscription journal has good rights practice. You (and, ideally, your fellow editorial board members) hand the publisher a copy of the Science Commons Delayed Access (SCDA) addendum. (Here’s a sample.) You request that they adjust their standard article publication agreement so as to make the addendum redundant. This request has several nice effects.
- It’s simple, concrete, and unambiguous.
- It describes the desired result in terms of functionality — what the publication agreement should achieve — not how it should be worded.
- It guarantees that the journal exhibits best practices for a subscription journal. Any journal that can satisfy the criterion that the SCDA addendum is redundant:
- Let’s authors retain rights to use and reuse the article in further research and scholarly activities,
- Allows green OA self-archiving without embargo,
- Allows compliance with any funder policies (such as the NIH Public Access Policy),
- Allows compliance with employer policies (such as university open-access policies) without having to get a waiver, and
- Allows distribution of the publisher’s version of the article after a short embargo period.
- It applies to journals of all types. (Just because the addendum comes from Science Commons doesn’t mean it’s not appropriate for non-science journals.)
- It doesn’t require the journal to give up exclusivity to its published content (though it makes that content available with a moving six-month wall).
The most controversial aspect of an SCDA-compliant agreement from the publisher’s point of view is likely the ability to distribute the publisher’s version of the article after a six-month embargo. I wouldn’t be wed to that six month figure. This provision would be the first thing to negotiate, by increasing the embargo length — to one year, two years, even five years. But sticking to some finite embargo period for distributing the publisher’s version is a good idea, if only to serve as precedent for the idea. Once the journal is committed to allowing distribution of the publisher’s version after some period, the particular embargo length might be reduced over time.
The previous proposal does a good job, to my mind, of encapsulating a criterion of best publication agreement practice, but it doesn’t address the important issue of pricing practice. Indeed, with respect to pricing practices, it’s tricky to define good value. Looking at the brute price of a journal is useless, since journals publish wildly different numbers of articles, from the single digits to the four digits per year, so three orders of magnitude variations in price per journal is expected. Price per article and price per page are more plausible metrics of value, but even there, because journals differ in the quality of articles they tend to publish, hence their likely utility to readers, variation in these metrics should be expected as well. For this reason, some analyses of value look to citation rate as a proxy for quality, leading to a calculation of price per citation.
Another problem is getting appropriate measures of the numerator in these metrics. When calculating price per article or per page or per citation, what price should one use? Institutional list price is a good start. List price for individual subscriptions is more or less irrelevant, given that individual subscriptions account for a small fraction of revenue. But publishers, especially major commercial publishers with large journal portfolios, practice bundling and price discrimination that make it hard to get a sense of the actual price that libraries pay. On the other hand, list price is certainly an upper bound on the actual price, so not an unreasonable proxy.
Finally, any of these metrics may vary systematically across research fields, so the metrics ought to be relativized within a field.
Ted and Carl Bergstrom have collected just this kind of data for a large range of journals at their journalprices.com site, calculating price per article and price per citation along with a composite index calculated as the geometric mean of the two. To handle the problem of field differences, they provide a relative price index (RPI) that compares the composite to the median for non-profit journals within the field, and propose that a journal be considered “good value” if RPI is less than 1.25, “medium value” if its RPI is less than 2, and “bad value” otherwise.
As a good first cut at a simple message to a journal publisher then, you could request that the price of a journal be reduced to bring its RPI below 1.25 (that is, good value), or at least 2 (medium value). Since lots of journals run in the black with composite price indexes below median, that is, with RPI below 1, achieving an RPI of 2 should be achievable for an efficient publisher. (My colleague’s journal, the one that precipitated this post, has an RPI of 2.85. Plenty of room for improvement.)
In summary, you ask the journal to change its publication agreement to be SCDA-compliant and its price to have RPI less than 2. That’s specific, pragmatic, and actionable. If the journal won’t comply, you at least know where they stand. If you don’t like the answers you’re getting, you can work to find a new publisher willing to play ball, or at least, don’t lend your free labor to the current one.
May 29th, 2012
|John Tenniel, c. 1864. Study for illustration to Alice’s adventures in wonderland. Harcourt Amory collection of Lewis Carroll, Houghton Library, Harvard University.|
We’ve just completed spring semester during which I taught a system design course jointly in Engineering Sciences and Computer Science. The aim of ES96/CS96 is to help the students learn about the process of solving complex, real-world problems — applying engineering and computational design skills — by undertaking an extended, focused effort directed toward an open-ended problem defined by an interested “client”.
The students work independently as a self-directed team. The instructional staff provides coaching, but the students do all of the organization and carrying out of the work, from fact-finding to design to development to presentation of their findings.
This term the problem to be addressed concerned the Harvard Library’s exceptional special collections, vast holdings of rare books, archives, manuscripts, personal documents, and other materials that the library stewards. Harvard’s special collections are unique and invaluable, but are useful only insofar as potential users of the material can find and gain access to them. Despite herculean efforts of an outstanding staff of archivists, the scope of the collections means that large portions are not catalogued, or catalogued in insufficient detail, making materials essentially unavailable for research. And this problem is growing as the cataloging backlog mounts. The students were asked to address core questions about this valuable resource: What accounts for this problem at its core? Can tools from computer science and technology help address the problems? Can they even qualitatively improve the utility of the special collections?
The students’ recommendations centered around the design, development, and prototyping of an “archivist’s workstation” and the unconventional “flipped” collections processing that the workstation enabled. Their process involves exhaustive but lightweight digitization of a collection as a precursor to highly efficient metadata acquisition on top of the digitized images, rather than the conventional approach of digitizing selectively only after all processing of the collection is performed. The “digitize first” approach means that documents need only be touched once, with all of the sorting, arrangement, and metadata application being performed virtually using optimized user interfaces that they designed for these purposes. The output is a dynamic finding aid with images of all documents, complete with search and faceted browsing of the collection, to supplement the static finding aid of traditional archival processing. The students estimate that processing in this way would be faster than current methods, while delivering a superior result. Their demo video (below) gives a nice overview of the idea.
The deliverables for the course are now available at the course web site, including the final report and a videotape of their final presentation before dozens of Harvard archivists, librarians, and other members of the community.
I hope others find the ideas that the students developed as provocative and exciting as I do. I’m continuing to work with some of them over the summer and perhaps beyond, so comments are greatly appreciated.
May 21st, 2012
I just sent the email below to my friends and family. Feel free to send a similar letter to yours.
You know me. I don’t send around chain letters, much less start them. So you know that if I’m sending you an email and asking you to tell your friends, it must be important.
This is important.
As taxpayers, we deserve access to the research that we fund. It’s in everyone’s interest: citizens, researchers, government, everyone. I’ve been working on this issue for years. I recently testified before a House committee about it.
Now we have an opportunity to tell the White House that they need to take action. There is a petition at the White House petition site calling for “President Obama to act now to implement open access policies for all federal agencies that fund scientific research.” If we get 25,000 signatures by June 19, 2012, the petition will be placed in the Executive Office of the President for a policy response.
Please sign the petition. I did. I was signatory number 442. Only 24,558 more to go.
Signing the petition is easy. You register at the White House web site verifying your email address, and then click a button. It’ll take five minutes tops. (If you’re already registered, you’re down to ten seconds.)
Please sign the petition, and then tell those of your friends and family who might be interested to do so as well. You can inform people by tweeting them this URL <http://bit.ly/MAbTHG> or posting on your Facebook page or sending them an email or forwarding them this one. If you want, you can point them to a copy of this email that I’ve put up on the web at <http://hvrd.me/J8EmyD>.
Since I’ve just requested that you send other people this email (and that they do so as well), I want to make sure that there’s a chain letter disclaimer here: Do not merely spam every email address you can find. Please forward only to those people who you know well enough that it will be appreciated. Do not forward this email after June 19, 2012. The petition drive will be over by then. By all means before forwarding the email check the White House web link showing the petition at whitehouse.gov to verify that this isn’t a hoax. Feel free to modify this letter when you forward it, but please don’t drop the substance of this disclaimer paragraph.
You can find out more about the petition from the wonderful people at Access2Research who initiated it, and you can read more about my own views on open access to the scholarly literature at my blog, the Occasional Pamphlet.
Thank you for your help.
April 27th, 2012
photo by flickr user Iguana Joe, used by permission (CC-by-nc)
Earlier this week, the Harvard Library announced its new open metadata policy, which was approved by the Library Board earlier this year, along with an initial two metadata releases. The policy is straightforward:
The Harvard Library provides open access to library metadata, subject to legal and privacy factors. In particular, the Library makes available its own catalog metadata under appropriate broad use licenses. The Library Board is responsible for interpreting this policy, resolving disputes concerning its interpretation and application, and modifying it as necessary.
The first releases under the policy include the metadata in the DASH repository. Though this metadata has been available through open APIs since early in the repository’s history, the open metadata policy makes clear the open licensing terms that the data is provided under.
The release of a huge percentage of the Harvard Library’s bibliographic metadata for its holdings is likely to have much bigger impact. We’ve provided 12 million records — the vast majority of Harvard’s bibliographic data — describing Harvard’s library holdings in MARC format under a CC0 license that requests adherence to a set of community norms that I think are quite reasonable, primarily calling for attribution to Harvard and our major partners in the release, OCLC and the Library of Congress.
OCLC in particular has praised the effort, saying it “furthers [Harvard's] mandate from their Library Board and Faculty to make as much of their metadata as possible available through open access in order to support learning and research, to disseminate knowledge and to foster innovation and aligns with the very public and established commitment that Harvard has made to open access for scholarly communication. I’m pleased to say that they worked with OCLC as they thought about the terms under which the release would be made.” We’ve gotten nice coverage from the New York Times, Library Journal, and Boing Boing as well.
Many people have asked what we expect people to do with the data. Personally, I have no idea, and that’s the point. I’ve seen over and over that when data is made openly available with the fewest impediments — legal and technical — people are incredibly creative about finding innovative uses for the data that we never could have predicted. Already, we’re seeing people picking up the data, exploring it, and building on it.
- The Digital Public Library of America is making the data available through an API that provides data in a much nicer way than the pure MARC record dump that Harvard is making available.
- Within hours of release, Benjamin Bergstein had already set up his own search interface to the Harvard data using the DPLA API.
- Carlos Bueno has developed code for the Harvard Library Bibliographic Dataset to parse its “wonky” MARC21 format, and has open-sourced the code.
- Alf Eaton has documented his own efforts to work with the bibliographic dataset, providing instructions for downloading and extracting the records and putting up all of the code he developed to massage and render the data. He outlines his plans for further extensions as well.
(I’m sure I’ve missed some of the ways people are using the data. Let me know if you’ve heard of others, and I’ll update this list.)
As I’ve said before, “This data serves to link things together in ways that are difficult to predict. The more information you release, the more you see people doing innovative things.” These examples are the first evidence of that potential.
John Palfrey, who was really the instigator of the open metadata project, has been especially interested in getting other institutions to make their own collection metadata publicly available, and the DPLA stands ready to help. They’re running a wiki with instructions on how to add your own institution’s metadata to the DPLA service.
It’s hard to list all the people who make initiatives like this possible, since there are so many, but I’d like to mention a few major participants (in addition to John): Jonathan Hulbert, Tracey Robinson, David Weinberger, and Robin Wendler. Thanks to them and the many others that have helped in various ways.
March 30th, 2012
|“Majesty of Law”
Statue in front of the Rayburn House Office Building in Washington, D.C., photo by flickr user NCinDC, used by permission (CC-by-nd)
Here is my written testimony filed in association with my appearance yesterday at the hearing on “Federally Funded Research: Examining Public Access and Scholarly Publication Interests” before the Subcommittee on Investigations and Oversight of the House Committee on Science, Space and Technology. My thanks to Chairman Broun, ranking member Tonko, and the committee for allowing me the opportunity to speak with them today.
March 8th, 2012
|“Have scientists lost interest again?”
The “Cost of Knowledge” boycott of Elsevier is in its seventh week. The boycott was precipitated by various practices of the journal publisher, most recently its support for the Research Works Act, a bill that would roll back the NIH public access policy and prevent similar policies by other federal funding agencies.
Early on, several hundred researchers a day were signing on to the pledge not to submit to or edit or review for Elsevier journals, but recently that rate had settled down to about a hundred per day. On February 11, I started tracking the daily totals by scraping the site through a simple scraper I set up at ScraperWiki. I’ve graphed the results in the attached graph, showing raw count of signatories with the blue line (left axis) and the number added since the previous day with the green bars (right axis).
As you can see from the chart, there seems to be a slight drop in activity around weekends, and Sunday February 26 and Monday February 27 had clearly been the slowest days since I’ve been keeping records, and likely since the effort started. On the 27th (red arrow), Elsevier issued its quasi-recantation of support for RWA. (“While we continue to oppose government mandates in this area, Elsevier is withdrawing support for the Research Work Act itself. We hope this will address some of the concerns expressed….”)
The day after Elsevier’s announcement saw a bit of a bump back to previous levels. Was this an instance of the Streisand effect or was the 26-27 dip an aberration? It’s hard to tell. However, since the 27th, it seems clear that the number of pledges is down considerably. It could well be that Elsevier’s tactical approach has worked and it has stanched the spate of boycott pledges, despite the fact that the community was generally unimpressed with Elsevier’s statement, as Peter Suber has cataloged. Alternatively, the current rate of new pledges may just reflect the natural reductions that had been happening over the last few weeks.
|“Note the surges…”
[Update 4/20/2012: Now that a few more weeks have passed, here's an updated figure of the boycott growth. Note the surges around March 18 and April 10. As near as I can make out, these were the result of widely disseminated coverage in Slashdot and the Guardian, respectively. These surges show that the boycott hasn't played itself out yet, and that continued discussion of the boycott is likely to lead to a continued steady rise in the number of signatures.
At the current rate, I expect the number of signatories to hit 10,000 around April 27 or so.]
[Update 4/24/2012: Well, my guess was wrong. A big bump of activity in the last few days meant that the boycott broke 10,000 signatures on April 23. I'm not sure who to blame for the renewed interest in the last couple of days. Anyone have any conjectures?]
March 6th, 2012
|“You seem to believe in fairies.”
Photo of the Cottingley Fairies, 1917, by Elsie Wright via Wikipedia.
Aficionados of open access should know about the Journal of Machine Learning Research (JMLR), an open-access journal in my own research field of artificial intelligence, a subfield of computer science concerned with the computational implementation and understanding of behaviors that in humans are considered intelligent. The journal became the topic of some dispute in a conversation that took place a few months ago in the comment stream of the Scholarly Kitchen blog between computer science professor Yann LeCun and scholarly journal publisher Kent Anderson, with LeCun stating that “The best publications in my field are not only open access, but completely free to the readers and to the authors.” He used JMLR as the exemplar. Anderson expressed incredulity:
I’m not entirely clear how JMLR is supported, but there is financial and infrastructure support going on, most likely from MIT. The servers are not “marginal cost = 0″ — as a computer scientist, you surely understand the 20-25% annual maintenance costs for computer systems (upgrades, repairs, expansion, updates). MIT is probably footing the bill for this. The journal has a 27% acceptance rate, so there is definitely a selection process going on. There is an EIC, a managing editor, and a production editor, all likely paid positions. There is a Webmaster. I think your understanding of JMLR’s financing is only slightly worse than mine — I don’t understand how it’s financed, but I know it’s financed somehow. You seem to believe in fairies.
Since I have some pretty substantial knowledge of JMLR and how it works, I thought I’d comment on the facts of the matter. Read the rest of this entry »
February 25th, 2012
|“…the interpersonal processes that a student goes through…”
Harvard students (2008) by E>mar via flickr. Used by permission (CC by-nc-nd)
Is the pot calling the kettle black? Oh sure, journal prices are going up, but so is tuition. How can universities complain about journal price hyperinflation if tuition is hyperinflating too? Why can’t universities use that income stream to pay for the rising journal costs?
There are several problems with this argument, above and beyond the obvious one that two wrongs don’t make a right.
First, tuition fees aren’t the bulk of a university’s revenue stream. So even if it were true that tuition is hyperinflating at the pace of journal prices, that wouldn’t mean that university revenues were keeping pace with journal prices.
Second, a journal is a monopolistic good. If its price hyperinflates, buyers can’t go elsewhere for a substitute; it’s pay or do without. But a college education can be arranged for at thousands of institutions. Students and their families can and do shop around for the best bang for the buck. (Just do a search for “best college values” for the evidence.) In economists’ parlance, colleges are economic substitutes. So even if it were true that tuition at a given college is hyperinflating at the pace of journal prices, individual students can adjust accordingly. As the College Board says in their report on “Trends in College Pricing 2011”:
Neither changes in average published prices nor changes in average net prices necessarily describe the circumstances facing individual students. There is considerable variation in prices across sectors and across states and regions as well as among institutions within these categories. College students in the United States have a wide variety of educational institutions from which to choose, and these come with many different price tags.
Third, a journal article is a pure information good. What you buy is the content. Pure information goods include things like novels and music CDs. They tend to have high fixed costs and low marginal costs, leading to large economies of scale. But a college education is not a pure information good. Sure, you are paying in part to acquire some particular knowledge, say, by listening to a lecture. But far more important are the interpersonal processes that a student participates in: interacting with faculty, other instructional staff, librarians, other students, in their dormitories, labs, libraries, and classrooms, and so forth. It is through the person-to-person hands-on interactions that a college education develops knowledge, skills, and character.
This aspect of college education has high marginal costs. One would not expect it to exhibit the economies of scale of a pure information good. So even if it were true that tuition is hyperinflating at the pace of journal prices, that would not take the journals off the hook; they should be able to operate with much higher economies of scale than a college by virtue of the type of good they are.
Which makes it all the more surprising that the claims about college tuition hyperinflating at the rate of journals are, as it turns out, just plain false.
Let’s look at what the average Harvard College student pays for his or her education. Read the rest of this entry »
January 4th, 2012
|“…time to switch…”
A very old light switch (2008) by RayBanBro66 via flickr. Used by permission (CC by-nc-nd)
The journal Research in Learning Technology has switched its approach from closed to open access as of New Year’s 2012. Congratulations to the Association for Learning Technology (ALT) and its Central Executive Committee for this farsighted move.
This isn’t the first journal to make the switch. The Open Access Directory lists about 130 of them. In my own research field, the Association for Computational Linguistics (ACL) converted its flagship journal Computational Linguistics to OA as of 2009, and has just announced a new open-access journal Transactions of the Association for Computational Linguistics. Each such transition is a reminder of the trajectory that journal publishing ought to head.
The ALT has done lots of things right in this change. They’ve chosen the ideal licensing regime for papers, the Creative Commons Attribution (CC-BY) license. They’ve jettisoned one of the largest commercial subscription journal publishers, and gone with a small but dedicated professional open-access publisher, Co-Action Publishing. They’ve opened access to the journal retrospectively, so that the entire archive, back to 1993, is available from the publisher’s web site.
Here’s hoping that other scholarly societies are inspired by the examples of the ALT and ACL, and join the many hundreds of scholarly societies that publish their journals open access. It’s time to switch.