In the popular conception, open-access journals generate revenue by charging publication fees. The popular conception turns out to be false. Various studies have explored the extent to which OA journals charge publication fees. The results have been counterintuitive to many, indicating that far fewer OA journals charge publication fees than one might have thought. You can verify this yourself using some software I provide in this post.
The first study of what we’ll call the “publication-fee percentage”, by Kaufman and Wills, showed that fewer than half of the OA journals they looked at charge publication fees. The figure for publication-fee percentage they report is about 47%. (For convenience, we put all publication-fee percentages in boldface in this post.) Following on from this, Suber and Sutton provided a figure of 16.7% for scholarly society journals charging publication fees.
Bill Hooker came up with a clever way of calculating a figure for publication fee percentage, by taking advantage of the publication fee metadata hidden in the “for authors” journal listings at the Directory of Open Access Journals to calculate the figure as of December 2007. Here are his totals:
Charges 534 (18%) No charges 1980 (67%) Information missing 453 (15%) Total (excl. hybrids) 2967
Depending on the disposition of the “information missing” cases, Hooker’s study indicates that 18-33% of OA journals charge fees.
Hooker performed his study using a combination of automated and manual methods. In particular, he apparently used manual effort to eliminate the hybrid journal listings. But it isn’t difficult to write software to perform the entire analysis automatically, which allows anyone to replicate the results him- or herself. Unfortunately, the OAI-PMH feed that DOAJ kindly provides doesn’t include the crucial information of whether journals charge fees and whether they are pure or hybrid OA journals, so I, like Hooker, resorted to screen-scraping. The method is effective, if inelegant.
Here are the results computed by my software, as of May 26, 2009:
Charges 951 (23.14%) No charges 2889 (70.29%) Information missing 270 (6.57%) Hybrid 1519 (26.99%) Total 5629
The numbers are consistent with those of Hooker’s study some 16 months earlier. You’ll see that the total number of full OA journals is up from 2967 to 4110, and the number with missing information has been halved from 15% to about 7%. The reduction in those with missing information seems to have gone more to those with fees than those without, so that the percentage charging fees is up some 5% and those not charging fees only up 3%. Again, depending on the “information missing” cases, the range of fee-charging journals is 23-30%. Assuming that the missing information cases are similar in distribution to those that were resolved over the last year, the figure would be about 27%. That leaves 73% of OA journals, the overwhelming bulk, charging no fees.
Anyone interested in replicating the results should feel free to use the simple Python script below, provided without warranty.
#!/usr/bin/python ''' Calculate the percentage of open access journals with different publication fee policies using data from the Directory of Open Access Journals doaj.org) Stuart M. Shieber March 26, 2009 ''' from urllib import urlretrieve import os import re from collections import defaultdict feecount = defaultdict(int) hybridcount = 0 journalcount = 0 def processpage(file): '''Process a file of article listings from the DOAJ "Authors" listing of articles, which includes publication fee information to extract journal entries and update running counts''' global hybridcount, journalcount, feecount # Get the contents of the file f = open(file, 'r') contents = f.read() f.close # Clean up the file by removing some header stuff pat = re.compile("^.*End Result.*<p /><br />", re.DOTALL) contents = re.sub(pat, "", contents) # Get rid of newlines to make pattern matching easier contents = re.sub('n', '|||', contents) # Place each article entry on a separate line by keying off of the # serendipitous use of "passMe" at the start of each entry contents = re.sub('passMe', 'npassMe', contents) # Match each article record, getting title, hybrid status, fee # info pat = re.compile("passMe[^>]*>([^<]*)</a>.*class=info>([^<]*)</span>.*Publication fee.*>(.*)</font>") for match in pat.finditer(contents): journalcount += 1 title = match.group(1) accesstype = match.group(2) feetype = match.group(3) # Print an entry for a csv file print ""%s", "%s", "%s"" % (title, accesstype, feetype) # Bump counts if accesstype == 'Open Access': feecount[feetype] +=1 else: hybridcount += 1 ### Download all of the pages at DOAJ, caching locally, and process ### each one for letter in "ABCDEFGHIJKLMNOPQRSTUVWXYZ": for page in range(1,8): # Generate source and destination locations url = "http://www.doaj.org/doaj?func=byTitle&p=%d&hybrid=1&query=%s" % (page, letter) local = "/tmp/%s%d" % (letter, page) # Pull over the page if not cached if not os.path.exists(local): print "retrieving " + url urlretrieve(url, local) # and process it processpage(local) ### Print a table of results for fee in feecount.keys(): print "%-20s : %5d (%5.4f)" % (fee, feecount[fee], feecount[fee]/float(journalcount-hybridcount)) print "%-20s : %5d (%5.4f)" % ('Hybrid', hybridcount, hybridcount/float(journalcount)) print "%-20s : %5d" % ('TOTAL', journalcount)