## What percentage of open-access journals charge publication fees?

### May 29th, 2009

In the popular conception, open-access journals generate revenue by charging publication fees. The popular conception turns out to be false. Various studies have explored the extent to which OA journals charge publication fees. The results have been counterintuitive to many, indicating that far fewer OA journals charge publication fees than one might have thought. You can verify this yourself using some software I provide in this post.

The first study of what we’ll call the “publication-fee percentage”, by Kaufman and Wills, showed that fewer than half of the OA journals they looked at charge publication fees. The figure for publication-fee percentage they report is about 47%. (For convenience, we put all publication-fee percentages in boldface in this post.) Following on from this, Suber and Sutton provided a figure of 16.7% for scholarly society journals charging publication fees.

Bill Hooker came up with a clever way of calculating a figure for publication fee percentage, by taking advantage of the publication fee metadata hidden in the “for authors” journal listings at the Directory of Open Access Journals to calculate the figure as of December 2007.  Here are his totals:

 Charges 534 (18%) No charges 1980 (67%) Information missing 453 (15%) Total (excl. hybrids) 2967

Depending on the disposition of the “information missing” cases, Hooker’s study indicates that 18-33% of OA journals charge fees.

Hooker performed his study using a combination of automated and manual methods. In particular, he apparently used manual effort to eliminate the hybrid journal listings. But it isn’t difficult to write software to perform the entire analysis automatically, which allows anyone to replicate the results him- or herself. Unfortunately, the OAI-PMH feed that DOAJ kindly provides doesn’t include the crucial information of whether journals charge fees and whether they are pure or hybrid OA journals, so I, like Hooker, resorted to screen-scraping. The method is effective, if inelegant.

Here are the results computed by my software, as of May 26, 2009:

 Charges 951 (23.14%) No charges 2889 (70.29%) Information missing 270 (6.57%) Hybrid 1519 (26.99%) Total 5629

The numbers are consistent with those of Hooker’s study some 16 months earlier. You’ll see that the total number of full OA journals is up from 2967 to 4110, and the number with missing information has been halved from 15% to about 7%. The reduction in those with missing information seems to have gone more to those with fees than those without, so that the percentage charging fees is up some 5% and those not charging fees only up 3%. Again, depending on the “information missing” cases, the range of fee-charging journals is 23-30%. Assuming that the missing information cases are similar in distribution to those that were resolved over the last year, the figure would be about 27%. That leaves 73% of OA journals, the overwhelming bulk, charging no fees.

Anyone interested in replicating the results should feel free to use the simple Python script below, provided without warranty.

#!/usr/bin/python

'''
Calculate the percentage of open access journals with different
publication fee policies using data from the Directory of Open Access
Journals  doaj.org)

Stuart M. Shieber
March 26, 2009
'''

from urllib import urlretrieve
import os
import re
from collections import defaultdict

feecount = defaultdict(int)
hybridcount = 0
journalcount = 0

def processpage(file):
'''Process a file of article listings from the DOAJ "Authors"
listing of articles, which includes publication fee information to
extract journal entries and update running counts'''

global hybridcount, journalcount, feecount

# Get the contents of the file
f = open(file, 'r')
f.close

# Clean up the file by removing some header stuff
pat = re.compile("^.*End Result.*<p /><br />", re.DOTALL)
contents = re.sub(pat, "", contents)
# Get rid of newlines to make pattern matching easier
contents = re.sub('n', '|||', contents)
# Place each article entry on a separate line by keying off of the
# serendipitous use of "passMe" at the start of each entry
contents = re.sub('passMe', 'npassMe', contents)

# Match each article record, getting title, hybrid status, fee
# info
pat = re.compile("passMe[^>]*>([^<]*)</a>.*class=info>([^<]*)</span>.*Publication fee.*>(.*)</font>")
for match in pat.finditer(contents):
journalcount += 1
title = match.group(1)
accesstype = match.group(2)
feetype = match.group(3)
# Print an entry for a csv file
print ""%s", "%s", "%s"" % (title, accesstype, feetype)
# Bump counts
if accesstype == 'Open Access':
feecount[feetype] +=1
else:
hybridcount += 1

### Download all of the pages at DOAJ, caching locally, and process
### each one
for letter in "ABCDEFGHIJKLMNOPQRSTUVWXYZ":
for page in range(1,8):
# Generate source and destination locations
url = "http://www.doaj.org/doaj?func=byTitle&p=%d&hybrid=1&query=%s" % (page, letter)
local = "/tmp/%s%d" % (letter, page)
# Pull over the page if not cached
if not os.path.exists(local):
print "retrieving " + url
urlretrieve(url, local)
# and process it
processpage(local)

### Print a table of results
for fee in feecount.keys():
print "%-20s : %5d (%5.4f)" % (fee, feecount[fee],
feecount[fee]/float(journalcount-hybridcount))
print "%-20s : %5d (%5.4f)" % ('Hybrid', hybridcount, hybridcount/float(journalcount))
print "%-20s : %5d" % ('TOTAL', journalcount)

### 4 Responses to “What percentage of open-access journals charge publication fees?”

1. bill Says:

it isn’t difficult to write software to perform the entire analysis automatically

Not for you, it’s not! :-) I think I managed to write a regular expression to take out the hybrid entries in TextPad, but that’s about the extent of my programming skills.

So thank you for doing this right — the way I’d have liked to do it in the first place! — and for making the code freely available.

2. Open-Access Publisher Appears to Have Accepted Fake Paper From Bogus Center | newcareerhelper.com Says:

[...] Stuart M. Shieber, a professor of computer science at Harvard University. Mr. Shieber, in his blog, The Occasional Pamphlet, said he had devised a program to pull data out of computerized medical-journal listings and [...]