Posted by: bluesyemre | February 4, 2019

Linking to freely available articles – how various databases and citation indexes use #unpaywall data by #AaronTay


So it has come to pass, the big 3 discovery and citation indexes in our industry, Web of Science, Scopus and Dimensions now provides widespread native support to users to find open access versions of papers as in July 2018.  This is a remarkable achievement, besides the 3 major citation indexes mentioned, Unpaywall data is also used in Europe PMCScienceOpenLens, link resolvers like 360link (though the data usually isn’t preloaded but is called dynamically on the fly like in the earlier examples) and more.

Discovery services like Primo,Summon, EDS have added OA filters and indicators  and have added or will add big chunks of content via large scale OA aggregators like CORE and/or BASE.

Add the explosion in tools like Kopernio, OAbutton, Anywhere AccessLean Library Browser , Google Scholar button and more that help users find papers (both subscribed and paid) there is no doubt that the discovery of open access papers is now mainstream in library related tools.

To think that just 3.5 years ago in July 2015 my surprisingly popular “5 things Google Scholar does better than your discovery service” I  lamented the irony of how library based discovery tools were horrible at detecting OA content compared to Google.

Things certainly have changed since then and as I noted in July 2017 that OA has become too big to ignore in discovery.

But what are the implications in a environment where finding freely available items is the norm with our tools?

Firstly, just because a service or index like Dimensions, Scopus, Web of Science, Europe PMC etc are licensing unpaywall data doesn’t means they will be using it in the same way. As you will see later some of them are selectively exposing only a subset of the links (e.g. Scopus is showing only Gold OA, while Web of Science shows Green as well as long as it is a author accepted manuscript or published version), while others (e.g. Dimensions) are using the full set.

Secondly, what is the right policy for links? Should we demand such services and indexs to link to everything that can be found freely available? Or set a tool like Lean Library ezbrowser to only show author accepted manuscript or published versions?  Also given the difficulties of version detection can we be sure this is working?

Thirdly, should such OA detection functionality reside at the database vendor site eg Scopus or should they be controlled by the library typically via our link resolver that will offer subscribed versions first.

Lastly, given our tools are now exposing students to more versions of papers , how should we shift the way we do information literacy ?

A note about terminology

There is some confusion on how to call different versions of papers. For example, what exactly counts as a “preprint”? A “post print”? These terms no longer make sense when journals are born digital. Terms like “working paper” add to the confusion as disciplines have different practices.

What standards are there to resolve this

The NISO JAV (Journal Article version) standard recommends the following

1.Author’s Original (AO)
2. Submitted Manuscript Under Review (SMUR)
3, Accepted Manuscript (AM)
4. Proof (P)
5. Version of Record (VoR)
6. Corrected Version of Record (CVoR)
7. Enhanced Version of Record (EVoR)

I’m not sure if it ever caught on, though the term Version of Record (VoR) to refer to the final version that you see on publisher platforms seem to be commonly used, the others not so much.

The other standard that seems important is DRIVER Guidelines v2.0 VERSION types (also used in Unpaywall API) It gives

2.submitted version
3.accepted version
4. published version
5. updated version

I prefer this version, because it’s simpler and a bit more descriptive though I believe it lacks the equivalent of NISO’s “proof” and “EVoR”. Granted I guess these versions are less commonly encountered unless you are a publisher.

The truth is while terms like “Preprint” and “Postprint” have become hopelessly confusing, I suspect they are still influential due to tradition and they are still used in sites like Sherpa Romeo. In fact, there is a whole class of up and coming repositories that carry the label “Preprint servers”.

For the purposes of this post, I’m going to use the terms

1. Preprint to refer all versions up to just before the accept version (also known as Author accepted manuscript).

2. Accepted version or alternatively author accept manuscript (AAM) – the version just after peer review but before published version

3. Version of record (VoR) – published version

Users will encounter more versions of papers. Need for Information Literacy?

A few days ago, Lisa Hinchliffe asked on Twitter how librarians were discussing preprints with users. The answer from another librarian and from me was similar, we didn’t think many librarians were talking much about preprints, particularly to first years.

I suspect the reason is that many librarians focus on recommending traditional databases like Ebsco or Proquest ones which points to version of  so the issue never comes up.

Of course, I’m not naive, and I know our users whether it be faculty or undergraduate, use Google, Google Scholar and do encounter versions that are not version of records. There’s in fact I suspect a high likelihood they treat it just like a VoR and cite it as per normal, unless it occurs them to ask someone.

But until recently, a librarian was able to maintain the polite fiction that our users would only see VoR because for most part if they were using our databases they would rarely encounter other versions.

In the past year or two, keeping up this fiction is becoming harder and harder. As Lisa points out in the tweet above, even our subscribed databases are starting to point to non-VoRs.

Users and Libraries are also increasingly beginning to support use of tools like KopernioLean Library Browser (e.g Stanford is encouraging use of this) which not only brings users to subscribed resources but further increases the chances of users running into OA versions that are not version of records.

Of course, one interesting question is what open access versions do these tools point to? If they pointed only to Gold OA (including hybrid) copies, everything would be as per normal. It’s unclear to me what these tools surface by default and whether there is flexibility given to libraries to change this default.

Will this drive more traffic to our institution repository?

One can imagine even within the library there might be difference between opinions on what such tools should point to. For example , a Institutional Repository (IR) Manager might rejoice at how the current trend towards pointing to OA could help drive traffice to IRs.

This may be true, but it depends on firstly how good such OA finding/linking tools like Unpaywall are good at surfacing Green OA , particularly institution repository and secondly and more important how the licencees use the data. For example, they may only add links to Gold OA material or only to version of record copies.

Even if they point to earlier versions like author accepted manuscripts or even preprints, such tools may prioritze versions located at subject repositories like PMC as they are bigger , more well known and hence more likely to be customized for (in terms of metadata extraction) than individual Institutional Repositories.

What can unpaywall detect?

Let’s focus on Unpaywall data which is now probably the most widely used source of data for detecting OA material though many tools further enhance it with their own sources.

Unpaywall data is of course at the article level and among other things, allows you to tell where the OA article is hosted (host_type = Publisher or Repository) and version of OA paper (version=submittedVersion or acceptedVersion or publishedVersion). It can also tell you the license (Creative commons or Publisher License or Implied OA).

Sources that license unpaywall data can choose to use the data to provide all or only some of the links known to Unpaywall. This data can also be used to show the facets and filters relating to open access.

How different licensee of Unpaywall data expose open access papers

How Dimensions uses Unpaywall data

Dimensions I believe uses the whole unpaywall data to show all links known to Unpaywall. In terms of filters, they provide filtering by host type whether an article is publisher hosted or repository hosted. In other words, you get all the OA versions whether preprints, accept versions or version of records at least as of July 2018.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.


%d bloggers like this: