Author Carpentry logo

Persistent Identifiers and Open Citations: Basic Building Blocks of the Scholarly Web

45 Minutes

Learning Objectives


Digital Object Identifiers (DOIs) are unique names assigned to information resources (including research papers and datasets) that are represented in some way on the Internet. DOIs represent an established international information standard, ISO26324:2012, and many publishers, data centers, and other information providers rely on this standard to assign unique identifiers for works under their care. The DOI assigned to a given research object distinguishes it from other works, including other versions of the same intellectual material. Examples of research-related information resources assigned DOIs include journal articles, curated datasets, theses, conference papers, pre-prints, technical reports, and books.

DOIs are actionable on the Internet: when put in URL form ({DOI}, these strings automatically redirect to an online landing page that offers information about the research object. Often, but not always, the landing page contains a link to the resource itself. Where the resource is not online, the landing page indicates where to find it. For example, a DOI assigned to a physical resource such as a print book, a museum specimen, or a scientific sample will specify the repository or collection where that item resides.

The DOI is NOT by itself a seal of quality. Yet information resources that are assigned DOIs tend to be of enduring value; are likely to be used and cited by others; and are maintained by a publisher or other information provider who is committed to curating and preserving the resource over time. Because an organization is committed to curating the resource over time, the DOI is considered persistent. Some say that the DOI is basically a promise to always supply information about the information resource associated with the identifier.

A DOI is considered a persistent identifier because it reliably resolves to a human- and machine-readable landing page representing the information resource. But the DOI itself is not a guarantee that the dataset or paper is available via Open Access: rather, the DOI resolves to an Open Access landing page that may (or may not) permit linking through to the desired object. Whether or not a user can access the full content depends on various circumstances unrelated to the DOI system: format of the resource, restrictions and conditions governing access; authentication requirements; software compatibility; etc.

Nonetheless, the metadata associated with the DOI is rich enough to provide useful data for researchers. DOI metadata can tell alot about what has been published, who has published it, what works it cites, under what conditions the work was created and distributed (with what funding, whether the work is available open access, whether the work has been updated since publication, and more.)

In this session we’ll explore the anatomy of a DOI, how it is generated, and how to retrieve the rich metadata associated with a given research object and its DOI.

Finally, we will add to our command line repetoire by practicing a few new tools to help us acquire, examine, and use scholarly metadata available on the Web.

Exercise 1a. Practice using curl to interact with a World Wide Web site and retrieve a document from that site to a file on your desktop. Then display the file on your terminal.

First, check that you are on your desktop. Type

$ pwd

This will print your location to the terminal. In most cases you can move to the desktop by typing

$ cd ~/Desktop

Now we’ll use curl to grab web content. Note that you can copy and paste these commands into your terminal window to save typing.

$ curl -o think.html
$ head think.html

What is the format of the retrieved content? How would we view this content?

Exercise 1b. Convert the file into a clean format for reading or printing using pandoc

$ pandoc -o think.docx think.html

Please note .docx instead of .doc, otherwise it doesn’t work. Launch LibreOffice: how does the document look now? Will printing this document in this format look different than if you print it directly from the website?

TIP: Feel free to send yourself a copy of this useful handout on how to assess whether a journal is reputable or not

Exercise 2a. Practice using curl to retrieve data from the DOI database, CrossRef, and save to a file on your desktop. Then display the file on your terminal.

$ curl -o shen.json
$ head shen.json

What is the format of the retrieved content?

Exercise 2b. Use either atom (with pretty-json) or jq to pretty print the file for easier human reading.


Open your .json file in Atom. Click Packages/Pretty JSON/Prettify. Now you can save the formatted file as shen_pretty.json.



$ jq . shen.json > shen_pretty.json

Now that you can read the file more easily, you should be able to answer the following questions:

Exercise 3a. You just need the citation, not the entire metadata record for this research object. Use content negotation with the CrossRef database to just get the citation for this item, in APA style. View the result on your screen.

$ curl -LH "Accept:text/x-bibliography; style=apa" -o shen.txt

The -H option provides headers and the -L option tells curl to follow

redirects. See what you get when you leave off the -L

Exercise 3b. It turns out that some other systems where you want to submit this citation data only take the open citation format, bibtex. Perform content negotiation with the CrossRef database again, but this time require the citation in bibtex format. Save the output to a file on your desktop for later reuse.

(For example, the ORCiD researcher profile system and certain funding agencies’ submission systems accespt bibtex citations).

$ curl -LH "Accept:application/x-bibtex" -o shen.bib

$ head shen.bib

Exercise 3c. Retrieve and save bibtex citations for three more research papers so you can complete your publication list for your project. Here are the DOIs for each paper:

Challenge question: how could you use a single command line tool to quickly combine these citations into one file representing your publication list?


$ cat file1.bib file2.bib file3.bib > publist.bib

Exercise 4. The final step in your author carpentry DOI pipeline to make a ‘publication list’ with your works based on their DOIs. Can you apply the steps learned in Exercises 2+3 to accomplish this task?

HINT: You need to use content negotation with the DOI database to retrieve citations for each DOI you have. Then combine the individual citations into a single file.

Once you have a bibtex file with your DOIs and citations, you are ready to set up your ORCiD account and connect your author ID with your DOIs.`

Anatomy of a DOI

The International DOI System is the overall infrastructure by which Digital Object Identifiers are assigned, registered, resolved, and associated with valuable metadata including citation, availability of full text, funder information, licensing information, and more. The following components of the DOI System together make it work:

The metadata associated with the DOI is often rich enough to provide useful data for a researcher. We’ll look at this data throughout the rest of the lesson.

Anatomy of a DOI

Exercise 5. Find out who owns a DOI prefix

DOI prefixes can be assigned by different Registration Agencies to different users. DataCite has an api that gives you the Registraition Agency.

$ curl ""
$ curl ""

If the Registration Agency is CrossRef, you can use the CrossRef API to get more information about the member assigned to the prefix.

$ curl ""

Next: Register a DOI