Tech Blog

Using the Linked Data Integration API to Enhance Discovery

At the University of Wisconsin-Madison we have been taking advantage of the Alma Linked Data integration API to enhance an experimental version of our local discovery system with info cards about some of the identities found in bibliographic records. See, for example, Gertrude Stein on Picasso. The catalog will attempt to retrieve some brief biographical information about Pablo Picasso and Gertrude Stein for this record:

This API returns a description of a bibliographic record from our Alma catalog in the lightweight JSON Linked Data format. The JSON format means that the API responses are readymade for web programming and as Linked Data the information contains not just strings of text, but identifiers in the form of URIs where we can resolve more information not found in our catalog records.

The JSON-LD API produces data like the following:

"creator":[
  {
    "@id":"http://id.loc.gov/authorities/names/n78086005",
    "label":"Picasso, Pablo, 1881-1973.",
    "sameAs":"https://open-na.hosted.exlibrisgroup.com/resolver/wikidata/lc/n78086005"
  },
  {
    "@id":"http://id.loc.gov/authorities/names/n79006977",
    "label":"Stein, Gertrude, 1874-1946.",
    "sameAs":"https://open-na.hosted.exlibrisgroup.com/resolver/wikidata/lc/n79006977"
  }
]

The key piece of information in this excerpt of the JSON data are the URIs for the Library of Congress Name Authority File (LCNAF) entries. We will use this information to resolve the URIs for these identities in the Virtual International Authority File, which serves as an identity hub that points out to many other useful sources of information.

Resolving the VIAF URI

In a Linked Data context, ideally one would simply follow links and crawl from one data description to another. Unfortunately, in the data we want to retrieve is at VIAF and the data available from the Library of Congress is a dead end of sorts. The data representations of the authority records at id.loc.gov only provide data points internal to their data set. They do not link to other entities on the Web that also describe the people described by the authority records. As a result we must do a little work to figure out what the VIAF URI is for Picasso and Stein in this example.

Fortunately, the LCNAF control number is easily parsed off the end of the URI. Using this control number we can then use the VIAF API for translating an LCCN ID to its corresponding VIAF URI. In this case we can use an RDF library or other HTTP client capable of content negotiating for a data representation and point it at the VIAF API. For example, when setting the HTTP Accept header to ask for the mime type application/rdf+xml against the API URL:
http://www.viaf.org/viaf/lccn/n78086005

the VIAF API will resolve the URI we are looking for and send a series of HTTP redirect responses. Try running the following Linux/Unix curl command and trace the output:

$ curl -vLH "Accept: application/rdf+xml" http://www.viaf.org/viaf/lccn/n78086005

The -L flag in the curl command indicates that all subsequent redirects should be followed. VIAF will respond directly to the first request with a redirect response (HTTP 301) to:

http://viaf.org/viaf/15873

This new HTTP location happens to be the identity/real world object (RWO) URI in VIAF. Based on OCLC’s Linked Data URI design patterns, VIAF responds with one more redirect response to a document representation:
http://viaf.org/viaf/15873/

(Notice the trailing slash!)

The final HTTP request to this document responds with RDF/XML data for the identity in question. So now we have our hands on some data. At this point, though, we need to determine which data description in the RDF graph returned by VIAF corresponds to the entity from the Alma bibliographic record. The VIAF data we are holding onto now has descriptions for:

  1. The data document itself, insofar as it is a document.
  2. The RWO for the identity in question: this is the entity we want!
  3. Every authority file entry that contributed data about the identity in question.

Identifying the useful entity is simply a matter of matching it against the LCNAF URI that was returned from Alma. Expressed in RDF/Turtle, the relevant portion of this entire graph is represented by the following data excerpt:

@prefix schema: <http://schema.org/> .

<http://viaf.org/viaf/15873> a schema:Person ;
  schema:sameAs <http://id.loc.gov/authorities/names/n78086005> .

Which means resolving the VIAF URI is simply a matter of querying the data graph:

sameas_uri  = RDF::URI.new("http://schema.org/sameAs")
creator_uri = RDF::URI.new("http://id.loc.gov/authorities/names/n78086005")
graph.query(predicate: sameas_uri, object: creator_uri)

The resulting query should produce a single triple with the VIAF URI representing the Real World Object identity for Picasso.

Discovering Other Entities for Picasso in the VIAF Graph

At this point in our data processing, we are just a short hop away from finding new information about our identities that are not included in the MARC-based catalog data. VIAF data about a person makes assertions that its data describes the same identity as other sources on the Web. In fact, we have already encountered one of these assertions when we resolved the VIAF entity itself. In addition to the assertion above, a VIAF entity might make assertions to multiple other entities on the web:

@prefix schema: <http://schema.org/> .

<http://viaf.org/viaf/15873> a schema:Person ;
  schema:sameAs <http://id.loc.gov/authorities/names/n78086005>,
    <http://vocab.getty.edu/ulan/500009666-agent>,
    <http://www.wikidata.org/entity/Q5593>,
    <http://dbpedia.org/resource/Pablo_Picasso> .

In this RDF excerpt, the VIAF entity asserts that it is the same as the thing described by the Library of Congress, Getty Research Institute, Wikidata and DBpedia. Knowing the URIs for each of those entities on the Web will now provide the data points for querying their respective Linked Open Data repositories using SPARQL.

Query for Description Beyond the MARC Record

SPARQL is a query language similar to the Structured Query Language (SQL) used in relational database management systems. SPARQL uses a similar syntax to SQL and is used for retrieving information from RDF data sets. Using the author Michael Pollan as an example, we can construct a simple query to retrieve a list of all of the films in which he appeared:

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX dbo: <http://dbpedia.org/ontology/>

SELECT ?film ?filmName ?filmAbstract
WHERE {
  ?film dbo:starring <http://dbpedia.org/resource/Michael_Pollan> .
  ?film rdfs:label ?filmName .
  ?film dbo:abstract ?filmAbstract .
  FILTER(langMatches(lang(?filmName), "en"))
  FILTER(langMatches(lang(?filmAbstract), "en"))
}

You can see the query results at the DBpedia SPARQL Explorer.

In our library catalog implementation, we just specify HTTP Accept headers to indicate that we are interested in a parseable JSON data response with the mime type application/sparql-results+json. The result is that we now have new data about the authors or subjects of a book from the library catalog. From this we can now enhance the catalog display.

A fully featured implementation of the general process outlined here can be found in the BibCard Ruby gem available at GitHub.com. You can also learn more about the design of this code library from a presentation archived at the IGELU/ELUNA show and tell series.

Leave a Reply