Scheming
We have been getting to work building our index for the menus data from the New York Public Library’s What’s on the Menu? project. We have traveled a long way from the idea that what the menus data set needs in order to be most useful to researchers is for someone to “clean it up.” As we discussed in our last essay, our plan instead is to create a data set of our own—an “index”—that links to and provides more information about the NYPL data.
We’ve had a provisional sense of what this index will be and how it will work for a while but, now that we’re actually building it, a few questions have arisen that we thought might be worth discussing publicly. The issues of most interest to us at this moment involve: what linked data techniques we should use to relate our data to the NYPL data, how to use identifiers from the NYPL data to make explicit links between our data and theirs, and how the linked data we’re creating might later be used or queried. This is very much work in progress so please let us know what you think—you can comment on any paragraph.
Index
Our index is a set of hierarchically-organized concepts about the domain of food. The first version of this index took the form of lists, posted to a wiki which we could collaboratively edit. For the next version, we wanted the index data to be more machine-processable. The Simple Knowledge Organization System (SKOS) standard maintained by the World Wide Web Consortium (W3C) is designed for just the use case we have imagined. SKOS supports web publication of “the basic structure and content of concept schemes such as thesauri, classification schemes, subject heading lists, taxonomies, folksonomies, and other similar types of controlled vocabulary.”
SKOS models concepts, their names, and their relationships to each other. With this vocabulary we were able to take the notes from our wiki and (with the help of Ed Summers) express the structure we wanted using the elements from SKOS. As a way of sketching out how this would look, we decided to serialize our data as JSON-LD. (We don’t expect to be writing out the serialization of our data by hand, but to work through some examples, this format seemed readable enough to be useful):
While modeling a domain as large and complex as “food” might be (is definitely) a sisyphean task, this does not negate the value of modeling as a data curation activity. For our project, we are beginning with just one corner of the food domain—a corner that maps to “dining out” food practices, primarily American, primarily from the twentieth century. We are purposely developing this model from the “top down”—starting with “Food” then broad categories of food like “Meat,” “Vegetables,” etc.—because such a domain model provides a level of abstraction that we think people will be able to use to more easily see comparisons, trend lines, relevant importance, and certain types of outliers. We are developing this conceptual structure not based on specific terms contained in the NYPL data but instead based on our own judgement about what would effectively serve potential users. We hope our model of the domain will also provide the groundwork for researchers to subset and refine the data for their own purposes. By linking a domain model, even a simplistic one, to the dataset from NYPL we can enable uses beyond what is currently possible. The NYPL dataset aggregates transcriptions of representations printed on historic menus. At bottom this data is a set of letters presented in order for a customer to say, “I want this” (or a banquet attendee to know “we will be served that”). Inevitably this produces a great deal of variation until the only effective way to subset the data is by date of the menu (which was carefully recorded when the object was collected by the Buttolph or other librarians). The NYPL dataset only indirectly provides information about foodstuffs; we believe based on these texts that the food was present (at least waiting in the kitchen pantry).
The practical challenge of adding a domain model to data which we did not create meant figuring out how to articulate the relationships between our SKOS concepts and NYPL’s data.
Explicitness
Specifically, we spent time trying to figure out the “proper” linked data way to accomplish this (and thought more than once that we had solutions when we didn’t). We had to reason our way—statement by statement—through the ways that our set was linked to NYPL’s in order to find what we wanted to say (and also what we didn’t.) Our linked data graph will include assertions like these (first in English):
- A thing identified by a URI from the NYPL menus domain—for example
<api.menus.nypl.org/dishes/dishes/8371>
—is some data - We have defined a concept (carrot as foodstuff) and given that concept a URI—for example,
<www.publicfare.org/def/Carrot>
- We assert that the subject of the data identified in #1 is the concept in #2
Importantly, this set of statements says only that the thing from NYPL is a little chunk of data—a series of key/value pairs. The identifier is minted based on the uniqueness of the string found in the key called “name.” We don’t believe the thing identified by the URI has any other ontological status—while the NYPL application calls these “dishes”—it does not treat them as things. If we believed that this data modeled dishes-as-things, you would be able to tell the difference between being served a dish of “carrots” and a dish of “Carrots.” These so-called “dishes” are (changing) aggregations of the appearances of certain strings in the NYPL’s application database based on the activity of volunteer transcribers.
In our formal RDF model, we use a couple of additional statements to capture our understanding of this situation in an explicit way:
@prefix dc: <http://purl.org/dc/terms/> .
@prefix dcmitype: <http://purl.org/dc/dcmitype/> .
@prefix foaf: <http://xmlns.com/foaf/spec/#term_focus> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
<http://api.menus.nypl.org/dishes/dishes/8371> a dcmitype:Dataset;
dc:subject <http://www.publicfare.org/def/food/Carrot> .
<http://www.publicfare.org/def/food/Carrot> a skos:Concept;
foaf:focus <http://dbpedia.org/resource/Carrot> .
Pointing
While the discussion above suggests that we knew what URI to choose to link to, this was yet another decision we had to make. Given how the NYPL data can be accessed, we saw three options:
- Use the URI for the web page for the dish Pros: accessible, takes people back to the source, including to images of the menus; maintained by NYPL Cons: an odd landing point that doesn’t have clearly structured data; doesn’t reveal as much data easily as the API
- Use the URI that will return the JSON document for the dish from the NYPL API Pros: nice structured data, including multiple points into the content and information about it; maintained by NYPL Cons: needs a key to access so not as user friendly
- Republish the data from the API within our own JSON documents Pros: includes all the data of the NYPL API, but is accessible Cons: means we have to store and maintain that much more data
Initially we hesitated to link to the URI for the JSON because the data about this URI could not be directly accessed by other users if they did not have a key for the NYPL API. However, this does not seem to deter other creators of linked data and is the most precise way to refer to what we want to talk about. Further, we realized that we could use predicates from the Europeana data model to have it both ways. We will point to both the JSON document from the NYPL API and the NYPL dish web web page, and these two entry points are valuable to other researchers because they express different information.
Queries
We set out to build a small application that would allow us to aggregate dishes that would have the same value for “name” if only differences in spelling, word order, and pluralization were normalized. However, now that we are beginning to index the data with concepts, this seems a more promising strategy than the (more time-intensive) project of aggregating and choosing normalized forms for similar strings. At this point, we’d rather scholars (including ourselves) be able to query our linked data for dishes that have “carrot” and “soup” as subject than to find matches for the specific string “carrot soup” in the data. This would allow them to get “carrot soup,” “chilled carrot vichyssoise,” and “soup of carrot and leek”—which is to say, direct experience of the diversity of the data.
We are both attending the Digital Library Federation Forum this week and look forward to talking with our colleagues about data and decisions.