Data Dictionary

Overview

As part of the What’s on the Menu? project site, the New York Public Library (NYPL) explains that “all data generated through What’s on the Menu? is available in two ways”—through an application programming interface (with an access key provided upon request by the library) or a downloadable package of “spreadsheet exports.” Since these downloadable files are the most frictionless way to obtain the NYPL data, we will focus on describing them. The goal of this document is to supply additional documentation about the data files that isn’t available elsewhere.

The data files are available from: http://menus.nypl.org/data

Clicking on the link labeled “latest data export in CSV format ([DATE])” should automatically start a download in most browsers. It may be necessary to right click on a link and select an option to “save” or “download” files.

The download will consist of a single compressed (zipped) tar file. To use the data, it will be necessary to unzip and untar this file. You can probably accomplish this by simply double clicking on the downloaded file. After doing so, the result will be a folder containing four csv files. These files should be named: “Dish.csv”, “Menu.csv”, “MenuItem.csv”, and “MenuPage.csv”.

The files can be opened in any plain-text editor or in common spreadsheet applications (Microsoft Excel, Google Sheets, etc.) though some of them—MenuItem.csv, for example—are very large files.

Data Files, Database Tables, and Data Models

Each spreadsheet file corresponds to a table from the relational database that underlies the What’s on the Menu? web application. The files can be linked together by mapping shared identifiers that appear in multiple tables. (We will say more about these links below).

Each file (representing a database table) corresponds to a data “model”—an abstraction used to group and organize data values. These data models were designed by the creators of the What’s on the Menu? web application and are meant to correspond to objects in the real world domain. In some cases, these correspondences are straightforward: “Dish.csv” records data stored in a database table organized according to an abstract model of a dish of food. A dish in this case has a “name” and other relevant attributes. For some of the data files, the domain object modeled needs further explanation.

Dish.csv
Information about all the dishes from all the menus transcribed by the project. In this data file, a dish is represented by a row of values. Columns identify attributes of a dish. One of these attributes is an identifier, which identifies the dish. However, the identity of a dish appears to be based on the exact form of the string labeled "name." Thus, dishes with variant orthographic forms of their names, e.g. “half chicken”, “Half Chicken” and “chicken [half]”) are treated as separate entries with different identifiers.
Menu.csv
Information about the menus as physical objects, including historical information about their origins, uses, and formats.
MenuItem.csv
This is the largest data file. A "MenuItem" represents a single instance of a dish appearing somewhere on a menu page image. “MenuItem.csv” is useful as a mapping between multiple other data files/tables.
MenuPage.csv
Information about individual pages of the menus represented in "Menu.csv". Pages are modeled here as digital images produced as a result of digitization by the NYPL. Menus often have multiple pages.

Conventions

In the tables below we provide information about each downloadable data file including:

We note the presence of missing data points in the column “missing values.” However, there are also strings in the present data that point to missing information (e.g. “unknown” or “?”), which do not formally appear as null values.

The data in these files comes from several different sources. We reflect this in the “generated by” category of the tables below. Some of the data is supplied by “volunteer transcribers,” by which we mean people who have participated in the project through the What’s On the Menu? site. Some of the information is generated by “web application.” This means that some of this information was automatically created as the database supporting the application was constructed and populated (e.g., various ids); some is created as the web application runs (e.g. timestamps as data values are updated). Finally, a lot of information is generated from “NYPL metadata.” This metadata comes from many places and reflects the long history of the project and the many parts of New York Public Library involved in it. Much of the data “supplied by NYPL metadata” in the menu spreadsheet is from the catalog cards made by Frank E. Buttolph in the early twentieth century.

Detailed Data Dictionary

This information is based on our analysis of the publicly available data files. We welcome corrections and additions. Send us an email: curatingmenus [at] gmail.com.

Dish.csv

Column label Gloss Data type Missing values? Generated by Description
id identifier for a dish id no web application corresponds to dish_id in MenuItem.csv
name name of dish string no volunteer transcribers This value matches what the transcriber typed. Sometimes the dish name matches exactly what was printed on the original menu; however, transcribers had various punctuation and capitalization practices, and sometimes relied on contextual information provided by the layout or other items on the menu.
description n/a n/a yes n/a contains no data
menus_appeared total count of menus on which dish with this id appears integer no web application
times_appeared total count of appearances of the dish with this id across all menus integer no web application
first_appeared earliest year of a menu on which a dish with this id appears date (YYYY) no web application, based on NYPL metadata for menus
last_appeared latest year of a menu on which a dish with this id appears date (YYYY) no web application, based on NYPL metadata for menus
lowest_price lowest price associated with a dish with a given id float yes volunteer transcribers Some menus are in other currencies than dollars; also transcribers did not always make distinctions between dollar amounts and cent amounts leading to errors in the data.
highest_price highest price associated with a dish with a given id float yes volunteer transcribers
Column label Gloss Data type Missing values? Generated by Description
id identifier for menu id no web application corresponds to menu_id
name n/a yes n/a contains no data
sponsor who sponsored the meal (organizations, people, name of restaurant) string yes NYPL metadata
event category (lunch, annual dinner) string yes NYPL metadata Information in this category varies widely.
venue type of place (commercial, social, professional) string yes NYPL metadata Information in this category varies widely.
place where the meal took place (often a geographic location) string yes NYPL metadata These vary widely (street address, cities, names of restaurants; names of a ships or train). NYPL has been crowdsourcing more precise geolocations for menus but this data is not available in these files.
physical_description dimension and material description of the menu string yes NYPL metadata
occasion occasion of the meal (holidays, anniversaries, daily) string yes NYPL metadata This field likely comes from Buttolph’s original organization of the menu collection.
notes notes by librarians about the original material string yes NYPL metadata
call_number call number of the menu string yes NYPL metadata
keywords n/a yes n/a contains no data
language n/a yes n/a contains no data
date date of the menu date (YYYY-MM-DD) yes NYPL metadata contains no data
location organization or business who produced the menu string no NYPL metadata
location_type n/a yes n/a contains no data
currency system of money the menu uses (dollars, etc.) string yes NYPL metadata
currency_symbol symbol for the currency ($, etc.) string yes NYPL metadata
status completeness of the menu transcription (transcribed, under review, etc.) string no web application
page_count how many pages the menu has integer no web application
dish_count how many dishes the menu has integer no web application
Column label Gloss Data type Missing values? Generated by Description
id identifier for the menu item id no web application
menu_page_id id of the page the menu item is on id no web application corresponds to MenuPage.csv id
price first price of menu item float yes volunteer transcribers
high_price if the item has more than on price on a single menu, the highest price float yes volunteer transcribers If there are more than two values for price, the web application instructs volunteers to enter the lowest and highest prices rather than all values.
dish_id id of the dish id yes web application corresponds to dish.csv id
created_at date/time of first transcription datetime UTC (YYYY-MM-DD HH:MM:SS UTC) no web application
updated_at date/time of the last edit to the value datetime UTC (YYYY-MM-DD HH:MM:SS UTC) no web application Usually, the updated time would be the time of the review.
xpos horizontal coordinate on the page for the upper left point where menu item is on the page float no web application This is where the green arrow on the What’s On The Menu? site sits to show people what to transcribe.
ypos vertical coordinate on the page for the upper left point where the menu item is on the page float no web application This is where the green arrow on the What’s On The Menu? site sits to show people what to transcribe.
Column label Gloss Data type Missing values? Generated by Description
id identifier for menu page id no web application
menu_id identifier for menu id no web application corresponds to Menu.csv id
page_number number representing sequence of page in the menu integer yes web application
image_id identifier for the page image id no web application
full_height height of the page image in pixels integer yes web application
full_width width of the page image in pixels integer yes web application
uuid universally unique identifier for the highest resolution version of the image id no web application