Data Dictionary
Overview
As part of the What’s on the Menu? project site, the New York Public Library (NYPL) explains that “all data generated through What’s on the Menu? is available in two ways”—through an application programming interface (with an access key provided upon request by the library) or a downloadable package of “spreadsheet exports.” Since these downloadable files are the most frictionless way to obtain the NYPL data, we will focus on describing them. The goal of this document is to supply additional documentation about the data files that isn’t available elsewhere.
The data files are available from: http://menus.nypl.org/data
Clicking on the link labeled “latest data export in CSV format ([DATE])” should automatically start a download in most browsers. It may be necessary to right click on a link and select an option to “save” or “download” files.
The download will consist of a single compressed (zipped) tar file. To use the data, it will be necessary to unzip and untar this file. You can probably accomplish this by simply double clicking on the downloaded file. After doing so, the result will be a folder containing four csv files. These files should be named: “Dish.csv”, “Menu.csv”, “MenuItem.csv”, and “MenuPage.csv”.
The files can be opened in any plain-text editor or in common spreadsheet applications (Microsoft Excel, Google Sheets, etc.) though some of them—MenuItem.csv, for example—are very large files.
Data Files, Database Tables, and Data Models
Each spreadsheet file corresponds to a table from the relational database that underlies the What’s on the Menu? web application. The files can be linked together by mapping shared identifiers that appear in multiple tables. (We will say more about these links below).
Each file (representing a database table) corresponds to a data “model”—an abstraction used to group and organize data values. These data models were designed by the creators of the What’s on the Menu? web application and are meant to correspond to objects in the real world domain. In some cases, these correspondences are straightforward: “Dish.csv” records data stored in a database table organized according to an abstract model of a dish of food. A dish in this case has a “name” and other relevant attributes. For some of the data files, the domain object modeled needs further explanation.
- Dish.csv
- Information about all the dishes from all the menus transcribed by the project. In this data file, a dish is represented by a row of values. Columns identify attributes of a dish. One of these attributes is an identifier, which identifies the dish. However, the identity of a dish appears to be based on the exact form of the string labeled "name." Thus, dishes with variant orthographic forms of their names, e.g. “half chicken”, “Half Chicken” and “chicken [half]”) are treated as separate entries with different identifiers.
- Menu.csv
- Information about the menus as physical objects, including historical information about their origins, uses, and formats.
- MenuItem.csv
- This is the largest data file. A "MenuItem" represents a single instance of a dish appearing somewhere on a menu page image. “MenuItem.csv” is useful as a mapping between multiple other data files/tables.
- MenuPage.csv
- Information about individual pages of the menus represented in "Menu.csv". Pages are modeled here as digital images produced as a result of digitization by the NYPL. Menus often have multiple pages.
Conventions
In the tables below we provide information about each downloadable data file including:
- a gloss of the column labels,
- the datatype of values that appear in each column,
- an indication of whether columns have missing data (in the form of empty values),
- the source of values in that column,
- plus any additional notes.
We note the presence of missing data points in the column “missing values.” However, there are also strings in the present data that point to missing information (e.g. “unknown” or “?”), which do not formally appear as null values.
The data in these files comes from several different sources. We reflect this in the “generated by” category of the tables below. Some of the data is supplied by “volunteer transcribers,” by which we mean people who have participated in the project through the What’s On the Menu? site. Some of the information is generated by “web application.” This means that some of this information was automatically created as the database supporting the application was constructed and populated (e.g., various ids); some is created as the web application runs (e.g. timestamps as data values are updated). Finally, a lot of information is generated from “NYPL metadata.” This metadata comes from many places and reflects the long history of the project and the many parts of New York Public Library involved in it. Much of the data “supplied by NYPL metadata” in the menu spreadsheet is from the catalog cards made by Frank E. Buttolph in the early twentieth century.
Detailed Data Dictionary
This information is based on our analysis of the publicly available data files. We welcome corrections and additions. Send us an email: curatingmenus [at] gmail.com.
Dish.csv
Column label | Gloss | Data type | Missing values? | Generated by | Description |
---|---|---|---|---|---|
id | identifier for a dish | id | no | web application | corresponds to dish_id in MenuItem.csv |
name | name of dish | string | no | volunteer transcribers | This value matches what the transcriber typed. Sometimes the dish name matches exactly what was printed on the original menu; however, transcribers had various punctuation and capitalization practices, and sometimes relied on contextual information provided by the layout or other items on the menu. |
description | n/a | n/a | yes | n/a | contains no data |
menus_appeared | total count of menus on which dish with this id appears | integer | no | web application | |
times_appeared | total count of appearances of the dish with this id across all menus | integer | no | web application | |
first_appeared | earliest year of a menu on which a dish with this id appears | date (YYYY) | no | web application, based on NYPL metadata for menus | |
last_appeared | latest year of a menu on which a dish with this id appears | date (YYYY) | no | web application, based on NYPL metadata for menus | |
lowest_price | lowest price associated with a dish with a given id | float | yes | volunteer transcribers | Some menus are in other currencies than dollars; also transcribers did not always make distinctions between dollar amounts and cent amounts leading to errors in the data. |
highest_price | highest price associated with a dish with a given id | float | yes | volunteer transcribers |
Menu.csv
Column label | Gloss | Data type | Missing values? | Generated by | Description |
---|---|---|---|---|---|
id | identifier for menu | id | no | web application | corresponds to menu_id |
name | n/a | yes | n/a | contains no data | |
sponsor | who sponsored the meal (organizations, people, name of restaurant) | string | yes | NYPL metadata | |
event | category (lunch, annual dinner) | string | yes | NYPL metadata | Information in this category varies widely. |
venue | type of place (commercial, social, professional) | string | yes | NYPL metadata | Information in this category varies widely. |
place | where the meal took place (often a geographic location) | string | yes | NYPL metadata | These vary widely (street address, cities, names of restaurants; names of a ships or train). NYPL has been crowdsourcing more precise geolocations for menus but this data is not available in these files. |
physical_description | dimension and material description of the menu | string | yes | NYPL metadata | |
occasion | occasion of the meal (holidays, anniversaries, daily) | string | yes | NYPL metadata | This field likely comes from Buttolph’s original organization of the menu collection. |
notes | notes by librarians about the original material | string | yes | NYPL metadata | |
call_number | call number of the menu | string | yes | NYPL metadata | |
keywords | n/a | yes | n/a | contains no data | |
language | n/a | yes | n/a | contains no data | |
date | date of the menu | date (YYYY-MM-DD) | yes | NYPL metadata | contains no data |
location | organization or business who produced the menu | string | no | NYPL metadata | |
location_type | n/a | yes | n/a | contains no data | |
currency | system of money the menu uses (dollars, etc.) | string | yes | NYPL metadata | |
currency_symbol | symbol for the currency ($, etc.) | string | yes | NYPL metadata | |
status | completeness of the menu transcription (transcribed, under review, etc.) | string | no | web application | |
page_count | how many pages the menu has | integer | no | web application | |
dish_count | how many dishes the menu has | integer | no | web application |
MenuItem.csv
Column label | Gloss | Data type | Missing values? | Generated by | Description |
---|---|---|---|---|---|
id | identifier for the menu item | id | no | web application | |
menu_page_id | id of the page the menu item is on | id | no | web application | corresponds to MenuPage.csv id |
price | first price of menu item | float | yes | volunteer transcribers | |
high_price | if the item has more than on price on a single menu, the highest price | float | yes | volunteer transcribers | If there are more than two values for price, the web application instructs volunteers to enter the lowest and highest prices rather than all values. |
dish_id | id of the dish | id | yes | web application | corresponds to dish.csv id |
created_at | date/time of first transcription | datetime UTC (YYYY-MM-DD HH:MM:SS UTC) | no | web application | |
updated_at | date/time of the last edit to the value | datetime UTC (YYYY-MM-DD HH:MM:SS UTC) | no | web application | Usually, the updated time would be the time of the review. |
xpos | horizontal coordinate on the page for the upper left point where menu item is on the page | float | no | web application | This is where the green arrow on the What’s On The Menu? site sits to show people what to transcribe. |
ypos | vertical coordinate on the page for the upper left point where the menu item is on the page | float | no | web application | This is where the green arrow on the What’s On The Menu? site sits to show people what to transcribe. |
MenuPage.csv
Column label | Gloss | Data type | Missing values? | Generated by | Description |
---|---|---|---|---|---|
id | identifier for menu page | id | no | web application | |
menu_id | identifier for menu | id | no | web application | corresponds to Menu.csv id |
page_number | number representing sequence of page in the menu | integer | yes | web application | |
image_id | identifier for the page image | id | no | web application | |
full_height | height of the page image in pixels | integer | yes | web application | |
full_width | width of the page image in pixels | integer | yes | web application | |
uuid | universally unique identifier for the highest resolution version of the image | id | no | web application |