- Key Concepts
- Bulk Data Key Concepts
- Data Dictionary
- Box Office
- Querying In Athena
- Creating Tables DDL
- Release Notes
Bulk Data Key Concepts
JSON Lines File Format
IMDb's data set is provided in JSON Lines file format. The files are UTF-8 encoded text files, where each line in the file is a valid JSON string. Each JSON document, one per line, relates to a single entity, uniquely identified by an IMDb ID. A JSON schema is also provided that documents the format that is used for each JSON document within the file.
Every published revision of IMDb's data set contains data file(s), documentation for that data, and a schema which validates that data. Each of these is associated with a version number, which can be found at the end of their filenames.
At any time we may change the format of new data set revisions and their accompanying schema, but previously published data set revisions will remain unchanged. If data from a new revision of the data set is not compatible with the previous schema (i.e. a breaking change) then we will increment the version number for the data files, schema, and documentation. In this case we will publish both formats of the data set for some period of time before we stop publishing the older one. The data set format and schema may change without incrementing the schema version number if the change is compatible with previous revisions (i.e. a non-breaking change).
The following are examples of non-breaking changes to the schema:
- Adding a new key anywhere in the structure.
- Removing an optional key.
- Changing a key from optional to required.
- Changing the validation rules for a specific key such that all values still validate against previous validation rules.
The following are examples of breaking changes to the schema:
- Changing a key from required to optional.
- Removing a required key.
- Changing the validation rules for a specific key such that newly published values may exist that do not validate against previous validation rules.
Data Structure Conventions
There are some conventions you should be aware of when using IMDb's data set:
- There are no null values in the data set. If we do not have a value for a particular key we omit publishing that key. Keys which are required by the schema will never have a null value.
- There are no empty objects in the data set. If an object would have contained no keys we omit publishing that object.
- There are no empty arrays in the data set. If an array value would have contained no items we omit publishing the corresponding key.
Data Consistency Model
IMDb’s data is constantly being expanded and updated, and it can take seconds or minutes for a change to propagate throughout the entire catalog. This means that the snapshot of IMDb’s data published may contain temporary inconsistencies. For example, it is possible that we report an actor appearing in a title in their filmography, but it has not yet propagated to that title’s credits. Each individual inconsistency will be resolved in the next published revision of the data set.
Linking to IMDb
IMDb's data contains URLs that you can use to link back to the IMDb website in any experience you build for your users. Your license may require you to attach a "refmarker" to the end of the URL. The "refmarker" is a special sequence of characters that we use to identify the source of our traffic. Add the "refmarker" to the URL by appending
?ref_=xx_xxx_x to the URL, where
xx_xxx_x is replaced by the code we have provided to you. A full URL could look something like
Explore IMDb’s complete data dictionary for every field in our products (including names, titles and box office data)