IMDb uses unique identifiers for each of the entities referenced in IMDb data. For example we have "Name IDs" identifying name entities (people) and "Title IDs" identifying title entities (movies, series, episodes and video games). IMDb's identifiers always take the form of two letters, which signify the type of entity being identified, followed by a sequence of at least seven numbers that uniquely identify a specific entity of that type. For example:
tt0050083is the unique identifier for the movie "12 Angry Men (1957)", where
ttsignifies that it's a title entity and
0050083uniquely indicates "12 Angry Men (1957)".
nm0000020is the unique identifier for the actor "Henry Fonda", where
nmsignifies that it's a name entity and
0000020uniquely indicates "Henry Fonda".
Within the data set, each entry relates to a single IMDb identifier.
IMDb's data set is provided in JSON Lines file format. The files are UTF-8 encoded text files, where each line in the file is a valid JSON string. Each JSON document, one per line, relates to a single entity, uniquely identified by an IMDb ID. We also provide a JSON schema that documents the format that is used for each JSON document within the file.
Every published revision of IMDb's data set contains data file(s), documentation for that data, and a schema which validates that data. Each of these is associated with a correlated version number, which can be found at the end of their filenames.
At any time we may change the format of new data set revisions and their accompanying schema, but previously published data set revisions will remain unchanged. If data from a new revision of the data set is not compatible with the previous schema (i.e. a breaking change) then we will increment the version number for the data files, schema, and documentation. In this case we will publish both formats of the data set for some period of time before we stop publishing the older one. The data set format and schema may change without incrementing the schema version number if the change is compatible with previous revisions (i.e. a non-breaking change).
The following are examples of non-breaking changes to the schema:
- Adding a new key anywhere in the structure.
- Removing an optional key.
- Changing a key from optional to required.
- Changing the validation rules for a specific key such that all values still validate against previous validation rules.
The following are examples of breaking changes to the schema:
- Changing a key from required to optional.
- Removing a required key.
- Changing the validation rules for a specific key such that newly published values may exist that do not validate against previous validation rules.
There are some conventions you should be aware of when using IMDb's data set:
- There are no null values in the data set. If we do not have a value for a particular key we omit publishing that key. Keys which are required by the schema will never have a null value.
- There are no empty objects in the data set. If an object would have contained no keys we omit publishing that object.
- There are no empty arrays in the data set. If an array value would have contained no items we omit publishing the corresponding key.
IMDb's data set is constantly being updated, adding more data and improving the quality of the data we have. While there is only ever one entry per IMDb ID, we sometimes find that we have duplicate IMDb IDs for an entity within our system. For example, we may learn that two people we have identified separately are actually the same person. When this happens, we maintain the data associated with both identifiers in the data set, duplicating the data. This allows you to continue using any matching you have between IMDb identifiers and other identifiers. To identify when this is the case we include a
remappedTo field on one of the copies which gives you the new preferred identifier for that entity.
Sometimes we delete entities from the data set. The most prominent example of this is the deletion of titles that have been canceled during development and will therefore never be released. When we delete an entity it is no longer included in the data set. The identifier associated with it is never reused for a different entity.
IMDb’s data is constantly being expanded and updated, and it can take seconds or minutes for a change to propagate throughout the entire catalog. This means that the snapshot of IMDb’s data published may contain temporary inconsistencies. For example, it is possible that we report an actor appearing in a title in their filmography, but it has not yet propagated to that title’s credits. Each individual inconsistency will be resolved in the next published revision of the data set.
IMDb's data contains URLs that you can use to link back to the IMDb website in any experience you build for your users. Your license may require you to attach a "refmarker" to the end of the URL. The "refmarker" is a special sequence of characters that we use to identify the source of our traffic. Add the "refmarker" to the URL by appending
?ref_=xx_xxx_x to the URL, where
xx_xxx_x is replaced by the code we have provided to you. A full URL could look something like
Explore IMDb’s complete data dictionary for every field in our products (including names, titles and box office data)