Querying Data in Amazon Athena Tables

What is Amazon Athena?

Amazon Athena is an interactive query service that makes it easy to analyze data directly in Amazon Simple Storage Service (Amazon S3) using standard SQL. For more information, and for getting started with Athena, read the user guide.

Getting Started

You first need to create a database in Athena. This process is documented in the user guide

When you have a database, you're ready to create a table that's based on the dataset. You need to upload your dataset to Amazon S3. When you specify a location for your table you should use a trailing slash for your folder or bucket. Do not use file names or glob characters.

Use: s3://S3-BUCKET/S3-KEY/

Don't use: s3://S3-BUCKET s3://S3-BUCKET/S3-KEY/* s3://S3-BUCKET/S3-KEY/DATASET.JSONL.GZ

Athena will query all objects in the specified location so it is important that only one dataset is found at that path.

To create a table use the create table DDL statement found at the end of this document. Remember to set the location to the location of your dataset. If you do not need to query all of the columns in the table you can remove them from the create table DDL statement.

Now that you have a table created in Athena based on the data in Amazon S3, you can run queries on the table and see the results in Athena.

We have provided some example queries for common use cases.

What Are the Highest-Rated Movies On IMDb?

IMDb user ratings can be found in the title essential dataset as part of the imdbRating structure.

select
    tc.titleId,
    tc.originalTitle,
    tc.imdbRating.rating,
    tc.imdbRating.numberOfVotes
from
    title_essential_v1 as tc
where
    tc.remappedTo is null and tc.titleType = 'movie'
order by
    tc.imdbRating.rating desc,
    tc.imdbRating.numberOfVotes desc

Running this query might return the following results:

titleid	originaltitle	rating	numberofvotes
tt0050083	12 Angry Men	8.9	616319
tt0110413	Léon: The Professional	8.5	953632
tt0064116	Once Upon a Time in the West	8.5	278746

What Are the Title Texts for the Titles That a Person Is Known For?

The title IDs that a person is known for can be found in the name essential dataset as part of the knownFor array. To query this array it is necessary to flatten it into multiple rows using CROSS JOIN in conjunction with the UNNEST operator. To include the original title text it is necessary to JOIN the title essential dataset.

select
    nc.nameId,
    nc.name,
    u_knownFor.titleId,
    tc.originalTitle
from
    name_essential_v1 as nc
cross join
    unnest(nc.knownFor) with ordinality as t(u_knownFor, ordinal)
join
    title_essential_v1 as tc
        on u_knownFor.titleId = tc.titleId
where
    nc.remappedTo is null
order by
    nc.nameId, ordinal

Running this query might return the following results:

nameid	name	titleid	originaltitle
nm0000020	Henry Fonda	tt0050083	12 Angry Men
nm0000020	Henry Fonda	tt0082846	On Golden Pond
nm0000020	Henry Fonda	tt0032551	The Grapes of Wrath

What Are the Names of the Principal Cast for a Title?

The name IDs for the principal cast for a title can be found in the title essential dataset as part of the principalCastMembers array. To query this array it is necessary to flatten it into multiple rows using CROSS JOIN in conjunction with the UNNEST operator. To include the name it is necessary to JOIN the name essential dataset.

select
    tc.titleId,
    tc.originalTitle,
    nc.name
from
    title_essential_v1 as tc
cross join
    unnest(tc.principalCastMembers) with ordinality as t(u_pcm, ordinal)
join
    name_essential_v1 as nc
        on u_pcm.nameId = nc.nameId
where
    tc.remappedTo is null
order by
    tc.titleId, ordinal

Running this query on the sample dataset returns the following results:

titleid	originaltitle	nameid	name
tt0050083	12 Angry Men	nm0000020	Henry Fonda
tt0050083	12 Angry Men	nm0002011	Lee J. Cobb
tt0050083	12 Angry Men	nm0000842	Martin Balsam

What Awards was a Title Nominated for, and who were the Award Nominees?

select
    tc.titleId,
    tc.originalTitle,
    tc_awards.awardNominationId,
    tc_awards.awardName,
    nc.nameId,
    nc.name
from
    name_essential_v1 as nc
cross join
    unnest(nc.knownFor) with ordinality as t(u_knownFor, ordinal)
join
    title_essential_v1 as tc
        on u_knownFor.titleId = tc.titleId
cross join
    unnest(tc.awards) as t(tc_awards)
cross join
    unnest(nc.awards) as t(nc_awards)
where
    tc_awards.awardNominationid = nc_awards.awardNominationId

Running this query might return the following results:

titleId	originalTitle	awardNominationId	awardName	nameId	name
tt0032551	The Grapes of Wrath	an0052939	Oscar	nm0002034	Jane Darwell
tt0032551	The Grapes of Wrath	an0829169	NBR award	nm0002034	Jane Darwell
tt0032551	The Grapes of Wrath	an0052926	Oscar	nm0000020	Henry Fonda

What Are the Title Texts for Episodes of a Series?

The title IDs for episodes that are part of a series can be found in the title essential dataset as part of the episodeTitleIds array. To query this array it is necessary to flatten it into multiple rows using CROSS JOIN in conjunction with the UNNEST operator. To include the title text it is necessary to JOIN the title essential dataset.

select
    tc_series.titleId,
    tc_series.originalTitle,
    tc_episode.titleId,
    tc_episode.originalTitle,
    tc_episode.episodeInfo.seasonNumber,
    tc_episode.episodeInfo.episodeNumber
from
    title_essential_v1 as tc_series
cross join
    unnest(tc_series.seriesInfo.episodeTitleIds) as t(u_eti)
join
    title_essential_v1 as tc_episode
        on u_eti = tc_episode.titleId
where
    tc_series.remappedTo is null
order by
    tc_series.titleId,
    tc_episode.episodeInfo.seasonNumber,
    tc_episode.episodeInfo.episodeNumber

Running this query might return the following results:

titleid	originaltitle	titleid	originaltitle	seasonnumber	episodenumber
tt5491994	Planet Earth II	tt6142646	Islands	1	1
tt5491994	Planet Earth II	tt6209126	Mountains	1	2
tt5491994	Planet Earth II	tt6209130	Jungles	1	3

What Is the Title Text of the Series for an Episode Title?

The title ID for a series that an episode is part of can be found in the title essential dataset as part of the episodeInfo structure. To include the title text it is necessary to JOIN the title essential dataset.

select
    tc_episode.titleId,
    tc_episode.originalTitle,
    tc_series.titleId,
    tc_series.originalTitle,
    tc_episode.episodeInfo.seasonNumber,
    tc_episode.episodeInfo.episodeNumber
from
    title_essential_v1 as tc_episode
join
    title_essential_v1 as tc_series
        on tc_episode.episodeInfo.seriesTitleId = tc_series.titleId
where
    tc_episode.remappedTo is null
order by
    tc_episode.titleId

Running this query might return the following results:

titleid	originaltitle	titleid	originaltitle	seasonnumber	episodenumber
tt6142646	Islands	tt5491994	Planet Earth II	1	1
tt6209126	Mountains	tt5491994	Planet Earth II	1	2
tt6209130	Jungles	tt5491994	Planet Earth II	1	3

What are the most popular titles on IMDb?

This can be obtained by running the below query.

select  
titleId,  
date,  
rank  
from  
geo_title_meter_v1  
where  
country = 'US'  
order by  
rank

Running this query might return the following results:

titleid	Date	rank
tt0001167	7/14/2024	1
tt8458664	7/14/2024	2
tt1428821	7/14/2024	3

Who are the most popular names on IMDb?

This can be obtained by running the below query:

select  
nameId,  
date,  
rank  
from  
geo_star_meter_v1  
where  
country = 'US'  
order by  
rank

Running this query might return the following results:

nameId	Date	rank
nm0001167	7/14/2024	1
nm8458664	7/14/2024	2
nm1428821	7/14/2024	3

Creating Tables DDL

Explore sample DDL commands to create tables for IMDb name and essential data.

Querying Data in Amazon Athena Tables

What is Amazon Athena?

Getting Started

What Are the Highest-Rated Movies On IMDb?

What Are the Title Texts for the Titles That a Person Is Known For?

What Are the Names of the Principal Cast for a Title?

What Awards was a Title Nominated for, and who were the Award Nominees?

What Are the Title Texts for Episodes of a Series?

What Is the Title Text of the Series for an Episode Title?

What are the most popular titles on IMDb?

Who are the most popular names on IMDb?

Creating Tables DDL

Get started