- Documentation
- Key Concepts
- Release Notes
Querying Data in Amazon Athena Tables
What is Amazon Athena?
Amazon Athena is an interactive query service that makes it easy to analyze data directly in Amazon Simple Storage Service (Amazon S3) using standard SQL. For more information, and for getting started with Athena, read the user guide.
Getting Started
You first need to create a database in Athena. This process is documented in the user guide
When you have a database, you're ready to create a table that's based on the dataset. You need to upload your dataset to Amazon S3. When you specify a location for your table you should use a trailing slash for your folder or bucket. Do not use file names or glob characters.
Use: s3://S3-BUCKET/S3-KEY/
Don't use: s3://S3-BUCKET s3://S3-BUCKET/S3-KEY/* s3://S3-BUCKET/S3-KEY/DATASET.JSONL.GZ
Athena will query all objects in the specified location so it is important that only one dataset is found at that path.
To create a table use the create table DDL statement found at the end of this document. Remember to set the location to the location of your dataset. If you do not need to query all of the columns in the table you can remove them from the create table DDL statement.
Now that you have a table created in Athena based on the data in Amazon S3, you can run queries on the table and see the results in Athena.
We have provided some example queries for common use cases.
What Are the Highest-Rated Movies On IMDb?
IMDb user ratings can be found in the title essential dataset as part of the imdbRating
structure.
select
tc.titleId,
tc.originalTitle,
tc.imdbRating.rating,
tc.imdbRating.numberOfVotes
from
title_essential_v1 as tc
where
tc.remappedTo is null and tc.titleType = 'movie'
order by
tc.imdbRating.rating desc,
tc.imdbRating.numberOfVotes desc
Running this query might return the following results:
titleid | originaltitle | rating | numberofvotes |
---|---|---|---|
tt0050083 | 12 Angry Men | 8.9 | 616319 |
tt0110413 | Léon: The Professional | 8.5 | 953632 |
tt0064116 | Once Upon a Time in the West | 8.5 | 278746 |
What Are the Title Texts for the Titles That a Person Is Known For?
The title IDs that a person is known for can be found in the name essential dataset as part of the knownFor
array. To query this array it is necessary to flatten it into multiple rows using CROSS JOIN
in conjunction with the UNNEST
operator. To include the original title text it is necessary to JOIN
the title essential dataset.
select
nc.nameId,
nc.name,
u_knownFor.titleId,
tc.originalTitle
from
name_essential_v1 as nc
cross join
unnest(nc.knownFor) with ordinality as t(u_knownFor, ordinal)
join
title_essential_v1 as tc
on u_knownFor.titleId = tc.titleId
where
nc.remappedTo is null
order by
nc.nameId, ordinal
Running this query might return the following results:
nameid | name | titleid | originaltitle |
---|---|---|---|
nm0000020 | Henry Fonda | tt0050083 | 12 Angry Men |
nm0000020 | Henry Fonda | tt0082846 | On Golden Pond |
nm0000020 | Henry Fonda | tt0032551 | The Grapes of Wrath |
What Are the Names of the Principal Cast for a Title?
The name IDs for the principal cast for a title can be found in the title essential dataset as part of the principalCastMembers
array. To query this array it is necessary to flatten it into multiple rows using CROSS JOIN
in conjunction with the UNNEST
operator. To include the name it is necessary to JOIN
the name essential dataset.
select
tc.titleId,
tc.originalTitle,
nc.name
from
title_essential_v1 as tc
cross join
unnest(tc.principalCastMembers) with ordinality as t(u_pcm, ordinal)
join
name_essential_v1 as nc
on u_pcm.nameId = nc.nameId
where
tc.remappedTo is null
order by
tc.titleId, ordinal
Running this query on the sample dataset returns the following results:
titleid | originaltitle | nameid | name |
---|---|---|---|
tt0050083 | 12 Angry Men | nm0000020 | Henry Fonda |
tt0050083 | 12 Angry Men | nm0002011 | Lee J. Cobb |
tt0050083 | 12 Angry Men | nm0000842 | Martin Balsam |
What Awards was a Title Nominated for, and who were the Award Nominees?
select
tc.titleId,
tc.originalTitle,
tc_awards.awardNominationId,
tc_awards.awardName,
nc.nameId,
nc.name
from
name_essential_v1 as nc
cross join
unnest(nc.knownFor) with ordinality as t(u_knownFor, ordinal)
join
title_essential_v1 as tc
on u_knownFor.titleId = tc.titleId
cross join
unnest(tc.awards) as t(tc_awards)
cross join
unnest(nc.awards) as t(nc_awards)
where
tc_awards.awardNominationid = nc_awards.awardNominationId
Running this query might return the following results:
titleId | originalTitle | awardNominationId | awardName | nameId | name |
---|---|---|---|---|---|
tt0032551 | The Grapes of Wrath | an0052939 | Oscar | nm0002034 | Jane Darwell |
tt0032551 | The Grapes of Wrath | an0829169 | NBR award | nm0002034 | Jane Darwell |
tt0032551 | The Grapes of Wrath | an0052926 | Oscar | nm0000020 | Henry Fonda |
What Are the Title Texts for Episodes of a Series?
The title IDs for episodes that are part of a series can be found in the title essential dataset as part of the episodeTitleIds
array. To query this array it is necessary to flatten it into multiple rows using CROSS JOIN
in conjunction with the UNNEST
operator. To include the title text it is necessary to JOIN
the title essential dataset.
select
tc_series.titleId,
tc_series.originalTitle,
tc_episode.titleId,
tc_episode.originalTitle,
tc_episode.episodeInfo.seasonNumber,
tc_episode.episodeInfo.episodeNumber
from
title_essential_v1 as tc_series
cross join
unnest(tc_series.seriesInfo.episodeTitleIds) as t(u_eti)
join
title_essential_v1 as tc_episode
on u_eti = tc_episode.titleId
where
tc_series.remappedTo is null
order by
tc_series.titleId,
tc_episode.episodeInfo.seasonNumber,
tc_episode.episodeInfo.episodeNumber
Running this query might return the following results:
titleid | originaltitle | titleid | originaltitle | seasonnumber | episodenumber |
---|---|---|---|---|---|
tt5491994 | Planet Earth II | tt6142646 | Islands | 1 | 1 |
tt5491994 | Planet Earth II | tt6209126 | Mountains | 1 | 2 |
tt5491994 | Planet Earth II | tt6209130 | Jungles | 1 | 3 |
What Is the Title Text of the Series for an Episode Title?
The title ID for a series that an episode is part of can be found in the title essential dataset as part of the episodeInfo
structure. To include the title text it is necessary to JOIN
the title essential dataset.
select
tc_episode.titleId,
tc_episode.originalTitle,
tc_series.titleId,
tc_series.originalTitle,
tc_episode.episodeInfo.seasonNumber,
tc_episode.episodeInfo.episodeNumber
from
title_essential_v1 as tc_episode
join
title_essential_v1 as tc_series
on tc_episode.episodeInfo.seriesTitleId = tc_series.titleId
where
tc_episode.remappedTo is null
order by
tc_episode.titleId
Running this query might return the following results:
titleid | originaltitle | titleid | originaltitle | seasonnumber | episodenumber |
---|---|---|---|---|---|
tt6142646 | Islands | tt5491994 | Planet Earth II | 1 | 1 |
tt6209126 | Mountains | tt5491994 | Planet Earth II | 1 | 2 |
tt6209130 | Jungles | tt5491994 | Planet Earth II | 1 | 3 |
Explore sample DDL commands to create tables for IMDb name and essential data.