TDM 20100: Project 11 — SQL

Motivation: Now we will apply our SQL skills by studying movies and TV shows.

Context: The Internet Movie DataBase www.imdb.com provides data tables here: datasets.imdbws.com which we have stored in a database for you here: /anvil/projects/tdm/data/movies_and_tv/imdb2024.db

Scope: There are 7 tables to get familiar with: akas, basics, crew, episode, name, principals, ratings

Learning Objectives:
  • We will learn how to use SQL to analyze several aspects of movies and TV shows

Make sure to read about, and use the template found here, and the important information about project submissions here.

Dataset(s)

This project will use the following dataset:

  • /anvil/projects/tdm/data/movies_and_tv/imdb2024.db (Internet Movie DataBase (IMDB))

Our page in The Examples Book about SQL (in general) is given here: the-examples-book.com/tools/SQL/

Questions

Using the seminar kernel, if you run this line in a cell by itself:

%sql sqlite:////anvil/projects/tdm/data/movies_and_tv/imdb2024.db

then you will have the movies and TV database loaded.

If your kernel dies, then you need to re-run the line above. You also need to re-run this line at the start of any new Jupyter Lab session.

The tables in this database are:

akas
basics
crew
episode
name
principals
ratings

Question 1 (2 pts)

  1. From the basics table, display the entry for Friends. (The title constant for Friends is tt0108778. Notice that this corresponds to the IMDB webpage for Friends: www.imdb.com/title/tt0108778 from IMDB.)

  2. Find all of the entries of the principals table that correspond to people in Friends.

  3. Use the episode table to discover how many episodes occurred during each season of Friends. For each season, print the season number and the number of episodes in that season.

Notice that the episode table has two columns of title constants: one of the episode title constant (which, in the table, is tconst) and the other is the main show title constant (which, in the table, is parentTconst).

Deliverables
  • From the basics table, display the entry for Friends.

  • Find all of the entries of the principals table that correspond to people in Friends.

  • Use the episode table to discover how many episodes occurred during each season of Friends. For each season, print the season number and the number of episodes in that season.

Question 2 (2 pts)

Join the ratings and the basics table, to find the 13 titles that each have more than 2 million ratings. For each such title, output these values: tconst, averageRating, numVotes, primaryTitle, startYear, runtimeMinutes, and genres

Deliverables
  • For each of the 13 titles that each have more than 2 million ratings, output these values: tconst, averageRating, numVotes, primaryTitle, startYear, runtimeMinutes, and genres

Question 3 (2 pts)

Using the startYear values from the basics table, find the total number of entries in each startYear.

Deliverables
  • For each startYear value from the basics table, print the startYear and the total number of entries in corresponding to that startYear.

Question 4 (2 pts)

  1. From the name column, find the nconst value for Emma Watson. (Notice that there are several entries with this name, but probably only one of them is the one that you want to analyze.)

  2. How many entries in the principals table correspond to Emma Watson (using only the correct value of nconst that you found in part a)?

Deliverables
  • From the name column, find the nconst value for Emma Watson. (Although several values appear, just find the 1 value that is correct for her.)

  • How many entries in the principals table correspond to Emma Watson?

Question 5 (2 pts)

Join the basics and the ratings table to find the 3 entries that have startYear = 2024 and numVotes > 100000 and averageRating > 8. (Print all of the entries from both tables, for these 3 entries.)

Deliverables
  • Join the basics and the ratings table to find the 3 entries that have startYear = 2024 and numVotes > 100000 and averageRating > 8. (Print all of the entries from both tables, for these 3 entries.)

Submitting your Work

We see that the SQL skills that we learned for the Lahman baseball database are directly applicable to analyzing the movies and TV database too! It is a good feeling to be able to apply what we have learned in a new setting!

Items to submit
  • firstname-lastname-project11.ipynb

You must double check your .ipynb after submitting it in gradescope. A very common mistake is to assume that your .ipynb file has been rendered properly and contains your code, comments (in markdown or with hashtags), and code output, even though it may not. Please take the time to double check your work. See the instructions on how to double check your submission.

You will not receive full credit if your .ipynb file submitted in Gradescope does not show all of the information you expect it to, including the output for each question result (i.e., the results of running your code), and also comments about your work on each question. Please ask a TA if you need help with this. Please do not wait until Friday afternoon or evening to complete and submit your work.