Trends in the Oscars

For my final project for my Data & Web Technologies for Data Analysis (STA 141B) class, I analyzed trends between nominees/winners of the Oscars, the films and the actors.

I’ve included the code-free markdown versions of the jupyter notebooks below.


STA 141B Final Report

Trends in the Academy Awards (Oscars)

This report code is divided up into several Jupyter notebooks:

I. Introduction introduction.ipynb
II. Data Collection, Cleaning and Storage data.ipynb
II. Movie Analysis and Statistics movieanalysis.ipynb
IV. Actor Analysis and Statistics actoranalysis.ipynb
V. Unused Code unused.ipynb

I. Introduction

Contents

  1. Topic
  2. Questions and Goals
  3. Process

Topic

The 91st Academy Awards took place just last month, perfectly coinciding with the beginning of our project. We decided to look at the Oscars in more detail, and we quickly came up with our topic: Trends in Movies and Movie Awards.

Movies are a representation of culture and society, and are deeply connected with many social issues that we face. Many of these issues, like systemic sexual assault, racial bias and net neutrality have prompted widespread social movements and created lasting impacts. While we won’t be solving any problems in this report, we can take a look at the data and gain a better understanding of some of these issues.

Guiding Questions

Our group decided on two guiding questions for the project:

  1. Are certain movie characteristics correlated to Oscar nominations/wins?
  2. Are certain actor demographics correlated to Oscar nominations/wins?

Process

Our scope was slightly broad and we ran into a variety of different problems while working on this project. Data collection, for example, was much more difficult than we had anticipated and took the vast majority of our time. Additionally, there are many different factors that could play into win correlations, and future explorations could go into more depth on those factors.

II. Data Collection, Cleaning and Storage

Contents

  1. Academy Award Nominees and Winners
  2. Film Production and Basic Information
    2.1. Introduction
    2.2. TMDB Data
    2.3. OMDB Data
    2.4. The Numbers Data
    2.5. Combining Data
  3. Demographics of Actors
    3.1. NNDB Data
    3.2. Digression on Methodology
  4. Summary and Further Exploration

1. Academy Award Nominees and Winners

The first type of data we needed to collect was all the nominees and winners of the Oscars. We tried several sources but ultimately ended up using the Official Academy Awards Database and Wikipedia. Additional sources, methods, reasoning and code can be found in the Unused Code Notebook unused.ipynb.

This was one of the most time consuming aspects of this project because the data was inconsistent and because we needed to spend time understanding the task. We figured out which categories we wanted to use, how category names have changed over the years, what entities each category tracked,

These were some of the data sources we tried working with but eventually decided against. For these sources, we discovered that fixing/cleaning the issues would take more work than retrieving the data from elsewhere.

  1. Kaggle
    This dataset contained all nominees and winners for all categories from 1927-2015. It contained data we did not want, including special, honorary and discontinued categories. It did not contain some of the data that we wanted, like current category names and data from 2016-2019. It also had several issues: category names were different from official names, there were typos. The dealbreaker for this dataset was the errors in the name/film columns, where data would sometimes be in the wrong column. The only to correct this was a manual review, which was not feasible given the size of the data.

  2. Datahub
    This dataset contained all nominees and winners for all categories from 1927-2017, and was the best compiled dataset we could find. Again there were problems with inconsistent naming and lack of 2018/2019. The dealbreaker with this dataset was that they only had one column for the winning entity, which seemed to be chosen arbitrarily. For example, in the acting category, there was no data on what movie the actor won an award for.

  3. Wikipedia
    The data in Wikipedia was nearly complete and was the most updated, but had several issues as well. There was no indication of what type of entity the winning entity was, film names were not always included, and the tables had merged cells that made scraping the information a challenge. Additionally, each page stored the nominees/winners in differently formatted tables. Despite this, Wikipedia was a great source in manually verifying changes in category names over time.

We discovered that there was no public dataset available with the data that we wanted apart from the official website. The time we spent trying to work with other sources helped us understand how we wanted our data to be organized, so that our final web scrape gave us the efficient and clear data we needed. Because of this we didn’t need to do any more cleaning on our gathered data.

We downloaded html pages of the returned results from the official database to use for scraping. These are the html pages in data/. We did this because there did not appear to be a way to do this using requests or other methods we knew, and because processing downloaded data was quicker.

We chose the 23 categories that are currently in use, excluding special achievement, honorary, discontinued and other special categories. For each nominee and winner, we collected:

  • year : Year(s) the ceremony took place in. From 1927-1933 it was 2 years and from 1933-present it was one year. Because these were inconsistent, we used show numbers in our analysis.

  • show : Show number of the award ceremony.

  • original_category : Original name of the category.

  • winner : 0 for nominee, 1 for winner.

  • film : Name of the associated film.

  • entity : Name of the winning entity associated with the film. These were inconsistent. For example, in the Best Picture category, the winning entity was changed from the production studio to individual producers in the 1950’s.

  • note : Additional information

image1

2. Film Production and Basic Information

2.1. Introduction

The second type of data we collected was basic information for all movies. Additional sources, methods, reasoning and code can be found in the Unused Code Notebook unused.ipynb.

We started off by collecting information for Best Picture nominees and winners using the IMDbPy library. We soon discovered that this package ran slow, taking roughly 3 seconds per movie. This was alright for our preliminary data gathering but would be too slow to use for additional categories. The multiple-hour wait and the lack of support for caching led us to look elsewhere.

We considered web scraping IMDb, but quickly found that its search feature lacked the functionality we wanted. For example, when we searched by title and type of media (TV, movie, documentary, etc) we couldn’t specify the year. When we searched by title and year, the title needed to be an exact match or no results would appear. Any scraping we did would require too much manual verification and resulted in more errors than we were comfortable with, so we looked at other methods.

Our final method uses a combination of the TMDB and OMDb databases, as well as The Numbers. We found that the TMDB search API was the most reliable way to retrieve movie IDs by name, and that OMDb provided us with the best movie information given the movie ID. OMDb gave us all the information available on IMDb (genres, votes, runtime, plot, cast, etc), but also included Rotten Tomatoes and Metacritic ratings. A web scrape of The Numbers gave us budget and box office figures.

First we compiled a list of all distinct movies from every oscar category. We found that there were 4706 distinct movies.

2.2. TMDB Data

Next we wrote a function to retrieve TMDB movie IDs using the movie names. TMDB limited us to 40 requests per 10 second, and it took an average of 2 requests to get each id (searching multiple years). There were 4706 unique movies and we waited 0.5s between each movie to stay within the rate limit, so that it took roughly 40 minutes to retrieve all the TMDB IDs.

After we ran the function, there were 51 cases where no movie was found. This was much better than our IMDb or IMDbPy methods, and represented a success rate of 1-(51/4706) = 0.989, or 98.9%.

We manually retrieved the correct name for each of these cases and retrieved the TMDB IDs for each of them. We found that these cases consisted primarily of movies with abbreviations, foreign films and short films/documentaries.

Now that we had all the TMDB IDs for each movie, we could perform a TMDB request to get the IMDb ID using the TMDB ID.

It took 1 request to get each IMDB ID. At 40 requests per 10 seconds, we would wait 0.25s between each movie, so it took roughly 20 minutes to get them all.

2.3. OMDB Data

Now that we had all the IMDB IDs for each movie, we could use OMDb to get the data that we wanted for each movie. We paid for an API key so that we weren’t rate-restricted.

2.4. The Numbers Data

Here we scraped links to films on The Numbers using the film name. Then we scraped the budget and box office figures from The Numbers using the retreived link.

2.5. Combining Data

After we retreived all the data we needed, we combined it into a dataframe and exported it as an Excel file for importing and using in the future.

image2

Through the above dataframe, we can see we have collected the following data from OMDB:

  • omdbdata

    • Title, Year, Rated, Released, Runtime, Genre, Director, Writer, Actors, Plot, Language, Country, Awards, Poster, Metascore, imdbRating, imdbVotes, Rotten Tomatoes, imdbID, Type, DVD, BoxOffice, Production, Website

We can also see that we have collected the following data from The Numbers:

  • thenumbers

    • budget: Budget of the film

    • domestic: Domestic box office figures

    • international: International box office figures

    • worldwide: Worldwide box office figures

    • opening: Opening weekend box office figures

3. Demographics of Actors

3.1. NNDB Data

The third type of data we collected was personal information of actors in all acting categories (Best Actor, Best Supporting Actor, Best Actress, Best Supporting Actress). While there were a variety of sources for actors’ personal information, NNDB was the only source we could find that included ethnicity.

We did find one other potential data source, http://ethnicelebs.com/, which provided ancestry and ethnic descent along with sources. This would be a good source for further exploration as it goes more in-depth than we are looking for. Other sources we looked at before choosing NNDB included Google Knowledge Graph, WolframAlpha and Wikipedia.

We scraped NNDB for each actor and collected:

  • birthplace: City where the actor was born

  • born: Birthdate

  • children: Number of children

  • died: Date of death (if applicable)

  • gender

  • highschool

  • orientation: Sexual orientation

  • race: NNDB categorized as Asian/Indian, AmericanAborigine, White, Hispanic, Black, Multiracial, MiddleEastern, Asian

  • university: List of universities attended

  • link: URL to NNDB page

  • name

We retreived the NNDB page given the actors’ name.

There were 1728 actors in total, and NNDB was missing 29 of them, which represented a 1-(29/1728) = 0.983 = 98.3% hit rate (found actors). We were able to manually imput data for the missing actors by finding them on Wikipedia.

Then we scraped the actors’ information once we had the link to their page.

After we gathered all the information, we stored them as sheets in an excel file for manual corrections and for future imports.

image3

3.2. Digression on Methodology

The information and categorizations provided by our sources are not 100% accurate, but they are the best data sources we could find for our situation. As a result, discretion is advised.

We reviewed dozens of articles, several studies and emailed journalists regarding methodology for gathering ethnic and other demographic trends in the Oscars and Hollywood in general. From what we found, data for every analysis related to ethnicity was gathered manually.

This USC Annenberg study on demographics is one prominent source, and utilized a group of "71 students … recruited in the Fall of 2013" to manually evaluate 26,225 speaking characters over 6 years of movies for demographics. The students "underwent a rigorous classroom based training" that included "series of diagnostics designed to test their unitizing and variable coding ability" to ensure reliability. Similar methodologies were used in this UCLA study, this TIME labs analysis and other studies as well.

Like in the UN Ethnocultural characteristics data, we recognize that "specific ethnic and/or national groups … are dependent upon individual circumstances". Because of this, "basic criteria used should be clearly explained so that the meaning of the classification will be readily apparent".

We didn’t have the resources to perform manual categorization, so NNDB was our next best option. We utilized perceived/apparant ethnicity/race for our categorizations similar to the US Census, but maintained a degree of flexibility.

For example:
Rami Malek, a first-generation American with Egyptian immigrant parents who self-identifies as culturally Egyptian, would be categorized as ‘White – North African’ in the US Census and ‘White’ in our dataset.

Additional reading:
https://genderedinnovations.stanford.edu/terms/race.html https://unstats.un.org/unsd/demographic/sconcerns/popchar/popcharmethods.htm https://en.wikipedia.org/wiki/Race_and_ethnicity_in_the_United_States

4. Summary and Further Exploration

In this section we compiled data on Oscar nominees and winners, movie information and actor information. These files can be found in the /data folder as oscardata, moviedata and actordata. In the next sections we’ll analyze this data and look for any trends.

If we wanted to continue with further exploration in this section, areas of focus could include getting ‘oscar buzz’ data using google ngram and gathering lists of all ‘eligible releases’ for each year.

II. Movie Analysis and Statistics

In this section, we address our first guiding question, "Are certain movie characteristics correlated to Oscar nominations/wins?"

Contents

  1. Setup and Data Processing
  2. Preliminary Explorations
  3. OMDB Plots
  4. The Numbers Plots
  5. Further Exploration

1. Setup and Data Processing

First we import oscar data and movie data.

The dataframe moviedata contains all data for all movies that have ever been nominated in any category. To get moviedata for individual categories, we merge with oscardata.

We get omdbdf and thenumbersdf from moviedata. We then merge each with oscardata to get bestpicture_omdb and bestpicture_thenumbers.

image4

image5

2. Timeline of Category Names

While we were gathering data for Oscar winners/nominees, we needed to understand what deprecated categories corresponded to current categories, and how the names of the categories have changed. The plots below organize those categorizations into a comprehensive timeline for category names.

These are helpful for prelininary analysis and help us get a better understanding of the data we’re working with.

image6

image7

2. Preliminary Exploration

Here, we’ve included more graphs we made for exploring the data before we moved on.

In this first graph, we see that movie nominees tend to cluster around an IMDb Rating of 8.0 and around runtimes of 140 minutes.

image8

Next, we created word clouds for the plot summaries of all nominees and winners.

image

image

In these word plots of nominees and winners, we notice several things. Life, love, family and friend are prevalent in both nominees and winners. The plot summaries of best picture winners tends to feature ‘love’ more heavily.

In general, we see that these words mostly have to do with life in general and family, and we can assume that the majority of the best picture nominees and winners have to do with these topics.

image

A quick analysis of the most frequent genres confirm our word cloud assumptions, and tell us that the best picture award favors dramas, biographies and romance.

3. OMDB Plots

In this section, we do more in-depth explorations on movie data and take a look at correlations between our data and being a category winner. We explore the effects that release month, IMDb rating, IMDb votes, and runtime have on winning and how the data has changed over time.

image

The plot above tells us what month movie nominees and winners were released in for every year of the best picture category. The larger the blue dot, the more nominees were released in that month. At first glance, the plot appears to have randomly distributed data points. However, we can see that there are several trends in the data.

Best picture nominees in the past 10 years tended to be released around November, December and January. We see that the distribution of movie nominations tends to change and cluster in certain months for extended periods of time. December has remained a popular month since around show 30, and February gained popularity around show 50-70.

Although there are several clusters for nominees, the winners (red dots) tend to be randomly distributed, roughly following trends in release months of nominees. Ultimately there is no correlation between release month and wins in the best picture category.

Next, we plot IMDb ratings and votes.

image

This graph shows that best picture nominees and winners tend to hover around a rating of 7.5. Again, from the graph we can see that IMDb rating does not tend to affect win rate.

image

We begin to see a distinction between winning and being nominated in this plot. IMDb votes are community-submitted, such that the general popularity of a movie is directly correlated to the number of votes it recieved. In the plot, winners for that year typically had more votes than the nominees.

However, this trend appears to be changing, as winners in the past decade tend not to have as many IMDb votes as nominees. It’s impossible to determine what’s causing the trends seen in this graph, but we have several ideas on why they are happening. The buzz surrounding an oscar win could lead more people to watch, and rate the movie. Additionally, IMDb votes may take time to build up, such that more recent movies will follow the trend going forwards.

image

In this plot, we see that runtime does not affect nominations or wins, and that movies tend to cluster around the two-hour mark.

4. The Numbers Plots

Next, we created plots for financial information about the movies. This included budget, domestic and worldwide box office.

image

image

image

Looking at these for best picture nominees, we see that there is little to no correlation for winning based on worldwide box office or budget. However, the domestic box office plot shows that winners tend to out-perform their nominee counterparts beginning 15 years back. Again, there’s no way to tell, but this could be because winning the award leads to greater viewership afterwards.

5. Further Exploration

Given the time, we would like to continue to look for trends in the data, including rotten tomatoes ratings, deeper looks into plot summaries and for more categories.

IV. Actor Analysis and Statistics

In this section, we address our second guiding question, "Are certain actor demographics correlated to Oscar nominations/wins?"

Contents

  1. Setup and Data Processing
  2. Actor Ethnicity
  3. Actor Sexual Orientation
  4. Actor Age
  5. Actor Birthplaces
  6. Future Exploration

1. Setup and Data Processing

First, we import actor data from the sheets in actordata.xlsx. We also combine actor data from all acting categories into one dataframe, allactors_data.

2. Actor Ethnicity

Here, we’ll explore trends in actor ethnicities for nominees and winners of acting categories. In the code below, we calculate a total number of actors that belong to each ethnic category. In total, there were 1728 actors. Next, we create the data for plotting ethnicity trends over time.

image

image

image

The above graphs provide a look at the ethnicities of nominees and winners for each award show. We can see that nominees and winners have been predominantly white, but that ethnic represntation has been increasing in recent years.

This was widely covered in the news in 2015 and 2016, where for two years in a row, “all 20 actors nominated in the lead and supporting acting categories were white”. The Academy has since taken "dramatic steps to alter the makeup of our membership", and in 2018 invited 928 new members to diversify its previously "94 percent white, 76 percent male” voting group".

Four years after the initial movement gained momentum, it is still too early to see if the Academy’s steps have had any effects.

There are a variety of facts and we can see from the plot. For example, no asian actor has won an acting award in over 30 years, and all actors nominated from 1949-1954 were white.

3. Actor Sexual Orientation

Here, we take a look at sexual orientations of actors in all acting roles. The information comes primarily from self-identification, and many actors were unable to be categorized.

image

Below, we gather the data and plot sexual orientations over time.

image

image

4. Actor Age

Here, we plot the ages of nominated actors and see if age is correlated to chances of winning.

image

We see that ages have been increasing slightly over time, hovering around the 40-year mark. However, there is no correlation between age and winning. We can see that there are several very young nominees in the sub-10 year range, and several nominees in their mid-80s.

5. Actor Birthplaces

A basic visualization of actor birthplaces can be found at https://plot.ly/~sanchit2407/0/usa-data/#/ . We can see that although actors were born in a variety of places, there are several hotspots, including Los Angeles and New York City.

6. Future Exploration

In future explorations, we could do more in-depth analyses on actor birthplaces, education level and familial ties.