In-Depth Movie Analysis

Nick K.
7 min readSep 18, 2020
Drawing by Karl Addison

Recently, I joined Flatiron’s Data Science Full-Time bootcamp with aspirations of becoming a data scientist. The program has been outstanding so far; the instructors are top-notch, my peers are extremely bright and motivational, and the course work is challenging, in the best way possible.

As we concluded our first module, the skills and knowledge we gained allowed us to work on our first complete data science project. The objective was fairly open-ended, we were given multiple data sets, and tasked with extracting useful, actionable insights for a movie studio entering the movie-making space.

The datasets contained over 5,000 titles, and spanned over a century, with films from between 1915 and 2020.

My initial thought was to convert all of the dollar amounts, including production budgets and gross revenue numbers, to a standard value. Given that $1 in 1915 is nearly equivalent to $26 dollars in 2020 terms, this would allow for a more accurate comparison between films of different years.

Before I could begin the process of converting to a standard dollar value, first I had to clean the columns with dollar amounts to remove characters such as “$” and “,”. This makes the column values, previously identified as a string format, into a much more workable integer format.

raw dataframe prior to cleaning/conversion

The function I used created a list of numbers and dropped all the characters not contained within the list. By passing the columns with dollar string data into the function, the integer format was returned.

After successfully cleaning the columns, the next step in accounting for inflation was to find a table with inflation values online for the years 1915–2020. It turns out USinflationcalculator.com had the perfect table for this purpose.

credit: USinflationcalculator.com

Using the python package Beautiful Soup, I was able to scrape this website and turn the HTML table into a pandas dataframe.

By creating this dataframe, I was able to find a mean inflation value of 3.22% for the years 1915–2020. Next, I created a dictionary with years as keys and the mean inflation as values. The next step was to apply this dictionary to the main dataframe.

The dictionary started on year 0, which was 1915 and worked forward. For each year after, the dollar values were multiplied by an exponential decay function where their year’s index value was the exponent. With 1916 being 1, 1917 being 2, 1918 being 3, and so on.

Now that the dollar values in the dataframe have been standardized by converting them to 1915 dollar values, it’s time to dig further into the analysis.

The first question in my analysis was:

“Is there a correlation between the revenue of a movie and it’s budget?”

By creating a scatterplot of the two variables, I found a weak linear relationship between revenue and production budget. The correlation coefficient, or pearson r, between the two variables was 0.48, which is also a weak correlation.

My next question was,

“Do movie typically generate more revenue domestically or in the foreign markets?”

By plotting revenue against these two markets, I thought I would find that the stronger relationship between revenue and a market segment would indicate a greater contribution towards total revenue.

Interestingly enough, the foreign gross had a stronger linear relationship with revenue. I later discovered that domestic revenue on average comprised about 62% of a movie’s revenue compared to foreign gross’ 38%. As a greater portion of revenue, domestic gross likely had greater variation, which would lead to a weaker linear correlation with revenue.

Question 3: “Does a movie’s release month affect it’s performance?”

At first I sorted the movies by month and calculated the total revenue generated for each month.

This led to the conclusion that December, followed by June and May, were the most profitable months to release movies in. However, after investigating the number of releases made each month, I found there was strong bias towards December, seeing as it had more than twice as many movies released as the month with the fewest releases, January.

So I decided a better metric to gauge the best month for movie releases was to take a mean revenue of the movies within each month. After calculating an average revenue, I found that the top 3 performing months for movie releases were:

  1. May
  2. June
  3. July

Next, I decided to examine

“Which years had the highest grossing films”

to see if it was a certain movie-making style or an era of actors that had contributed to a more profitable time period in cinematography.

I found that movie-making really began to accelerate in the 1970’s, and peaked in about 2015, which saw the release of nearly 350 films.

The large number of films released in these later years again skewed the gross revenue numbers towards the second half of the graph when calculating total gross revenue numbers by year.

Again, I determined a more effective metric might be to take the mean of gross revenue for each year, to see which years were most profitable on average.

What I discovered was very odd, it seemed the most profitable years, when accounting for number of films released, were in the earlier half of the century.

This was due largely in part to the low number of films released (some of the years saw the release of only 1 or 2 films).

This era saw the release of many films that are popular even today, such as:

— Snow White and the Seven Dwarves

— The Wizard of Oz

— Pinocchio

— Bambi

— Gone With The Wind

— Cinderella

— Peter Pan

— Alice in Wonderland

while later years may have only had a few box office hits among hundreds of “dud” films, lowering their averages.

My final question was,

“What is the trend in production budgets, are more expensive to produce films generating more revenue?”

When examining the total production budgets by year, I also found higher totals in the later years, which produced many more films on an absolute basis.

Adjusting for the number of movies produced by year gave a better overview of how production budgets have trended over time.

I found that movie budgets have generally trended higher but for the most part peaked in the late 90’s.

Hypothesizing why this may be led me to ask,

Are actors/actresses taking lower salaries due to increased competition for the highly sought after roles?

Is CGI replacing the need for expensive sets and special effects teams?

Is the art of cinematography becoming decentralized, as now even film students can make high quality films relatively inexpensively using cheaper cameras and software?

Summarizing my thoughts

Overall I really enjoyed doing this project. It combined many of the skills I have been learning such as coding, data cleaning, and the creation of visualizations, into a very tangible and realistic application. It allowed me to creatively approach the data and tested the limits of my fledgling coding abilities. I look forward to continuing to learn and grow through the Flatiron bootcamp, as I continue my data science journey.

Until next time,

Nick Kolowich

Sign up to discover human stories that deepen your understanding of the world.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

No responses yet

Write a response