When I decided to enroll in the Flatiron School and learn data science, I found myself in a bit of a career identity crisis. I’ve been living and breathing journalism since college. Was I leaving that behind?
My main goal as a Flatiron student is to use the technical skills that I’m learning in order to open my my career options to jobs that weren’t available to me before.
In truth, I don’t know if I’m leaving journalism behind or if I’ll be able to have a job where I’m both a data scientist and a journalist. The most important thing to me is that I’m able to use the aspects of both fields that I’m most interested.
It turns out there’s a significant overlap between the two. For both journalists and data scientists, it’s important to be able to communicate complicated or niche concepts to less exposed audiences. Journalists immerse themselves in worlds that are unfamiliar to their audiences in order to tell stories in familiar terms. Similarly, data scientists immerse themselves in data and code in order to draw insights, but they also have to break down their methods and insights to non-technical stakeholders.
I got to put this to practice in my first project for the Flatiron School. The (hypothetical) premise is that Microsoft wants to launch a movie studio and is consulting a data scientist on what kinds of movies are most successful.
Getting to this point wasn’t a walk in the park. In a matter of weeks, I went from nearly zero coding knowledge to being able to tackle this project with SQL, Python, and libraries like like NumPy, Pandas, Matplotlib, and Seaborn. I also learned how to scrape websites, make API calls, and maintain a Git repository.
After immersing myself in that world, and after using those skills to draw insights for an imaginary version of Microsoft, I had to also present my methods and findings to a non-technical stakeholder (rather, a data science instructor pretending to be a non-technical stakeholder). This where I was hoping my journalism background would be useful.
With that, below is the non-technical version of my analysis for (fake) Microsoft. This this the first of five projects I’ll do at Flatiron, so I still have a lot to learn. I may even look back at this project after I graduate and disagree with my own methods.
You can also check out the GitHub repository for my project here. It contains a Jupyter Notebook with all the Python code I used to do this analysis. It also has the slides to the non-technical presentation.
Microsoft’s Foray Into the Movie World
Flatiron School Data Science: Project 1
Author: Zaid Shoorbajee

Image source: Pixabay
Overview
This project analyzes data about the ratings and popularity of movies to make recommendations to Microsoft, which intends to launch its own movie studio. As a newcomer to the scene, Microsoft has asked for recommendations on what types of movies perform well among audiences. I have available to me movie datasets from Box Office Mojo, IMDb, Rotten Tomatoes, The Movie Database, and The Numbers. I derive my conclusions mainly from the IMDb datasets, which contain information about movies from 2010 to 2019, including, genres, average user rating, and the number of users who voted on each movie. As a result of the analysis, I was able to distill 10 well-performing genres for Microsoft to focus on, as well as make recommendations about how much it should focus on making 1) comedies and 2) animated movies.
Business Problem
Measuring success: A first instinct might be to narrow down the attributes of movies that have the highest return-on-investment at the box office. However, in the streaming age, that might not be the best measure of success. Popular movies are increasingly being released directly to streaming services, and the COVID-19 pandemic has dissuaded many people from going to the theaters anymore. A better measure of success would be the number of people that will actually watch the movie. Whether Microsoft plans to sell its movies to distributors like Netflix or spin up its own streaming service to host the films, it needs to determine what kinds of movies are going to attract the most viewers in numbers.
I use the number of votes a movie has received on IMDb as an analogue for the number of viewers. The votes may be negative or positive, but we can infer that a vote means someone actually watched the film. Using this metric, I attempt to answer these questions:
- Which 10 genres tend to perform best?
- How much should Microsoft focus on making comedies?
- How much should Microsoft focus on making animated movies?

Image source: Pixabay
Data Understanding
IMDb is one of the most popular websites for basic facts about movies and TV shows, as well as user reviews. It claims to have nearly 600,000 movies listed and is ranked 75th in in global internet engagement.
The data I’ve been provided is housed in a SQL file, from which I primarily use two tables:
movie_basics: Contains information about each movie’s name, release year, runtime, and genres.movie_ratings: Contains a weighted average of all the individual user ratings and the number of votes a movie has received.
More information here.
The two tables have a shared column movie_id, which is a unique identifier for each movie. I grouped the movies by genre to see each genre’s average number of votes.
Measuring Success
I use number of votes as an indicator of a movie’s of success. In the streaming age, this is arguably a better indicator of a movie’s popularity as opposed to return on investment at the box office.
I also found that number of votes and average rating are positively correlated. Thus, in choosing number of votes as our measure of success, we are reassured that that it’s generally associated with a higher movie rating.
Results:
The top 10 movie genres in terms of average number of votes on IMDb are:
- Adventure
- Fantasy
- Sci-Fi
- Animation
- Mystery
- Western
- Action
- Crime
- Biography
- Romance

Of the top 10% best-performing movies, 2,281 out of 7,304 — or 31.22% — are comedies.

Of the top 10% best-performing movies, 287 out of 7,304, — or 3.93% — are animated.

Recommendations
In this analysis I attempted to determine the most successful movie genres as well as what proportions of movies are comedies or animated. I arrived at three recommendations for what kinds of movies Microsoft should make:
- Microsoft should focus its efforts on movies with some combination of these genres:
- Adventure
- Fantasy
- Sci-Fi
- Animation
- Mystery
- Western
- Action
- Crime
- Biography
- Romance
- Microsoft should focus about a third of its efforts on comedy movies.
- Microsoft should focus about 4 percent of its efforts on animated movies.
Further exploration
As I mentioned before, I came to these conclusions after dropping 90% of the data I had available to me. I justified this by saying that those were niche movies that barely a thousand or so people on the internet knew of, and that those weren’t the types of movies I’d want to analyze when making recommendations to a multi-billion dollar company like Microsoft.
I could further explore this by answering the same questions for the bottom 90% as I did for the top 10%. Knowing the differences between the tiers might lead to other helpful insights.
Having said that, the division between the two tiers is arbitrary; I could have looked at the top 5% or 25%. Another way to go about this would be to merge in a new dataset that told us which movie studio is behind each movie. I could then separate the data into movies made by established, big-budget studios and movies that are not. Looking at the differences among movies in those two tiers might give us different results and possibly lead to different recommendations.