Data analysis of the Netflix shows - part 1

Date: 30/11/2020
Time: 09:30-11:30

The dataset

We have a dataset containing the tv shows and movies available on Netflix as of 2019 (the dataset is under a CC0 Public Domain license and had been downloaded using kaggle APIs).
The dataset is in CSV format and is available at the github repository inside the dedicated directory of this lesson: netflix_titles.csv.

Contents

The CSV file contains the following columns:

  • show_id: Unique ID for every movie/tv-show
  • type: Type of the show - "Movie" or "TV Show"
  • title: Title of the show
  • director: Director of the movie
  • cast: Actors involved in the show
  • country: Country where the show was produced
  • date_added: Date it was added on Netflix
  • release_year: Actual Release year of the show
  • rating: TV Rating of the show (used for evaluating the content and reporting the suitability of television programs for children, teenagers, or adults)
  • duration: Total Duration - in minutes or number of seasons

Project initialization (see also the github repository)

Initialize your project from PyCharm following these instructions:

  • create a directory and name it netflix_analysis
  • create a python script inside netflix_analysis/ and name it main.py
  • create a directory inside netflix_analysis and name it data
  • download (if you have not done it yet) the netflix_titles.csv dataset and put it inside the netflix_analysis/data directory.

Read the dataset

Open the netflix_analysis/main.py file and define the following functions:

a) Define a function named open_csv() which takes the PATH of a CSV file f_path. The function must open and read the CSV file and return a matrix (list of lists) representing the CSV contents without including the first row of the CSV (the header).

For example:
If the CSV file is:
name,age,nationality
Marco,22,italian
James,27,english
Ilaria,20,italian
Calling open_csv({PATH-TO-THE-CSV-FILE}) must return:
[
    ["Marco",22,"italian"],
    ["James",27,"english"],
    ["Ilaria",20,"italian"]
]

b) Modify the previous function open_csv(), such that it must return a list of dictionaries (instead of a matrix)

Hint:
Using the function csv.DictReader({YOUR_FILE}) will read/interpret the CSV file as a list of dictionaries

For example:
If the CSV file is:
name,age,nationality
Marco,22,italian
James,27,english
Ilaria,20,italian
The function returns:
[
    {"name":"Marco","age":22,"nationality":"italian"},
    {"name":"James","age":27,"nationality":"english"},
    {"name":"Ilaria","age":20,"nationality":"italian"}
]
Mark the box to see the solution
def open_csv(f_path):
    result = []
    with open(f_path, mode='r') as csv_file:
        ## a
        #csv_reader = csv.reader(csv_file, delimiter=',')
        #next(csv_reader)
        ## b
        csv_reader = csv.DictReader(csv_file)
        for row in csv_reader:
            result.append(row)
    return result

netflix_data = open_csv("data/netflix_titles.csv"

Descriptive statistics

a) define a function netflix_types() which returns a tuple of two elements. Both elements of the tuple are lists containing Netflix shows ids. The first element stores only those having type equal to "Movie", while the second element stores only those having type equal to "TV Show".

Mark the box to see the solution
def netflix_types():
    l_movies = []
    l_tvshow = []
    for row in netflix_data:
        if row["type"] == "Movie":
            l_movies.append(row["show_id"])
        else:
            if row["type"] == "TV Show":
                l_tvshow.append(row["show_id"])
    return (l_movies,l_tvshow)

b) define a function netflix_countries() which returns a dictionary having all the different countries of the Netflix shows (keys of the dictionary). Each country (key) in the dictionary will have a list of all its related shows ids. Note: some shows have more than one country, in this case the countries are separated by a comma, e.g. "United States, India, South Korea".

Mark the box to see the solution
def netflix_countries():
    res_dict = dict()
    for row in netflix_data:
        for country_value in row["country"].split(", "):
            if country_value not in res_dict:
                res_dict[country_value] = []
            res_dict[country_value].append(row["show_id"])
    return res_dict

c) using the previously defined functions print the following informations:

  • The titles of all the Netflix shows produced in Italy.
  • The number of Italian movies and the number of Italian tv shows.
  • True/False whether there is at least a movie produced in "Finland"
Mark the box to see the solution
italian_shows = netflix_countries()["Italy"]
for it_show_id in italian_shows:
    for show_row in netflix_data:
        if show_row["show_id"] == it_show_id:
            print(show_row["title"])
            break


show_types = netflix_types()
count_movies = 0
count_tvshows = 0
for it_show_id in italian_shows:
    if it_show_id in show_types[0]:
        count_movies += 1
    if it_show_id in show_types[1]:
        count_tvshows += 1
print(count_movies, count_tvshows)


all_countries = netflix_countries()
found = False
if "Finland" in all_countries:
    l_show_ids = all_countries["Finland"]
    for show_id in l_show_ids:
        if show_id in show_types[0]:
            found = True
print(found)