Data analysis of the Netflix shows - part 1

Date: 30/11/2020

Time: 09:30-11:30

The dataset

We have a dataset containing the tv shows and movies available on Netflix as of 2019 (the dataset is under a CC0 Public Domain license and had been downloaded using kaggle APIs).
The dataset is in CSV format and is available at the github repository inside the dedicated directory of this lesson: netflix_titles.csv.

The CSV file contains the following columns:

show_id: Unique ID for every movie/tv-show
type: Type of the show - "Movie" or "TV Show"
title: Title of the show
director: Director of the movie
cast: Actors involved in the show
country: Country where the show was produced
date_added: Date it was added on Netflix
release_year: Actual Release year of the show
rating: TV Rating of the show (used for evaluating the content and reporting the suitability of television programs for children, teenagers, or adults)
duration: Total Duration - in minutes or number of seasons

Project initialization (see also the github repository)

Initialize your project from PyCharm following these instructions:

create a directory and name it netflix_analysis
create a python script inside netflix_analysis/ and name it main.py
create a directory inside netflix_analysis and name it data
download (if you have not done it yet) the netflix_titles.csv dataset and put it inside the netflix_analysis/data directory.

Read the dataset

Open the netflix_analysis/main.py file and define the following functions:

a) Define a function named open_csv() which takes the PATH of a CSV file f_path. The function must open and read the CSV file and return a matrix (list of lists) representing the CSV contents without including the first row of the CSV (the header).

For example:
If the CSV file is:

name,age,nationality
Marco,22,italian
James,27,english
Ilaria,20,italian

Calling open_csv({PATH-TO-THE-CSV-FILE}) must return:

[
    ["Marco",22,"italian"],
    ["James",27,"english"],
    ["Ilaria",20,"italian"]
]

b) Modify the previous function open_csv(), such that it must return a list of dictionaries (instead of a matrix)

Hint:
Using the function csv.DictReader({YOUR_FILE}) will read/interpret the CSV file as a list of dictionaries

For example:
If the CSV file is:

name,age,nationality
Marco,22,italian
James,27,english
Ilaria,20,italian

The function returns:

[
    {"name":"Marco","age":22,"nationality":"italian"},
    {"name":"James","age":27,"nationality":"english"},
    {"name":"Ilaria","age":20,"nationality":"italian"}
]

Mark the box to see the solution

def open_csv(f_path):

result = []

with open(f_path, mode='r') as csv_file:

## a

#csv_reader = csv.reader(csv_file, delimiter=',')

#next(csv_reader)

## b

csv_reader = csv.DictReader(csv_file)

for row in csv_reader:

result.append(row)

return result

netflix_data = open_csv("data/netflix_titles.csv"

Descriptive statistics

a) define a function netflix_types() which returns a tuple of two elements. Both elements of the tuple are lists containing Netflix shows ids. The first element stores only those having type equal to "Movie", while the second element stores only those having type equal to "TV Show".

Mark the box to see the solution

def netflix_types():

l_movies = []

l_tvshow = []

for row in netflix_data:

if row["type"] == "Movie":

l_movies.append(row["show_id"])

else:

if row["type"] == "TV Show":

l_tvshow.append(row["show_id"])

return (l_movies,l_tvshow)

b) define a function netflix_countries() which returns a dictionary having all the different countries of the Netflix shows (keys of the dictionary). Each country (key) in the dictionary will have a list of all its related shows ids. Note: some shows have more than one country, in this case the countries are separated by a comma, e.g. "United States, India, South Korea".

Mark the box to see the solution

def netflix_countries():

res_dict = dict()

for row in netflix_data:

for country_value in row["country"].split(", "):

if country_value not in res_dict:

res_dict[country_value] = []

res_dict[country_value].append(row["show_id"])

return res_dict

c) using the previously defined functions print the following informations:

The titles of all the Netflix shows produced in Italy.
The number of Italian movies and the number of Italian tv shows.
True/False whether there is at least a movie produced in "Finland"

Mark the box to see the solution

italian_shows = netflix_countries()["Italy"]

for it_show_id in italian_shows:

for show_row in netflix_data:

if show_row["show_id"] == it_show_id:

print(show_row["title"])

break

show_types = netflix_types()

count_movies = 0

count_tvshows = 0

for it_show_id in italian_shows:

if it_show_id in show_types[0]:

count_movies += 1

if it_show_id in show_types[1]:

count_tvshows += 1

print(count_movies, count_tvshows)

all_countries = netflix_countries()

found = False

if "Finland" in all_countries:

l_show_ids = all_countries["Finland"]

for show_id in l_show_ids:

if show_id in show_types[0]:

found = True

print(found)

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search

Data analysis of the Netflix shows - part 1

Date: 30/11/2020

Time: 09:30-11:30

The dataset

Contents

Project initialization (see also the github repository)

Read the dataset

Descriptive statistics