Data analysis on Netflix shows

The dataset

The dataset netflix_titles.csv contains a list of the tv shows and movies available on Netflix as of 2019 (the dataset is under a CC0 Public Domain license and had been downloaded using kaggle APIs).

Contents

The CSV file contains the following columns:

  • show_id: Unique ID for every movie/tv-show
  • type: Type of the show - "Movie" or "TV Show"
  • title: Title of the show
  • director: Director of the movie
  • cast: Actors involved in the show
  • country: Country where the show was produced
  • date_added: Date it was added on Netflix
  • release_year: Actual Release year of the show
  • rating: TV Rating of the show (used for evaluating the content and reporting the suitability of television programs for children, teenagers, or adults)
  • duration: Total Duration - in minutes or number of seasons
  • listed_in: The category(s) of the show (e.g., Comedies, Independent Movies, Romantic Movies, etc.)
  • description: A description of the show

Project initialization

Init your project as follow:

  • create a directory and name it netflix_analysis
  • create a python script inside netflix_analysis/ and name it main.py
  • create a python script inside netflix_analysis/ and name it util.py
  • create a directory inside netflix_analysis/ and name it data
  • download (if you have not done it yet) the netflix_titles.csv dataset and put it inside the netflix_analysis/data directory.

1st Exercise – read the dataset

Open the netflix_analysis/util.py file and define the following functions:

1.a) csv_to_matrix() which takes the PATH of a CSV file f_path and returns a matrix (list of lists) representing the CSV contents without including the first row of the CSV (the header).

def csv_to_matrix(f_path):
    result = []
    with open(f_path, mode='r') as csv_file:
        csv_reader = csv.reader(csv_file, delimiter=',')
        next(csv_reader)
        for row in csv_reader:
            result.append(row)
    return result

1.b) csv_to_list_of_dict() which takes the PATH of a CSV file f_path and returns a list of dictionaries which represents the CSV contents, such that each element of the list (i.e. a dictionary) represents a row of the CSV. The pairs key-value of the dictionary represents respectively the name of the column and the corresponding value in the cell.
Example:[ {"name": "James", "age": 25}, {"name":"David", "age": 28} ]
Hint: the function csv.DictReader(csv_file) reads a CSV file and returns a DictReader object, each element of the new object (i.e., a row) is represented as dictionary

def csv_to_list_of_dict(f_path):
    result = []
    with open(f_path, mode='r') as csv_file:
        csv_reader = csv.DictReader(csv_file)
        for row in csv_reader:
            result.append(row)
    return result

2nd Exercise - statistics on Italian shows

Open the netflix_analysis/main.py file and write a portion of code which prints the following informations:

  • a) The titles of all the Netflix shows produced in Italy.
  • b) The total number of Italian movies and the total number of Italian tv shows.
  • c) The number of Italian shows for each different year of their release

import csv
from util import *
netflix_data = csv_to_list_of_dict("netflix_analysis/data/netflix_titles.csv")
# a ----
italian_shows = []
for row in netflix_data:
    if "Italy" in row["country"].split(", "):
        italian_shows.append(row)
for ita_show in italian_shows:
    print(ita_show["title"])
# b ----
count_movies = 0
count_tvshows = 0
for ita_show in italian_shows:
    show_type = ita_show["type"]
    if show_type == "Movie":
        count_movies += 1
    elif show_type == "TV Show":
        count_tvshows += 1
print(count_movies, count_tvshows)
# c ----
ita_shows_by_year = dict()
for ita_show in italian_shows:
    release_year = ita_show["release_year"]
    if release_year not in ita_shows_by_year:
        ita_shows_by_year[release_year] = 0
    ita_shows_by_year[release_year] += 1

3rd Exercise - general statistics

Open the netflix_analysis/main.py file and write a portion of code which:

  • a) Detects the director with the higher number of shows and prints his name
  • b) Creates an index that contains all the countries, and for each country the number of shows classified according to the genere of the show (e.g., Comedies, Documentaries, etc)

from util import *
netflix_data = csv_to_list_of_dict("netflix_analysis/data/netflix_titles.csv")
# a ----
directors_index = dict()
max_val = 0
max_name = None
for row in netflix_data:
    for director in row["director"].split(", "):
        # detect the empty values
        if director == "":
            continue
        if director not in directors_index:
            directors_index[director] = 0
        directors_index[director] += 1
        # check if the new value is higher than max
        new_val = directors_index[director]
        if new_val > max_val:
            max_name, max_val = director, new_val
print(max_name, max_val)
# b ----
country_cat_index = dict()
for row in netflix_data:
    for country in row["country"].split(", "):
        # detect the empty values
        if country == "":
            continue
        # The value is OK, so ...
        if country not in country_cat_index:
            country_cat_index[country] = dict()
        for cat in row["listed_in"].split(", "):
            # detect the empty values
            if cat == "":
                continue
            # The value is OK, so ...
            if cat not in country_cat_index[country]:
                country_cat_index[country][cat] = 0
            country_cat_index[country][cat] += 1