Data analysis of the Netflix shows - part 2

Date: 04/12/2020
Time: 12:30-14:30

Text analysis

Natural Language Processing (NLP) is about developing applications and services that can understand human languages. Nowadays we have millions of gigabytes of data generated every day by blogs, social websites, and web pages. A big data analysis using NLP applications can turn out to be very beneficial. Possible NLP applications are speech recognition/translation, semantic analysis of the words, the syntactical/grammatical analysis of the sentences and paragraphs, and others. The analysis of the text is a common and main task in NLP. We will focus on some basic text analysis operations and how are they performed using Python.

Note:
If you are interested in a further application of NLP methods in python (beyond the materials of this lesson), I suggest you to read the following article: https://medium.com/towards-artificial-intelligence/text-mining-in-python-steps-and-examples-78b3f8fd913b

The NLTK library

The Natural language toolkit (NLTK) is a very popular library for natural language processing in python and it has a large active community behind it. We need to install it:

pip3 install nltk

To check if the NLTK library has been correctly installed we need to import it in our python script:

Import nltk

If every thing looks fine we will install/download popular nltk packages. We will download the popular and essential ones used for working with the basic nltk operations. Write the following line in your python script (after importing nltk):

nltk.download("popular")

Note:
In case running the above commands returns an error message: "CERTIFICATE_VERIFY_FAILED", then write in your script only the following lines and run it again.
import nltk
import ssl

try:
    _create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
    pass
else:
    ssl._create_default_https_context = _create_unverified_https_context

nltk.download()
If you still have some errors please follow these instructions https://www.xspdf.com/help/50406704.html

Tokenization and stopwords removal

First thing we want to do is a Tokenization and Stopwords removal.

Tokenization is the process of separating and classifying different sections of a string. The resulting tokens are then passed on to some other form of processing.

For example:

If we want to tokenize the sentence
"Hi I am Marco and I am a DHDK student"

Our list of tokens would be:
"Hi", "I", "am", "Marco", "and", "I", "am", "a", "DHDK", "student"

Stopwords removal these are words that may not carry any valuable information, like articles (e.g. "the"), conjunctions (e.g. "and"), or propositions (e.g. "with").

For example:

If we want to remove the stop words from the list of tokens
"Hi", "I", "am", "Marco", "and", "I", "am", "a", "DHDK", "student"

Then our new list could be:
"Hi", "Marco", "DHDK", "student"

Toward a text analysis of the Netflix shows dataset (see also the github repository)

Lets see an example in python first

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

example_sent = "I want to tokenize and remove all the stopwords from this sentence"

# Tokenize the given sentence
word_tokens = word_tokenize(example_sent)

# Filter the stopwords
stop_words = set(stopwords.words('english'))
filtered_sentence = []
for w in word_tokens:
    if not w in stop_words:
        filtered_sentence.append(w)

print(filtered_sentence

a) define a function netflix_titles_tokens() which returns a collection of all the different tokens/words in the Netflix shows titles. The collection should not include the stopwords.

Mark the box to see the solution
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

def netflix_titles_tokens():
    stop_words = set(stopwords.words('english'))
    result = set()
    for show in netflix_data:
        word_tokens = word_tokenize(show["title"])
        filtered_sentence = []
        for w in word_tokens:
            if not w in stop_words:
                filtered_sentence.append(w)
        result.update(filtered_sentence)
    return result

print(netflix_titles_tokens())

b) define a function netflix_director_names() which returns a collection of all the tokens/words included in the director field of the Netflix shows dataset. Each token must be accompanied with a number representing its frequency in the dataset.

Hint:
Importing from nltk.probability import FreqDist and using FreqDist({LIST_OF_TOKENS}) returns a dictionary, such that the dictionary key is a token and its value is the number of times it appears in the given list. To print the results you can use {FREQ_DIST_DICT}.most_common({N}), this function prints the {N} most frequent terms inside the dictionary {FREQ_DIST_DICT}.

For example:
FreqDist(["hi","bye","ciao","bye","hi"]) returns {"hi": 2, "bye":2, "ciao": 1}
Mark the box to see the solution
from nltk.probability import FreqDist

def netflix_director_names():
    result = []
    for show in netflix_data:
        word_tokens = word_tokenize(show["director"])
        result += word_tokens
    return FreqDist(result)

counts = netflix_director_names()