Development - Advanced, exercise 3
Text
In information retrieval, the term frequency–inverse document frequency (or tf-idf) is a numerical statistic that is intended to reflect how important a word is to a document in a corpus. It is based on two functions:
- the term frequency, tf(t, d), which counts the number of times a term t occurs in document d;
 - the inverse document frequency, idf(t, d_list), which measures whether a term t is common or rare across all the documents in a list d_list, calculated as the logarithm of the division between the total number of documents in the list and the number of documents that contains the term t.
 
Thus, the tf-idf of a term t in a document d included in a collection of document d_list is simply the multiplication between its term frequency and its inverse document frequency.
Write the Python function tfidf(t, d, d_list) which takes as input a string t representing a term, a string d representing a document, and a list of strings d_list representing a collection of documents which includes also d, and that returns the tf-idf of the input term according to the document in that document list. As a simplification, all the input strings are composed only by lowercase English alphabetic characters with no punctuation. The logarithm function is available in Python within the module math (from math import log) and has the following signature: def log(n) – e.g. log(3) calculates the logarithm of the number 3.
Solution
from math import log
# Test case for the function
def test_tfidf(t, d, d_list, expected):
    result = tfidf(t, d, d_list)
    if expected == round(result, 2):
        return True
    else:
        return False
# Code of the function
def tfidf(t, d, d_list):
    return tf(t, d) * idf(t, d_list)
def tf(t, d):
    r = 0
    for term in d.split():
        if t == term:
            r += 1
    return r
def idf(t, d_list):
    d_with_t = 0
    for d in d_list:
        if t in d.split():
            d_with_t += 1
    return log(len(d_list) / d_with_t)
# Tests
d1 = "snow in my shoe abandoned sparrow's nest"
d2 = "whitecaps on the bay a broken signboard banging in the April wind"
d3 = "lily out of the water out of itself bass picking bugs off the moon"
d4 = "an aging willow its image unsteady in the flowing stream"
d5 = "just friends he watches my gauze dress blowing on the line"
d6 = "little spider will you outlive me"
d7 = "meteor shower a gentle wave wets our sandals"
d_list = [d1, d2, d3, d4, d5, d6, d7]
print(test_tfidf("a", d2, d_list, 1.25))
print(test_tfidf("out", d1, d_list, 0.0))
print(test_tfidf("out", d3, d_list, 3.89))
Additional material
The runnable Python file is available online.