Creating a Term Frequency-Inverse Document Frequency table with scikit-learn

By 0x7df, Wed 07 April 2021, modified Wed 07 April 2021, in category Misc

Given a collection of text documents (a corpus) to be processed or analysed in some way, a useful first step is to create a term frequency-inverse document frequency (TF-IDF) table.

The term frequency for a given term and a given document - \(\mathrm{tf}(t,d)\) - is simply the number of times the term appears in the document. It is a matrix over all documents in the corpus and all terms in the vocabulary (i.e. the total set of unique words used in the corpus).

The document frequency for a given term - \(\mathrm{df}(t)\) - is the number of documents in the corpus that contain the term. It is a vector over all terms in the vocabulary.

TF-IDF for a given term in a given document is the term frequency divided by document frequency:

$$ \mathrm{tfidf}(t,d) = \mathrm{tf}(t,d) \times \mathrm{idf}(t) = \frac{\mathrm{tf}(t,d)}{\mathrm{df}(t)} $$

We can use this to identify which terms characterise a particular document. For example, if a particular term appears very frequently in a document, we might want to infer something from that, either about the term or the document. However, if the term is, for some reason, used frequently in all the documents in the corpus, then it no longer really tells us anything about the particular document. So the term frequency alone can be misleading. Therefore, we normalise term frequency by document frequency; the resulting score is higher for terms which are more frequent in that particular document than in the corpus as a whole.

Reading a corpus

Consider, as an example corpus, a set of tweets. Each tweet is a document. We use here the tweets from @0x7df, which have been downloaded in JSON format into a file tweet.js.

The following Python code:

import json

# Read tweets into a corpus
with open('tweet.js', 'r') as file_pointer:
    json_data = json.load(file_pointer)
corpus = [item['tweet']['full_text'] for item in json_data]

reads the data file into a data structure (json_data), and then extracts the full text of each tweet into a list of strings, with each string being a document - this forms our corpus. (A file containing your own tweets can be downloaded from twitter.com, or you can replace the code above with anything that results in a list of strings.)

TF-IDF in scikit-learn

We can use create scikit-learn to create the TF-IDF table using built-in functionality, so just a few lines of code. First install it, if necessary (we use a Conda environment here):

conda create -n tf-idf
conda activate tf-idf
conda install -c conda-forge scikit-learn

Now we can work with the corpus in Python3:

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
term_document_matrix = vectorizer.fit_transform(corpus)

The scikit-learn CountVectorizer object will perform the first step towards creating a TF-IDF, by generating a term-document matrix, which is just a matrix of token counts for the corpus.

This shows a count of terms in the corpus in an intuitive array format:

print(term_document_matrix.toarray())

the size of which we can get with:

print(np.shape(term_document_matrix.toarray()))

which in this case is 394 (the number of documents) by 2143 (the number of terms).

The list of terms - the vocabulary of the corpus - corresponds to the columns of this matrix. E.g.:

vocabulary = vectorizer.get_feature_names()
print(len(vocabulary))
print(vocabulary[600:610])

gives output:

2143
['estimating', 'etc', 'even', 'evenly', 'ever', 'every', 'everybody', 'everyone', 'everywhere', 'evolution']

Out of interest let's find a term that is used multiple times in a document; pick the first occasion where a term appears more than 4 times:

::: python
import numpy as np
d = np.where(term_document_matrix.toarray() > 4)[0][0]
t = np.where(term_document_matrix.toarray() > 4)[1][0]

In our case:

print(vocabulary[t], "\n", corpus[d])

gives:

data
Bayes Theorem. Given some new data:

posterior=likelihood*prior/marginal

posterior = prob'y of hypothesis after new data
prior = prob'y of hypothesis before new data
likelihood = prob'y of observing the new data if hypothesis is true
marginal = prob'y of observing the new data

We can count how often data appears in the corpus (the document frequency):

print(len(term_document_matrix.toarray()[:,t].nonzero()[0]))

The TF-IDF of the corpus is calculated from the term-document matrix using scikit-learn's TfidfTransformer class:

from sklearn.feature_extraction.text import TfidfTransformer
transformer = TfidfTransformer()
tf_idf = transformer.fit_transform(term_document_matrix)
print(tf_idf)

giving:

(0, 1997)   0.7069693709594577
(0, 1724)   0.6613659134533925
(0, 834)    0.1936081602656838
(0, 337)    0.15907645119779876
(1, 2016)   0.35055205958116886
(1, 1765)   0.26739527029671417
(1, 1561)   0.21530466145224245
(1, 1520)   0.11733710329987213
(1, 1510)   0.3092301427145016
(1, 1247)   0.29426239833024026
(1, 1230)   0.2850583708895842
(1, 1221)   0.37472383140608634
(1, 1220)   0.3000404615682601
(1, 875)    0.37472383140608634
(1, 565)    0.33340191453941903
(2, 1971)   0.2974636234916694
(2, 1885)   0.26466142066763854
(2, 1836)   0.25410147000438943
:   :
(392, 519)  0.31753609195303606
(392, 337)  0.08364212868421203
(392, 87)   0.29763794190054826
(393, 1800) 0.2926392069550637
(393, 1788) 0.2926392069550637
(393, 1595) 0.2926392069550637
(393, 1517) 0.2926392069550637
(393, 1389) 0.2926392069550637
(393, 1152) 0.24998032295630565
(393, 997)  0.2177101229630202
(393, 988)  0.2926392069550637
(393, 960)  0.2926392069550637
(393, 902)  0.2603690069617782
(393, 834)  0.08014115010850688
(393, 639)  0.2926392069550637
(393, 381)  0.2926392069550637
(393, 337)  0.06584727491174341
(393, 179)  0.2020453021681305

The tuple in the first column gives the location in the TF-IDF matrix (the first element is the document index and the second is the index of the term). The second column is the TF-IDF score.

Note that the TF-IDF score returned by this TfidfTransformer class is not simply the term frequency divided by the document frequency, as suggested above; instead:

$$ \mathrm{idf}(t) = \log \left[ \frac{n + 1}{\mathrm{df}(t) + 1} \right] + 1 $$

is used to calculate the inverse document frequency from the document frequency \(\mathrm{df}(t)\) (where \(n\) is the number of documents). This scales the inverse document frequency so that a term which appears in every document in the corpus has a value of 1, whilst avoiding divide-by-zeros.

The \(\mathrm{idf}(t)\) values are stored in transformer.idf_.

Finally, note that the TF-IDF values for each document are normalised so each document has unit Euclidean norm. Each document is a vector in \(m\)-dimensional space, where \(m\) is the number of terms in the vocabulary, and the TF-IDF score of term \(i\) is the component of the vector in the \(i^{\mathrm{th}}\) dimension. For a general vector \(v\):

$$ v_{\mathrm{norm}} = \frac{v}{\sqrt{v_1^2 + v_2^2 + ... + v_m^2}} $$

so for a vector in a three-dimensional space for example (equivalent to a document in a corpus with only three terms in the vocabulary):

$$ (x, y, z)_{\mathrm{norm}} = \frac{(x, y, z)}{\sqrt{x^2 + y^2 + z^2}} $$

where \(x\), \(y\) and \(z\) are the raw TF-IDF scores for the three terms (in the given document).

Comments

Add comment