Text mining for dummies : a quick text mining example

This notebook shows the common steps of any text mining project.

This goal is not to give an exhaustive overview of text mining, but to quickstart your thinking and give ideas for further enhancements

Step 1 : Data
For teaching purposes, we start with a very very small data set of 6 reviews.
Data often comes from web scraping review websites, because they are good sources of data with at the same time a raw text and a numeric evaluation.

scores = [5, 5, 5, 1, 1, 1]

reviews = ["I loved our hotel", "Staff was awesome", "Awesome staff and loved the hotel",
           "I hated our hotel", "Staff was rude", "Rude staff and hated the hotel"]

Step 2: Data preparation
The data will often have to be cleaned more than in this example, eg regex, or python string operations.

The real challenge of text mining is converting text to numerical data. This is often done in two steps:

Stemming / Lemmatizing : bringing all words back to their 'base form' in order to make an easier word count

Vectorizing : applying an algorithm that is based on wordcount (more advanced)

In this example I use a LancasterStemmer and a CountVecotrizer, which are well-known and easy-to-use methods.

Step 2a : LancasterStemmer to bring words back to their base form

from nltk.stem import LancasterStemmer
my_stemmer = LancasterStemmer()
stemmed = [[my_stemmer.stem(word) for word in review.split()] for review in reviews ]
stemmed

[['i', 'lov', 'our', 'hotel'],
 ['staff', 'was', 'awesom'],
 ['awesom', 'staff', 'and', 'lov', 'the', 'hotel'],
 ['i', 'hat', 'our', 'hotel'],
 ['staff', 'was', 'rud'],
 ['rud', 'staff', 'and', 'hat', 'the', 'hotel']]

stemmed_concat = [' '.join(review) for review in stemmed]
stemmed_concat

['i lov our hotel',
 'staff was awesom',
 'awesom staff and lov the hotel',
 'i hat our hotel',
 'staff was rud',
 'rud staff and hat the hotel']

Step 2b : CountVecorizer to apply Bag Of Word (basically a word count) for vectorizing (that means converting text data into numerical data)

from sklearn.feature_extraction.text import CountVectorizer

my_bow = CountVectorizer()

bow = my_bow.fit_transform(stemmed_concat)
print(bow.todense())

[[0 0 0 1 1 1 0 0 0 0]
 [0 1 0 0 0 0 0 1 0 1]
 [1 1 0 1 1 0 0 1 1 0]
 [0 0 1 1 0 1 0 0 0 0]
 [0 0 0 0 0 0 1 1 0 1]
 [1 0 1 1 0 0 1 1 1 0]]

Step 3: Machine Learning
Since the text has been converted to numeric data, just use any method that you could use on regular data !

from sklearn.linear_model import LinearRegression

linreg = LinearRegression()
linreg.fit(bow, scores)
predicted = linreg.predict(bow)

import pandas as pd
pd.DataFrame({'review': reviews, 'original score': scores, 'predicted scores': predicted})

	review	original score	predicted scores
0	I loved our hotel	5	4.333333
1	Staff was awesome	5	4.333333
2	Awesome staff and loved the hotel	5	5.666667
3	I hated our hotel	1	1.666667
4	Staff was rude	1	1.666667
5	Rude staff and hated the hotel	1	0.333333