Text mining for dummies : a quick text mining example

This notebook shows the common steps of any text mining project.

This goal is not to give an exhaustive overview of text mining, but to quickstart your thinking and give ideas for further enhancements

In [0]:
 

Step 1 : Data
For teaching purposes, we start with a very very small data set of 6 reviews.
Data often comes from web scraping review websites, because they are good sources of data with at the same time a raw text and a numeric evaluation.

In [0]:
scores = [5, 5, 5, 1, 1, 1]
In [0]:
reviews = ["I loved our hotel", "Staff was awesome", "Awesome staff and loved the hotel",
           "I hated our hotel", "Staff was rude", "Rude staff and hated the hotel"]

Step 2: Data preparation
The data will often have to be cleaned more than in this example, eg regex, or python string operations.

The real challenge of text mining is converting text to numerical data. This is often done in two steps:

  • Stemming / Lemmatizing : bringing all words back to their 'base form' in order to make an easier word count

  • Vectorizing : applying an algorithm that is based on wordcount (more advanced)

    In this example I use a LancasterStemmer and a CountVecotrizer, which are well-known and easy-to-use methods.

  • Step 2a : LancasterStemmer to bring words back to their base form

    In [0]:
    from nltk.stem import LancasterStemmer
    my_stemmer = LancasterStemmer()
    stemmed = [[my_stemmer.stem(word) for word in review.split()] for review in reviews ]
    stemmed
    
    Out[0]:
    [['i', 'lov', 'our', 'hotel'],
     ['staff', 'was', 'awesom'],
     ['awesom', 'staff', 'and', 'lov', 'the', 'hotel'],
     ['i', 'hat', 'our', 'hotel'],
     ['staff', 'was', 'rud'],
     ['rud', 'staff', 'and', 'hat', 'the', 'hotel']]
    In [0]:
    stemmed_concat = [' '.join(review) for review in stemmed]
    stemmed_concat
    
    Out[0]:
    ['i lov our hotel',
     'staff was awesom',
     'awesom staff and lov the hotel',
     'i hat our hotel',
     'staff was rud',
     'rud staff and hat the hotel']

    Step 2b : CountVecorizer to apply Bag Of Word (basically a word count) for vectorizing (that means converting text data into numerical data)

    In [0]:
    from sklearn.feature_extraction.text import CountVectorizer
    
    In [0]:
    my_bow = CountVectorizer()
    
    In [0]:
    bow = my_bow.fit_transform(stemmed_concat)
    print(bow.todense())
    
    [[0 0 0 1 1 1 0 0 0 0]
     [0 1 0 0 0 0 0 1 0 1]
     [1 1 0 1 1 0 0 1 1 0]
     [0 0 1 1 0 1 0 0 0 0]
     [0 0 0 0 0 0 1 1 0 1]
     [1 0 1 1 0 0 1 1 1 0]]
    

    Step 3: Machine Learning
    Since the text has been converted to numeric data, just use any method that you could use on regular data !

    In [0]:
    from sklearn.linear_model import LinearRegression
    
    In [0]:
    linreg = LinearRegression()
    linreg.fit(bow, scores)
    predicted = linreg.predict(bow)
    
    In [0]:
    import pandas as pd
    pd.DataFrame({'review': reviews, 'original score': scores, 'predicted scores': predicted})
    
    Out[0]:
    review original score predicted scores
    0 I loved our hotel 5 4.333333
    1 Staff was awesome 5 4.333333
    2 Awesome staff and loved the hotel 5 5.666667
    3 I hated our hotel 1 1.666667
    4 Staff was rude 1 1.666667
    5 Rude staff and hated the hotel 1 0.333333
    In [0]: