This goal is not to give an exhaustive overview of text mining, but to quickstart your thinking and give ideas for further enhancements
Step 1 : Data
For teaching purposes, we start with a very very small data set of 6 reviews.
Data often comes from web scraping review websites, because they are good sources of data with at the same time a raw text and a numeric evaluation.
scores = [5, 5, 5, 1, 1, 1]
reviews = ["I loved our hotel", "Staff was awesome", "Awesome staff and loved the hotel",
"I hated our hotel", "Staff was rude", "Rude staff and hated the hotel"]
Step 2: Data preparation
The data will often have to be cleaned more than in this example, eg regex, or python string operations.
The real challenge of text mining is converting text to numerical data. This is often done in two steps:
Step 2a : LancasterStemmer to bring words back to their base form
from nltk.stem import LancasterStemmer
my_stemmer = LancasterStemmer()
stemmed = [[my_stemmer.stem(word) for word in review.split()] for review in reviews ]
stemmed
stemmed_concat = [' '.join(review) for review in stemmed]
stemmed_concat
Step 2b : CountVecorizer to apply Bag Of Word (basically a word count) for vectorizing (that means converting text data into numerical data)
from sklearn.feature_extraction.text import CountVectorizer
my_bow = CountVectorizer()
bow = my_bow.fit_transform(stemmed_concat)
print(bow.todense())
Step 3: Machine Learning
Since the text has been converted to numeric data, just use any method that you could use on regular data !
from sklearn.linear_model import LinearRegression
linreg = LinearRegression()
linreg.fit(bow, scores)
predicted = linreg.predict(bow)
import pandas as pd
pd.DataFrame({'review': reviews, 'original score': scores, 'predicted scores': predicted})