The goal of this notebook is to show the main steps in building a Machine Learning model.
The example will in no way be perfect, but it will help you in your reflections on model development and model evaluation.
The steps covered in this notebook are:
import pandas as pd
from sklearn.datasets import fetch_kddcup99
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import roc_auc_score
import seaborn
Step 1: Data Preparation
Importing data on cybersecurity and cyberattacks : the data has many varaibles about connections and we have a variable that indicates whether a variable is dangerous (cyber attack) or not. The goal is to develop a model that is able to 'learn' how to distinguish between a normal connection and a dangerous connection. We don't know yet how : our machine learning model will have to learn this with our help.
#Importing example data on cybersecurity
X, y = fetch_kddcup99(return_X_y = True)
Below we see a quick print of the data. The first columns give us information about the type of connection and the other columns are numeric information about the connection (see kddcup99 for more detailed info).
#We have a quick look at the data
pd.DataFrame(X).head()
In a real-life Machine Learning case, we would need to inspect this data and find out which variables we should use or which variables we should definitely ignore. But this is a quick example so we take the data as-is and see what accuracy we can obtain in 10 minutes. Therefore, we delete columns on type of connection and we treat every type of connection in the same way.
#Delete first columns
X = pd.DataFrame(X).iloc[:,4:]
We also inspect the dependent variable and we notice that it is not in the numeric 1 vs 0 format that we want. Let's change that.
#print y and see it's not what we want
y[:5]
# y new : reclassify in 0 vs 1 (normal is 0, attack is 1)
y_new = []
for i in y:
if i == b'normal.':
y_new.append(0)
else:
y_new.append(1)
To have a quick idea of the data, we plot the count of attacks and normal connections in the data. Notice that our data is biased towards attakcs.
#count plot for number of observations that were an attack vs not an attack
seaborn.countplot(y_new)
Step 2 : Train Validation Test Split
We have a lot of observations, so we can allow to have a split in three parts : Train, Validation, Test. In cases where we have less data, we often go for a split in 2 : Train and Test.
#cut the data in three parts : train, validation and test
X_train, X_test, y_train, y_test = train_test_split(X, y_new, test_size=0.33, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.33, random_state=42)
Step 3 : building three different models and applying GridSearchCV for hyperparameter tuning
The three models are : Random Forest, Decision Tree and Logistic Regression.
The following hyperparameters will be tuned using GridSearchCV:
Step 3a : Random Forest hyperparameter tuning
#Random Forest
parameters_RF = {'max_features':[0.85, 0.9, 0.95]}
RF = GridSearchCV(RandomForestClassifier(), parameters_RF, cv=5)
RF.fit(X_train, y_train)
Step 3a results : applying the Random Forest on the valdiation data and printing the roc_auc_score
RF_accuracy = roc_auc_score(y_val, RF.predict(X_val))
RF_accuracy
99% is very high !
Step 3b : Decision Tree hyperparameter tuning
# Decision Tree
parameters_DT = {'max_features':[0.85, 0.9, 0.95], 'min_samples_split': [2, 3, 4, 5, 10]}
DT = GridSearchCV(DecisionTreeClassifier(), parameters_DT, cv=5)
DT.fit(X_train, y_train)
Step 3b results : applying the Decision Tree on the valdiation data and printing the roc_auc_score
DT_accuracy = roc_auc_score(y_val, DT.predict(X_val))
DT_accuracy
Also 99%, also very high !
Step 3c : Logistic Regression hyperparameter tuning
# Logistic Regression
parameters_LR = {'C':[0.01, 0.1, 1.0]}
LR = GridSearchCV(LogisticRegression(), parameters_LR, cv=5)
LR.fit(X_train, y_train)
Step 3c results : applying the Logistic Regression on the valdiation data and printing the roc_auc_score
LR_accuracy = roc_auc_score(y_val, LR.predict(X_val))
LR_accuracy
98% is also high !
Step 4 : choose the best performing model on the validation data set
# repeating the three accuracy scores
RF_accuracy, DT_accuracy, LR_accuracy
Selected model will be Random Forest !
Step 5 : Final Estimate of Accuracy : apply the selected model to the Test Data
# predict on the test set and compute accuracy
roc_auc_score(y_test, RF.predict(X_test))