Quick Machine Learning : build and test models in 10 minutes

The goal of this notebook is to show the main steps in building a Machine Learning model.

The example will in no way be perfect, but it will help you in your reflections on model development and model evaluation.

The steps covered in this notebook are:

  • super short data inspection (this it what takes much more time in real cases)
  • splitting data in train, validation and test set
  • building three different models using sklearn : random forest, decision tree, logistic regression
  • Hyperparameter Tuning using GridSearch crossvalidation
  • In [1]:
    import pandas as pd
    from sklearn.datasets import fetch_kddcup99
    
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.tree import DecisionTreeClassifier
    from sklearn.linear_model import LogisticRegression
    from sklearn.svm import SVC
    
    from sklearn.model_selection import train_test_split
    from sklearn.model_selection import GridSearchCV
    from sklearn.metrics import roc_auc_score
    
    import seaborn
    

    Step 1: Data Preparation
    Importing data on cybersecurity and cyberattacks : the data has many varaibles about connections and we have a variable that indicates whether a variable is dangerous (cyber attack) or not. The goal is to develop a model that is able to 'learn' how to distinguish between a normal connection and a dangerous connection. We don't know yet how : our machine learning model will have to learn this with our help.

    In [2]:
    #Importing example data on cybersecurity
    X, y = fetch_kddcup99(return_X_y  = True)
    
    Downloading https://ndownloader.figshare.com/files/5976042
    

    Below we see a quick print of the data. The first columns give us information about the type of connection and the other columns are numeric information about the connection (see kddcup99 for more detailed info).

    In [3]:
    #We have a quick look at the data
    pd.DataFrame(X).head()
    
    Out[3]:
    0 1 2 3 4 5 6 7 8 9 ... 31 32 33 34 35 36 37 38 39 40
    0 0 b'tcp' b'http' b'SF' 181 5450 0 0 0 0 ... 9 9 1 0 0.11 0 0 0 0 0
    1 0 b'tcp' b'http' b'SF' 239 486 0 0 0 0 ... 19 19 1 0 0.05 0 0 0 0 0
    2 0 b'tcp' b'http' b'SF' 235 1337 0 0 0 0 ... 29 29 1 0 0.03 0 0 0 0 0
    3 0 b'tcp' b'http' b'SF' 219 1337 0 0 0 0 ... 39 39 1 0 0.03 0 0 0 0 0
    4 0 b'tcp' b'http' b'SF' 217 2032 0 0 0 0 ... 49 49 1 0 0.02 0 0 0 0 0

    5 rows × 41 columns

    In a real-life Machine Learning case, we would need to inspect this data and find out which variables we should use or which variables we should definitely ignore. But this is a quick example so we take the data as-is and see what accuracy we can obtain in 10 minutes. Therefore, we delete columns on type of connection and we treat every type of connection in the same way.

    In [4]:
    #Delete first columns
    X = pd.DataFrame(X).iloc[:,4:]
    

    We also inspect the dependent variable and we notice that it is not in the numeric 1 vs 0 format that we want. Let's change that.

    In [5]:
    #print y and see it's not what we want
    y[:5]
    
    Out[5]:
    array([b'normal.', b'normal.', b'normal.', b'normal.', b'normal.'],
          dtype=object)
    In [6]:
    # y new : reclassify in 0 vs 1 (normal is 0, attack is 1)
    y_new = []
    for i in y:
      if i == b'normal.':
        y_new.append(0)
      else:
        y_new.append(1)
    

    To have a quick idea of the data, we plot the count of attacks and normal connections in the data. Notice that our data is biased towards attakcs.

    In [7]:
    #count plot for number of observations that were an attack vs not an attack
    seaborn.countplot(y_new)
    
    Out[7]:
    <matplotlib.axes._subplots.AxesSubplot at 0x1501ad99a58>

    Step 2 : Train Validation Test Split
    We have a lot of observations, so we can allow to have a split in three parts : Train, Validation, Test. In cases where we have less data, we often go for a split in 2 : Train and Test.

  • Train : used for model building and hyper parameter tuning (GridSearchCV)
  • Validation : used for model selection
  • Test: used for having an objective final estimate of the out-of-sample accuracy of our classification model
  • In [8]:
    #cut the data in three parts : train, validation and test 
    X_train, X_test, y_train, y_test = train_test_split(X, y_new, test_size=0.33, random_state=42)
    X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.33, random_state=42)
    

    Step 3 : building three different models and applying GridSearchCV for hyperparameter tuning
    The three models are : Random Forest, Decision Tree and Logistic Regression.
    The following hyperparameters will be tuned using GridSearchCV:

  • For the Random Forest: 'max_features'
  • For the Decision Tree: 'max_features' and 'min_samples_split'
  • For the Logistic Regression: 'C'
  • Step 3a : Random Forest hyperparameter tuning

    In [9]:
    #Random Forest
    parameters_RF = {'max_features':[0.85, 0.9, 0.95]}
    RF = GridSearchCV(RandomForestClassifier(), parameters_RF, cv=5)
    RF.fit(X_train, y_train)
    
    C:\Users\jkorstan\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\ensemble\forest.py:245: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
      "10 in version 0.20 to 100 in 0.22.", FutureWarning)
    C:\Users\jkorstan\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\ensemble\forest.py:245: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
      "10 in version 0.20 to 100 in 0.22.", FutureWarning)
    C:\Users\jkorstan\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\ensemble\forest.py:245: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
      "10 in version 0.20 to 100 in 0.22.", FutureWarning)
    C:\Users\jkorstan\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\ensemble\forest.py:245: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
      "10 in version 0.20 to 100 in 0.22.", FutureWarning)
    C:\Users\jkorstan\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\ensemble\forest.py:245: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
      "10 in version 0.20 to 100 in 0.22.", FutureWarning)
    C:\Users\jkorstan\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\ensemble\forest.py:245: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
      "10 in version 0.20 to 100 in 0.22.", FutureWarning)
    C:\Users\jkorstan\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\ensemble\forest.py:245: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
      "10 in version 0.20 to 100 in 0.22.", FutureWarning)
    C:\Users\jkorstan\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\ensemble\forest.py:245: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
      "10 in version 0.20 to 100 in 0.22.", FutureWarning)
    C:\Users\jkorstan\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\ensemble\forest.py:245: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
      "10 in version 0.20 to 100 in 0.22.", FutureWarning)
    C:\Users\jkorstan\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\ensemble\forest.py:245: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
      "10 in version 0.20 to 100 in 0.22.", FutureWarning)
    C:\Users\jkorstan\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\ensemble\forest.py:245: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
      "10 in version 0.20 to 100 in 0.22.", FutureWarning)
    C:\Users\jkorstan\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\ensemble\forest.py:245: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
      "10 in version 0.20 to 100 in 0.22.", FutureWarning)
    C:\Users\jkorstan\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\ensemble\forest.py:245: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
      "10 in version 0.20 to 100 in 0.22.", FutureWarning)
    C:\Users\jkorstan\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\ensemble\forest.py:245: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
      "10 in version 0.20 to 100 in 0.22.", FutureWarning)
    C:\Users\jkorstan\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\ensemble\forest.py:245: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
      "10 in version 0.20 to 100 in 0.22.", FutureWarning)
    C:\Users\jkorstan\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\ensemble\forest.py:245: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
      "10 in version 0.20 to 100 in 0.22.", FutureWarning)
    
    Out[9]:
    GridSearchCV(cv=5, error_score='raise-deprecating',
                 estimator=RandomForestClassifier(bootstrap=True, class_weight=None,
                                                  criterion='gini', max_depth=None,
                                                  max_features='auto',
                                                  max_leaf_nodes=None,
                                                  min_impurity_decrease=0.0,
                                                  min_impurity_split=None,
                                                  min_samples_leaf=1,
                                                  min_samples_split=2,
                                                  min_weight_fraction_leaf=0.0,
                                                  n_estimators='warn', n_jobs=None,
                                                  oob_score=False,
                                                  random_state=None, verbose=0,
                                                  warm_start=False),
                 iid='warn', n_jobs=None,
                 param_grid={'max_features': [0.85, 0.9, 0.95]},
                 pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
                 scoring=None, verbose=0)

    Step 3a results : applying the Random Forest on the valdiation data and printing the roc_auc_score

    In [10]:
    RF_accuracy = roc_auc_score(y_val, RF.predict(X_val))
    RF_accuracy
    
    Out[10]:
    0.9997989809923306

    99% is very high !

    Step 3b : Decision Tree hyperparameter tuning

    In [11]:
    # Decision Tree
    parameters_DT = {'max_features':[0.85, 0.9, 0.95], 'min_samples_split': [2, 3, 4, 5, 10]}
    DT = GridSearchCV(DecisionTreeClassifier(), parameters_DT, cv=5)
    DT.fit(X_train, y_train)
    
    Out[11]:
    GridSearchCV(cv=5, error_score='raise-deprecating',
                 estimator=DecisionTreeClassifier(class_weight=None,
                                                  criterion='gini', max_depth=None,
                                                  max_features=None,
                                                  max_leaf_nodes=None,
                                                  min_impurity_decrease=0.0,
                                                  min_impurity_split=None,
                                                  min_samples_leaf=1,
                                                  min_samples_split=2,
                                                  min_weight_fraction_leaf=0.0,
                                                  presort=False, random_state=None,
                                                  splitter='best'),
                 iid='warn', n_jobs=None,
                 param_grid={'max_features': [0.85, 0.9, 0.95],
                             'min_samples_split': [2, 3, 4, 5, 10]},
                 pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
                 scoring=None, verbose=0)

    Step 3b results : applying the Decision Tree on the valdiation data and printing the roc_auc_score

    In [12]:
    DT_accuracy = roc_auc_score(y_val, DT.predict(X_val))
    DT_accuracy
    
    Out[12]:
    0.9994156220391427

    Also 99%, also very high !

    Step 3c : Logistic Regression hyperparameter tuning

    In [13]:
    # Logistic Regression
    parameters_LR = {'C':[0.01, 0.1, 1.0]}
    LR = GridSearchCV(LogisticRegression(), parameters_LR, cv=5)
    LR.fit(X_train, y_train)
    
    C:\Users\jkorstan\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
      FutureWarning)
    C:\Users\jkorstan\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
      FutureWarning)
    C:\Users\jkorstan\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\svm\base.py:929: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
      "the number of iterations.", ConvergenceWarning)
    C:\Users\jkorstan\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
      FutureWarning)
    C:\Users\jkorstan\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\svm\base.py:929: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
      "the number of iterations.", ConvergenceWarning)
    C:\Users\jkorstan\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
      FutureWarning)
    C:\Users\jkorstan\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
      FutureWarning)
    C:\Users\jkorstan\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
      FutureWarning)
    C:\Users\jkorstan\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
      FutureWarning)
    C:\Users\jkorstan\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\svm\base.py:929: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
      "the number of iterations.", ConvergenceWarning)
    C:\Users\jkorstan\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
      FutureWarning)
    C:\Users\jkorstan\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\svm\base.py:929: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
      "the number of iterations.", ConvergenceWarning)
    C:\Users\jkorstan\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
      FutureWarning)
    C:\Users\jkorstan\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\svm\base.py:929: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
      "the number of iterations.", ConvergenceWarning)
    C:\Users\jkorstan\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
      FutureWarning)
    C:\Users\jkorstan\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
      FutureWarning)
    C:\Users\jkorstan\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\svm\base.py:929: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
      "the number of iterations.", ConvergenceWarning)
    C:\Users\jkorstan\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
      FutureWarning)
    C:\Users\jkorstan\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\svm\base.py:929: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
      "the number of iterations.", ConvergenceWarning)
    C:\Users\jkorstan\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
      FutureWarning)
    C:\Users\jkorstan\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\svm\base.py:929: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
      "the number of iterations.", ConvergenceWarning)
    C:\Users\jkorstan\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
      FutureWarning)
    C:\Users\jkorstan\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\svm\base.py:929: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
      "the number of iterations.", ConvergenceWarning)
    C:\Users\jkorstan\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
      FutureWarning)
    C:\Users\jkorstan\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
      FutureWarning)
    C:\Users\jkorstan\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\svm\base.py:929: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
      "the number of iterations.", ConvergenceWarning)
    
    Out[13]:
    GridSearchCV(cv=5, error_score='raise-deprecating',
                 estimator=LogisticRegression(C=1.0, class_weight=None, dual=False,
                                              fit_intercept=True,
                                              intercept_scaling=1, l1_ratio=None,
                                              max_iter=100, multi_class='warn',
                                              n_jobs=None, penalty='l2',
                                              random_state=None, solver='warn',
                                              tol=0.0001, verbose=0,
                                              warm_start=False),
                 iid='warn', n_jobs=None, param_grid={'C': [0.01, 0.1, 1.0]},
                 pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
                 scoring=None, verbose=0)

    Step 3c results : applying the Logistic Regression on the valdiation data and printing the roc_auc_score

    In [14]:
    LR_accuracy = roc_auc_score(y_val, LR.predict(X_val))
    LR_accuracy
    
    Out[14]:
    0.9374875161517064

    98% is also high !

    Step 4 : choose the best performing model on the validation data set

    In [15]:
    # repeating the three accuracy scores
    RF_accuracy, DT_accuracy, LR_accuracy
    
    Out[15]:
    (0.9997989809923306, 0.9994156220391427, 0.9374875161517064)

    Selected model will be Random Forest !

    Step 5 : Final Estimate of Accuracy : apply the selected model to the Test Data

    In [16]:
    # predict on the test set and compute accuracy
    roc_auc_score(y_test, RF.predict(X_test))
    
    Out[16]:
    0.9995592125652177