Using scikit-learn¶

scikit-learn and pandas are the common tools for data science in python.

In [1]:
import sklearn
import numpy as np
import matplotlib.pyplot as plt

Scikit-learn¶

sklearn has many of the tools needed to set up a data analysis pipeline:

  • preprocessors
  • models
  • model selection

Preprocessor¶

Preprocessors include

  • standardScaler: shifts and scale the data to have mean 0 and standard deviation 1.
  • Normalizer: normalises the features for each data sample to have unit lenght
  • MinMaxScaler: shifts and scales the data so it fits in a given interval
  • OneHotEncoder: transforms class labels to a one-hot encoded matrix of 0 or 1 values
  • PolynomialFeatures: Creates polynomial features
  • ...

Models¶

in sklearn.linear_model:

  • LogisticRegression: the logistic regression classifier discussed in Lecture 2.
  • Ridge: the ridge regression discussed in Lecture 4
  • Perceptron: the perceptron model discussed in Lecture 1

in sklearn.neural_network:

  • MLPclassifier: the multiple layer perceptron 'classic' neural network discussed in lecture 5 and 6.

in sklearn.neighbors:

  • KNeighborsClassifier: the $k$-neighbours classifier discussed in Lecture 7.

in sklearn.svm:

  • SVC: the support vector classifier discussed in Lecture 3.

Interface¶

The preprocessors and models in sklearn have a common functions:

  • fit: fits to the data to set the model/preprocessor parameters
  • transform(): transforms the input data and returns the transformed data
  • fit_transform(): do both operations

Models have common functions:

  • predict(X): make a prediction for new data X
  • score(X,y): gives the score for data X and targets y

fit example

In [2]:
from sklearn.preprocessing import StandardScaler
stdScaler = StandardScaler()
randomData = np.random.normal(2,3,size=(1000,1) )
stdScaler.fit(randomData)
Out[2]:
StandardScaler()

After the fit the standard scaler has leaned the mean and standard deviation of the dataset

In [3]:
stdScaler.mean_, stdScaler.scale_
Out[3]:
(array([1.9907856]), array([2.92175702]))

It can now apply the same transformation to unseen data:

In [4]:
stdScaler.transform([
    [2],
    [5],
    [-1]
])
Out[4]:
array([[ 0.00315372],
       [ 1.02993315],
       [-1.02362571]])

Tools¶

in model_selection:

  • learning_curve: can be used to produce learning curves.
  • train_test_split: can be used to separate a given dataset in a training and validation sample.
  • GridSearchCV: can be used to scan through a grid of parameter through cross validation.

Model selection with GridSearchCV¶

We start with the same dataset as in one of the exercises:

In [17]:
def fn(x):
    return 7 - 8*x - 0.5*x**2 + 0.5*x**3
  
n_train = 100
np.random.seed(1122)
xs = np.linspace(0, 5)
rxs = 5 * np.random.random(n_train)
X1D = np.array([rxs]).T
ys1D = fn(rxs) + np.random.normal(size = (n_train) )
In [18]:
plt.plot(xs, fn(xs), 'b--')
plt.plot(rxs, ys1D, 'ok')
plt.xlabel('x');
plt.ylabel('y');
In [19]:
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Ridge
from sklearn.preprocessing import PolynomialFeatures

polynomial_features = PolynomialFeatures(degree=8)
X_train = polynomial_features.fit_transform(X1D)

alpha_values = np.logspace(-4, 4, 100)
parameters = {'alpha': alpha_values}
r = Ridge()
Rsearch = GridSearchCV(r, parameters, cv=5)
Rsearch.fit(X_train, ys1D);

Our grid search has trained a ridge regression for each values of $\alpha$ and performed a 5-fold cross validation, so we will have access to an average and an uncertainty estimate.

In [8]:
Rsearch.cv_results_.keys()
Out[8]:
dict_keys(['mean_fit_time', 'std_fit_time', 'mean_score_time', 'std_score_time', 'param_alpha', 'params', 'split0_test_score', 'split1_test_score', 'split2_test_score', 'split3_test_score', 'split4_test_score', 'mean_test_score', 'std_test_score', 'rank_test_score'])

We can now plot the score as a function of $\alpha$:

In [9]:
scores = Rsearch.cv_results_['mean_test_score']
scores_std = Rsearch.cv_results_['std_test_score']
plt.fill_between(alpha_values, scores - scores_std,
                 scores + scores_std, alpha=0.1, color="g")
plt.plot(alpha_values, scores)
plt.xscale('log')
plt.xlabel(r'Regularisation parameter $\alpha$')
plt.ylabel('Average score');

We can access the best model using best_estimator_:

In [10]:
xval = np.arange(0,5.1,0.1).reshape(-1, 1)
pxval = polynomial_features.transform(xval)
ypred = Rsearch.best_estimator_.predict(pxval)

plt.plot(rxs, ys1D,'ok')  
plt.plot(xval, ypred , color='r')

plt.xlabel('x')
plt.ylabel('y');

Pipelines¶

You noticed that we had to remember all the steps of the training to make the prediction of the model for the preceding plot. This is akward and error-prone.

We can use Pipeline to create all steps of an analysis in one object.

In [11]:
from sklearn.pipeline import Pipeline

analysis_pipeline = Pipeline([
    ('poly', PolynomialFeatures(degree=8)), 
    ('ridge', Ridge())
])

This pipeline can be used as a normal model, for example we can use it in a grid search:

In [12]:
degrees = [5,6,7]
parameters = {
    'ridge__alpha': alpha_values, 
    'poly__degree': degrees
}
Psearch = GridSearchCV(analysis_pipeline, parameters, cv=5)
Psearch.fit(X1D, ys1D);

Notice how parameters of specific steps can be set!

We can plot the scores for each polynomial order:

In [13]:
for j in range(3):
    scores = Psearch.cv_results_['mean_test_score'][j*100:(j+1)*100]
    scores_std = Psearch.cv_results_['std_test_score'][j*100:(j+1)*100]
    plt.fill_between(alpha_values, scores - scores_std,
                 scores + scores_std, alpha=0.1, label="n={}".format(degrees[j]))
    plt.plot(alpha_values, scores)
plt.xscale('log')
plt.legend()
plt.xlabel(r'Regularisation parameter $\alpha$')
plt.ylabel('Test score');

And plot the best estimator's prediction:

In [14]:
xval = np.arange(0,5.1,0.1).reshape(-1, 1)
ypred = Psearch.best_estimator_.predict(xval)

plt.plot(rxs, ys1D,'ok')  
plt.plot(xval, ypred , color='r')

plt.xlabel('x')
plt.ylabel('y');

Notice how we did not need to explicitely perform all the steps.

We can also use this estimator in other tools such as learning_curve:

In [15]:
from sklearn.model_selection import learning_curve

train_sizes = np.linspace(.1, 1.0, 10)

train_sizes, train_scores, test_scores = learning_curve(
    Psearch.best_estimator_, X1D, ys1D, cv=5, train_sizes=train_sizes)

train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)
In [16]:
plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
                 train_scores_mean + train_scores_std, alpha=0.1, color="r")
plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
                 test_scores_mean + test_scores_std, alpha=0.1, color="g")

plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
         label="Training score")
plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
         label="Cross-validation score")

plt.ylim(0,1); plt.grid(); plt.legend(loc="best");