{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"%matplotlib inline"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\nThe eigenfaces example: chaining PCA and SVMs\n=============================================\n\nThe goal of this example is to show how an unsupervised method and a\nsupervised one can be chained for better prediction. It starts with a\ndidactic but lengthy way of doing things, and finishes with the\nidiomatic approach to pipelining in scikit-learn.\n\nHere we'll take a look at a simple facial recognition example. Ideally,\nwe would use a dataset consisting of a subset of the `Labeled Faces in\nthe Wild `__ data that is available\nwith :func:`sklearn.datasets.fetch_lfw_people`. However, this is a\nrelatively large download (~200MB) so we will do the tutorial on a\nsimpler, less rich dataset. Feel free to explore the LFW dataset.\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from sklearn import datasets\nfaces = datasets.fetch_olivetti_faces()\nfaces.data.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's visualize these faces to see what we're working with\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from matplotlib import pyplot as plt\nfig = plt.figure(figsize=(8, 6))\n# plot several images\nfor i in range(15):\n ax = fig.add_subplot(3, 5, i + 1, xticks=[], yticks=[])\n ax.imshow(faces.images[i], cmap=plt.cm.bone)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
".. tip::\n\n Note is that these faces have already been localized and scaled to a\n common size. This is an important preprocessing piece for facial\n recognition, and is a process that can require a large collection of\n training data. This can be done in scikit-learn, but the challenge is\n gathering a sufficient amount of training data for the algorithm to work.\n Fortunately, this piece is common enough that it has been done. One good\n resource is\n `OpenCV `__,\n the *Open Computer Vision Library*.\n\nWe'll perform a Support Vector classification of the images. We'll do a\ntypical train-test split on the images:\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from sklearn.model_selection import train_test_split\nX_train, X_test, y_train, y_test = train_test_split(faces.data,\n faces.target, random_state=0)\n\nprint(X_train.shape, X_test.shape)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Preprocessing: Principal Component Analysis\n-------------------------------------------\n\n1850 dimensions is a lot for SVM. We can use PCA to reduce these 1850\nfeatures to a manageable size, while maintaining most of the information\nin the dataset.\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from sklearn import decomposition\npca = decomposition.PCA(n_components=150, whiten=True)\npca.fit(X_train)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"One interesting part of PCA is that it computes the \"mean\" face, which\ncan be interesting to examine:\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"plt.imshow(pca.mean_.reshape(faces.images[0].shape),\n cmap=plt.cm.bone)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The principal components measure deviations about this mean along\northogonal axes.\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"print(pca.components_.shape)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It is also interesting to visualize these principal components:\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"fig = plt.figure(figsize=(16, 6))\nfor i in range(30):\n ax = fig.add_subplot(3, 10, i + 1, xticks=[], yticks=[])\n ax.imshow(pca.components_[i].reshape(faces.images[0].shape),\n cmap=plt.cm.bone)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The components (\"eigenfaces\") are ordered by their importance from\ntop-left to bottom-right. We see that the first few components seem to\nprimarily take care of lighting conditions; the remaining components\npull out certain identifying features: the nose, eyes, eyebrows, etc.\n\nWith this projection computed, we can now project our original training\nand test data onto the PCA basis:\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"X_train_pca = pca.transform(X_train)\nX_test_pca = pca.transform(X_test)\nprint(X_train_pca.shape)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"print(X_test_pca.shape)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"These projected components correspond to factors in a linear combination\nof component images such that the combination approaches the original\nface.\n\nDoing the Learning: Support Vector Machines\n-------------------------------------------\n\nNow we'll perform support-vector-machine classification on this reduced\ndataset:\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from sklearn import svm\nclf = svm.SVC(C=5., gamma=0.001)\nclf.fit(X_train_pca, y_train)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Finally, we can evaluate how well this classification did. First, we\nmight plot a few of the test-cases with the labels learned from the\ntraining set:\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import numpy as np\nfig = plt.figure(figsize=(8, 6))\nfor i in range(15):\n ax = fig.add_subplot(3, 5, i + 1, xticks=[], yticks=[])\n ax.imshow(X_test[i].reshape(faces.images[0].shape),\n cmap=plt.cm.bone)\n y_pred = clf.predict(X_test_pca[i, np.newaxis])[0]\n color = ('black' if y_pred == y_test[i] else 'red')\n ax.set_title(y_pred, fontsize='small', color=color)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The classifier is correct on an impressive number of images given the\nsimplicity of its learning model! Using a linear classifier on 150\nfeatures derived from the pixel-level data, the algorithm correctly\nidentifies a large number of the people in the images.\n\nAgain, we can quantify this effectiveness using one of several measures\nfrom :mod:`sklearn.metrics`. First we can do the classification\nreport, which shows the precision, recall and other measures of the\n\"goodness\" of the classification:\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from sklearn import metrics\ny_pred = clf.predict(X_test_pca)\nprint(metrics.classification_report(y_test, y_pred))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Another interesting metric is the *confusion matrix*, which indicates\nhow often any two items are mixed-up. The confusion matrix of a perfect\nclassifier would only have nonzero entries on the diagonal, with zeros\non the off-diagonal:\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"print(metrics.confusion_matrix(y_test, y_pred))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Pipelining\n----------\n\nAbove we used PCA as a pre-processing step before applying our support\nvector machine classifier. Plugging the output of one estimator directly\ninto the input of a second estimator is a commonly used pattern; for\nthis reason scikit-learn provides a ``Pipeline`` object which automates\nthis process. The above problem can be re-expressed as a pipeline as\nfollows:\n\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from sklearn.pipeline import Pipeline\nclf = Pipeline([('pca', decomposition.PCA(n_components=150, whiten=True)),\n ('svm', svm.LinearSVC(C=1.0))])\n\nclf.fit(X_train, y_train)\n\ny_pred = clf.predict(X_test)\nprint(metrics.confusion_matrix(y_pred, y_test))\nplt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A Note on Facial Recognition\n----------------------------\n\nHere we have used PCA \"eigenfaces\" as a pre-processing step for facial\nrecognition. The reason we chose this is because PCA is a\nbroadly-applicable technique, which can be useful for a wide array of\ndata types. Research in the field of facial recognition in particular,\nhowever, has shown that other more specific feature extraction methods\nare can be much more effective.\n\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.13"
}
},
"nbformat": 4,
"nbformat_minor": 0
}