Machine Learning - Deep Learning

Precision/Recall Trade-off

Precision/Recall Trade-off

· To understand this trade-off,

ü let’s look at how the SGDClassifier makes its classification decisions.

ü For each instance, it computes a score based on a decision function.

ü If that score is greater than a threshold,

• it assigns the instance to the positive class;

• otherwise it assigns it to the negative class.

Figure 3-3. In this precision/recall trade-off, images are ranked by their classifier score, and those above the chosen decision threshold are considered positive; the higher the threshold, the lower the recall, but (in general) the higher the precision.

• Figure 3 shows a few digits positioned from the lowest score on the left to the highest score on the right.

• Suppose the decision threshold is positioned at the central arrow (between the two 5s):

ü you will find

ü 4 true positives (actual 5s) on the right of that threshold, and

ü 1 false positive (actually a 6).

ü Therefore, with that threshold, the precision is 80% (4 out of 5).

• But out of 6 actual 5s, the classifier only detects 4,

ü so the recall is 67% (4 out of 6).

• If you raise the threshold (move it to the arrow on the right),

ü the false positive (the 6) becomes a true negative,

ü thereby increasing the precision (up to 100% in this case),

ü but one true positive becomes a false negative, decreasing recall down to 50%.

• Conversely, lowering the threshold increases recall and reduces precision.

· Scikit-Learn does not let you set the threshold directly,

ü but it does give you access to the decision scores that it uses to make predictions.

· Instead of calling the classifier’s predict() method,

ü you can call its decision_function() method,

· which returns a score for each instance, and

· then use any threshold you want to make predictions based on those scores:

y_scores = sgd_clf.decision_function([some_digit])

y_scores

array([2412.53175101])

threshold = 0

y_some_digit_pred = (y_scores > threshold)

array([ True])

· The SGDClassifier uses a threshold equal to 0, so the previous code returns the same result as the predict() method (i.e., True).

· Let’s raise the threshold:

threshold = 8000

y_some_digit_pred = (y_scores > threshold)

y_some_digit_pred

array([False])

• This confirms that raising the threshold decreases recall.

• The image actually represents a 5, and

ü the classifier detects it when the threshold is 0,

• but it misses it when the threshold is increased to 8,000.

How do you decide which threshold to use?

· First, use the cross_val_predict() function to get the scores of all instances in the training set,

ü but this time specify that you want to return decision scores instead of predictions:

y_scores = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3, method = "decision_function")

With these scores,

• use the precision_recall_curve() function to compute

ü precision and

ü recall for all possible thresholds:

from sklearn.metrics import precision_recall_curve

precisions, recalls, thresholds = precision_recall_curve

(y_train_5, y_scores)

• Finally, use Matplotlib to plot precision and recall as functions of the threshold value (Figure 4):

def plot_precision_recall_vs_threshold(precisions, recalls, thresholds):

plt.plot(thresholds, precisions[:-1], "b--", label="Precision")

plt.plot(thresholds, recalls[:-1], "g-", label="Recall")

[...] # highlight the threshold and add the legend, axis label, and grid

plot_precision_recall_vs_threshold(precisions, recalls, thresholds)

plt.show()

Figure 4. Precision and recall versus the decision threshold

· You may wonder why the precision curve is bumpier than the recall curve in Figure 4.

· The reason is that precision may sometimes go down when you raise the threshold.

· To understand why,

ü look back at Figure 3 and

ü notice what happens when you start from the central threshold and

ü move it just one digit to the right:

· precision goes from 4/5 (80%) down to 3/4 (75%).

· On the other hand,

ü recall can only go down when the threshold is increased,

ü which explains why its curve looks smooth.

• Another way to select a good precision/recall trade-off is to

• plot precision directly against recall, as shown in Figure 5 (the same threshold as earlier is highlighted).

ü You can see that precision really starts to fall sharply around 80% recall.

ü You will probably want to select a precision/recall trade-off just before that drop—for example, at around 60% recall.

ü But of course, the choice depends on your project.

Figure 5. Precision versus recall

· Suppose you decide to aim for 90% precision.

· You look up the first plot and

ü find that you need to use a threshold of about 8,000.

· To be more precise you can search for the lowest threshold that gives you at least 90% precision

· np.argmax() will give you the first index of the maximum value,

ü which in this case means the first True value:

threshold_90_precision = thresholds [np.argmax ( precisions >= 0.90 ) ] # ~7816

• To make predictions (on the training set for now),

ü instead of calling the classifier’s predict() method,

ü you can run this code:

y_train_pred_90 = (y_scores >= threshold_90_precision)

• Let’s check these predictions’ precision and recall:

precision_score(y_train_5, y_train_pred_90)

0.900038008361839

recall_score(y_train_5, y_train_pred_90)

0.4368197749492714

· Great, you have a 90% precision classifier!

· As you can see, it is fairly easy to create a classifier with virtually any precision you want:

ü just set a high enough threshold.

· But wait, not so fast.

· A high-precision classifier is not very useful if its recall is too low!

Precision and Recall

Precision and Recall

Scikit-Learn provides several functions to compute classifier metrics, including precision and recall:

from sklearn.metrics import precision_score, recall_score

precision_score(y_train_5, y_train_pred) # == 4096 / (4096 + 1522)

0.7290850836596654

recall_score(y_train_5, y_train_pred) # == 4096 / (4096 + 1325)

0.7555801512636044

· Now your 5-detector does not look as shiny as it did when you looked at its accuracy.

· When it claims an image represents a 5, it is correct only 72.9% of the time.

· Moreover, it only detects 75.6% of the 5s.

• It is often convenient to combine

ü precision and

ü recall into a single metric called the F1 score,

ü in particular if you need a simple way to compare two classifiers.

• The F1 score is the harmonic mean of precision and recall.

F1Score

• Whereas the regular mean treats all values equally,

ü the harmonic mean gives much more weight to low values.

• As a result, the classifier will only get a high F1 score if

ü both recall and precision are high.

• To compute the F1 score, simply call the f1_score() function:

from sklearn.metrics import f1_score

f1_score(y_train_5, y_train_pred)

0.7420962043663375

· The F1 score favors classifiers that have similar

ü precision and

ü recall.

· This is not always what you want:

ü in some contexts you mostly care about precision, and

ü in other contexts you really care about recall.

· For example, if you trained a classifier to detect videos that are safe for kids,

ü you would probably prefer a classifier that rejects many good videos (low recall)

ü but keeps only safe ones (high precision),

ü rather than a classifier that has a much higher recall but lets a few really bad videos show up in your product.

· On the contrary, however suppose

· you train a classifier to detect shoplifters in surveillance images:

ü it is probably fine if your classifier has only 30% precision as long as it has 99% recall

ü the security guards will get a few false alerts,

ü but almost all shoplifters will get caught.

· Unfortunately, you can’t have it both ways:

i. increasing precision reduces recall, and

ii. vice versa.

ü This is called the precision/recall trade-off.

Confusion Matrix

Confusion Matrix

· A much better way to evaluate the performance of a classifier is to look at the confusion matrix.

· The general idea is to

· count the number of times instances of class A are classified as class B.

· For example,

ü to know the number of times the classifier confused images of 5s with 3s,

ü you would look in

· the fifth row and

· third column of the confusion matrix.

ü To compute the confusion matrix,

• you first need to have a set of predictions so that they can be compared to the actual targets.

• You could make predictions on the test set.

• Remember that you want to use the test set only at the very end of your project, once you have a classifier that you are ready to launch.

ü Instead, you can use the cross_val_predict() function:

from sklearn.model_selection import cross_val_predict

y_train_pred = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3)

· Just like the cross_val_score() function,

· cross_val_predict() performs

ü K-fold cross-validation, but instead of returning the evaluation scores,

§ it returns the predictions made on each test fold.

· This means that you get a clean prediction for each instance in the training set.

· “clean” meaning that the prediction is made by a model that never saw the data during training.

ü Now you are ready to get the confusion matrix using the confusion_matrix() function.

ü Just pass it the target classes (y_train_5) and the predicted classes (y_train_pred):

from sklearn.metrics import confusion_matrix

confusion_matrix(y_train_5, y_train_pred)

array([[53057, 1522],

[ 1325, 4096]])

ü Each row in a confusion matrix represents an actual class, while each column represents a predicted class.

ü The first row of this matrix considers non-5 images (the negative class):

• 53,057 of them were correctly classified as non-5s (they are called true negatives),

• while the remaining 1,522 were wrongly classified as 5s (false positives).

• The second row considers the images of 5s (the positive class):

ü 1,325 were wrongly classified as non-5s (false negatives),

ü while the remaining 4,096 were correctly classified as 5s (true positives).

ü A perfect classifier would have only

• true positives and

• true negatives,

ü so its confusion matrix would have nonzero values only on its main diagonal (top left to bottom right):

y_train_perfect_predictions = y_train_5 # pretend we reached perfection

confusion_matrix(y_train_5, y_train_perfect_predictions)

array([[54579, 0],

[ 0, 5421]])

· The confusion matrix gives you a lot of information, but sometimes you may prefer a more concise metric.

· The accuracy of the positive predictions; this is called the precision of the classifier.

• Precision

ü TP is the number of true positives

ü FP is the number of false positives.

· A trivial way to have perfect precision is to make one single positive prediction and ensure it is correct (precision = 1/1 = 100%).

· But this would not be very useful, since the classifier would ignore all but one positive instance.

· So precision is typically used along with another metric named recall, also called

ü sensitivity or

ü the true positive rate (TPR):

· this is the ratio of positive instances that are correctly detected by the classifier.

Recall

•

ü FN is the number of false negatives.

• Confusion matrix is explained in Figure 2.

Figure 2. An illustrated confusion matrix shows examples of true negatives (top left), false positives (top right), false negatives (lower left), and true positives (lower right)

Machine Learning - Deep Learning

Precision/Recall Trade-off

Precision and Recall

Confusion Matrix

About Machine Learning

SOFTWARE ENGINEERING