Support Vector Machines
Predicting Date Fruit Varieties with Support
Vector Machines
Context
Objective is to leverage advanced machine learning techniques to predict
the variety of date fruits, empowering farmers and agricultural stakeholders to
improve classification accuracy and streamline post-harvest processes. Your
role is to analyze various morphological, colorimetric, and textural attributes
of date fruits to build predictive models that distinguish different varieties
effectively.
Dataset Description
You have been
provided with a comprehensive dataset containing morphological and colorimetric
features of different varieties of date fruits. The dataset includes the
following attributes:
Morphological Attributes:
- AREA: Surface area of the date fruit.
- PERIMETER: Perimeter measurement around the fruit.
- MAJOR_AXIS: Length of the major axis of the date
fruit.
- MINOR_AXIS: Length of the minor axis of the date
fruit.
- ECCENTRICITY: Ratio describing the shape of the fruit
based on the axes.
- EQDIASQ: Equivalent diameter of a circle with the
same area as the fruit.
- SOLIDITY: Ratio of the area to the convex hull
area.
- CONVEX_AREA: Area of the smallest convex polygon that
can contain the fruit.
- EXTENT: Ratio of the area to the bounding box
area.
- ASPECT_RATIO: Ratio of the major axis to the minor
axis.
- ROUNDNESS: Measure of how circular the fruit is.
- COMPACTNESS: Measure of how compact or dense the fruit
is.
Shape Factor Attributes:
- SHAPEFACTOR_1: Ratio of the perimeter squared to 4π
times the area.
- SHAPEFACTOR_2: Ratio of 4π times the area to the
perimeter squared.
- SHAPEFACTOR_3: Ratio of the major axis to the equivalent
diameter.
- SHAPEFACTOR_4: Ratio of the minor axis to the equivalent
diameter.
Colorimetric Attributes:
- MeanRR: Mean intensity of the red color channel.
- MeanRG: Mean intensity of the green color
channel.
- MeanRB: Mean intensity of the blue color channel.
- StdDevRR: Standard deviation of the red color
channel.
- StdDevRG: Standard deviation of the green color
channel.
- StdDevRB: Standard deviation of the blue color
channel.
- SkewRR: Skewness of the red color channel.
- SkewRG: Skewness of the green color channel.
- SkewRB: Skewness of the blue color channel.
- KurtosisRR: Kurtosis of the red color channel.
- KurtosisRG: Kurtosis of the green color channel.
- KurtosisRB: Kurtosis of the blue color channel.
- EntropyRR: Entropy of the red color channel.
- EntropyRG: Entropy of the green color channel.
- EntropyRB: Entropy of the blue color channel.
Daubechies Wavelet Attributes:
- ALLdaub4RR: Wavelet-transformed feature of the red
color channel.
- ALLdaub4RG: Wavelet-transformed feature of the green
color channel.
- ALLdaub4RB: Wavelet-transformed feature of the blue
color channel.
Target Attribute:
- Class: The variety or class of the date fruit.
- Check this Date Fruit Vlog to see
the images of these dates
Your task is
to utilize Support Vector Machines (SVMs) to predict the "Class" of
each date fruit and identify the most influential features contributing to
accurate classification. This project will help streamline the sorting and
grading processes in the agricultural industry, offering practical insights to
farmers and other stakeholders.
!wget
https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/070/671/original/Date_Fruit_Datasets.zip
--2024-07-13 05:51:52-- https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/070/671/original/Date_Fruit_Datasets.zip
Resolving d2beiqkhq929f0.cloudfront.net
(d2beiqkhq929f0.cloudfront.net)... 13.35.37.7, 13.35.37.159, 13.35.37.102, ...
Connecting to d2beiqkhq929f0.cloudfront.net
(d2beiqkhq929f0.cloudfront.net)|13.35.37.7|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 416324 (407K) [application/zip]
Saving to: ‘Date_Fruit_Datasets.zip’
Date_Fruit_Datasets 100%[===================>]
406.57K --.-KB/s in 0.02s
2024-07-13 05:51:52 (18.5 MB/s) - ‘Date_Fruit_Datasets.zip’ saved
[416324/416324]
!unzip
/content/Date_Fruit_Datasets.zip
Archive:
/content/Date_Fruit_Datasets.zip
creating:
Date_Fruit_Datasets/
inflating:
Date_Fruit_Datasets/Date_Fruit_Datasets.arff
inflating:
Date_Fruit_Datasets/Date_Fruit_Datasets.xlsx
inflating:
Date_Fruit_Datasets/Date_Fruit_Datasets_Citation_Request.txt
import warnings
warnings.filterwarnings("ignore")
import os
import pandas as pd
df = pd.read_excel('/content/Date_Fruit_Datasets/Date_Fruit_Datasets.xlsx')
df.head()
AREA PERIMETER MAJOR_AXIS MINOR_AXIS ECCENTRICITY EQDIASQ SOLIDITY CONVEX_AREA EXTENT ASPECT_RATIO ... KurtosisRR KurtosisRG KurtosisRB EntropyRR EntropyRG EntropyRB ALLdaub4RR ALLdaub4RG ALLdaub4RB Class
0 422163 2378.908 837.8484 645.6693 0.6373 733.1539 0.9947 424428 0.7831 1.2976 ... 3.2370 2.9574 4.2287 -59191263232 -50714214400 -39922372608 58.7255 54.9554 47.8400 BERHI
1 338136 2085.144 723.8198 595.2073 0.5690 656.1464 0.9974 339014 0.7795 1.2161 ... 2.6228 2.6350 3.1704 -34233065472 -37462601728 -31477794816 50.0259 52.8168 47.8315 BERHI
2 526843 2647.394 940.7379 715.3638 0.6494 819.0222 0.9962 528876 0.7657 1.3150 ... 3.7516 3.8611 4.7192 -93948354560 -74738221056 -60311207936 65.4772 59.2860 51.9378 BERHI
3 416063 2351.210 827.9804 645.2988 0.6266 727.8378 0.9948 418255 0.7759 1.2831 ... 5.0401 8.6136 8.2618 -32074307584 -32060925952 -29575010304 43.3900 44.1259 41.1882 BERHI
4 347562 2160.354 763.9877 582.8359 0.6465 665.2291 0.9908 350797 0.7569 1.3108 ... 2.7016 2.9761 4.4146 -39980974080 -35980042240 -25593278464 52.7743 50.9080 42.6666 BERHI
5 rows × 35 columns
index |
AREA |
PERIMETER |
MAJOR_AXIS |
MINOR_AXIS |
ECCENTRICITY |
EQDIASQ |
SOLIDITY |
CONVEX_AREA |
EXTENT |
ASPECT_RATIO |
ROUNDNESS |
COMPACTNESS |
SHAPEFACTOR_1 |
SHAPEFACTOR_2 |
SHAPEFACTOR_3 |
SHAPEFACTOR_4 |
MeanRR |
MeanRG |
MeanRB |
StdDevRR |
0 |
422163 |
2378.908 |
837.8484 |
645.6693 |
0.6373 |
733.1539 |
0.9947 |
424428 |
0.7831 |
1.2976 |
0.9374 |
0.875 |
0.002 |
0.0015 |
0.7657 |
0.9936 |
117.4466 |
109.9085 |
95.6774 |
26.5152 |
1 |
338136 |
2085.144 |
723.8198 |
595.2073 |
0.569 |
656.1464 |
0.9974 |
339014 |
0.7795 |
1.2161 |
0.9773 |
0.9065 |
0.0021 |
0.0018 |
0.8218 |
0.9993 |
100.0578 |
105.6314 |
95.661 |
27.2656 |
2 |
526843 |
2647.394 |
940.7379 |
715.3638 |
0.6494 |
819.0222 |
0.9962 |
528876 |
0.7657 |
1.315 |
0.9446 |
0.8706 |
0.0018 |
0.0014 |
0.758 |
0.9968 |
130.9558 |
118.5703 |
103.875 |
29.7036 |
3 |
416063 |
2351.21 |
827.9804 |
645.2988 |
0.6266 |
727.8378 |
0.9948 |
418255 |
0.7759 |
1.2831 |
0.9458 |
0.8791 |
0.002 |
0.0016 |
0.7727 |
0.9915 |
86.7798 |
88.2531 |
82.3751 |
28.7288 |
4 |
347562 |
2160.354 |
763.9877 |
582.8359 |
0.6465 |
665.2291 |
0.9908 |
350797 |
0.7569 |
1.3108 |
0.9358 |
0.8707 |
0.0022 |
0.0017 |
0.7582 |
0.9938 |
105.5484 |
101.8132 |
85.3342 |
30.3205 |
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 898 entries, 0 to 897
Data columns (total 35 columns):
# Column Non-Null Count Dtype
---
------ -------------- -----
0 AREA 898 non-null int64
1 PERIMETER 898 non-null float64
2 MAJOR_AXIS 898 non-null float64
3 MINOR_AXIS 898 non-null float64
4 ECCENTRICITY 898 non-null float64
5 EQDIASQ 898 non-null float64
6 SOLIDITY 898 non-null float64
7 CONVEX_AREA 898 non-null int64
8 EXTENT 898 non-null float64
9 ASPECT_RATIO 898 non-null float64
10 ROUNDNESS
898 non-null float64
11 COMPACTNESS
898 non-null float64
12 SHAPEFACTOR_1
898 non-null float64
13 SHAPEFACTOR_2
898 non-null float64
14 SHAPEFACTOR_3
898 non-null float64
15 SHAPEFACTOR_4
898 non-null float64
16 MeanRR
898 non-null float64
17 MeanRG
898 non-null float64
18 MeanRB
898 non-null float64
19 StdDevRR
898 non-null float64
20 StdDevRG
898 non-null float64
21 StdDevRB
898 non-null float64
22 SkewRR
898 non-null float64
23 SkewRG
898 non-null float64
24 SkewRB
898 non-null float64
25 KurtosisRR
898 non-null float64
26 KurtosisRG
898 non-null float64
27 KurtosisRB
898 non-null float64
28 EntropyRR
898 non-null int64
29 EntropyRG
898 non-null int64
30 EntropyRB
898 non-null int64
31 ALLdaub4RR
898 non-null float64
32 ALLdaub4RG
898 non-null float64
33 ALLdaub4RB
898 non-null float64
34 Class
898 non-null object
dtypes: float64(29), int64(5), object(1)
memory usage: 245.7+ KB
class_counts = df.Class.value_counts()
classes = list(class_counts.index)
n_classes = len(classes)
print(f"we've got {n_classes} classes\n")
display(class_counts)
we've got 7 classes
Class
DOKOL 204
SAFAVI 199
ROTANA 166
DEGLET 98
SOGAY 94
IRAQI 72
BERHI 65
Name: count, dtype: int64
Q1. Feature Correlation Analysis
Context:
Support Vector Machines (SVM) are sensitive to highly correlated
features, especially when using linear kernels. Such correlations can
negatively affect the model’s generalization ability and potentially lead to
overfitting.
Task:
Calculate the Pearson correlation coefficients among 'PERIMETER',
'MAJOR-AXIS', 'MINOR-AXIS', and 'COMPACTNESS'. These insights will help in
understanding the interdependencies among features that are critical to SVM
performance.
Instructions:
- Calculate Correlation Matrix: Use
the dataset to calculate the Pearson correlation matrix for the specified
features.
- Visualize Correlation Matrix: Create
a heatmap to visualize the correlations between features, which will help
identify which pairs are most correlated.
Question:
Based on the heatmap, which pair of features shows the highest
correlation? Discuss the potential impact of this correlation on SVM
classification and suggest how to mitigate negative effects.
Options:
A) 'PERIMETER'
and 'MAJOR_AXIS'
B) 'MAJOR_AXIS' and 'MINOR_AXIS'
C) 'MINOR_AXIS' and 'COMPACTNESS'
D) 'PERIMETER' and 'COMPACTNESS'
import seaborn as sns
import
matplotlib.pyplot as
plt
# Features to include in the correlation matrix
features = ['PERIMETER',
'MAJOR_AXIS', 'MINOR_AXIS', 'COMPACTNESS']
# TODO: Compute the correlation matrix for the
selected features
correlation_matrix = _________
plt.figure(figsize=(10, 8))
# TODO: Fill in the appropriate variable to
visualize the correlation matrix
sns.heatmap(________, annot=True,
cmap='coolwarm',
linewidths=2,
linecolor='black')
plt.title('Heatmap of
Pearson Correlation Among Features')
plt.show()
import seaborn as sns
import
matplotlib.pyplot as
plt
# Features to include in the correlation matrix
features = ['PERIMETER',
'MAJOR_AXIS', 'MINOR_AXIS', 'COMPACTNESS']
# TODO: Compute the correlation matrix for the
selected features
correlation_matrix = df[features].corr()
plt.figure(figsize=(10, 8))
# TODO: Fill in the appropriate variable to
visualize the correlation matrix
sns.heatmap(correlation_matrix, annot=True,
cmap='coolwarm',
linewidths=2,
linecolor='black')
plt.title('Heatmap of
Pearson Correlation Among Features')
plt.show()
Ans: 'PERIMETER' and 'MAJOR_AXIS'
Q2. SVM Precision with Feature Scaling
Context:
Support Vector Machine (SVM) is sensitive to the scale of the data,
which can significantly impact its performance. This question aims to explore
the effect of feature scaling on SVM classification accuracy and precision for
each class.
Task:
After applying feature scaling, identify the class with the lowest
precision in an SVM classification task. The SVM model uses a linear kernel.
Instructions:
- Prepare Data:
- Split the dataset into training and test sets using train_test_split with test_size=0.3 and random_state=42.
- Extract features (X) and the target variable (y) from the dataset.
- Feature Scaling:
- Scale the features using StandardScaler.
- Train SVM:
- Train a Support Vector Machine with a linear kernel on the scaled
training data.
- Evaluate the Model:
- Predict the target variable on the scaled test set.
- Generate a classification report to review the precision scores
for each class.
Question:
After training an SVM with a linear kernel on scaled data, which class
in the classification report exhibits the lowest precision?
Options:
(A) SAFAVI
(B) IRAQI
(C) ROTANA
(D) DEGLET
from
sklearn.preprocessing import
StandardScaler
from
sklearn.___ import
SVC # TODO: Import the SVC (Support
Vector Classifier) from sklearn.svm
from
sklearn.metrics import
classification_report
from
sklearn.model_selection import
train_test_split
# Preparing the feature and target variables
X = df.drop('Class',
axis=1)
y = df['Class']
# Splitting the dataset into training and testing
sets
X_train, X_test, y_train, y_test =
train_test_split(X, y, test_size=0.3,
random_state=42)
# Initializing the SVM model
svm = SVC(_____='linear')
# TODO: Specify the kernel type
for the SVC model
# With scaling
scaler = StandardScaler()
X_train_scaled = scaler.____(X_train) #
TODO: Apply the scaling to the training features
X_test_scaled = scaler.____(X_test) #
TODO: Apply the scaling to the test features
# Training the SVM model
svm.fit(X_train_scaled, y_train)
predictions_scaled = svm.____(X_test_scaled) #
TODO: Use the model to predict the test set
print("With Scaling:")
print(classification_report(____,
____)) # TODO: Provide the true and
predicted values to generate the classification report
Final Code:
from
sklearn.preprocessing import
StandardScaler
from
sklearn.svm import
SVC # TODO: Import the SVC (Support
Vector Classifier) from sklearn.svm
from
sklearn.metrics import
classification_report
from
sklearn.model_selection import
train_test_split
# Preparing the feature and target variables
X = df.drop('Class',
axis=1)
y = df['Class']
# Splitting the dataset into training and testing
sets
X_train, X_test, y_train, y_test =
train_test_split(X, y, test_size=0.3,
random_state=42)
# Initializing the SVM model
svm = SVC(kernel='linear') #
TODO: Specify the kernel type for the SVC model
# With scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
# TODO: Apply the scaling to the
training features
X_test_scaled = scaler.transform(X_test) #
TODO: Apply the scaling to the test features
# Training the SVM model
svm.fit(X_train_scaled, y_train)
predictions_scaled = svm.predict(X_test_scaled)
# TODO: Use the model to predict
the test set
print("With Scaling:")
print(classification_report(y_test,
predictions_scaled)) # TODO: Provide the true and
predicted values to generate the classification report
With Scaling:
precision recall f1-score
support
BERHI 0.89 0.94
0.91 17
DEGLET 0.73 0.79
0.76 28
DOKOL 0.97 0.96
0.96 67
IRAQI 0.87 0.95
0.91 21
ROTANA 1.00 0.85
0.92 55
SAFAVI 0.98 0.98
0.98 51
SOGAY 0.74 0.84
0.79 31
accuracy
0.91 270
macro
avg 0.88 0.90
0.89 270
weighted avg 0.91 0.91
0.91 270
Ans: DEGLET
Q3. Optimal Feature Count
Context:
Feature selection is crucial in machine learning to reduce
dimensionality and improve model performance. This exercise involves using
Recursive Feature Elimination (RFE) with a Support Vector Machine (SVM) to
identify the optimal number of features that yield the highest precision.
Task:
Analyze the effect of different numbers of features on the precision of
an SVM model trained with these features. Identify the number of features that
leads to the highest precision.
Instructions:
- Data Preparation:
- Load the dataset and separate it into features (X) and the target
variable (y).
- Standardize the features to improve model training.
- Feature Selection and Model Training:
- Implement Recursive Feature Elimination (RFE) with an SVM
classifier to select varying numbers of features.
- Train an SVM model using these selected features.
- Use StratifiedKFold for cross-validation to ensure the model is robust and
generalizable.
- Evaluation:
- Calculate cross-validated precision for the model with different
numbers of selected features.
- Record and plot precision against the number of features.
- Analysis:
- Identify the number of features for which the precision is
maximized.
Question:
Based on the evaluation, what is the number of features required to
achieve the highest precision with the SVM model?
Options:
(A) 5 features
(B) 8 features
(C) 16 features
(D) 18 features
import numpy as np
from sklearn.feature_selection
import RFE
from
sklearn.preprocessing import
StandardScaler
from
sklearn.metrics import
accuracy_score, precision_score, f1_score
from
sklearn.model_selection import
StratifiedKFold, cross_val_score
# Load your data and separate into X (features) and
y (target)
X = df.drop('Class',
axis=1)
y = df['Class']
# Prepare cross-validation (use StratifiedKFold for
classification tasks)
cv = StratifiedKFold(n_splits=5, random_state=42, shuffle=True)
# Lists to store metrics
accuracies = []
precisions = []
f1_scores = []
feature_counts = ______ #
TODO: Define the range of feature counts to be selected
for n in
feature_counts:
# Feature
scaling
scaler = StandardScaler()
X_scaled = scaler.______(X)
# TODO: Apply scaling to the
feature set
# Feature
selection
selector = RFE(SVC(kernel='linear', random_state=42), n_features_to_select=_____,
step=1) #
TODO: Specify the number of features to select
X_selected = _____.____(X_scaled, y)
# TODO: Fit and transform the
data with the selector
# SVM
classifier
svm = ____(kernel='linear', random_state=10) #
TODO: Initialize the SVM model with appropriate parameters
# Evaluate
using cross-validation
accuracy_scores = []
precision_scores = []
f1_scores_list = []
for
train_index, test_index in
cv.split(X_selected, y):
X_train_cv, X_test_cv =
X_selected[train_index], X_selected[test_index]
y_train_cv, y_test_cv =
y[train_index], y[test_index]
svm.fit(______,
y_train_cv) # TODO: Fit the SVM model on the
training data
y_pred =
svm.______(X_test_cv) # TODO: Make predictions on
the test data
accuracy_scores.append(_____(y_test_cv, y_pred)) #
TODO: Calculate the accuracy score
precision_scores.append(_____(y_test_cv, y_pred, average='weighted')) #
TODO: Calculate the precision score
f1_scores_list.append(_____(y_test_cv, y_pred, average='weighted')) #
TODO: Calculate the F1 score
accuracies.append(np.mean(accuracy_scores))
precisions.append(np.mean(precision_scores))
f1_scores.append(np.mean(f1_scores_list))
# Plotting the results
plt.figure(figsize=(12, 8))
plt.plot(feature_counts, accuracies, label='Accuracy', marker='o')
plt.plot(feature_counts, precisions, label='Precision', marker='o')
plt.plot(feature_counts, f1_scores, label='F1 Score', marker='o')
# Highlight the point with highest precision
max_precision_index = np.argmax(precisions)
max_precision_feature_count =
feature_counts[max_precision_index]
plt.scatter(max_precision_feature_count,
precisions[max_precision_index], color='red')
plt.axvline(x=max_precision_feature_count, color='r', linestyle='--', lw=2)
plt.annotate(f'Highest Precision: {precisions[max_precision_index]:.2f} at {max_precision_feature_count} features',
(max_precision_feature_count, precisions[max_precision_index]),
textcoords="offset points",
xytext=(0,10), ha='center')
plt.xlabel('Number of
Features Selected')
plt.ylabel('Score')
plt.title('Cross-Validated
Performance Metrics vs. Number of Features')
plt.legend()
plt.grid(True)
plt.show()
Final Code:
import numpy as np
from
sklearn.feature_selection import
RFE
from
sklearn.preprocessing import
StandardScaler
from
sklearn.metrics import
accuracy_score, precision_score, f1_score
from
sklearn.model_selection import
StratifiedKFold, cross_val_score
# Load your data and separate into X (features) and
y (target)
X = df.drop('Class',
axis=1)
y = df['Class']
# Prepare cross-validation (use StratifiedKFold for
classification tasks)
cv = StratifiedKFold(n_splits=5, random_state=42, shuffle=True)
# Lists to store metrics
accuracies = []
precisions = []
f1_scores = []
feature_counts = range(1, X.shape[1] + 1) #
TODO: Define the range of feature counts to be selected
for n in
feature_counts:
# Feature
scaling
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# TODO: Apply scaling to the
feature set
# Feature
selection
selector = RFE(SVC(kernel='linear', random_state=42), n_features_to_select=n, step=1) #
TODO: Specify the number of features to select
X_selected =
selector.fit_transform(X_scaled, y) #
TODO: Fit and transform the data with the selector
# SVM
classifier
svm = SVC(kernel='linear', random_state=10) #
TODO: Initialize the SVM model with appropriate parameters
# Evaluate
using cross-validation
accuracy_scores = []
precision_scores = []
f1_scores_list = []
for
train_index, test_index in
cv.split(X_selected, y):
X_train_cv, X_test_cv =
X_selected[train_index], X_selected[test_index]
y_train_cv, y_test_cv =
y[train_index], y[test_index]
svm.fit(X_train_cv,
y_train_cv) # TODO: Fit the SVM model on the
training data
y_pred =
svm.predict(X_test_cv) # TODO: Make predictions on
the test data
accuracy_scores.append(accuracy_score(y_test_cv,
y_pred)) # TODO: Calculate the accuracy
score
precision_scores.append(precision_score(y_test_cv, y_pred, average='weighted')) #
TODO: Calculate the precision score
f1_scores_list.append(f1_score(y_test_cv, y_pred, average='weighted')) #
TODO: Calculate the F1 score
accuracies.append(np.mean(accuracy_scores))
precisions.append(np.mean(precision_scores))
f1_scores.append(np.mean(f1_scores_list))
# Plotting the results
plt.figure(figsize=(12, 8))
plt.plot(feature_counts, accuracies, label='Accuracy', marker='o')
plt.plot(feature_counts, precisions, label='Precision', marker='o')
plt.plot(feature_counts, f1_scores, label='F1 Score', marker='o')
# Highlight the point with highest precision
max_precision_index = np.argmax(precisions)
max_precision_feature_count =
feature_counts[max_precision_index]
plt.scatter(max_precision_feature_count,
precisions[max_precision_index], color='red')
plt.axvline(x=max_precision_feature_count, color='r', linestyle='--', lw=2)
plt.annotate(f'Highest Precision: {precisions[max_precision_index]:.2f} at {max_precision_feature_count} features',
(max_precision_feature_count, precisions[max_precision_index]),
textcoords="offset points",
xytext=(0,10), ha='center')
plt.xlabel('Number of
Features Selected')
plt.ylabel('Score')
plt.title('Cross-Validated
Performance Metrics vs. Number of Features')
plt.legend()
plt.grid(True)
plt.show()
Ans: 16 features
Q4. Optimal Degree for Polynomial
Context:
Support Vector Machines (SVM) with a polynomial kernel are useful for
non-linear data classification. The degree of the polynomial kernel plays a
critical role in the model's ability to capture complex patterns in the data.
This exercise involves determining the optimal polynomial degree for maximum
precision in an SVM model.
Task:
Evaluate the performance of an SVM with a polynomial kernel at various
degrees and identify the degree that results in the highest precision.
Instructions:
- Data Preparation:
- Assume X_train_scaled and y_train are your scaled feature set and target variable,
respectively, prepared for training.
- Model Configuration and Evaluation:
- Set up a polynomial kernel SVM for degrees [2, 3, 4, 5].
- Use cross-validation to evaluate the accuracy, precision, and F1
score for each degree.
- Collect and plot these metrics to visualize how they vary with the
polynomial degree.
- Analysis:
- Identify which polynomial degree achieves the highest precision
score.
Question:
Based on the cross-validated performance metrics, which degree of the
polynomial kernel results in the highest precision?
Options:
(A) Degree 2
(B) Degree 3
(C) Degree 4
(D) Degree 5
import
matplotlib.pyplot as
plt
from
sklearn.model_selection import
cross_val_score, StratifiedKFold
from
sklearn.metrics import
accuracy_score, precision_score, f1_score
from
sklearn.svm import
SVC
from sklearn.preprocessing
import
StandardScaler
# Prepare cross-validation
cv = StratifiedKFold(n_splits=5, shuffle=True,
random_state=42)
# Testing different degrees of the polynomial
kernel
degrees = [2,
3, 4, 5]
accuracies = []
precisions = []
f1_scores = []
for degree in
degrees:
svm_poly = SVC(kernel=____________,
________=degree, gamma='scale',
random_state=10)
# TODO: Specify the kernel type
and degree parameter
#
Cross-validate the model
accuracy = ______(svm_poly,
X_train_scaled, y_train, cv=cv, scoring='accuracy').mean()
# TODO: Use the appropriate
cross-validation function to get the mean accuracy
precision = cross_val_score(svm_poly,
______, y_train, cv=cv, scoring='precision_weighted').____()
# TODO: Provide the feature set
and call the appropriate method to get the mean precision
f1 = cross_val_score(svm_poly,
X_train_scaled, y_train, ____=cv, ______='f1_weighted').mean()
# TODO: Specify the appropriate
parameters for cross-validation
# Store
results for plotting
accuracies.append(accuracy)
precisions.append(precision)
f1_scores.append(f1)
# Plotting the results
plt.figure(figsize=(10, 6))
plt.plot(degrees, accuracies, label='Accuracy', marker='o')
plt.plot(degrees, precisions, label='Precision', marker='o')
plt.plot(degrees, f1_scores, label='F1 Score', marker='o')
plt.xlabel('Degree of
Polynomial Kernel')
plt.ylabel('Score')
plt.title('Performance
Metrics for SVM with Polynomial Kernel')
plt.xticks(degrees) #
Set x-ticks to be the degrees for better readability
plt.legend()
plt.grid(True)
plt.show()
Final Code:
import
matplotlib.pyplot as
plt
from
sklearn.model_selection import
cross_val_score, StratifiedKFold
from
sklearn.metrics import
accuracy_score, precision_score, f1_score
from
sklearn.svm import
SVC
from
sklearn.preprocessing import
StandardScaler
# Prepare cross-validation
cv = StratifiedKFold(n_splits=5, shuffle=True,
random_state=42)
# Testing different degrees of the polynomial
kernel
degrees = [2,
3, 4, 5]
accuracies = []
precisions = []
f1_scores = []
for degree in
degrees:
svm_poly = SVC(kernel='poly', degree=degree, gamma='scale', random_state=10) #
TODO: Specify the kernel type and degree parameter
#
Cross-validate the model
accuracy = cross_val_score(svm_poly,
X_train_scaled, y_train, cv=cv, scoring='accuracy').mean()
# TODO: Use the appropriate
cross-validation function to get the mean accuracy
precision =
cross_val_score(svm_poly,X_train_scaled, y_train, cv=cv, scoring='precision_weighted').mean() #
TODO: Provide the feature set and call the appropriate method to get the mean
precision
f1 = cross_val_score(svm_poly,
X_train_scaled, y_train, cv=cv, scoring='f1_weighted').mean()
# TODO: Specify the appropriate
parameters for cross-validation
# Store
results for plotting
accuracies.append(accuracy)
precisions.append(precision)
f1_scores.append(f1)
# Plotting the results
plt.figure(figsize=(10, 6))
plt.plot(degrees, accuracies, label='Accuracy', marker='o')
plt.plot(degrees, precisions, label='Precision', marker='o')
plt.plot(degrees, f1_scores, label='F1 Score', marker='o')
plt.xlabel('Degree of
Polynomial Kernel')
plt.ylabel('Score')
plt.title('Performance
Metrics for SVM with Polynomial Kernel')
plt.xticks(degrees) #
Set x-ticks to be the degrees for better readability
plt.legend()
plt.grid(True)
plt.show()
Ans: Degree 3
Q5. Determining Optimal Gamma
Context:
In Support Vector Machines (SVM) using the radial basis function (RBF)
kernel, the gamma parameter defines how far the influence of a single training
example reaches, with low values meaning 'far' and high values meaning 'close'.
Adjusting gamma can significantly affect the model's ability to generalize.
Task:
Identify the optimal gamma value for an SVM model using the RBF kernel
that results in the highest precision. Gamma values are tested on a logarithmic
scale from (10−4) to (101).
Instructions:
- Setup and Data Preparation:
- Use StratifiedKFold for cross-validation with 5 splits, shuffled data, and a set
random state for reproducibility.
- Standardize the features to improve model performance.
- Prepare a range of gamma values on a logarithmic scale for
testing.
- Model Training and Evaluation:
- Train an SVM model with the RBF kernel at various gamma settings.
- Use cross-validation to evaluate accuracy, precision, and F1 score
for each gamma value.
- Record these scores for analysis.
- Analysis and Visualization:
- Plot accuracy, precision, and F1 score against gamma values.
- Highlight the gamma value that results in the highest precision.
- Use logarithmic scaling for the gamma axis to enhance
visualization.
Question:
Based on the evaluation, in which range does the optimal gamma value for
achieving the highest precision fall?
Options:
(A) (10−4) to (10−3)
(B) (10−3) to (10−2)
(C) (10−1) to (100)
(D) (10−2) to (10−1)
# Prepare cross-validation
cv = StratifiedKFold(n_splits=5, shuffle=True,
random_state=42)
# Testing a new range of gamma values with more
granularity
gamma_values = np.logspace(-4, 1, 20) #
Logarithmic scale from 10^-4 to 10^1
accuracies = []
precisions = []
f1_scores = []
for gamma in
gamma_values:
svm_rbf = SVC(kernel=____,
_____=gamma, random_state=10)
# TODO: Specify the correct
kernel and parameter name for gamma
#
Cross-validate the model
accuracy = np.mean(______(_____,
X_train_scaled, y_train, cv=cv, scoring='accuracy'))
# TODO: Use cross-validation to
compute the mean accuracy
precision =
np.mean(cross_val_score(svm_rbf, ____, y_train, cv=cv, scoring='precision_weighted')) #
TODO: Provide the feature set and compute the mean precision
f1 = np.mean(cross_val_score(svm_rbf,
X_train_scaled, y_train, ___=cv, scoring='f1_weighted'))
# TODO: Compute the mean F1
score using cross-validation
# Store the
scores
accuracies.append(accuracy)
precisions.append(precision)
f1_scores.append(f1)
# Find the index and value of the highest precision
max_precision_index = np.____(precisions) #
TODO: Find the index of the maximum precision
max_precision_value =
precisions[max_precision_index]
max_precision_gamma =
gamma_values[max_precision_index]
# Plotting the results
plt.figure(figsize=(12, 8))
plt.plot(gamma_values, accuracies, label='Accuracy', marker='o')
plt.plot(gamma_values, precisions, label='Precision', marker='o')
plt.plot(gamma_values, f1_scores, label='F1 Score', marker='o')
# Highlighting the point with highest precision
plt.scatter(max_precision_gamma,
max_precision_value, color='red')
# Highlight the point
plt.axvline(x=max_precision_gamma, color='r', linestyle='--', lw=2) #
Vertical line
plt.annotate(f'Highest Precision: {max_precision_value:.2f}\nGamma: {max_precision_gamma}',
(max_precision_gamma, max_precision_value),
textcoords="offset points",
xytext=(0,
-50), ha='center')
plt.xlabel('Gamma')
plt.ylabel('Score')
plt.title('SVM
Performance Comparison for Different Gamma Values')
plt.xscale('log')
# Using logarithmic scale for
gamma values
plt.legend()
plt.grid(True)
plt.show()
Final Code:
# Prepare cross-validation
cv = StratifiedKFold(n_splits=5, shuffle=True,
random_state=42)
# Testing a new range of gamma values with more
granularity
gamma_values = np.logspace(-4, 1, 20) #
Logarithmic scale from 10^-4 to 10^1
accuracies = []
precisions = []
f1_scores = []
for gamma in
gamma_values:
svm_rbf = SVC(kernel='rbf', gamma=gamma, random_state=10) #
TODO: Specify the correct kernel and parameter name for gamma
#
Cross-validate the model
accuracy =
np.mean(cross_val_score(svm_rbf, X_train_scaled, y_train, cv=cv, scoring='accuracy')) #
TODO: Use cross-validation to compute the mean accuracy
precision =
np.mean(cross_val_score(svm_rbf, X_train_scaled, y_train, cv=cv, scoring='precision_weighted')) #
TODO: Provide the feature set and compute the mean precision
f1 = np.mean(cross_val_score(svm_rbf,
X_train_scaled, y_train, cv=cv, scoring='f1_weighted'))
# TODO: Compute the mean F1
score using cross-validation
# Store the
scores
accuracies.append(accuracy)
precisions.append(precision)
f1_scores.append(f1)
# Find the index and value of the highest precision
max_precision_index = np.argmax(precisions) #
TODO: Find the index of the maximum precision
max_precision_value =
precisions[max_precision_index]
max_precision_gamma =
gamma_values[max_precision_index]
# Plotting the results
plt.figure(figsize=(12, 8))
plt.plot(gamma_values, accuracies, label='Accuracy', marker='o')
plt.plot(gamma_values, precisions, label='Precision', marker='o')
plt.plot(gamma_values, f1_scores, label='F1 Score', marker='o')
# Highlighting the point with highest precision
plt.scatter(max_precision_gamma,
max_precision_value, color='red')
# Highlight the point
plt.axvline(x=max_precision_gamma, color='r', linestyle='--', lw=2) #
Vertical line
plt.annotate(f'Highest Precision: {max_precision_value:.2f}\nGamma: {max_precision_gamma}',
(max_precision_gamma, max_precision_value),
textcoords="offset points",
xytext=(0,
-50), ha='center')
plt.xlabel('Gamma')
plt.ylabel('Score')
plt.title('SVM
Performance Comparison for Different Gamma Values')
plt.xscale('log')
# Using logarithmic scale for
gamma values
plt.legend()
plt.grid(True)
plt.show()
Ans: (
Q6. Optimal Regularization Parameter
Context:
The regularization parameter ( C ) in Support Vector Machines (SVM)
controls the trade-off between achieving a low error on the training data and
minimizing the model complexity for better generalization. Determining the
optimal ( C ) value is crucial for model performance, especially when using SVM
for classification tasks.
Task:
Use Recursive Feature Elimination (RFE) to select the top 16 features
for training an SVM classifier. Then, determine the optimal ( C ) value from a
range of possible values to maximize the precision of the classifier.
Instructions:
- Feature Selection:
- Use RFE with SVM to reduce the number of features to the top 16
most significant features.
- Standardize the features before applying RFE.
- Model Training and Validation:
- Train an SVM model with a linear kernel using the selected
features.
- Explore a range of ( C ) values using logarithmic scaling to
determine the best ( C ) for maximizing precision.
- Employ cross-validation using StratifiedKFold with 5 splits to
evaluate model performance metrics such as accuracy, precision, and F1
score.
- Performance Analysis:
- Plot the performance metrics across different ( C ) values.
- Identify and highlight the ( C ) value that results in the highest
precision.
Question:
Based on the evaluation, in which range does the optimal ( C ) value for
achieving the highest precision fall?
Options:
(A) 0 to 1
(B) 1 to 3
(C) 3 to 6
(D) 6 to 10
# Set up the RFE with SVM as the estimator and
select the top 16 features
selector = RFE(SVC(_____='linear', random_state=42), ______=16, step=1) #
TODO: Specify the correct parameter names
X_train_rfe = _____.fit_transform(X_train_scaled,
y_train) # TODO: Fit and transform the
training data with RFE
X_test_rfe = _____.transform(X_test_scaled) #
TODO: Transform the test data with RFE
# Prepare cross-validation
cv = StratifiedKFold(n_splits=5, shuffle=True,
random_state=42)
# Testing different C values
c_values = np.logspace(-3, 2, 20) #
Explore a range of C values on a log scale
accuracies = []
precisions = []
f1_scores = []
for c in
c_values:
svm_linear = ___(kernel='linear', ____=c, random_state=42) #
TODO: Initialize the SVC model with a linear kernel and the C parameter
#
Cross-validate the model using only the selected features by RFE
accuracy =
np.mean(cross_val_score(_____, X_train_rfe, y_train, cv=cv, ____='accuracy')) #
TODO: Use cross_val_score to compute the mean accuracy
precision =
np.mean(cross_val_score(svm_linear, ____, y_train, cv=cv, scoring='precision_weighted')) #
TODO: Compute the mean precision
f1 =
np.___(cross_val_score(svm_linear, X_train_rfe, ____, __=cv, scoring='f1_weighted')) #
TODO: Compute the mean F1 score
# Store the
scores
accuracies.append(accuracy)
precisions.append(precision)
f1_scores.append(f1)
# Find the index and value of the highest precision
max_precision_index = np.____(precisions) #
TODO: Find the index of the maximum precision
max_precision_value =
precisions[max_precision_index]
max_precision_c = c_values[max_precision_index]
# Plotting the results
plt.figure(figsize=(12, 8))
plt.plot(c_values, accuracies, label='Accuracy', marker='o')
plt.plot(c_values, precisions, label='Precision', marker='o')
plt.plot(c_values, f1_scores, label='F1 Score', marker='o')
# Highlighting the point with highest precision
plt.scatter(max_precision_c, max_precision_value,
color='red')
# Highlight the point
plt.axvline(x=max_precision_c, color='r', linestyle='--', lw=2) #
Vertical line
plt.annotate(f'Highest Precision: {max_precision_value:.2f}\nC: {max_precision_c}',
(max_precision_c, max_precision_value),
textcoords="offset points",
xytext=(0,
-50), ha='center')
plt.xlabel('C Value')
plt.ylabel('Score')
plt.title('SVM with
Linear Kernel Performance Comparison for Different C Values')
plt.xscale('log')
# Using logarithmic scale for C
values
plt.legend()
plt.grid(True)
plt.show()
Final Code
# Set up the RFE with SVM as the estimator and
select the top 16 features
selector = RFE(SVC(kernel='linear', random_state=42), n_features_to_select=16, step=1) #
TODO: Specify the correct parameter names
X_train_rfe =
selector.fit_transform(X_train_scaled, y_train) #
TODO: Fit and transform the training data with RFE
X_test_rfe = selector.transform(X_test_scaled)
# TODO: Transform the test data
with RFE
# Prepare cross-validation
cv = StratifiedKFold(n_splits=5, shuffle=True,
random_state=42)
# Testing different C values
c_values = np.logspace(-3, 2, 20) #
Explore a range of C values on a log scale
accuracies = []
precisions = []
f1_scores = []
for c in
c_values:
svm_linear = SVC(kernel='linear', C=c, random_state=42) #
TODO: Initialize the SVC model with a linear kernel and the C parameter
#
Cross-validate the model using only the selected features by RFE
accuracy =
np.mean(cross_val_score(svm_linear, X_train_rfe, y_train, cv=cv,scoring='accuracy')) #
TODO: Use cross_val_score to compute the mean accuracy
precision =
np.mean(cross_val_score(svm_linear, X_train_rfe, y_train, cv=cv, scoring='precision_weighted')) #
TODO: Compute the mean precision
f1 =
np.mean(cross_val_score(svm_linear, X_train_rfe, y_train, cv=cv, scoring='f1_weighted')) #
TODO: Compute the mean F1 score
# Store the
scores
accuracies.append(accuracy)
precisions.append(precision)
f1_scores.append(f1)
# Find the index and value of the highest precision
max_precision_index = np.argmax(precisions) #
TODO: Find the index of the maximum precision
max_precision_value =
precisions[max_precision_index]
max_precision_c = c_values[max_precision_index]
# Plotting the results
plt.figure(figsize=(12, 8))
plt.plot(c_values, accuracies, label='Accuracy', marker='o')
plt.plot(c_values, precisions, label='Precision', marker='o')
plt.plot(c_values, f1_scores, label='F1 Score', marker='o')
# Highlighting the point with highest precision
plt.scatter(max_precision_c, max_precision_value,
color='red')
# Highlight the point
plt.axvline(x=max_precision_c, color='r', linestyle='--', lw=2) #
Vertical line
plt.annotate(f'Highest Precision: {max_precision_value:.2f}\nC: {max_precision_c}',
(max_precision_c, max_precision_value),
textcoords="offset points",
xytext=(0,
-50), ha='center')
plt.xlabel('C Value')
plt.ylabel('Score')
plt.title('SVM with
Linear Kernel Performance Comparison for Different C Values')
plt.xscale('log')
# Using logarithmic scale for C
values
plt.legend()
plt.grid(True)
plt.show()
Ans: 1 to 3
Q7. Counting Support Vectors
Context:
Support Vector Machine (SVM) is a robust classification technique that
constructs a hyperplane or set of hyperplanes in a high-dimensional space,
which can be used for classification, regression, or other tasks. A critical
component of the SVM classifier is the support vectors, which are the data
points nearest to the hyperplane and influence its position and orientation.
Task:
Use t-Distributed Stochastic Neighbor Embedding (t-SNE) for
dimensionality reduction followed by SVM to classify data. Determine the total
number of support vectors involved in classifying the data.
Instructions:
- Data Preparation and Preprocessing:
- Filter the dataset for two specific classes for simplification.
- Apply standard scaling to the features to normalize data,
enhancing the SVM's performance.
- Dimensionality Reduction:
- Implement t-SNE to reduce feature dimensions to two components,
which aids in visualizing the dataset.
- SVM Training:
- Train an SVM classifier using the t-SNE output. Set the kernel to
linear and regularization parameter ( C ) to 1.0.
- Identify the support vectors from the trained model.
- Visualization:
- Plot the decision boundaries and highlight the support vectors.
- Use contours to depict decision boundaries and margins.
- Analysis:
- Report the total number of support vectors found.
Question:
After training the SVM model on t-SNE reduced features, how many support
vectors are found?
Options:
(A) 10
(B) 15
(C) 4
(D) 20
import numpy as np
import pandas as pd
from
sklearn.manifold import
TSNE
from
sklearn.svm import
SVC
from
sklearn.preprocessing import
StandardScaler
import
matplotlib.pyplot as
plt
from
matplotlib.colors import
ListedColormap
# List of features
features = ['AREA',
'PERIMETER', 'MAJOR_AXIS', 'MINOR_AXIS', 'ECCENTRICITY',
'EQDIASQ', 'SOLIDITY', 'CONVEX_AREA', 'EXTENT', 'ASPECT_RATIO',
'ROUNDNESS', 'COMPACTNESS', 'SHAPEFACTOR_1', 'SHAPEFACTOR_2',
'SHAPEFACTOR_3', 'SHAPEFACTOR_4', 'MeanRR', 'MeanRG', 'MeanRB',
'StdDevRR', 'StdDevRG', 'StdDevRB', 'SkewRR', 'SkewRG', 'SkewRB',
'KurtosisRR', 'KurtosisRG', 'KurtosisRB', 'EntropyRR', 'EntropyRG',
'EntropyRB', 'ALLdaub4RR', 'ALLdaub4RG', 'ALLdaub4RB']
# Create a copy of the DataFrame
df_temp = df.copy()
# Select features and filter for two classes
df_temp['Class']
= df_temp['Class'].apply(lambda
x: 'SAFAVI' if x == 'SAFAVI' else 'Other')
# Encoding classes
df_temp['Label']
= df_temp['Class'].____({'SAFAVI': 1, 'Other': 0}) #
TODO: Use the appropriate method to map class labels to numerical values
# Standard scaling
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df_temp[features])
# t-SNE
tsne = TSNE(n_components=2, random_state=42)
X_tsne = tsne.fit_transform(X_scaled)
# SVM training
svm = ____(kernel='linear', ___=1.0) #
TODO: Initialize the SVC model with the correct parameters
svm.fit(X_tsne, df_temp['Label'])
# Support vectors
# Visit
https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html to get
the correct attribute for support vectors
support_vectors = svm.______
print(f'---------------------------------------------------------------')
print(f'Total Number of Support Vectors found are {len(support_vectors)}')
print(f'---------------------------------------------------------------')
# Create grid to plot decision boundaries
x_min, x_max = X_tsne[:, 0].min() - 1, X_tsne[:, 0].max() + 1
y_min, y_max = X_tsne[:, 1].min() - 1, X_tsne[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.02), np.arange(y_min, y_max, 0.02))
# Plot decision boundary and margins
Z = svm.decision_function(np.c_[xx.ravel(),
yy.ravel()]) # TODO: Use the decision
function to predict decision boundaries
Z = Z.reshape(xx.shape)
plt.figure(figsize=(10, 8)) #
Define plot size here to ensure it encompasses all the elements
plt.contourf(xx, yy, Z, levels=[-1, 0, 1], alpha=0.2, colors=['blue', 'grey', 'red']) #
Background color for decision areas
plt.contour(xx, yy, Z, colors='k', levels=[-1, 0, 1], alpha=0.5, linestyles=['--', '-', '--']) #
Decision boundaries and margins
# Plot data points and support vectors
scatter = plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=df_temp['Label'], cmap=ListedColormap(['#FF0000', '#0000FF']), s=50, edgecolors='k')
plt.scatter(support_vectors[:, 0], support_vectors[:, 1], s=120, facecolors='none', edgecolors='yellow', linewidths=2, label='Support vectors')
# Labels and title
plt.xlabel('t-SNE
Feature 1')
plt.ylabel('t-SNE
Feature 2')
plt.title('t-SNE
Visualization with SVM Decision Boundary for Date Fruit Classification')
plt.legend(handles=[scatter], labels=['Data points'], loc='upper right')
plt.show()
Final Code:
import numpy as np
import pandas as pd
from
sklearn.manifold import
TSNE
from
sklearn.svm import
SVC
from
sklearn.preprocessing import
StandardScaler
import matplotlib.pyplot
as plt
from
matplotlib.colors import
ListedColormap
# List of features
features = ['AREA',
'PERIMETER', 'MAJOR_AXIS', 'MINOR_AXIS', 'ECCENTRICITY',
'EQDIASQ', 'SOLIDITY', 'CONVEX_AREA', 'EXTENT', 'ASPECT_RATIO',
'ROUNDNESS', 'COMPACTNESS', 'SHAPEFACTOR_1', 'SHAPEFACTOR_2',
'SHAPEFACTOR_3', 'SHAPEFACTOR_4', 'MeanRR', 'MeanRG', 'MeanRB',
'StdDevRR', 'StdDevRG', 'StdDevRB', 'SkewRR', 'SkewRG', 'SkewRB',
'KurtosisRR', 'KurtosisRG', 'KurtosisRB', 'EntropyRR', 'EntropyRG',
'EntropyRB', 'ALLdaub4RR', 'ALLdaub4RG', 'ALLdaub4RB']
# Create a copy of the DataFrame
df_temp = df.copy()
# Select features and filter for two classes
df_temp['Class']
= df_temp['Class'].apply(lambda
x: 'SAFAVI' if x == 'SAFAVI' else 'Other')
# Encoding classes
df_temp['Label']
= df_temp['Class'].map({'SAFAVI': 1, 'Other': 0}) #
TODO: Use the appropriate method to map class labels to numerical values
# Standard scaling
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df_temp[features])
# t-SNE
tsne = TSNE(n_components=2, random_state=42)
X_tsne = tsne.fit_transform(X_scaled)
# SVM training
svm = SVC(kernel='linear', C=1.0) #
TODO: Initialize the SVC model with the correct parameters
svm.fit(X_tsne, df_temp['Label'])
# Support vectors
# Visit
https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html to get
the correct attribute for support vectors
support_vectors = svm.support_vectors_
print(f'---------------------------------------------------------------')
print(f'Total Number of Support Vectors found are {len(support_vectors)}')
print(f'---------------------------------------------------------------')
# Create grid to plot decision boundaries
x_min, x_max = X_tsne[:, 0].min() - 1, X_tsne[:, 0].max() + 1
y_min, y_max = X_tsne[:, 1].min() - 1, X_tsne[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.02), np.arange(y_min, y_max, 0.02))
# Plot decision boundary and margins
Z = svm.decision_function(np.c_[xx.ravel(),
yy.ravel()]) # TODO: Use the decision
function to predict decision boundaries
Z = Z.reshape(xx.shape)
plt.figure(figsize=(10, 8)) #
Define plot size here to ensure it encompasses all the elements
plt.contourf(xx, yy, Z, levels=[-1, 0, 1], alpha=0.2, colors=['blue', 'grey', 'red']) #
Background color for decision areas
plt.contour(xx, yy, Z, colors='k', levels=[-1, 0, 1], alpha=0.5, linestyles=['--', '-', '--']) #
Decision boundaries and margins
# Plot data points and support vectors
scatter = plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=df_temp['Label'], cmap=ListedColormap(['#FF0000', '#0000FF']), s=50, edgecolors='k')
plt.scatter(support_vectors[:, 0], support_vectors[:, 1], s=120, facecolors='none', edgecolors='yellow', linewidths=2, label='Support vectors')
# Labels and title
plt.xlabel('t-SNE
Feature 1')
plt.ylabel('t-SNE
Feature 2')
plt.title('t-SNE
Visualization with SVM Decision Boundary for Date Fruit Classification')
plt.legend(handles=[scatter], labels=['Data points'], loc='upper right')
plt.show()
Total Number of Support Vectors
found are 15
Ans: 15
Q8. Implement Hinge Loss from Scratch
Context:
Support Vector Machines (SVMs) rely on the hinge loss function to
penalize misclassified points and those that are too close to the decision
boundary. Understanding hinge loss helps explain how SVMs work.
Task:
Write a function to compute hinge loss for a simple binary
classification dataset and use it to determine the loss of a sample classifier.
Instructions:
- Define Hinge Loss Function: Implement a Python
function to compute hinge loss using the following steps:
- Step 1: Calculate the margin for each data point:
margin=𝑦𝑖(𝑤⃗⋅𝑥⃗𝑖+𝑏)
- Step 2: Compute the hinge loss for each data point:
max(0,1-margin)
- Step 3: Return the average hinge loss across all data points.
- Evaluate Sample Classifier: Use your function to
compute the hinge loss for a classifier defined by weights and bias.
- Answer the Question: Based on the computed
loss, answer the following question about the nature of hinge loss and its
implications.
Question:
What does the computed hinge loss value indicate about the classifier's
performance?
Statements:
- The hinge loss value ( > 1 ) indicates that most points are
misclassified or too close to the decision boundary.
- The hinge loss value ( < 1 ) suggests that most points are
classified correctly but may still be close to the decision boundary.
- The hinge loss value ( = 0 ) indicates that all points are
classified correctly and far from the decision boundary.
- The hinge loss value ( = 1 ) suggests that all points are
classified correctly.
Options:
A) Statement 1
B) Statement 2
C) Statement 3
D) Statement 4
# Sample Data (Binary classification)
X = np.array([[2,
3], [3, 4], [1, 1], [4, 5], [6, 7]])
y = np.array([1,
1, -1, 1, -1])
data = pd.DataFrame(X, columns = ['f1', 'f2'])
data['y']
= y
data.head()
index,f1,f2,y
0,2,3,1
1,3,4,1
2,1,1,-1
3,4,5,1
4,6,7,-1
import numpy as np
# Hinge Loss Implementation
def hinge_loss(X, y, weights, bias):
"""
Compute hinge loss for a simple SVM
model.
Args:
X (ndarray): Feature
matrix.
y (ndarray): Target
vector.
weights (ndarray):
Weight vector.
bias (float): Bias
term.
Returns:
float: Hinge loss
value.
"""
margins = y * (np.___(X, ____) + ___)
# TODO: Calculate the margins
using a dot product of weights and features, and add the bias
hinge_loss = np.____(0, 1 - ____) #
TODO: Apply the hinge loss formula: max(0, 1 - margin)
return
np.____(hinge_loss) # TODO: Return the mean or
sum of the hinge loss values
# Example weights and bias
weights = np.array([0.5, 0.5])
bias = 0.0
# Compute Hinge Loss
loss = hinge_loss(X, y, weights, bias)
print(f'Hinge Loss: {loss:.4f}')
Final Code:
# Sample Data (Binary classification)
X = np.array([[2,
3], [3, 4], [1, 1], [4, 5], [6, 7]])
y = np.array([1,
1, -1, 1, -1])
data = pd.DataFrame(X, columns = ['f1', 'f2'])
data['y']
= y
data.head()
import numpy as np
# Hinge Loss Implementation
def hinge_loss(X, y, weights, bias):
"""
Compute hinge loss for a simple SVM
model.
Args:
X (ndarray): Feature
matrix.
y (ndarray): Target
vector.
weights (ndarray):
Weight vector.
bias (float): Bias
term.
Returns:
float: Hinge loss
value.
"""
margins = y * (np.dot(X, weights) +
bias)
hinge_loss = np.maximum(0, 1 - margins)
return
np.mean(hinge_loss)
Hinge Loss: 1.9000
Ans: Statement 1
Comments
Post a Comment