Support Vector Machines

Predicting Date Fruit Varieties with Support Vector Machines

Context

Objective is to leverage advanced machine learning techniques to predict the variety of date fruits, empowering farmers and agricultural stakeholders to improve classification accuracy and streamline post-harvest processes. Your role is to analyze various morphological, colorimetric, and textural attributes of date fruits to build predictive models that distinguish different varieties effectively.

Dataset Description

You have been provided with a comprehensive dataset containing morphological and colorimetric features of different varieties of date fruits. The dataset includes the following attributes:

Morphological Attributes:

AREA: Surface area of the date fruit.
PERIMETER: Perimeter measurement around the fruit.
MAJOR_AXIS: Length of the major axis of the date fruit.
MINOR_AXIS: Length of the minor axis of the date fruit.
ECCENTRICITY: Ratio describing the shape of the fruit based on the axes.
EQDIASQ: Equivalent diameter of a circle with the same area as the fruit.
SOLIDITY: Ratio of the area to the convex hull area.
CONVEX_AREA: Area of the smallest convex polygon that can contain the fruit.
EXTENT: Ratio of the area to the bounding box area.
ASPECT_RATIO: Ratio of the major axis to the minor axis.
ROUNDNESS: Measure of how circular the fruit is.
COMPACTNESS: Measure of how compact or dense the fruit is.

Shape Factor Attributes:

SHAPEFACTOR_1: Ratio of the perimeter squared to 4π times the area.
SHAPEFACTOR_2: Ratio of 4π times the area to the perimeter squared.
SHAPEFACTOR_3: Ratio of the major axis to the equivalent diameter.
SHAPEFACTOR_4: Ratio of the minor axis to the equivalent diameter.

Colorimetric Attributes:

MeanRR: Mean intensity of the red color channel.
MeanRG: Mean intensity of the green color channel.
MeanRB: Mean intensity of the blue color channel.
StdDevRR: Standard deviation of the red color channel.
StdDevRG: Standard deviation of the green color channel.
StdDevRB: Standard deviation of the blue color channel.
SkewRR: Skewness of the red color channel.
SkewRG: Skewness of the green color channel.
SkewRB: Skewness of the blue color channel.
KurtosisRR: Kurtosis of the red color channel.
KurtosisRG: Kurtosis of the green color channel.
KurtosisRB: Kurtosis of the blue color channel.
EntropyRR: Entropy of the red color channel.
EntropyRG: Entropy of the green color channel.
EntropyRB: Entropy of the blue color channel.

Daubechies Wavelet Attributes:

ALLdaub4RR: Wavelet-transformed feature of the red color channel.
ALLdaub4RG: Wavelet-transformed feature of the green color channel.
ALLdaub4RB: Wavelet-transformed feature of the blue color channel.

Target Attribute:

Class: The variety or class of the date fruit.
Check this Date Fruit Vlog to see the images of these dates

Your task is to utilize Support Vector Machines (SVMs) to predict the "Class" of each date fruit and identify the most influential features contributing to accurate classification. This project will help streamline the sorting and grading processes in the agricultural industry, offering practical insights to farmers and other stakeholders.

!wget https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/070/671/original/Date_Fruit_Datasets.zip

--2024-07-13 05:51:52-- https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/070/671/original/Date_Fruit_Datasets.zip

Resolving d2beiqkhq929f0.cloudfront.net (d2beiqkhq929f0.cloudfront.net)... 13.35.37.7, 13.35.37.159, 13.35.37.102, ...

Connecting to d2beiqkhq929f0.cloudfront.net (d2beiqkhq929f0.cloudfront.net)|13.35.37.7|:443... connected.

HTTP request sent, awaiting response... 200 OK

Length: 416324 (407K) [application/zip]

Saving to: ‘Date_Fruit_Datasets.zip’

Date_Fruit_Datasets 100%[===================>] 406.57K --.-KB/s in 0.02s

2024-07-13 05:51:52 (18.5 MB/s) - ‘Date_Fruit_Datasets.zip’ saved [416324/416324]

!unzip /content/Date_Fruit_Datasets.zip

Archive: /content/Date_Fruit_Datasets.zip

creating: Date_Fruit_Datasets/

inflating: Date_Fruit_Datasets/Date_Fruit_Datasets.arff

inflating: Date_Fruit_Datasets/Date_Fruit_Datasets.xlsx

inflating: Date_Fruit_Datasets/Date_Fruit_Datasets_Citation_Request.txt

import warnings

warnings.filterwarnings("ignore")

import os

import pandas as pd

df = pd.read_excel('/content/Date_Fruit_Datasets/Date_Fruit_Datasets.xlsx')

df.head()

AREA PERIMETER MAJOR_AXIS MINOR_AXIS ECCENTRICITY EQDIASQ SOLIDITY CONVEX_AREA EXTENT ASPECT_RATIO ... KurtosisRR KurtosisRG KurtosisRB EntropyRR EntropyRG EntropyRB ALLdaub4RR ALLdaub4RG ALLdaub4RB Class

0 422163 2378.908 837.8484 645.6693 0.6373 733.1539 0.9947 424428 0.7831 1.2976 ... 3.2370 2.9574 4.2287 -59191263232 -50714214400 -39922372608 58.7255 54.9554 47.8400 BERHI

1 338136 2085.144 723.8198 595.2073 0.5690 656.1464 0.9974 339014 0.7795 1.2161 ... 2.6228 2.6350 3.1704 -34233065472 -37462601728 -31477794816 50.0259 52.8168 47.8315 BERHI

2 526843 2647.394 940.7379 715.3638 0.6494 819.0222 0.9962 528876 0.7657 1.3150 ... 3.7516 3.8611 4.7192 -93948354560 -74738221056 -60311207936 65.4772 59.2860 51.9378 BERHI

3 416063 2351.210 827.9804 645.2988 0.6266 727.8378 0.9948 418255 0.7759 1.2831 ... 5.0401 8.6136 8.2618 -32074307584 -32060925952 -29575010304 43.3900 44.1259 41.1882 BERHI

4 347562 2160.354 763.9877 582.8359 0.6465 665.2291 0.9908 350797 0.7569 1.3108 ... 2.7016 2.9761 4.4146 -39980974080 -35980042240 -25593278464 52.7743 50.9080 42.6666 BERHI

5 rows × 35 columns

index	AREA	PERIMETER	MAJOR_AXIS	MINOR_AXIS	ECCENTRICITY	EQDIASQ	SOLIDITY	CONVEX_AREA	EXTENT	ASPECT_RATIO	ROUNDNESS	COMPACTNESS	SHAPEFACTOR_1	SHAPEFACTOR_2	SHAPEFACTOR_3	SHAPEFACTOR_4	MeanRR	MeanRG	MeanRB	StdDevRR
0	422163	2378.908	837.8484	645.6693	0.6373	733.1539	0.9947	424428	0.7831	1.2976	0.9374	0.875	0.002	0.0015	0.7657	0.9936	117.4466	109.9085	95.6774	26.5152
1	338136	2085.144	723.8198	595.2073	0.569	656.1464	0.9974	339014	0.7795	1.2161	0.9773	0.9065	0.0021	0.0018	0.8218	0.9993	100.0578	105.6314	95.661	27.2656
2	526843	2647.394	940.7379	715.3638	0.6494	819.0222	0.9962	528876	0.7657	1.315	0.9446	0.8706	0.0018	0.0014	0.758	0.9968	130.9558	118.5703	103.875	29.7036
3	416063	2351.21	827.9804	645.2988	0.6266	727.8378	0.9948	418255	0.7759	1.2831	0.9458	0.8791	0.002	0.0016	0.7727	0.9915	86.7798	88.2531	82.3751	28.7288
4	347562	2160.354	763.9877	582.8359	0.6465	665.2291	0.9908	350797	0.7569	1.3108	0.9358	0.8707	0.0022	0.0017	0.7582	0.9938	105.5484	101.8132	85.3342	30.3205

df.info()

RangeIndex: 898 entries, 0 to 897

Data columns (total 35 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 AREA 898 non-null int64

1 PERIMETER 898 non-null float64

2 MAJOR_AXIS 898 non-null float64

3 MINOR_AXIS 898 non-null float64

4 ECCENTRICITY 898 non-null float64

5 EQDIASQ 898 non-null float64

6 SOLIDITY 898 non-null float64

7 CONVEX_AREA 898 non-null int64

8 EXTENT 898 non-null float64

9 ASPECT_RATIO 898 non-null float64

10 ROUNDNESS 898 non-null float64

11 COMPACTNESS 898 non-null float64

12 SHAPEFACTOR_1 898 non-null float64

13 SHAPEFACTOR_2 898 non-null float64

14 SHAPEFACTOR_3 898 non-null float64

15 SHAPEFACTOR_4 898 non-null float64

16 MeanRR 898 non-null float64

17 MeanRG 898 non-null float64

18 MeanRB 898 non-null float64

19 StdDevRR 898 non-null float64

20 StdDevRG 898 non-null float64

21 StdDevRB 898 non-null float64

22 SkewRR 898 non-null float64

23 SkewRG 898 non-null float64

24 SkewRB 898 non-null float64

25 KurtosisRR 898 non-null float64

26 KurtosisRG 898 non-null float64

27 KurtosisRB 898 non-null float64

28 EntropyRR 898 non-null int64

29 EntropyRG 898 non-null int64

30 EntropyRB 898 non-null int64

31 ALLdaub4RR 898 non-null float64

32 ALLdaub4RG 898 non-null float64

33 ALLdaub4RB 898 non-null float64

34 Class 898 non-null object

dtypes: float64(29), int64(5), object(1)

memory usage: 245.7+ KB

class_counts = df.Class.value_counts()

classes = list(class_counts.index)

n_classes = len(classes)

print(f"we've got {n_classes} classes\n")

display(class_counts)

we've got 7 classes

Class

DOKOL 204

SAFAVI 199

ROTANA 166

DEGLET 98

SOGAY 94

IRAQI 72

BERHI 65

Name: count, dtype: int64

Q1. Feature Correlation Analysis

Context:

Support Vector Machines (SVM) are sensitive to highly correlated features, especially when using linear kernels. Such correlations can negatively affect the model’s generalization ability and potentially lead to overfitting.

Task:

Calculate the Pearson correlation coefficients among 'PERIMETER', 'MAJOR-AXIS', 'MINOR-AXIS', and 'COMPACTNESS'. These insights will help in understanding the interdependencies among features that are critical to SVM performance.

Instructions:

Calculate Correlation Matrix: Use the dataset to calculate the Pearson correlation matrix for the specified features.
Visualize Correlation Matrix: Create a heatmap to visualize the correlations between features, which will help identify which pairs are most correlated.

Question:

Based on the heatmap, which pair of features shows the highest correlation? Discuss the potential impact of this correlation on SVM classification and suggest how to mitigate negative effects.

Options:

A) 'PERIMETER' and 'MAJOR_AXIS'
B) 'MAJOR_AXIS' and 'MINOR_AXIS'
C) 'MINOR_AXIS' and 'COMPACTNESS'
D) 'PERIMETER' and 'COMPACTNESS'

import seaborn as sns

import matplotlib.pyplot as plt

# Features to include in the correlation matrix

features = ['PERIMETER', 'MAJOR_AXIS', 'MINOR_AXIS', 'COMPACTNESS']

# TODO: Compute the correlation matrix for the selected features

correlation_matrix = _________

plt.figure(figsize=(10, 8))

# TODO: Fill in the appropriate variable to visualize the correlation matrix

sns.heatmap(________, annot=True, cmap='coolwarm', linewidths=2, linecolor='black')

plt.title('Heatmap of Pearson Correlation Among Features')

plt.show()

Final Code:

import seaborn as sns

import matplotlib.pyplot as plt

# Features to include in the correlation matrix

features = ['PERIMETER', 'MAJOR_AXIS', 'MINOR_AXIS', 'COMPACTNESS']

# TODO: Compute the correlation matrix for the selected features

correlation_matrix = df[features].corr()

plt.figure(figsize=(10, 8))

# TODO: Fill in the appropriate variable to visualize the correlation matrix

sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', linewidths=2, linecolor='black')

plt.title('Heatmap of Pearson Correlation Among Features')

plt.show()

Ans: 'PERIMETER' and 'MAJOR_AXIS'

Q2. SVM Precision with Feature Scaling

Context:

Support Vector Machine (SVM) is sensitive to the scale of the data, which can significantly impact its performance. This question aims to explore the effect of feature scaling on SVM classification accuracy and precision for each class.

Task:

After applying feature scaling, identify the class with the lowest precision in an SVM classification task. The SVM model uses a linear kernel.

Instructions:

Prepare Data:

Split the dataset into training and test sets using train_test_split with test_size=0.3 and random_state=42.
Extract features (X) and the target variable (y) from the dataset.

Feature Scaling:

Scale the features using StandardScaler.

Train SVM:

Train a Support Vector Machine with a linear kernel on the scaled training data.

Evaluate the Model:

Predict the target variable on the scaled test set.
Generate a classification report to review the precision scores for each class.

Question:

After training an SVM with a linear kernel on scaled data, which class in the classification report exhibits the lowest precision?

Options:

(A) SAFAVI

(B) IRAQI

(D) DEGLET

from sklearn.preprocessing import StandardScaler

from sklearn.___ import SVC # TODO: Import the SVC (Support Vector Classifier) from sklearn.svm

from sklearn.metrics import classification_report

from sklearn.model_selection import train_test_split

# Preparing the feature and target variables

X = df.drop('Class', axis=1)

y = df['Class']

# Splitting the dataset into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initializing the SVM model

svm = SVC(_____='linear') # TODO: Specify the kernel type for the SVC model

# With scaling

scaler = StandardScaler()

X_train_scaled = scaler.____(X_train) # TODO: Apply the scaling to the training features

X_test_scaled = scaler.____(X_test) # TODO: Apply the scaling to the test features

# Training the SVM model

svm.fit(X_train_scaled, y_train)

predictions_scaled = svm.____(X_test_scaled) # TODO: Use the model to predict the test set

print("With Scaling:")

print(classification_report(____, ____)) # TODO: Provide the true and predicted values to generate the classification report

Final Code:

from sklearn.preprocessing import StandardScaler

from sklearn.svm import SVC # TODO: Import the SVC (Support Vector Classifier) from sklearn.svm

from sklearn.metrics import classification_report

from sklearn.model_selection import train_test_split

# Preparing the feature and target variables

X = df.drop('Class', axis=1)

y = df['Class']

# Splitting the dataset into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initializing the SVM model

svm = SVC(kernel='linear') # TODO: Specify the kernel type for the SVC model

# With scaling

scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train) # TODO: Apply the scaling to the training features

X_test_scaled = scaler.transform(X_test) # TODO: Apply the scaling to the test features

# Training the SVM model

svm.fit(X_train_scaled, y_train)

predictions_scaled = svm.predict(X_test_scaled) # TODO: Use the model to predict the test set

print("With Scaling:")

print(classification_report(y_test, predictions_scaled)) # TODO: Provide the true and predicted values to generate the classification report

With Scaling:

precision recall f1-score support

BERHI 0.89 0.94 0.91 17

DEGLET 0.73 0.79 0.76 28

DOKOL 0.97 0.96 0.96 67

IRAQI 0.87 0.95 0.91 21

ROTANA 1.00 0.85 0.92 55

SAFAVI 0.98 0.98 0.98 51

SOGAY 0.74 0.84 0.79 31

accuracy 0.91 270

macro avg 0.88 0.90 0.89 270

weighted avg 0.91 0.91 0.91 270

Ans: DEGLET

Q3. Optimal Feature Count

Context:

Feature selection is crucial in machine learning to reduce dimensionality and improve model performance. This exercise involves using Recursive Feature Elimination (RFE) with a Support Vector Machine (SVM) to identify the optimal number of features that yield the highest precision.

Task:

Analyze the effect of different numbers of features on the precision of an SVM model trained with these features. Identify the number of features that leads to the highest precision.

Instructions:

Data Preparation:

Load the dataset and separate it into features (X) and the target variable (y).
Standardize the features to improve model training.

Feature Selection and Model Training:

Implement Recursive Feature Elimination (RFE) with an SVM classifier to select varying numbers of features.
Train an SVM model using these selected features.
Use StratifiedKFold for cross-validation to ensure the model is robust and generalizable.

Evaluation:

Calculate cross-validated precision for the model with different numbers of selected features.
Record and plot precision against the number of features.

Analysis:

Identify the number of features for which the precision is maximized.

Question:

Based on the evaluation, what is the number of features required to achieve the highest precision with the SVM model?

Options:

(A) 5 features
(B) 8 features
(C) 16 features
(D) 18 features

import numpy as np

from sklearn.feature_selection import RFE

from sklearn.preprocessing import StandardScaler

from sklearn.metrics import accuracy_score, precision_score, f1_score

from sklearn.model_selection import StratifiedKFold, cross_val_score

# Load your data and separate into X (features) and y (target)

X = df.drop('Class', axis=1)

y = df['Class']

# Prepare cross-validation (use StratifiedKFold for classification tasks)

cv = StratifiedKFold(n_splits=5, random_state=42, shuffle=True)

# Lists to store metrics

accuracies = []

precisions = []

f1_scores = []

feature_counts = ______ # TODO: Define the range of feature counts to be selected

for n in feature_counts:

# Feature scaling

scaler = StandardScaler()

X_scaled = scaler.______(X) # TODO: Apply scaling to the feature set

# Feature selection

selector = RFE(SVC(kernel='linear', random_state=42), n_features_to_select=_____, step=1) # TODO: Specify the number of features to select

X_selected = _____.____(X_scaled, y) # TODO: Fit and transform the data with the selector

# SVM classifier

svm = ____(kernel='linear', random_state=10) # TODO: Initialize the SVM model with appropriate parameters

# Evaluate using cross-validation

accuracy_scores = []

precision_scores = []

f1_scores_list = []

for train_index, test_index in cv.split(X_selected, y):

X_train_cv, X_test_cv = X_selected[train_index], X_selected[test_index]

y_train_cv, y_test_cv = y[train_index], y[test_index]

svm.fit(______, y_train_cv) # TODO: Fit the SVM model on the training data

y_pred = svm.______(X_test_cv) # TODO: Make predictions on the test data

accuracy_scores.append(_____(y_test_cv, y_pred)) # TODO: Calculate the accuracy score

precision_scores.append(_____(y_test_cv, y_pred, average='weighted')) # TODO: Calculate the precision score

f1_scores_list.append(_____(y_test_cv, y_pred, average='weighted')) # TODO: Calculate the F1 score

accuracies.append(np.mean(accuracy_scores))

precisions.append(np.mean(precision_scores))

f1_scores.append(np.mean(f1_scores_list))

# Plotting the results

plt.figure(figsize=(12, 8))

plt.plot(feature_counts, accuracies, label='Accuracy', marker='o')

plt.plot(feature_counts, precisions, label='Precision', marker='o')

plt.plot(feature_counts, f1_scores, label='F1 Score', marker='o')

# Highlight the point with highest precision

max_precision_index = np.argmax(precisions)

max_precision_feature_count = feature_counts[max_precision_index]

plt.scatter(max_precision_feature_count, precisions[max_precision_index], color='red')

plt.axvline(x=max_precision_feature_count, color='r', linestyle='--', lw=2)

plt.annotate(f'Highest Precision: {precisions[max_precision_index]:.2f} at {max_precision_feature_count} features',

(max_precision_feature_count, precisions[max_precision_index]),

textcoords="offset points", xytext=(0,10), ha='center')

plt.xlabel('Number of Features Selected')

plt.ylabel('Score')

plt.title('Cross-Validated Performance Metrics vs. Number of Features')

plt.legend()

plt.grid(True)

plt.show()

Final Code:

import numpy as np

from sklearn.feature_selection import RFE

from sklearn.preprocessing import StandardScaler

from sklearn.metrics import accuracy_score, precision_score, f1_score

from sklearn.model_selection import StratifiedKFold, cross_val_score

# Load your data and separate into X (features) and y (target)

X = df.drop('Class', axis=1)

y = df['Class']

# Prepare cross-validation (use StratifiedKFold for classification tasks)

cv = StratifiedKFold(n_splits=5, random_state=42, shuffle=True)

# Lists to store metrics

accuracies = []

precisions = []

f1_scores = []

feature_counts = range(1, X.shape[1] + 1) # TODO: Define the range of feature counts to be selected

for n in feature_counts:

# Feature scaling

scaler = StandardScaler()

X_scaled = scaler.fit_transform(X) # TODO: Apply scaling to the feature set

# Feature selection

selector = RFE(SVC(kernel='linear', random_state=42), n_features_to_select=n, step=1) # TODO: Specify the number of features to select

X_selected = selector.fit_transform(X_scaled, y) # TODO: Fit and transform the data with the selector

# SVM classifier

svm = SVC(kernel='linear', random_state=10) # TODO: Initialize the SVM model with appropriate parameters

# Evaluate using cross-validation

accuracy_scores = []

precision_scores = []

f1_scores_list = []

for train_index, test_index in cv.split(X_selected, y):

X_train_cv, X_test_cv = X_selected[train_index], X_selected[test_index]

y_train_cv, y_test_cv = y[train_index], y[test_index]

svm.fit(X_train_cv, y_train_cv) # TODO: Fit the SVM model on the training data

y_pred = svm.predict(X_test_cv) # TODO: Make predictions on the test data

accuracy_scores.append(accuracy_score(y_test_cv, y_pred)) # TODO: Calculate the accuracy score

precision_scores.append(precision_score(y_test_cv, y_pred, average='weighted')) # TODO: Calculate the precision score

f1_scores_list.append(f1_score(y_test_cv, y_pred, average='weighted')) # TODO: Calculate the F1 score

accuracies.append(np.mean(accuracy_scores))

precisions.append(np.mean(precision_scores))

f1_scores.append(np.mean(f1_scores_list))

# Plotting the results

plt.figure(figsize=(12, 8))

plt.plot(feature_counts, accuracies, label='Accuracy', marker='o')

plt.plot(feature_counts, precisions, label='Precision', marker='o')

plt.plot(feature_counts, f1_scores, label='F1 Score', marker='o')

# Highlight the point with highest precision

max_precision_index = np.argmax(precisions)

max_precision_feature_count = feature_counts[max_precision_index]

plt.scatter(max_precision_feature_count, precisions[max_precision_index], color='red')

plt.axvline(x=max_precision_feature_count, color='r', linestyle='--', lw=2)

plt.annotate(f'Highest Precision: {precisions[max_precision_index]:.2f} at {max_precision_feature_count} features',

(max_precision_feature_count, precisions[max_precision_index]),

textcoords="offset points", xytext=(0,10), ha='center')

plt.xlabel('Number of Features Selected')

plt.ylabel('Score')

plt.title('Cross-Validated Performance Metrics vs. Number of Features')

plt.legend()

plt.grid(True)

plt.show()

Ans: 16 features

Q4. Optimal Degree for Polynomial

Context:

Support Vector Machines (SVM) with a polynomial kernel are useful for non-linear data classification. The degree of the polynomial kernel plays a critical role in the model's ability to capture complex patterns in the data. This exercise involves determining the optimal polynomial degree for maximum precision in an SVM model.

Task:

Evaluate the performance of an SVM with a polynomial kernel at various degrees and identify the degree that results in the highest precision.

Instructions:

Data Preparation:

Assume X_train_scaled and y_train are your scaled feature set and target variable, respectively, prepared for training.

Model Configuration and Evaluation:

Set up a polynomial kernel SVM for degrees [2, 3, 4, 5].
Use cross-validation to evaluate the accuracy, precision, and F1 score for each degree.
Collect and plot these metrics to visualize how they vary with the polynomial degree.

Analysis:

Identify which polynomial degree achieves the highest precision score.

Question:

Based on the cross-validated performance metrics, which degree of the polynomial kernel results in the highest precision?

Options:

(A) Degree 2
(B) Degree 3
(C) Degree 4
(D) Degree 5

import matplotlib.pyplot as plt

from sklearn.model_selection import cross_val_score, StratifiedKFold

from sklearn.metrics import accuracy_score, precision_score, f1_score

from sklearn.svm import SVC

from sklearn.preprocessing import StandardScaler

# Prepare cross-validation

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Testing different degrees of the polynomial kernel

degrees = [2, 3, 4, 5]

accuracies = []

precisions = []

f1_scores = []

for degree in degrees:

svm_poly = SVC(kernel=____________, ________=degree, gamma='scale', random_state=10) # TODO: Specify the kernel type and degree parameter

# Cross-validate the model

accuracy = ______(svm_poly, X_train_scaled, y_train, cv=cv, scoring='accuracy').mean() # TODO: Use the appropriate cross-validation function to get the mean accuracy

precision = cross_val_score(svm_poly, ______, y_train, cv=cv, scoring='precision_weighted').____() # TODO: Provide the feature set and call the appropriate method to get the mean precision

f1 = cross_val_score(svm_poly, X_train_scaled, y_train, ____=cv, ______='f1_weighted').mean() # TODO: Specify the appropriate parameters for cross-validation

# Store results for plotting

accuracies.append(accuracy)

precisions.append(precision)

f1_scores.append(f1)

# Plotting the results

plt.figure(figsize=(10, 6))

plt.plot(degrees, accuracies, label='Accuracy', marker='o')

plt.plot(degrees, precisions, label='Precision', marker='o')

plt.plot(degrees, f1_scores, label='F1 Score', marker='o')

plt.xlabel('Degree of Polynomial Kernel')

plt.ylabel('Score')

plt.title('Performance Metrics for SVM with Polynomial Kernel')

plt.xticks(degrees) # Set x-ticks to be the degrees for better readability

plt.legend()

plt.grid(True)

plt.show()

Final Code:

import matplotlib.pyplot as plt

from sklearn.model_selection import cross_val_score, StratifiedKFold

from sklearn.metrics import accuracy_score, precision_score, f1_score

from sklearn.svm import SVC

from sklearn.preprocessing import StandardScaler

# Prepare cross-validation

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Testing different degrees of the polynomial kernel

degrees = [2, 3, 4, 5]

accuracies = []

precisions = []

f1_scores = []

for degree in degrees:

svm_poly = SVC(kernel='poly', degree=degree, gamma='scale', random_state=10) # TODO: Specify the kernel type and degree parameter

# Cross-validate the model

accuracy = cross_val_score(svm_poly, X_train_scaled, y_train, cv=cv, scoring='accuracy').mean() # TODO: Use the appropriate cross-validation function to get the mean accuracy

precision = cross_val_score(svm_poly,X_train_scaled, y_train, cv=cv, scoring='precision_weighted').mean() # TODO: Provide the feature set and call the appropriate method to get the mean precision

f1 = cross_val_score(svm_poly, X_train_scaled, y_train, cv=cv, scoring='f1_weighted').mean() # TODO: Specify the appropriate parameters for cross-validation

# Store results for plotting

accuracies.append(accuracy)

precisions.append(precision)

f1_scores.append(f1)

# Plotting the results

plt.figure(figsize=(10, 6))

plt.plot(degrees, accuracies, label='Accuracy', marker='o')

plt.plot(degrees, precisions, label='Precision', marker='o')

plt.plot(degrees, f1_scores, label='F1 Score', marker='o')

plt.xlabel('Degree of Polynomial Kernel')

plt.ylabel('Score')

plt.title('Performance Metrics for SVM with Polynomial Kernel')

plt.xticks(degrees) # Set x-ticks to be the degrees for better readability

plt.legend()

plt.grid(True)

plt.show()

Ans: Degree 3

Q5. Determining Optimal Gamma

Context:

In Support Vector Machines (SVM) using the radial basis function (RBF) kernel, the gamma parameter defines how far the influence of a single training example reaches, with low values meaning 'far' and high values meaning 'close'. Adjusting gamma can significantly affect the model's ability to generalize.

Task:

Identify the optimal gamma value for an SVM model using the RBF kernel that results in the highest precision. Gamma values are tested on a logarithmic scale from (10−4) to (101).

Instructions:

Setup and Data Preparation:

Use StratifiedKFold for cross-validation with 5 splits, shuffled data, and a set random state for reproducibility.
Standardize the features to improve model performance.
Prepare a range of gamma values on a logarithmic scale for testing.

Model Training and Evaluation:

Train an SVM model with the RBF kernel at various gamma settings.
Use cross-validation to evaluate accuracy, precision, and F1 score for each gamma value.
Record these scores for analysis.

Analysis and Visualization:

Plot accuracy, precision, and F1 score against gamma values.
Highlight the gamma value that results in the highest precision.
Use logarithmic scaling for the gamma axis to enhance visualization.

Question:

Based on the evaluation, in which range does the optimal gamma value for achieving the highest precision fall?

Options:

(A) (10−4) to (10−3)
(B) (10−3) to (10−2)
(C) (10−1) to (100)
(D) (10−2) to (10−1)

# Prepare cross-validation

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Testing a new range of gamma values with more granularity

gamma_values = np.logspace(-4, 1, 20) # Logarithmic scale from 10^-4 to 10^1

accuracies = []

precisions = []

f1_scores = []

for gamma in gamma_values:

svm_rbf = SVC(kernel=____, _____=gamma, random_state=10) # TODO: Specify the correct kernel and parameter name for gamma

# Cross-validate the model

accuracy = np.mean(______(_____, X_train_scaled, y_train, cv=cv, scoring='accuracy')) # TODO: Use cross-validation to compute the mean accuracy

precision = np.mean(cross_val_score(svm_rbf, ____, y_train, cv=cv, scoring='precision_weighted')) # TODO: Provide the feature set and compute the mean precision

f1 = np.mean(cross_val_score(svm_rbf, X_train_scaled, y_train, ___=cv, scoring='f1_weighted')) # TODO: Compute the mean F1 score using cross-validation

# Store the scores

accuracies.append(accuracy)

precisions.append(precision)

f1_scores.append(f1)

# Find the index and value of the highest precision

max_precision_index = np.____(precisions) # TODO: Find the index of the maximum precision

max_precision_value = precisions[max_precision_index]

max_precision_gamma = gamma_values[max_precision_index]

# Plotting the results

plt.figure(figsize=(12, 8))

plt.plot(gamma_values, accuracies, label='Accuracy', marker='o')

plt.plot(gamma_values, precisions, label='Precision', marker='o')

plt.plot(gamma_values, f1_scores, label='F1 Score', marker='o')

# Highlighting the point with highest precision

plt.scatter(max_precision_gamma, max_precision_value, color='red') # Highlight the point

plt.axvline(x=max_precision_gamma, color='r', linestyle='--', lw=2) # Vertical line

plt.annotate(f'Highest Precision: {max_precision_value:.2f}\nGamma: {max_precision_gamma}',

(max_precision_gamma, max_precision_value),

textcoords="offset points", xytext=(0, -50), ha='center')

plt.xlabel('Gamma')

plt.ylabel('Score')

plt.title('SVM Performance Comparison for Different Gamma Values')

plt.xscale('log') # Using logarithmic scale for gamma values

plt.legend()

plt.grid(True)

plt.show()

Final Code:

# Prepare cross-validation

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Testing a new range of gamma values with more granularity

gamma_values = np.logspace(-4, 1, 20) # Logarithmic scale from 10^-4 to 10^1

accuracies = []

precisions = []

f1_scores = []

for gamma in gamma_values:

svm_rbf = SVC(kernel='rbf', gamma=gamma, random_state=10) # TODO: Specify the correct kernel and parameter name for gamma

# Cross-validate the model

accuracy = np.mean(cross_val_score(svm_rbf, X_train_scaled, y_train, cv=cv, scoring='accuracy')) # TODO: Use cross-validation to compute the mean accuracy

precision = np.mean(cross_val_score(svm_rbf, X_train_scaled, y_train, cv=cv, scoring='precision_weighted')) # TODO: Provide the feature set and compute the mean precision

f1 = np.mean(cross_val_score(svm_rbf, X_train_scaled, y_train, cv=cv, scoring='f1_weighted')) # TODO: Compute the mean F1 score using cross-validation

# Store the scores

accuracies.append(accuracy)

precisions.append(precision)

f1_scores.append(f1)

# Find the index and value of the highest precision

max_precision_index = np.argmax(precisions) # TODO: Find the index of the maximum precision

max_precision_value = precisions[max_precision_index]

max_precision_gamma = gamma_values[max_precision_index]

# Plotting the results

plt.figure(figsize=(12, 8))

plt.plot(gamma_values, accuracies, label='Accuracy', marker='o')

plt.plot(gamma_values, precisions, label='Precision', marker='o')

plt.plot(gamma_values, f1_scores, label='F1 Score', marker='o')

# Highlighting the point with highest precision

plt.scatter(max_precision_gamma, max_precision_value, color='red') # Highlight the point

plt.axvline(x=max_precision_gamma, color='r', linestyle='--', lw=2) # Vertical line

plt.annotate(f'Highest Precision: {max_precision_value:.2f}\nGamma: {max_precision_gamma}',

(max_precision_gamma, max_precision_value),

textcoords="offset points", xytext=(0, -50), ha='center')

plt.xlabel('Gamma')

plt.ylabel('Score')

plt.title('SVM Performance Comparison for Different Gamma Values')

plt.xscale('log') # Using logarithmic scale for gamma values

plt.legend()

plt.grid(True)

plt.show()

Ans: (10−2) to (10−1)

Q6. Optimal Regularization Parameter

Context:

The regularization parameter ( C ) in Support Vector Machines (SVM) controls the trade-off between achieving a low error on the training data and minimizing the model complexity for better generalization. Determining the optimal ( C ) value is crucial for model performance, especially when using SVM for classification tasks.

Task:

Use Recursive Feature Elimination (RFE) to select the top 16 features for training an SVM classifier. Then, determine the optimal ( C ) value from a range of possible values to maximize the precision of the classifier.

Instructions:

Feature Selection:

Use RFE with SVM to reduce the number of features to the top 16 most significant features.
Standardize the features before applying RFE.

Model Training and Validation:

Train an SVM model with a linear kernel using the selected features.
Explore a range of ( C ) values using logarithmic scaling to determine the best ( C ) for maximizing precision.
Employ cross-validation using StratifiedKFold with 5 splits to evaluate model performance metrics such as accuracy, precision, and F1 score.

Performance Analysis:

Plot the performance metrics across different ( C ) values.
Identify and highlight the ( C ) value that results in the highest precision.

Question:

Based on the evaluation, in which range does the optimal ( C ) value for achieving the highest precision fall?

Options:

(A) 0 to 1
(B) 1 to 3
(C) 3 to 6
(D) 6 to 10

# Set up the RFE with SVM as the estimator and select the top 16 features

selector = RFE(SVC(_____='linear', random_state=42), ______=16, step=1) # TODO: Specify the correct parameter names

X_train_rfe = _____.fit_transform(X_train_scaled, y_train) # TODO: Fit and transform the training data with RFE

X_test_rfe = _____.transform(X_test_scaled) # TODO: Transform the test data with RFE

# Prepare cross-validation

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Testing different C values

c_values = np.logspace(-3, 2, 20) # Explore a range of C values on a log scale

accuracies = []

precisions = []

f1_scores = []

for c in c_values:

svm_linear = ___(kernel='linear', ____=c, random_state=42) # TODO: Initialize the SVC model with a linear kernel and the C parameter

# Cross-validate the model using only the selected features by RFE

accuracy = np.mean(cross_val_score(_____, X_train_rfe, y_train, cv=cv, ____='accuracy')) # TODO: Use cross_val_score to compute the mean accuracy

precision = np.mean(cross_val_score(svm_linear, ____, y_train, cv=cv, scoring='precision_weighted')) # TODO: Compute the mean precision

f1 = np.___(cross_val_score(svm_linear, X_train_rfe, ____, __=cv, scoring='f1_weighted')) # TODO: Compute the mean F1 score

# Store the scores

accuracies.append(accuracy)

precisions.append(precision)

f1_scores.append(f1)

# Find the index and value of the highest precision

max_precision_index = np.____(precisions) # TODO: Find the index of the maximum precision

max_precision_value = precisions[max_precision_index]

max_precision_c = c_values[max_precision_index]

# Plotting the results

plt.figure(figsize=(12, 8))

plt.plot(c_values, accuracies, label='Accuracy', marker='o')

plt.plot(c_values, precisions, label='Precision', marker='o')

plt.plot(c_values, f1_scores, label='F1 Score', marker='o')

# Highlighting the point with highest precision

plt.scatter(max_precision_c, max_precision_value, color='red') # Highlight the point

plt.axvline(x=max_precision_c, color='r', linestyle='--', lw=2) # Vertical line

plt.annotate(f'Highest Precision: {max_precision_value:.2f}\nC: {max_precision_c}',

(max_precision_c, max_precision_value),

textcoords="offset points", xytext=(0, -50), ha='center')

plt.xlabel('C Value')

plt.ylabel('Score')

plt.title('SVM with Linear Kernel Performance Comparison for Different C Values')

plt.xscale('log') # Using logarithmic scale for C values

plt.legend()

plt.grid(True)

plt.show()

Final Code

# Set up the RFE with SVM as the estimator and select the top 16 features

selector = RFE(SVC(kernel='linear', random_state=42), n_features_to_select=16, step=1) # TODO: Specify the correct parameter names

X_train_rfe = selector.fit_transform(X_train_scaled, y_train) # TODO: Fit and transform the training data with RFE

X_test_rfe = selector.transform(X_test_scaled) # TODO: Transform the test data with RFE

# Prepare cross-validation

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Testing different C values

c_values = np.logspace(-3, 2, 20) # Explore a range of C values on a log scale

accuracies = []

precisions = []

f1_scores = []

for c in c_values:

svm_linear = SVC(kernel='linear', C=c, random_state=42) # TODO: Initialize the SVC model with a linear kernel and the C parameter

# Cross-validate the model using only the selected features by RFE

accuracy = np.mean(cross_val_score(svm_linear, X_train_rfe, y_train, cv=cv,scoring='accuracy')) # TODO: Use cross_val_score to compute the mean accuracy

precision = np.mean(cross_val_score(svm_linear, X_train_rfe, y_train, cv=cv, scoring='precision_weighted')) # TODO: Compute the mean precision

f1 = np.mean(cross_val_score(svm_linear, X_train_rfe, y_train, cv=cv, scoring='f1_weighted')) # TODO: Compute the mean F1 score

# Store the scores

accuracies.append(accuracy)

precisions.append(precision)

f1_scores.append(f1)

# Find the index and value of the highest precision

max_precision_index = np.argmax(precisions) # TODO: Find the index of the maximum precision

max_precision_value = precisions[max_precision_index]

max_precision_c = c_values[max_precision_index]

# Plotting the results

plt.figure(figsize=(12, 8))

plt.plot(c_values, accuracies, label='Accuracy', marker='o')

plt.plot(c_values, precisions, label='Precision', marker='o')

plt.plot(c_values, f1_scores, label='F1 Score', marker='o')

# Highlighting the point with highest precision

plt.scatter(max_precision_c, max_precision_value, color='red') # Highlight the point

plt.axvline(x=max_precision_c, color='r', linestyle='--', lw=2) # Vertical line

plt.annotate(f'Highest Precision: {max_precision_value:.2f}\nC: {max_precision_c}',

(max_precision_c, max_precision_value),

textcoords="offset points", xytext=(0, -50), ha='center')

plt.xlabel('C Value')

plt.ylabel('Score')

plt.title('SVM with Linear Kernel Performance Comparison for Different C Values')

plt.xscale('log') # Using logarithmic scale for C values

plt.legend()

plt.grid(True)

plt.show()

Ans: 1 to 3

Q7. Counting Support Vectors

Context:

Support Vector Machine (SVM) is a robust classification technique that constructs a hyperplane or set of hyperplanes in a high-dimensional space, which can be used for classification, regression, or other tasks. A critical component of the SVM classifier is the support vectors, which are the data points nearest to the hyperplane and influence its position and orientation.

Task:

Use t-Distributed Stochastic Neighbor Embedding (t-SNE) for dimensionality reduction followed by SVM to classify data. Determine the total number of support vectors involved in classifying the data.

Instructions:

Data Preparation and Preprocessing:

Filter the dataset for two specific classes for simplification.
Apply standard scaling to the features to normalize data, enhancing the SVM's performance.

Dimensionality Reduction:

Implement t-SNE to reduce feature dimensions to two components, which aids in visualizing the dataset.

SVM Training:

Train an SVM classifier using the t-SNE output. Set the kernel to linear and regularization parameter ( C ) to 1.0.
Identify the support vectors from the trained model.

Visualization:

Plot the decision boundaries and highlight the support vectors.
Use contours to depict decision boundaries and margins.

Analysis:

Report the total number of support vectors found.

Question:

After training the SVM model on t-SNE reduced features, how many support vectors are found?

Options:

(A) 10
(B) 15
(C) 4
(D) 20

import numpy as np

import pandas as pd

from sklearn.manifold import TSNE

from sklearn.svm import SVC

from sklearn.preprocessing import StandardScaler

import matplotlib.pyplot as plt

from matplotlib.colors import ListedColormap

# List of features

features = ['AREA', 'PERIMETER', 'MAJOR_AXIS', 'MINOR_AXIS', 'ECCENTRICITY',

'EQDIASQ', 'SOLIDITY', 'CONVEX_AREA', 'EXTENT', 'ASPECT_RATIO',

'ROUNDNESS', 'COMPACTNESS', 'SHAPEFACTOR_1', 'SHAPEFACTOR_2',

'SHAPEFACTOR_3', 'SHAPEFACTOR_4', 'MeanRR', 'MeanRG', 'MeanRB',

'StdDevRR', 'StdDevRG', 'StdDevRB', 'SkewRR', 'SkewRG', 'SkewRB',

'KurtosisRR', 'KurtosisRG', 'KurtosisRB', 'EntropyRR', 'EntropyRG',

'EntropyRB', 'ALLdaub4RR', 'ALLdaub4RG', 'ALLdaub4RB']

# Create a copy of the DataFrame

df_temp = df.copy()

# Select features and filter for two classes

df_temp['Class'] = df_temp['Class'].apply(lambda x: 'SAFAVI' if x == 'SAFAVI' else 'Other')

# Encoding classes

df_temp['Label'] = df_temp['Class'].____({'SAFAVI': 1, 'Other': 0}) # TODO: Use the appropriate method to map class labels to numerical values

# Standard scaling

scaler = StandardScaler()

X_scaled = scaler.fit_transform(df_temp[features])

# t-SNE

tsne = TSNE(n_components=2, random_state=42)

X_tsne = tsne.fit_transform(X_scaled)

# SVM training

svm = ____(kernel='linear', ___=1.0) # TODO: Initialize the SVC model with the correct parameters

svm.fit(X_tsne, df_temp['Label'])

# Support vectors

# Visit https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html to get the correct attribute for support vectors

support_vectors = svm.______

print(f'---------------------------------------------------------------')

print(f'Total Number of Support Vectors found are {len(support_vectors)}')

print(f'---------------------------------------------------------------')

# Create grid to plot decision boundaries

x_min, x_max = X_tsne[:, 0].min() - 1, X_tsne[:, 0].max() + 1

y_min, y_max = X_tsne[:, 1].min() - 1, X_tsne[:, 1].max() + 1

xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.02), np.arange(y_min, y_max, 0.02))

# Plot decision boundary and margins

Z = svm.decision_function(np.c_[xx.ravel(), yy.ravel()]) # TODO: Use the decision function to predict decision boundaries

Z = Z.reshape(xx.shape)

plt.figure(figsize=(10, 8)) # Define plot size here to ensure it encompasses all the elements

plt.contourf(xx, yy, Z, levels=[-1, 0, 1], alpha=0.2, colors=['blue', 'grey', 'red']) # Background color for decision areas

plt.contour(xx, yy, Z, colors='k', levels=[-1, 0, 1], alpha=0.5, linestyles=['--', '-', '--']) # Decision boundaries and margins

# Plot data points and support vectors

scatter = plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=df_temp['Label'], cmap=ListedColormap(['#FF0000', '#0000FF']), s=50, edgecolors='k')

plt.scatter(support_vectors[:, 0], support_vectors[:, 1], s=120, facecolors='none', edgecolors='yellow', linewidths=2, label='Support vectors')

# Labels and title

plt.xlabel('t-SNE Feature 1')

plt.ylabel('t-SNE Feature 2')

plt.title('t-SNE Visualization with SVM Decision Boundary for Date Fruit Classification')

plt.legend(handles=[scatter], labels=['Data points'], loc='upper right')

plt.show()

Final Code:

import numpy as np

import pandas as pd

from sklearn.manifold import TSNE

from sklearn.svm import SVC

from sklearn.preprocessing import StandardScaler

import matplotlib.pyplot as plt

from matplotlib.colors import ListedColormap

# List of features

features = ['AREA', 'PERIMETER', 'MAJOR_AXIS', 'MINOR_AXIS', 'ECCENTRICITY',

'EQDIASQ', 'SOLIDITY', 'CONVEX_AREA', 'EXTENT', 'ASPECT_RATIO',

'ROUNDNESS', 'COMPACTNESS', 'SHAPEFACTOR_1', 'SHAPEFACTOR_2',

'SHAPEFACTOR_3', 'SHAPEFACTOR_4', 'MeanRR', 'MeanRG', 'MeanRB',

'StdDevRR', 'StdDevRG', 'StdDevRB', 'SkewRR', 'SkewRG', 'SkewRB',

'KurtosisRR', 'KurtosisRG', 'KurtosisRB', 'EntropyRR', 'EntropyRG',

'EntropyRB', 'ALLdaub4RR', 'ALLdaub4RG', 'ALLdaub4RB']

# Create a copy of the DataFrame

df_temp = df.copy()

# Select features and filter for two classes

df_temp['Class'] = df_temp['Class'].apply(lambda x: 'SAFAVI' if x == 'SAFAVI' else 'Other')

# Encoding classes

df_temp['Label'] = df_temp['Class'].map({'SAFAVI': 1, 'Other': 0}) # TODO: Use the appropriate method to map class labels to numerical values

# Standard scaling

scaler = StandardScaler()

X_scaled = scaler.fit_transform(df_temp[features])

# t-SNE

tsne = TSNE(n_components=2, random_state=42)

X_tsne = tsne.fit_transform(X_scaled)

# SVM training

svm = SVC(kernel='linear', C=1.0) # TODO: Initialize the SVC model with the correct parameters

svm.fit(X_tsne, df_temp['Label'])

# Support vectors

# Visit https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html to get the correct attribute for support vectors

support_vectors = svm.support_vectors_

print(f'---------------------------------------------------------------')

print(f'Total Number of Support Vectors found are {len(support_vectors)}')

print(f'---------------------------------------------------------------')

# Create grid to plot decision boundaries

x_min, x_max = X_tsne[:, 0].min() - 1, X_tsne[:, 0].max() + 1

y_min, y_max = X_tsne[:, 1].min() - 1, X_tsne[:, 1].max() + 1

xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.02), np.arange(y_min, y_max, 0.02))

# Plot decision boundary and margins

Z = svm.decision_function(np.c_[xx.ravel(), yy.ravel()]) # TODO: Use the decision function to predict decision boundaries

Z = Z.reshape(xx.shape)

plt.figure(figsize=(10, 8)) # Define plot size here to ensure it encompasses all the elements

plt.contourf(xx, yy, Z, levels=[-1, 0, 1], alpha=0.2, colors=['blue', 'grey', 'red']) # Background color for decision areas

plt.contour(xx, yy, Z, colors='k', levels=[-1, 0, 1], alpha=0.5, linestyles=['--', '-', '--']) # Decision boundaries and margins

# Plot data points and support vectors

scatter = plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=df_temp['Label'], cmap=ListedColormap(['#FF0000', '#0000FF']), s=50, edgecolors='k')

plt.scatter(support_vectors[:, 0], support_vectors[:, 1], s=120, facecolors='none', edgecolors='yellow', linewidths=2, label='Support vectors')

# Labels and title

plt.xlabel('t-SNE Feature 1')

plt.ylabel('t-SNE Feature 2')

plt.title('t-SNE Visualization with SVM Decision Boundary for Date Fruit Classification')

plt.legend(handles=[scatter], labels=['Data points'], loc='upper right')

plt.show()

Total Number of Support Vectors found are 15

Ans: 15

Q8. Implement Hinge Loss from Scratch

Context:

Support Vector Machines (SVMs) rely on the hinge loss function to penalize misclassified points and those that are too close to the decision boundary. Understanding hinge loss helps explain how SVMs work.

Task:

Write a function to compute hinge loss for a simple binary classification dataset and use it to determine the loss of a sample classifier.

Instructions:

Define Hinge Loss Function: Implement a Python function to compute hinge loss using the following steps:

Step 1: Calculate the margin for each data point:

margin=𝑦𝑖(𝑤⃗⋅𝑥⃗𝑖+𝑏)

Step 2: Compute the hinge loss for each data point:

max(0,1-margin)

Step 3: Return the average hinge loss across all data points.

Evaluate Sample Classifier: Use your function to compute the hinge loss for a classifier defined by weights and bias.
Answer the Question: Based on the computed loss, answer the following question about the nature of hinge loss and its implications.

Question:

What does the computed hinge loss value indicate about the classifier's performance?

Statements:

The hinge loss value ( > 1 ) indicates that most points are misclassified or too close to the decision boundary.
The hinge loss value ( < 1 ) suggests that most points are classified correctly but may still be close to the decision boundary.
The hinge loss value ( = 0 ) indicates that all points are classified correctly and far from the decision boundary.
The hinge loss value ( = 1 ) suggests that all points are classified correctly.

Options:

A) Statement 1
B) Statement 2
C) Statement 3
D) Statement 4

# Sample Data (Binary classification)

X = np.array([[2, 3], [3, 4], [1, 1], [4, 5], [6, 7]])

y = np.array([1, 1, -1, 1, -1])

data = pd.DataFrame(X, columns = ['f1', 'f2'])

data['y'] = y

data.head()

index,f1,f2,y

0,2,3,1

1,3,4,1

2,1,1,-1

3,4,5,1

4,6,7,-1

import numpy as np

# Hinge Loss Implementation

def hinge_loss(X, y, weights, bias):

"""

Compute hinge loss for a simple SVM model.

Args:

X (ndarray): Feature matrix.

y (ndarray): Target vector.

weights (ndarray): Weight vector.

bias (float): Bias term.

Returns:

float: Hinge loss value.

"""

margins = y * (np.___(X, ____) + ___) # TODO: Calculate the margins using a dot product of weights and features, and add the bias

hinge_loss = np.____(0, 1 - ____) # TODO: Apply the hinge loss formula: max(0, 1 - margin)

return np.____(hinge_loss) # TODO: Return the mean or sum of the hinge loss values

# Example weights and bias

weights = np.array([0.5, 0.5])

bias = 0.0

# Compute Hinge Loss

loss = hinge_loss(X, y, weights, bias)

print(f'Hinge Loss: {loss:.4f}')

Final Code:

# Sample Data (Binary classification)

X = np.array([[2, 3], [3, 4], [1, 1], [4, 5], [6, 7]])

y = np.array([1, 1, -1, 1, -1])

data = pd.DataFrame(X, columns = ['f1', 'f2'])

data['y'] = y

data.head()

import numpy as np

# Hinge Loss Implementation

def hinge_loss(X, y, weights, bias):

"""

Compute hinge loss for a simple SVM model.

Args:

X (ndarray): Feature matrix.

y (ndarray): Target vector.

weights (ndarray): Weight vector.

bias (float): Bias term.

Returns:

float: Hinge loss value.

"""

margins = y * (np.dot(X, weights) + bias)

hinge_loss = np.maximum(0, 1 - margins)

return np.mean(hinge_loss)

Hinge Loss: 1.9000

Ans: Statement 1

Machine Learning - Deep Learning

Support Vector Machines

Predicting Date Fruit Varieties with Support Vector Machines

Context

Dataset Description

Morphological Attributes:

Shape Factor Attributes:

Colorimetric Attributes:

Daubechies Wavelet Attributes:

Target Attribute:

Options:

Options:

Options:

Options:

Options:

Options:

No comments:

Post a Comment

About Machine Learning

SOFTWARE ENGINEERING