MNIST Dataset Description

MNIST:

MNIST (Modified National Institute of Standards and Technology):

·       MNIST dataset,

·       which is a set of 70,000 small images of digits handwritten by

ü  high school students

ü  employees of the US Census Bureau.

Each image is labeled with the digit it represents.



Figure 1. Digits from the MNIST dataset

 

·       This set has been studied so much that it is often called the “hello world” of Machine Learning:

ü  whenever people come up with a new classification algorithm,

ü  they are curious to see how it will perform on MNIST, and

ü   anyone who learns Machine Learning tackles this dataset sooner or later.

 

ü  Scikit-Learn provides many helper functions to download popular datasets.

ü  MNIST is one of them.

        The following code fetches the MNIST dataset:

from sklearn.datasets import fetch_openml

MNIST = fetch_openml('MNIST_784', version=1)

MNIST.keys()

        dict_keys(['data', 'target', 'feature_names', 'DESCR', 'details', 'categories', 'url']) including the following:

ü  A DESCR key describing the dataset

ü  A data key containing an array with one row per instance and one column per feature

ü  A target key containing an array with the labels

 

        Let’s look at these arrays:

            X, y = MNIST["data"], MNIST["target"]

X.shape

            (70000, 784)

y.shape

            (70000,)

         There are

ü  70,000 images, and

ü  each image has 784 features.

·       This is because each image is 28 × 28 pixels, and

·       each feature simply represents one pixel’s intensity,

·       from 0 (white) to 255 (black).



 

·       Let’s take a peek at one digit from the dataset.

·       All you need to do is grab an instance’s feature vector, reshape it to a 28 × 28 array, and display it using Matplotlib’s imshow() function:

import matplotlib as mpl

import matplotlib.pyplot as plt

some_digit = X[0]

some_digit_image = some_digit.reshape(28, 28)

plt.imshow(some_digit_image, cmap="binary")

plt.axis("off")

plt.show()

·       This looks like a 5, and indeed that’s what the label tells us.

 

y[0]

'5'

·       Note that the label is a string.

·       Most ML algorithms expect numbers, so let’s cast y to integer:

y = y.astype(np.uint8)

·       To give you a feel for the complexity of the classification task,

·       Figure 1 shows a few more images from the MNIST dataset.

·       You should always

ü  create a test set and

ü  set it aside before inspecting the data closely.

·       The MNIST dataset is actually already split into

ü  a training set (the first 60,000 images)

ü  a test set (the last 10,000 images):

X_train, X_test, y_train, y_test = X[:60000], X[60000:],

                                                                                    y[:60000], y[60000:]

 

·       The training set is already shuffled for us,

ü  which is good because this guarantees that

·       all cross-validation folds will be similar.


·       Moreover, some learning algorithms are

ü  sensitive to the order of the training instances, and

ü  they perform poorly if they get many similar instances in a row.

·       Shuffling the dataset ensures that this won’t happen.

 

 Youtube Link: 


https://youtu.be/GaVUPdyOSyY


 

Comments