MNIST Dataset Description

MNIST:

MNIST (Modified National Institute of Standards and Technology):

· MNIST dataset,

· which is a set of 70,000 small images of digits handwritten by

ü high school students

ü employees of the US Census Bureau.

Each image is labeled with the digit it represents.

Figure 1. Digits from the MNIST dataset

· This set has been studied so much that it is often called the “hello world” of Machine Learning:

ü whenever people come up with a new classification algorithm,

ü they are curious to see how it will perform on MNIST, and

ü anyone who learns Machine Learning tackles this dataset sooner or later.

ü Scikit-Learn provides many helper functions to download popular datasets.

ü MNIST is one of them.

• The following code fetches the MNIST dataset:

from sklearn.datasets import fetch_openml

MNIST = fetch_openml('MNIST_784', version=1)

MNIST.keys()

• dict_keys(['data', 'target', 'feature_names', 'DESCR', 'details', 'categories', 'url']) including the following:

ü A DESCR key describing the dataset

ü A data key containing an array with one row per instance and one column per feature

ü A target key containing an array with the labels

• Let’s look at these arrays:

X, y = MNIST["data"], MNIST["target"]

X.shape

(70000, 784)

y.shape

(70000,)

• There are

ü 70,000 images, and

ü each image has 784 features.

· This is because each image is 28 × 28 pixels, and

· each feature simply represents one pixel’s intensity,

· from 0 (white) to 255 (black).

· Let’s take a peek at one digit from the dataset.

· All you need to do is grab an instance’s feature vector, reshape it to a 28 × 28 array, and display it using Matplotlib’s imshow() function:

import matplotlib as mpl

import matplotlib.pyplot as plt

some_digit = X[0]

some_digit_image = some_digit.reshape(28, 28)

plt.imshow(some_digit_image, cmap="binary")

plt.axis("off")

plt.show()

· This looks like a 5, and indeed that’s what the label tells us.

y[0]

'5'

· Note that the label is a string.

· Most ML algorithms expect numbers, so let’s cast y to integer:

y = y.astype(np.uint8)

· To give you a feel for the complexity of the classification task,

· Figure 1 shows a few more images from the MNIST dataset.

· You should always

ü create a test set and

ü set it aside before inspecting the data closely.

· The MNIST dataset is actually already split into

ü a training set (the first 60,000 images)

ü a test set (the last 10,000 images):

X_train, X_test, y_train, y_test = X[:60000], X[60000:],

y[:60000], y[60000:]

· The training set is already shuffled for us,

ü which is good because this guarantees that

· all cross-validation folds will be similar.

· Moreover, some learning algorithms are

ü sensitive to the order of the training instances, and

ü they perform poorly if they get many similar instances in a row.

· Shuffling the dataset ensures that this won’t happen.

Youtube Link:

https://youtu.be/GaVUPdyOSyY

Search This Blog

Machine Learning

MNIST Dataset Description

Comments

Post a Comment