MNIST Dataset Description
MNIST (Modified National Institute of Standards and Technology):
· MNIST dataset,
· which is a set of 70,000 small images of digits handwritten by
ü high school students
ü employees of the US Census Bureau.
Each image
is labeled with the digit it represents.
Figure 1. Digits from
the MNIST dataset
This set has been studied so much that it is often called the “hello
world” of Machine Learning:
ü whenever people come up with a new
classification algorithm,
ü they are curious
to see how it will perform on MNIST, and
ü anyone who learns Machine Learning tackles this dataset
sooner or later.
Scikit-Learn provides many helper functions to download popular datasets.
MNIST is one of them.
The following code fetches the MNIST dataset:
from sklearn.datasets
import fetch_openml
MNIST = fetch_openml('MNIST_784', version=1)
dict_keys(['data', 'target', 'feature_names', 'DESCR', 'details',
'categories', 'url']) including the following:
ü A DESCR key describing
the dataset
ü A data key containing an
array with one row per instance and one column per feature
ü A target key containing an
array with the labels
Let’s look at these arrays:
X, y = MNIST["data"], MNIST["target"]
There are
ü 70,000 images, and
ü each image
has 784 features.
This is because each image is 28 × 28 pixels, and
each feature simply represents one pixel’s intensity,
from 0 (white) to 255 (black).
Let’s take a peek at one digit from the dataset.
All you need to do is grab an instance’s feature vector, reshape
it to a 28 × 28 array, and display it using Matplotlib’s imshow() function:
import matplotlib as mpl
import matplotlib.pyplot as plt
some_digit = X[0]
some_digit_image = some_digit.reshape(28, 28)
plt.imshow(some_digit_image, cmap="binary")
This looks like a 5, and indeed that’s what the label
tells us.
Note that the label is a string.
Most ML algorithms expect numbers, so let’s cast y to integer:
y = y.astype(np.uint8)
To give you a feel for the complexity of the classification task,
Figure 1 shows a few more images from the MNIST dataset.
You should always
ü create a test set and
ü set it aside
before inspecting the data closely.
The MNIST dataset is actually already split into
ü a training set
(the first 60,000 images)
ü a test set (the
last 10,000 images):
X_train, X_test, y_train, y_test = X[:60000], X[60000:],
y[:60000], y[60000:]
The training set is already shuffled for us,
ü which is good because
this guarantees that
all cross-validation folds will be similar.
Moreover, some learning algorithms are
ü sensitive to the order of the training
instances, and
ü they perform poorly if they get many similar
instances in a row.
Shuffling the dataset ensures that this won’t happen.
Youtube Link:
Post a Comment