MNIST Dataset Description
MNIST:
MNIST (Modified National Institute of Standards and Technology):
· MNIST dataset,
· which is a set of 70,000 small images of digits handwritten by
ü high school students
ü employees of the US Census Bureau.
Each image
is labeled with the digit it represents.
Figure 1. Digits from
the MNIST dataset
·
This set has been studied so much that it is often called the “hello
world” of Machine Learning:
ü whenever people come up with a new
classification algorithm,
ü they are curious
to see how it will perform on MNIST, and
ü anyone who learns Machine Learning tackles this dataset
sooner or later.
ü
Scikit-Learn provides many helper functions to download popular datasets.
ü
MNIST is one of them.
•
The following code fetches the MNIST dataset:
from sklearn.datasets
import fetch_openml
MNIST = fetch_openml('MNIST_784', version=1)
MNIST.keys()
•
dict_keys(['data', 'target', 'feature_names', 'DESCR', 'details',
'categories', 'url']) including the following:
ü A DESCR key describing
the dataset
ü A data key containing an
array with one row per instance and one column per feature
ü A target key containing an
array with the labels
•
Let’s look at these arrays:
X, y = MNIST["data"], MNIST["target"]
X.shape
(70000,
784)
y.shape
(70000,)
•
There are
ü 70,000 images, and
ü each image
has 784 features.
·
This is because each image is 28 × 28 pixels, and
·
each feature simply represents one pixel’s intensity,
·
from 0 (white) to 255 (black).
·
Let’s take a peek at one digit from the dataset.
·
All you need to do is grab an instance’s feature vector, reshape
it to a 28 × 28 array, and display it using Matplotlib’s imshow() function:
import matplotlib as mpl
import matplotlib.pyplot as plt
some_digit = X[0]
some_digit_image = some_digit.reshape(28, 28)
plt.imshow(some_digit_image, cmap="binary")
plt.axis("off")
plt.show()
·
This looks like a 5, and indeed that’s what the label
tells us.
y[0]
'5'
·
Note that the label is a string.
·
Most ML algorithms expect numbers, so let’s cast y to integer:
y = y.astype(np.uint8)
·
To give you a feel for the complexity of the classification task,
·
Figure 1 shows a few more images from the MNIST dataset.
·
You should always
ü create a test set and
ü set it aside
before inspecting the data closely.
·
The MNIST dataset is actually already split into
ü a training set
(the first 60,000 images)
ü a test set (the
last 10,000 images):
X_train, X_test, y_train, y_test = X[:60000], X[60000:],
y[:60000], y[60000:]
·
The training set is already shuffled for us,
ü which is good because
this guarantees that
·
all cross-validation folds will be similar.
·
Moreover, some learning algorithms are
ü sensitive to the order of the training
instances, and
ü they perform poorly if they get many similar
instances in a row.
·
Shuffling the dataset ensures that this won’t happen.
Youtube Link:
https://youtu.be/GaVUPdyOSyY
Comments
Post a Comment