MNSIT: Recognizing the numbers (Part 1)

Share this post

After working on binary classification in the kaggle competition with data from the Titanic , how about tackling another facet of machine learning? Multiple classification (Multi-class) through the recognition of figures / digit MNSIT (Modified National Institute of Standards and Technology database) is indeed a must in the initiation to Machine Learning. So let’s get carried away in this new Kaggle competition

NB: The MNIST database for Modified or Mixed National Institute of Standards and Technology , is a database of handwritten numbers. The MNIST base has become a standard test. It brings together 60,000 training images and 10,000 test images, taken from an earlier database, simply called NIST 1 . These are black and white images, normalized centered 28 pixels per side. (Source: Wikipedia )

Retrieve and read the dataset

To do this first register in the Kaggle competition , then retrieve the data in the data tab .

Here is the structure of the training and test files:

As usual of course the labels (label) are not present in the test data set… where would be the stake if not?😉

We have to process images. However, these images are bitmaps, that is to say they are represented via matrices of pixels. Each point / pixel is therefore an entry in the matrix. The value in the matrix defining the intensity of the point (gray level) of the image:

The only catch is that the data we recover is not exactly in this matrix format. in fact each image is flattened on a single line. We will therefore have to retrieve the entire row (without the label of course) and resize the row to a 28 × 28 matrix. Luckily the reshape () function comes to your rescue, but we’ll see that in the next chapter.

This step is not necessary if you do not wish to retouch or rework the images. You could quite simply run a machine learning algorithm directly on the matrix data online:

import pandas as pd
from sklearn.linear_model import SGDClassifier
pd.options.display.max_columns = None

TRAIN = pd.read_csv("./data/train.csv", delimiter=',') #, skiprows=1)
TEST = pd.read_csv("./data/test.csv", delimiter=',') #, skiprows=1)
X_TRAIN = TRAIN.copy()
X_TEST = TEST.copy()
y = TRAIN.label
del X_TRAIN["label"]

sgd = SGDClassifier(random_state=42)
sgd.fit(X_TRAIN, y)
print ("Score Train -->", round(sgd.score(X_TRAIN, y) *100,2), " %")
Score Train --> 85.88 %

Without any adjustments and in a few lines you will have an honorable score of 85% (if you submit it as is to kaggle you will get 84% which is consistent). But of course we can do better, much better.

View images

The following portion of code retrieves a row from the dataset. The row is converted to a 28 × 28 matrix via the reshape () function and and visualized via matplotlib .

# returns the image in digit (28x28)
def getImageMatriceDigit(dataset, rowIndex):
    return dataset.iloc[rowIndex, 0:].values.reshape(28,28)
# returns the image matrix in one row
def getImageLineDigit(dataset, rowIndex):
    return dataset.iloc[rowIndex, 0:]

imgDigitMatrice = getImageMatriceDigit(X_TRAIN, 3)
imgDigit = getImageLineDigit(X_TRAIN, 3)
plt.imshow(imgDigitMatrice, cmap=matplotlib.cm.binary, interpolation="nearest")
plt.axis("off")
plt.show()

The result is then displayed:

Multi-class classification

In the Titanic project , we were in a binary classification type machine learning project. Indeed we had to determine whether the passengers were either survivors or dead. 2 possibilities only, hence the term binary classification. The multi-class classification extends this principle of classification to several labeling classes. This is exactly the case here because we have to classify the digits on the different possibilities [0..9].

In terms of Machine Learning algorithms, we therefore have two ways of handling this type of problem:

  • Using binary classification algorithms. In this case it will be necessary to apply these algorithms several times by “binarizing” the labels. For example by applying an algorithm which will recognize the 1s, then the 2s, etc. This is what we call a strategy alone against everything (One versus All). Another method will consist of comparing the pairs / tuples to each other (One versus One)
  • Using Multi-class algorithms. The scikit-learn library offers a good number of them:
    • SVC (Support Vector Machine)
    • Random Forest
    • SGD (Stochastic Gradient Descent)
    • K near neighbors (KNeighborsClassifier)
    • etc.

Results analysis

In a previous chapter we started training on data with a Stochastic Gradient Descent (SGD) algorithm. We saw in another article how to analyze the results of a binary classification, but what about a multi-class classification?

Well, we’re just going to use the same analysis tools as the binary classification, but of course a little different.

The most practical tool in my opinion remains the confusion matrix . The difference here is we’ll read it differently.

sgd = SGDClassifier(random_state=42)
cross_val_score (sgd, X_TRAIN, y, cv=5, scoring="accuracy")
y_pred = cross_val_predict(sgd, X_TRAIN, y, cv=5)
mc = confusion_matrix(y, y_pred)
print(mc)

Here is the result :

The latter is very practical and allows you to compare by class (here I remind you [0..9]) the number of erroneous values ​​between the prediction and the value actually observed. As a corollary, a confusion matrix being diagonal means a score of 100%! we will therefore only be interested in the values ​​which are outside the diagonals, because they are those which point to the prediction errors.

A visualization in the form of a heat map is also very useful. A simple call to matplotlib via the matshow () method and voila:

plt.matshow(mc)

First conclusions

In this first article we haven’t really done much yet. Reading the data and launching a first algorithm in order to get a first overview is essential and appears as the first step in order to get an idea of ​​what we are going to implement subsequently to achieve our objective.

In a second article, we will see how to best use this dataset. For example, we could rework the information (for example by extending the dataset) or scale the raster data. And above all, finally, we will test and optimize our machine learning algorithms.

Share this post

Benoit Cayla

In more than 15 years, I have built-up a solid experience around various integration projects (data & applications). I have, indeed, worked in nine different companies and successively adopted the vision of the service provider, the customer and the software editor. This experience, which made me almost omniscient in my field naturally led me to be involved in large-scale projects around the digitalization of business processes, mainly in such sectors like insurance and finance. Really passionate about AI (Machine Learning, NLP and Deep Learning), I joined Blue Prism in 2019 as a pre-sales solution consultant, where I can combine my subject matter skills with automation to help my customers to automate complex business processes in a more efficient way. In parallel with my professional activity, I run a blog aimed at showing how to understand and analyze data as simply as possible: datacorner.fr Learning, convincing by the arguments and passing on my knowledge could be my caracteristic triptych.

View all posts by Benoit Cayla →

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Fork me on GitHub