Solution to 1_notmnist part2

1_notmnist is the first practice of Udacity Deep Learning course.

The solution ipynb file is in my github repository

In part 1, we used some methods to process the data, and in this part, we will try to train our own model.

5. Problem 5

By construction, this dataset might contain a lot of overlapping samples, including training data that’s also contained in the validation and test set! Overlap between training and test can skew the results if you expect to use your model in an environment where there is never an overlap, but are actually ok if you expect to see training samples recur when you use it. Measure how much overlap there is between training, validation and test samples.

Optional questions:

What about near duplicates between datasets? (images that are almost identical)

Create a sanitized validation and test set, and compare your accuracy on those in subsequent assignments.

In this problem, we need to find the overlapping samples at first. To compare two images, use encoding method is much faster than comparing in image array directly, here I chose md5 to encode the img, then compare them by traversal, record the keys.

5.1 Find the duplicates

import hashlib

# Judge the overlap imgs by md5, which is faster than np function
def overlap_measure_md5(dataset1, dataset2):
    dataset_md5_1 = np.array([hashlib.md5(img).hexdigest() for img in dataset1])
    dataset_md5_2 = np.array([hashlib.md5(img).hexdigest() for img in dataset2])
    overlap = {}
    for i, md5_1 in enumerate(dataset_md5_1):
        duplicates = np.where(dataset_md5_2 == md5_1)
        if len(duplicates[0]):
          overlap[i] = duplicates[0]
    return overlap

We can see the number of overlapping images between sets.

5.2 Sanitize the validation and test set

I just deleted the dulicated image in dataset1

def overlap_kick_off(dataset1, dataset2, labels1):
    dataset_md5_1 = np.array([hashlib.md5(img).hexdigest() for img in dataset1])
    dataset_md5_2 = np.array([hashlib.md5(img).hexdigest() for img in dataset2])
    overlap = [] 
    for i, md5_1 in enumerate(dataset_md5_1):
        duplicates = np.where(dataset_md5_2 == md5_1)
        if len(duplicates[0]):
          overlap.append(i)
    print('Delete '+ str(len(overlap)) + ' items')
    return np.delete(dataset1, overlap, 0), np.delete(labels1, overlap, None)

We can see the shape of the sets:

Remember to save the sanitized data.

6. Problem 6

Let’s get an idea of what an off-the-shelf classifier can give you on this data. It’s always good to check that there is something to learn, and that it’s a problem that is not so trivial that a canned solution solves it.

Train a simple model on this data using 50, 100, 1000 and 5000 training samples. Hint: you can use the LogisticRegression model from sklearn.linear_model.

Optional question: train an off-the-shelf model on all the data!

Here I used keras to build the logistic regression model, not sklearn, you can chosse the tools as you like.

# Import keras
from keras.utils import np_utils
from keras.models import Sequential
from keras.layers import Dense, Activation

# Build model
input_dim = 28*28 
output_dim = 10
model = Sequential()
model.add(Dense(output_dim, input_dim = input_dim, activation='softmax'))
batch_size = 128
epochs = 20

# Process data
X_train = train_dataset_sanitized.reshape(train_dataset_sanitized.shape[0],input_dim)
X_test = test_dataset_sanitized.reshape(test_dataset_sanitized.shape[0], input_dim)
X_val = valid_dataset.reshape(valid_dataset.shape[0], input_dim)
Y_train = train_labels_sanitized.astype('float32')
Y_test = test_labels_sanitized.astype('float32')
Y_val = valid_labels.astype('float32')

# Convert the labels to one-hot vector
Y_train = np_utils.to_categorical(Y_train, num_classes=10) 
Y_test = np_utils.to_categorical(Y_test, num_classes=10)
Y_val = np_utils.to_categorical(Y_val, num_classes=10)

model.compile(optimizer='sgd', loss='categorical_crossentropy',
              metrics=['accuracy'])
history = model.fit(X_train, Y_train,
                   batch_size = batch_size,
                   epochs = epochs,
                   verbose = 1,
                   validation_data = [X_val, Y_val])

We can see the train process:

Then we test our model, the accuracy looks good.

# Evaluate the model
score = model.evaluate(X_test, Y_test, verbose=0)
print('Test score:', score[0], 'Test accuracy:', score[1])

Test score: 0.4222583609672285 Test accuracy: 0.8918808649530804