Solution to 1_notmnist part1

1_notmnist is the first practice of Udacity Deep Learning course.

The solution ipynb file is in my github repository

1. Problem 1

Let’s take a peek at some of the data to make sure it looks sensible. Each exemplar should be an image of a character A through J rendered in a different font. Display a sample of the images that we just downloaded. Hint: you can use the package IPython.display.

To verify the data, I used random to select some samples and used pyplot to see its condition:

import random

def disp_samples(data_folders):
    for folder in data_folders:
        print(folder)
        folder_path = folder + '.pickle'
        sample = pickle.load(open(folder_path, "r"))
        print(sample.shape[0])
        size = sample.shape[0]

        plt.imshow(sample[random.randint(0,size)])
        plt.show()
disp_samples(train_folders)
disp_samples(test_folders)

Then I ran the functions already been written.

2. Problem 2

Let’s verify that the data still looks good. Displaying a sample of the labels and images from the ndarray. Hint: you can use matplotlib.pyplot.

To verify the data, I used random to select some samples in the same group and used pyplot to see their conditions:

def disp_samples(data_folders):
    folder = random.sample(data_folders, 1)
    folder_path = folder + '.pickle'
    dataset = pickle.load(open(folder_path,"r"))
    plt.suptitle(''.join(folder)[-1])
    item = random.sample(list(dataset), 8)
    for i, item in enumerate(items):
        plt.subplot(2, 4, i+1)
        plt.axis('off')
        plt.imshow(item)

3. Problem 3

Another check: we expect the data to be balanced across classes. Verify that.

Merge and prune the training data as needed. Depending on your computer setup, you might not be able to fit it all in memory, and you can tune train_size as needed. The labels will be stored into a separate array of integers 0 through 9.

Also create a validation dataset for hyperparameter tuning.

As some data can’t be read, we need to validate whether the data are balanced or not after processed them to ‘.pickle’ file, I print the size of them respectively

# Check the data size of classes
def show_size(data_folders):
    for folder in data_folders:
        folder_path = folder + '.pickle'
        sample = pickle.load(open(folder_path,"r"))
        print('The size of '+ folder +' is:' + str(sample.shape[0]))

show_size(train_folders)
show_size(test_folders)

Then I run the functions already been written without changing them, after the data preprocessing, we need to verify the data again:

num2letter = {0: 'A', 1: 'B', 2: 'C', 3: 'D', 4: 'E', 5: 'F', 6: 'G', 7: 'H', 8: 'I', 9: 'J'}

test = random.randint(0,200000)
plt.imshow(train_dataset[test])
print(num2letter[train_labels[test]])

4. Problem 4

Convince yourself that the data is still good after shuffling!

To see the condition of data, I picked some of them in train set and related label randomly

def disp_sample_dataset(dataset, labels):
  items = random.sample(range(len(labels)), 8)
  for i, item in enumerate(items):
    plt.subplot(2, 4, i+1)
    plt.axis('off')
    plt.title(num2letter[labels[item]])
    plt.imshow(dataset[item])

The conditions looked good, so we just need to save the data.