Programming Homework Help
Programming Homework Help. Machine learning
As a simplified example of character recognition, we will compare several supervised
learning classifiers with validation on a larger version of the MNIST digit recognition
dataset. In this assignment we will use a much larger dataset than that used for
assignment 1; this should represent a better distribution of the natural variability in hand
written 8s and 9s.
Download (from moodle), NumberRecognitionBigger.mat. Not the dataset includes data
samples for all handwritten digits 0 to 9, but we will be using only 8 and 9 for this
assignment. You can implement your assignment in either Matlab or python, with details
to follow:
Coding
Example Matlab and Python functions that can be relied upon are already outlined in
Assignment 1. Assignment 2 may also benefit from the following commands. You are
expected to read documentation on the commands available and try to get them
working, prior to asking for assistance. Please address questions to the course
Python
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis as QDA
from sklearn.naive_bayes import GaussianNB as NB
Also strongly consider using:
from sklearn.model_selection import cross_validate
from sklearn.model_selection import StratifiedKFold, StratifiedShuffleSplit
and using the random_state argument for either StratifiedShuffleSplit or
StratifiedKFold.
Question 1: Implement K-Fold cross validation (K=5). Within the validation, you
will train and compare a Linear Discriminant Analysis Classifier, a Quadratic
Discriminant Analysis Classifier, a Bayesian Classifier (Naïve Bayes) and a K-NN
(K=1, K=5 and K=10) classifier. The validation loop will train these models for
predicting 8s and 9s. NOTE: for a fair comparison, K-Fold randomization should
only be performed once, with any selected samples for training applied to the
creation of all classifier types (LDA, QDA, Bayes, KNN) in an identical manner (i.e.
the exact same set of training data will be used to construct each model being
compared to ensure a fair comparison).
Provide a K Fold validated error rate for each of the classifiers. Provide a printout of your
code (Matlab or python). Answer the following questions:
a) Which classifier performs the best in this task?
b) Why do you think this classifier outperforms the others?
c) How does KNN compare to the results obtained in assignment 1? Why do you
observe this comparative pattern?
It was previously announced on multiple occasions that each student is required to
assemble their own dataset compatible with supervised learning based classification
(i.e. a collection of measurements across many samples/instances/subjects that include
a group of interest distinct from the rest of the samples). If you are happy with your
choice from assignment 1, then re-provide your answer to Assignment 1 Question 2
below. If you want to change your dataset for this assignment, for a future assignment or
for your graduate project, you are free to do so, but you have to update your answer to
Question 2 based on your new dataset choice.
Question 2: (Repeat) Describe the dataset you have collected: total number of
samples, total number of measurements, brief description of the measurements
included, nature of the group of interest and what differentiates it from the other
samples, sample counts for your group of interest and sample count for the group not of
interest. Write a program that analyzes each measurement/feature individually. For each
measurement, compute Cohen’s d statistic (the difference between the average value of
the group of interest and the average value of the group not of interest, divided by the
standard deviation of the joint distribution that includes both groups). Provide a printout
of the 10 leading measurements (d statistic furthest from zero), with their corresponding
d statistic, making it clear what those measurements represent in your dataset (these
are the measurements with the most obvious potential to inform prediction in any given
machine learning algorithm). Provide a printout of this code.
Question 3: Adapt your code from Question 1 to be applied to the dataset that you’ve
organized for yourself. Provide a printout of the error rates for the different classifiers
and your code. Answer the following question: is the best performing classifier from
Question 1 the same in Question 3? Elaborate on those similarities/differences – what
about your dataset may have contributed to the differences/similarities observed?
Deadline: October 24th, 2019.