Computer Science Homework Help
Computer Science Homework Help. Comparing and analyzing various machine learning algorithms across various data sets using Weka open source.
Objective: Comparing and analyzing various machine learning algorithms across
various data sets using Weka open source. Both supervised and unsupervised
approaches will be looked at in this assignment.
NOTE: Programming is not required for this work if using WEKA, hence put your
effort in the analysis and report. Be sure to include any references you use for the
report. Setting up WEKA should take less than an hour. The run times for these
algorithms should be very short, all take seconds for a single run, except for the
neural network that takes a few minutes. You need to use multiple runs for your
report— (Use the experimenter feature described at the tutorial how to set
multiple runs in WEKA).
Part A: Supervised Learning
You are to compare three popular supervised learning algorithms across four
classification data sets.
Decision tree: Algorithm 1 will be a C4.5 decision tree. This algorithm can be
found in Weka under the classify tab using the label trees/J48.
Neural network: Algorithm 2 will be a standard neural network trained using back
propagation. This algorithm can be found in Weka under the classify tab using the
label functions/MultilayerPerceptron.
K nearest neighbours: Algorithm 3 will be using the K nearest neighbours
classification algorithm (will be reviewed on Tuesday tutorial, however,you c).
This algorithm can be found in Weka under the classify tab using the label
lazy/IBk.
Data sets:
Iris classification data set
Contains features regarding iris plants, with the goal of determining which
class of iris the plant is.
There are 150 input vectors. Each input vector contains 4 attributes, and 3
possible classifications.
https://archive.ics.uci.edu/ml/datasets/Iris
Wisconsin breast cancer data set
Contains medical features of a tumour, with the goal of determining if the
tumour is malignant or benign.
There are 699 input vectors. Each input vector contains 9 attributes, and 2
possible classifications.
https://archive.ics.uci.edu/ml/datasets/Breast+Can…
inal%29
Car evaluation data set
Contains features regarding different vehicles, with the goal of determining
the safety level of the car.
There are 1728 input vectors. Each input vector contains 6 attributes, and 4
possible classifications.
https://archive.ics.uci.edu/ml/datasets/Car+Evalua…
Diabetic retinopathy data set
Contains features of medical images, with the goal of determining whether
the image shows signs of diabetic retinopathy or not.
There are 1151 input vectors. Each input vector contains 20 attributes, and
2 possible classifications.
https://archive.ics.uci.edu/ml/datasets/Diabetic+R…
ata+Set
For part A, analyze the performance of each required algorithm for each data set.
What observations can you make regarding the data set used and the models
trained? Does one approach beat all others for every data set, or do different
approaches work better on the different problems? Using your understanding of
the algorithms, try and explain the observations you make. Try modifying the
parameters for the different algorithms. Does changing the parameters from their
default values significantly impact the performance of the algorithm?
Part B:
Implement and analyze the performance of clustering on unsupervised data sets
using various clustering algorithms. For this part you will use the K-means
clustering algorithm. The data sets that you will use are available at the following
link: http://cs.joensuu.fi/sipu/datasets/
Note: Use the S1, S2, S3 and S4 data sets. Feel free to use any additional
data sets from the above link. The data sets will need to be converted into an arff
file as explained in tutorial for use with Weka.
You are to make observations regarding how the k-means clustering works on the
different data sets. How does modifying the number of clusters impact the within
cluster sum of squared error? What happens if you use too many or too few
clusters? What sort of impact would you expect from modifying the way clusters
are initialized? What observations can you make comparing the clustering on an
easily separable data set (s1) to one where the optimal clusters are a lot less clear
(s4)?
For bonus marks, extend your analysis by including a self-organizing map
approach (self-organizing maps is not a topic typically covered in 4P76 but it is a
seminar topic that will be presented and is worth knowing). To install the self
organizing map package, on the Weka home page select the tools tab and click on
package manager. In the package manager, select the SelfOrganizingMap package
and click install. It will now be available under the cluster tab. Compare the self
organizing map clustering approach to the k-means clustering approach. How
does its clustering procedure differ from that of k-means? What sort of impact
does modifying the lattice width and height have on the algorithm? What
observations can you make when the values are the same, or when one value is
larger then the other?
Assignment Requirements and Grading: The results are to be handed in via a
technical paper written in the IEEE format shown to you in tutorial. Your report
should contain the following headers and sections:
Abstract, Introduction & problem definition
supervised and unsupervised learning, applicability of the two approaches.
Background
learning and one on unsupervised learning.
and equations.
they are not the focus of this assignment you do not need as much detail.
Results and Discussion
learning and one for unsupervised learning.
experimental evidence to support your claims.
provide further supporting evidence to your claims.
Conclusions
in your results section.
Computer Science Homework Help