CS 5590 Assignment 1 Foundations of Machine Learning 

$35.00

Download Details:

  • Name: Assignment-1-gucbbq.zip
  • Type: zip
  • Size: 943.07 KB

Category:

Description

5/5 - (1 vote)

This homework is intended to cover programming exercises in the following topics: • k-NN, Decision Trees, Model Selection, Naive Bayes classifier Instructions • Please upload your submission on Google Classroom by the deadline mentioned above. Your submission should comprise of a single file (PDF/ZIP), named Assign1, with all your solutions. • For late submissions, 10% is deducted for each day (including weekend) late after an assignment is due. Note that each student begins the course with 7 grace days for late submission of assignments (of which atmost 4 can be used for a given submission). Late submissions will automatically use your grace days balance, if you have any left. You can see your balance on the FoML Marks and Grace Days document (soon to be shared). • Please use Python for the programming questions. • Please read the department plagiarism policy. Do not engage in any form of cheating – strict penalties will be imposed for both givers and takers. Please talk to instructor or TA if you have concerns. 1 Questions 1. k-NN: (8 marks) In k-nearest neighbors (k-NN), the classification is achieved by majority vote in the vicinity of data. Given n points, imagine two classes of data each of n/2 points, which are overlapped to some extent in a 2-dimensional space. (a) (1 mark) Describe what happens to the training error (using all available data) when the neighbor size k varies from n to 1. (b) (2 marks) Predict and explain with a sketch how the generalization error (e.g. holding out some data for testing) would change when k varies? Explain your reasoning. 1 (c) (2 marks) Give two reasons (with sound justification) why k-NN may be undesirable when the input dimension is high. (d) (3 marks) Is it possible to build a univariate decision tree (with decisions at each node of the form “is x > a”, “is x < b”, “is y > c”, or “is y < d” for any real constants a, b, c, d) which classifies exactly similar to a 1-NN using the Euclidean distance measure? If so, explain how. If not, explain why not. 2. Bayes Classifier: (6 marks) (a) (3 marks) A training set consists of one dimensional examples from two classes. The training examples from class 1 are {0.5, 0.1, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.35, 0.25} and from class 2 are {0.9, 0.8, 0.75, 1.0}. Fit a (one dimensional) Gaussian using Maximum Likelihood to each of these two classes. You can assume that the variance for class 1 is 0.0149, and the variance for class 2 is 0.0092. Also estimate the class probabilities p1 and p2 using Maximum Likelihood. What is the probability that the test point x = 0.6 belongs to class 1? (b) (3 marks) You are now going to make a text classifier. To begin with, you attempt to classify documents as either sport or politics. You decide to represent each document as a (row) vector of attributes describing the presence or absence of the following words. x = (goal,football,golf,defence,offence,wicket,office,strategy) Training data from sport documents and from politics documents is represented below in a matrix in which each row represents the 8 attributes. xpolitics =         1 0 1 1 1 0 1 1 0 0 0 1 0 0 1 1 1 0 0 1 1 0 1 0 0 1 0 0 1 1 0 1 0 0 0 1 1 0 1 1 0 0 0 1 1 0 0 1         xsport =         1 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 1 0 1 0 0 0 0 1 1 0 1 0 0 0 1 1 1 0 1 1 0 0 0 0 0 0 1 0 1 0 0         Using a maximum likelihood naive Bayes classifier, what is the probability that the document x = (1, 0, 0, 1, 1, 1, 1, 0) is about politics? 3. Decision Trees: (16 marks) In this question, you will use the Wine dataset1 , a popular dataset to evaluate classification algorithms. The classification task is to determine, based on various parameters, whether a wine quality is over 7. The dataset has already been preprocessed to convert this into a binary classification problem (scores less than 7 belong to the “zero” class, and scores greater than or equal to 7 belong to the “one” class). Each line 1https://archive.ics.uci.edu/ml/datasets/Wine+Quality 2 describes a wine, using 12 columns: the first 11 describe the wine’s characteristics (details), and the last column is a ground truth label for the quality of the wine (0/1). You must not use the last column as an input feature when you classify the data. (a) (5 marks) Decision Tree Implementation: Implement your own version of the decision tree using binary univariate split, entropy and information gain. (b) (5 marks) Cross-Validation: Evaluate your decision tree using 10-fold cross validation. Please see lecture slides for details. In a nutshell, you will first make a split of the provided data into 10 parts. Then hold out 1 part as the test set and use the remaining 9 parts for training. Train your decision tree using the training set and use the trained decision tree to classify entries in the test set. Repeat this process for all 10 parts, so that each entry will be used as the test set exactly once. To get the final accuracy value, take the average of the 10 folds’ accuracies. With correct implementation of both parts (decision tree and cross validation), your classification accuracy should be around 0.78 or higher. (c) (6 marks) Improvement Strategies: Now, try and improve your decision tree algorithm. Some things you could do include (not exhaustive): • Use Gini index instead of entropy • Use multi-way split (instead of binary split) • Use multivariate split (instead of univariate) • Prune the tree after splitting for better generalization Report your performance as an outcome of ANY TWO improved strategies. Deliverables: • Code • Brief report (PDF) with: (i) Accuracy of your initial implementation; (ii) Accuracy of your improved implementation, along with your explanation of why the accuracy improved with this change. 3