Description

5/5 - (1 vote)

. Problem 1: Version Spaces [30 points] In this problem we will investigate the geometry of version spaces for different hypotheses. For all parts of this question, training instances are vectors in Z 2 , i.e. points on a 2D lattice, with labels c ∈ {−1, 1}. (a) Rectangular hypothesis Consider a hypothesis of the form (pT L, pBR), where pi = {xi , yi |xi , yi are integers and 0 ≤ xi , yi ≤ 8} define the diagonally-opposite corners of a rectangle (Top Left, Bottom Right, respectively). An instance (x, y) is classified positive if it falls inside, or on the boundary of this rectangle, and negative otherwise. Note, that the definition allows for degenerate cases, where two or four of the corners overlap. In addition, allow for pT L and pBR to take on a special value ∅. If either pT L = ∅ or pBR = ∅, all instances are classified as negative. This hypothesis definition applies to problems 1a to 1d. What is the size of this hypothesis space? (b) Rectangular hypothesis A hypothesis h1 is considered more general than hypothesis 1-1 h2 (and h2 more specific than h1) if h2 implies h1 The most general (most specific) hypothesis h ∗ is a hypothesis, such that no other hypothesis is more general (more specific) than h ∗ . Draw the most general hypothesis that satisfies the training data D1 in Figure 1. Draw the most specific hypothesis. (c) Rectangular hypothesis What is the size of the version space for the training data D1? (d) Rectangular hypothesis In a form of Active Learning, a learner can query the teacher for more data. The goal of the learner would be to pick query instances, such that the size of the version space is reduced the most (which means that the greatest number of inconsitent hypotheses is pruned after observing the query instance). Consider the following 3 candidates for a query instance: (3,7), (4,3), (4,6). Compute, for each, the expected size of the version space after observing the training set D1 and the label for that query instance (considering the two possible labels have an equal chance of occuring). Between the three query candidates, is there a best choice? x y Figure 1: Training set D1 (e) Decision Tree hypothesis Now consider a hypothesis space formed by all 1-level decision trees. Since all of the attributes for our data are integer-valued, consider that splitting thresholds in our decision tree are also constrained to integers, and each node in the tree splits on a single attribute, x or y, with a splitting criterion attribute ≥ threshold. Give the size of the version space for a 1-level decision tree and the training set D2. Note that in general multiple trees can represent the same function, and we are interested in counting functions. 1-2 (f) Decision Tree hypothesis Now consider a hypothesis space formed by all decision trees with 3 leaf nodes, with the same splitting criteria as above. What is the size of the version-space now? Draw all hypotheses that belong to the version space in the (x, y) space of Figure 2. x y Figure 2: Training dataset D2 Problem 2: Regression with kNN [30 points] In this problem, we will investigate the application of kNN to regression in a setting of partial image completion. You have been hired by the NSA to help them reconstruct corrupted surveillance photos of possible criminals. Due to a glitch in their cameras, some images are partially blank. Your goal will be to provide the “best” completion of the image using kNN. Included with the assignment is a subset of the Olivetti face dataset. The dataset is a collection of 64×64 grayscale images of peoples faces. This dataset was processed into the SVMLight format for this assignment. This file format will be used for all programming assignments in this course. In this file, each line corresponds to a training instance, providing the label, and the feature-value pairs for that instance: : : …. For this problem, each image has been vectorized into this format using a row-major encoding, i.e. features with indices 0 − 63 correspond to the grayscale intensities of the first row, features with indices 64 − 127 correspond to the grayscale intensities of the second row, etc. Grayscale intensities are in the range 0.0 to 1.0, where 0.0 is black, and 1.0 is white. 1-3 You are provided with a set of training images in the file f aces.train, and a set of test images in the file f aces.test. We will treat the vectorized form of the top half of the image (32 × 64 pixels) as a feature vector x, and the vectorized form of the bottom half of the image, y (also 32 × 64 pixels), as a target (label). Note, that in contrast to the first problem, the target variable is both multidimensional, and continuous. Our goal is: given an input x of someone’s top part of the face, predict what the lower part y might look like. Figure 3: Faces – test set Your instructions for implementation: Complete all five faces in the f aces.test file by performing regression with kNN. Use inverse euclidean distance as a similarity metric (and take the similarity between identical instances to be 1). As an example: if for k=2, your two nearest neighbors to a test instance xu have labels yv and yw with corresponding distances to the test instance duv and duw, then the predicted output would be yu = (1/duv)yv + (1/duw)yw. Experiment with k ∈ {1, 5, 10, 50}. What do you observe about the sharpness of the completed face when k is increased, and how do you account for this observation? Present “completed” faces corresponding to k = 5 and k = 50 in your submission. Problem 3: kNN Classification [40 points] In this problem, we will be performing text categorization with the kNN algorithm. We will be classifying books from amazon.com based on their previews available online (usually one or two chapters). The most common way to classify text is to first model a document as a bag-of-words. That means that the order of words in the document is ignored, allowing us to treat each unique English word as an independent feature. The number of times a particular word occurs in the document is then the value of the respective feature. That way, each English word is mapped to a unique id, allowing us to represent a single document in the SV Mlight file as a line in this format: :count :count …. Supplied to you are 2 such files: books.train, books.test, corresponding to the training 1-4 and test sets respectively. Additionally, you are provided with the id-word mappings in the books.vocab file, and the titles of books in the *.titles files with a line to line correspondence with the .train and .test files. There is a total of (almost) 10,000 books, divided equally between the training and test sets. Each book belongs to exactly one of the 5 categories (genres) — 0: Action-and-adventure, 1: Horror, 2: Mystery-thrillers, 3: Romance, 4: Science-fiction, where indices are the class labels corresponding to the genres, and are given as document labels in the SV Mlight files. Our goal in this problem will be to classify books according to their genre. (a) Content-based book recommendation As a warm up exercise, consider a problem of recommending books to read based on the books you already like. For each of the 2 books listed below, provide the top 10 recommendations by finding its 10 closest neighbors. Use cosine similarity as a measure of “closeness” between instances. Cosine similarity is given by: sim(xu, xv) = xu · xv ||xu||||xv|| where xu, xv are feature vectors corresponding to two document instances, and || · || indicates the euclidean length (L2 norm). Suppose your two most favorite books are: Fifty Shades of Grey: Book One of the Fifty Shades Trilogy Brains: A Zombie Memoir Report the top 10 recommendations corresponding to each book, using the instances in the books.train file only. Qualitatively, do these recommendations make sense? Note that some books have an amazon product ID number, instead of the title. You can look up the book corresponding to the number by searching for the number on amazon.com. (b) Baseline. Before trying kNN, let’s consider a simple baseline for the task of genre classification. For each class (genre1 ), compute a “centroid” feature vector by first normalizing the feature vector for each instance (using the L2 norm), and then summing the normalized feature vectors across all training instances corresponding to that class. To classify a test instance, compute cosine similarity between the test instance and the 5 “centroids”. Label the test instance with the class of the nearest centoid. Run this baseline on the entire test set books.test. Report the following metrics for this baseline: Accuracy (1 value), P recision for each class (5 values), Recall for each class (5 values). These metrics are defined as follows: 1 the two terms are interchangeable for this problem 1-5 Accuracy = nypredict=ytrue /n P recision(class c) = nypredict=ytrue=c/nypredict=c Recall(class c) = nypredict=ytrue=c/nytrue=c where n is the size of the test set (number of instances) and the expression nypredict=ytrue=c reads: the number of test instances for which our algorithm correctly predicts class c. Note that for the case that the denominators in the P recision and Recall expressions are zero, report zero for the corresponding metric. P recision and Recall are valuable for investigating the performance of our classifier on individual classes, whereas Accuracy provides only an aggregate metric. (c) kNN Implementation Implement the unweighted kNN algorithm that takes k, the training and testing sets as inputs, and outputs a class label c for every test instance. Use cosine similarity, and run the algorithm for values of k in the range {1, 2, 5, 10, 100, 200, 300, 500, 1,000, 2,000, 3,000, 4,000, 5,000}. Plot Accuracy as a function of log(k). In your kNN implementation, break ties by choosing a class with the smallest class index. (d) What value of k yields the highest accuracy on the test set? Report P recision and Recall for that value of k for every class (10 values in total). (e) What do you observe with respect to the P recision and Recall when k = 5, 000?. Does this match your expecations? (f) Consider a hypothetical instance of an unweighted kNN algorithm with k = ntrain. Consider an unbalanced distribution of class labels in the training set: {n0 = 1010, n1 = 999, n2 = 998, n3 = 997, n4 = 996}, and n0 + n1 + n2 + n3 + n4 = ntrain. What P recision and Recall do we expect on a test set of size ntest = 5000 where each of the five classes has an equal number of instances? (g) Compare the performance of kNN to the baseline in part b. It turns out that this baseline is quite strong. What does this say about the geometry of our instance space (book data)? Show a qualitative example of an instance space in 2D where we would expect the baseline to perform significantly worse compared to kNN. 1-6

CS 4780/5780 Machine Learning Assignment 1: Concept Learning and kNN

Download Details:

Description

CS 4780/5780 Machine Learning Assignment 1: Concept Learning and kNN

Download Details:

Description

Related products

CS 4780/5780 Machine Learning Assignment 3: HMM & Statistical Learning Theory

CS 4780/5780 Machine Learning Assignment 4: Kernels & Generative Models

CS 4780/5780 Machine Learning Assignment 2: Decision Trees and Hypothesis Testing