Description

5/5 - (1 vote)

1 Boolean Functions In this problem, you will be asked to write Boolean functions and linear threshold functions based on given labeled data. 1. [3 points] Table 1 shows several data points (the x’s) along with corresponding labels (y). (That is, each row is an example with a label.) Write down three different Boolean functions all of which can produce the label y when given the inputs x. y x1 x2 x3 x4 0 1 0 0 0 0 1 1 0 0 1 1 0 1 1 Table 1: Original Table 2. [5 points] Next, we expand Table 1 to Table 2 by adding more data points. How many errors will each of your functions from the previous questions make on the full data set. 3. [7 points] Write down the linear threshold function for the data in Table 2. 1 y x1 x2 x3 x4 0 1 0 0 0 0 1 1 0 0 1 1 0 1 1 0 0 1 0 0 0 0 1 1 0 0 1 1 1 0 1 0 1 1 1 Table 2: Expanded Table 2 Mistake Bound Model of Learning Consider an instance space consisting of integer points on the two dimensional plane (x1, x2) with −128 ≤ x1, x2 ≤ 128. Let C be a concept class defined on this instance space. Each function fr in C is defined by an integer radius r (with 1 ≤ r ≤ 128) as follows: fr(x1, x2) = +1 x 2 1 + x 2 2 ≤ r 2 ; −1 otherwise (1) Our goal is to come up with a error-driven algorithm that will learn the correct function f ∈ C that correctly classifies a dataset. Side notes 1. Recall that a concept class is the set of functions from which the true target function is drawn and the hypothesis space is the set of functions that the learning algorithm searches over. In this question, both these are the same set. 2. Assume that there is no noise. That is, assume that the data is separable using the hypothesis class. Questions 1. [5 points] Determine |C|, the size of concept class. 2. [5 points] To design an error driven learning algorithm, we should be able to first write down what it means to make a mistake. Suppose our current guess for the function is fr defined as in Equation 1 above. Say we get an input point (x t 1 , xt 2 ) along with its label y t . Write down an expression (an equality or an inequality) in terms of x t 1 , x t 2 , y t and r that checks whether the current hypothesis fr has made a mistake. 3. [10 points] Next, we need to specify how we will update a hypothesis if there is an error. Since fr is completely defined in terms of r, we only need to update r. How will you update r if there is an error? Consider errors for both positive and negative examples. 2 4. [20 points] Use the answers from the previous two steps to write a mistake-driven learning algorithm to learn the function. Please write the algorithm concisely in the form of pseudocode. What is the maximum number of mistakes that this algorithm can make on any dataset? 5. (For 6350 students)[15 points total] We have seen the Halving algorithm in class. The Halving algorithm will maintain a set of hypotheses consistent with all the examples seen so far and predict using the most frequent label among this set. Upon making a mistake, the algorithm prune at least half of this set. In this question, you will design and analyze a Halving algorithm for this particular concept space. a. [5 points] The set of hypotheses consistent with all examples seen so far can be defined storing only two integers. How would you do this? b. [5 points] How would you check if there is an error for an example (x t 1 , xt 2 ) that has the label y t ? c. [5 points] Write the full Halving algorithm for this specific concept space. (Do not write the same Halving algorithm we saw in class. You need to tailor it to this problem.) What is its mistake bound? 3 The Perceptron Algorithm and Its Variants 3.1 The Task and Data Imagine you have access to information about people such as age, gender and level of education. Now, you want to predict whether a person makes over $50K a year or not using these features. We will use Adult data set from the UCI Machine Learning repository1 . The original Adult data set has 14 features, among which 6 are continuous and 8 are categorical. In order to make it easier to use, we will use a pre-processed version (and subset) of the original Adult data set, created by the makers of the popular LIBSVM tool. From the LIBSVM website: “In this data set, the continuous features are discretized into quantiles, and each quantile is represented by a binary feature. Also, a categorical feature with m categories is converted to m binary features.” Use the training/test files called ‘a1a.train’ and ‘a1a.test’, available on the assignments page of the class website.2 This data is in the LIBSVM format, where each row is a single training example. The format of the each row in the data is : : … Here denotes the label for that example. The rest of the elements of the row is a sparse vector denoting the feature vector. For example, if the original feature vector is [0, 0, 1, 2, 0, 3], this would be represented as 3:1 4:2 6:3. That is, only the non-zero entries of the feature vector are stored. 1Look for information about the Adult data set at https://archive.ics.uci.edu/ml/datasets/Adult 2These are the same as a1a and a1a.t available at http://www.csie.ntu.edu.tw/~cjlin/ libsvmtools/datasets/binary.html 3 3.2 Algorithms You will implement two variants of the Perceptron algorithm. Note that each variant has different hyper-parameters, as described below. • Perceptron: This is the simple version of Perceptron as described in the class. An update will be performed on an example (x, y) if y(wT x + b) ≤ 0. Hyper-parameters: The learning rate r Two things bear additional explanation. First, note that in the formulation above, the bias term b is explicitly mentioned. This is because the features in the Adult data do not include a bias feature. Of course, you could choose to add an additional constant feature to each example and not have the explicit extra b during learning. (See the class lectures for more information.) However, here, we will see the version of Perceptron that explicitly has the bias term. Second, if w and b are initialized with zero, then the learning rate will have no effect. To see this, recall the Perceptron update: wnew ← wold + ryx bnew ← bold + ry. Now, if w and b are initialized with zeroes and a learning rate r is used, then we can show that the final parameters will be equivalent to having a learning rate 1. The final weight vector and the bias term will be scaled by r compared to the unit learning rate case. For this assignment, you should initialize the weight vector and the bias randomly and tune the learning rate parameter. We recommend trying small values less than one. (eg. 1, 0.1, 0.01, etc.) • Margin Perceptron: This variant of Perceptron will perform an update on an example (x, y) if y(wT x + b) ≤ µ, where µ is an additional positive hyper-parameter, specified by the user. Note that because µ is positive, this algorithm can update the weight vector even when the current weight vector does not make a mistake on the current example. Hyper-parameters: Learning rate r and the margin µ. We recommend setting the value of µ between 0 and 5.0. As mentioned in previous homework, you may use any programming language for your implementation. However, the graders should be able to execute your code on the CADE machines. 3.3 Experiments 1. [Sanity check, 10 points] Run the simple Perceptron algorithm on the data in Table 2 (one pass only) and report the weight vector that the algorithm returns. How many mistakes does it make? 4 You may choose whatever learning rate you like, but we suggest that you informally experiment with them before submitting the results. 2. [Online setting, 15 points] Run both the Perceptron algorithm and the margin Perceptron on the Adult data for one pass. Report the number of updates (or equivalently mistakes) made by each algorithm and the accuracy of the final weight vector on both the training and the test set. Once again, you will require some playing with the algorithm hyper-parameters. You will see that the hyper-parameters will make a difference and so try out different values. You may even write some code to run the algorithms with different sets of hyper-parameters. 3. [Using online algorithms in the batch setting, 20 points] The third experiment is to evaluate the algorithms in a more realistic setting, where the algorithms perform multiple passes over the training data. This means that there is an additional hyper-parameter: the number of epochs. Run the algorithms for three and five epochs and report the number of updates made, and the accuracies of the final weight vectors on the training and test data. It may be important to shuffle the training data before starting each epoch. Report the results of the above experiments when you shuffle do so. Briefly explain your results. 4. (For 6350 Students) [Aggressive Perceptron with Margin, 10 points] Implement is an extension of the margin Perceptron which performs an aggressive update as follows: If y(wT x + b) ≤ µ, then update (a) wnew ← wold + ηyx (b) bnew ← b + ηy, Unlike the standard Perceptron algorithm, here the learning rate η is given by η = µ − y(wT x + b) xT x + 1 As with the margin perceptron, there is an additional positive parameter µ. We call this the aggressive update because the update can be derived from the following optimization problem. When we see that y(wT x + b) ≤ µ, we try to find new values of w and b such that y(wT x + b) = µ using min wnew 1 2 ||wnew − wold||2 + 1 2 (bnew − bold) 2 s.t. y(wT x + b) = µ. That is, the goal is to find the smallest change in the weights so that the current example is on the right side of the weight vector. By substituting (a) and (b) from 5 above into this optimization problem, we will get a single variable optimization problem whose solution gives us the η defined above. You can think of this algorithm as trying to tune the weight vector so that the current example is correctly classified right after the update. Repeat the batch experiments with the aggressive update. You should report two sets of results (one with shuffling and one without). What To Submit 1. The report should detail your experiments. For each step, explain in no more than a paragraph or so how your implementation works. You may provide the results for the final step as a table or a graph. Describe what you did. Comment on the design choices in your implementation. For your experiments, what algorithm parameters did you use? Try to analyze this and give your observations. 2. Your report should be in the form of a pdf file, LATEX is recommended. 3. Your code should run on the CADE machines. You should include a shell script, run.sh, that will execute your code in the CADE environment. Your code should produce similar output to what you include in your report. You are responsible for ensuring that the grader can execute the code using only the included script. If you are using an esoteric programming language, you should make sure that its runtime is available on CADE. 4. Put your project code in a single directory, and the best is to create a compressed tar/zip file of code and script used to run it. Please do not hand in binary files. 5. Please look up the late policy on the course website. 6

CS 5350/6350: Machine Learining Homework 2

Download Details:

Description

CS 5350/6350: Machine Learining Homework 2

Download Details:

Description

Related products

CS 5350/6350: Machine Learining Homework 3

CS 5350/6350: Machine Learining Homework 4

CS 5350/6350: Machine Learining Homework 6