CSE 512 –Homework 1

$35.00

Download Details:

  • Name: HW1-Predicting-Goodness-of-Wine-sqn8k3.zip
  • Type: zip
  • Size: 1.36 MB

Category:

Description

5/5 - (1 vote)

This homework contains 4 questions. The last question requires programming. The maximum number of points is 100 plus 10 bonus points. Bonus points will not be used to fit the grade curve. For Questions 3 and 4 (and it is a good idea in general), we will use the following notations. Bold uppercase letters denote matrices (e.g. D), bold lowercase letters denote column vectors (e.g. d). di represents the i th column of the matrix D. dij denotes the scalar in the row j th and column i th of the matrix D; it is also the j th entry of column vector di . Non-bold letters represent scalar variables. 1k ∈ <k×1 is a column vector of ones. 0k ∈ <k×1 is a column vector of zeros. Ik ∈ <k×k is the identity matrix. We use comma to denote horizontal concatenation of two vectors (or matrices) (e.g., [a, b]). We use semicolon to denote vertical concatenation of two vectors (e.g., [a; b]). 1 Question 1 – Probability (15 points) Let X1 and X2 be independent, continuous random variables uniformly distributed on [0, 1]. Let X = max(X1, 2X2). Compute 1. (5 points) E(X). 2. (5 points) V ar(X). 3. (5 points) Cov(X, X1). 2 Question 2 – Parameter estimation (30 points) This question uses a discrete probability distribution known as the Poisson distribution. A discrete random variable X follows a Poisson distribution with parameter λ if p(X = k|λ) = λ k k! e −λ k ∈ {0, 1, 2, . . . } The Poisson distribution is a useful discrete distribution which can be used to model the number of occurrences of something per unit time. For example, it can be used to model the number of customers arriving at a bank per hour, or the number of calls handled by a telephone operator per minute. 2.1 Question 2.1 – MLE (10 points) Consider catching the LIRR train from Stony Brook to Penn Station. Of course, the train is always late, and assume the amount of delay in minutes is Poisson distributed (i.i.d) with parameter λ. Since you like Times Square so much, you have visited NYC several times and record the amount of train delays in each trip. Trip 1 2 3 4 5 6 7 Delay in minutes 4 5 3 5 6 9 3 Let X = (X1, . . . , Xn) be a random vector where Xi is the number of delay minutes for your i th trip: 1. (4 points) Give the log-likelihood function of X given λ. 2. (4 points) Compute the MLE for λ in the general case. 3. (2 point) Compute the MLE for λ using the observed X. 1 2.2 Question 2.2 – MAP (10 points) Now let’s be Bayesian and put a prior over the parameter λ. You talk to your party-goer friends, who tell you about the expected delays that they experience. You plot the distribution of the expected delays and your extensive experience in statistics tells you that the plot resembles a Gamma distribution pdf. So you believe a good prior distribution for λ may be a Gamma distribution. The Gamma distribution has pdf: p(λ|α, β) = β α Γ(α) λ α−1 e −βλ, λ > 0. (1) Γ(α) is the Gamma function, which is the normalization factor that is introduced to have a proper probability density function (i.e., sum to 1). Do not worry if you don’t know the explicit format of Γ(α); it is not important for this question. Also, if λ ∼ Γ(α, β), then it has mean α/β and the mode is (α − 1)/β for α > 1. Assume the prior distribution of λ is Γ(λ|α, β). 1. (5 points) Compute the posterior distribution over λ. Hint: λ Pn i=1 Xi+α−1 e −λ(n+β) looks like a Gamma distribution! Is the rest of the expression constant with respect to λ? Working out a messy integral can lead to the answer but it is unnecessary. 2. (5 points) Derive an analytic expression for the maximum a posterior (MAP) estimate of λ. 2.3 Question 2.3 – Estimator Bias (10 points) In class, we saw that the maximum likelihood estimator for the variance of a Gaussian distribution: σˆ 2 MLE = 1 N X N i=1 (xi − µˆ) 2 is biased, and that an unbiased estimator of variance is: σˆ 2 unbiased = 1 N − 1 X N i=1 (xi − µˆ) 2 For the Gaussian distribution, these estimators give similar results for large enough N, and it is unclear whether one estimator is preferable to the other. In this problem, we will explore an example in which the maximum likelihood estimate is dramatically superior to any unbiased estimator. Suppose we are not quite interested in estimating λ. We are more interested in estimating a quantity that is a nonlinear function of λ, namely η = e −2λ . Suppose we want to estimate η from a single observation X ∼ P oisson(λ). 1. [4pts] Let ηˆ = e −2X. Show that ηˆ is the maximum likelihood estimate of η. 2. [3pts] Show that the bias of ηˆ is e −2λ − e λ(1/e2−1). The following Taylor expansion may be useful: e x = X∞ k=0 x k k! 3. [3pts] It turns out that (−1)X is the only unbiased estimate of η. Prove that it is indeed unbiased and briefly explain why this is a bad estimator to use. 2 3 Question 3 – Regression and MLE [15 points] Suppose the data {(xi , yi)|xi ∈ λ end for end while Algorithm 2 Efficient Coordinate Descent Algorithm ak ← 2 Pn i=1 x 2 ik % You can compute a without for loop while not converged do 1. Compute the residual vector r (Q1 of 4.1, recalculate r each iteration to avoid numerical drift) 2. Update b (Q2 of 4.1) 3. Update r (Q3 of 4.1) for k ∈ {1, 2, · · · , d} do 4. Calculate ck (Q4 of 4.1) 5. Update wk (See Algorithm 1) 6. Update the residual r (Q5 of 4.1) end for end while 3. (2 points) Once b is updated, find the update rule for the residual vector r. Again, use no for loop. What is the time complexity for the update? 4. (2 points) Simplify the computation for ck in Algorithm 1 using the residual vector r. Hint: there should no longer a sum over j. Suppose zk is the number of non-zeros in the k th row of X, what is the time complexity to compute ck when r is already computed? 5. (2 points) Suppose we update wk from w old k to w new k . Derive an update rule for r. Express the time complexity in terms of zk. 6. (2 points) Putting this all together, you get the efficient coordinate descent algorithm in Algorithm 2. What is per iteration complexity of this algorithm (cost of each round of the while loop)? 4.2 Implement coordinate descent to solve the Lasso Now we are ready to implement the coordinate descent algorithm in Algorithm 2. Your code must • Include an offset term b that is not regularized • Take optional initial conditions for w. • Be able to handle sparse X. It is ok for your code not being able to handle dense X. In Python, this means you can always assume input is a sparse matrix. • Avoid unnecessary computation (i.e take advantage of what you learned from Problem 4.1) You may use any language for your implementation, but we recommend Matlab. Matlab is a very useful language, and you should find that Matlab achieves reasonable performance for this problem. Our implementation takes < 1s for an input matrix X of size 2500 × 10000. Matlab has many toolboxes, and you cannot use any existing Lasso solver. Before you get started, here are some hints that you may find helpful: • The only matrix operations required are: multiplying a matrix and a vector, adding vectors, computing dot product between two vectors, point-wise matrix operations (e.g., power of 2), calculate the sum of columns or rows or a matrix. Try to use as much vector/matrix computation as possible. • The most important check is to ensure the objective value is non-increasing with each step. • To ensure that a solution (wˆ , ˆb) is correct, you also can compute the value 2X(XT wˆ + ˆb − y) This is a d-dimensional vector that should take the value −λsign(wj ) at j for each wj that is nonzero. For the zero indices of wˆ , this vector should take values lesser in magnitude than λ. (This is similar to setting the gradient to zero, though more complicated because the objective function is not differentiable.) • It is up to you to decide on a suitable stopping condition. A common criteria is to stop when the objective value does not decrease by more than some small δ during an iteration. If you need your algorithm to run faster, an easy place to start is to loosen this condition. • For several problems, you will need to solve the Lasso on the same dataset for many values of λ. This is called a regularization path. One way to do this efficiently is to start at a large λ, and then for each consecutive solution, initialize the algorithm with the previous solution, decreasing λ by a constant ratio until finished. • The smallest value of λ for which the solution wˆ is entirely zero is given by λmax = 2 X y − 1 n X i yi ! ∞ This is helpful for choosing the first λ in a regularization path. Finally here are some pointers toward useful parts of Matlab • [I,J,S] = find(X). Finding the row indices, column indices, and non-zero values of a sparse matrix X. • Use sum(A, dim) to compute the sum of row or column of a matrix A. 4.3 Try out your work on synthetic data [12 points] We will now try out your solver with some synthetic data. A benefit of the Lasso is that if we believe many features are irrelevant for predicting y, the Lasso can be used to enforce a sparse solution, effectively differentiating between the relevant and irrelevant features. Let’s see if it actually works. Suppose that x ∈ <d , y ∈ <, k < d, and pairs of data (xi , yi) are generated independently according to the model yi = (w∗ ) T xi + i where i ∼ N (0, σ2 ) is some Gaussian noise. Note that since k < d, the features k + 1 through d are unnecessary (and potentially even harmful) for predicting y. With this model in mind, you are now tasked with the following: 5 1. (8 points) Let N = 250, d = 80, k = 10, and σ = 1. Generate some data by drawing each element of X ∈ <d×n from a standard normal distribution N (0, 1). Let b ∗ = 0 and create a w∗ by setting the first k elements to ±10 (choose any sign pattern) and the remaining elements to 0. Finally, generate a Gaussian noise vector with variance σ 2 and form y = XT w∗ + b ∗ + . With your synthetic data, solve multiple Lasso problems on a regularization path, starting at λmax and decreasing λ by a constant ratio of 2 for 10 steps. Compare the sparsity pattern of your Lasso solution wˆ to that of the true model parameters w∗ . Record values for precision (number of correct nonzeros in wˆ /total number of nonzeros in wˆ ) and recall (number of correct nonzeros in wˆ /k). How well are you able to discover the true nonzeros? Comment on how λ affects these results and include plots of precision and recall versus λ. 2. (4 points) Change σ to 10, regenerate the data, and solve the Lasso problem using a value of λ that worked well when σ = 1. How are precision and recall affected? How might you change λ in order to achieve better precision or recall? 4.4 Predicting goodness points of a wine given its review [16 points + 10 bonus] We’ll now put the Lasso to work on some real data. For this homework, the task is to predict the points the wine should get based on its review text alone. For many Kaggle competitions (and machine learning methods in general), one of the most important requirements for doing well is the ability to discover great features. We can use our Lasso solver for this as follows. First, generate a large amount of features from the data, even if many of them are likely unnecessary. Afterward, use the Lasso to reduce the number of features to a more reasonable amount. The data was scraped from WineEnthusiast during the week of June 15th, 2017 and can be found on the analytics website Kaggle (https://www.kaggle.com/zynicide/wine-reviews). As a side note, browsing other competitions on the site may also give you some ideas for class projects.) This dataset provides a variety of features, the points, description (review), variety, country, province etc. For this homework we will be using the points and description features. Points are ratings within range 1- 100. However, the dataset contains description for wines rated in range 80-100. We already provide you the feature vectors for each review. You can download the data for this homework from- https: //www.kaggle.com/c/hw1-wine-goodness-cse512-spr18/data. trainData.txt Training data matrix for predicting number of stars trainLabels.txt Training labels, list of the number of stars for each review valData.txt Validation data valLabels.txt Validation labels testData.txt Test data featureTypes.txt List of features used The data matrices are stored in Matrix Market Format, a format for sparse matrices. Each line is is a tuple of three elements for instance ID, feature ID, and feature value. Meta information for each feature is provided in featureTypes.txt. The features are strings of one, two, or three words (n-grams). The feature values correspond roughly to how often each word appears in the review. All rows have also been normalized. In Matlab, you can use the following code to load train data and labels: D = load(‘trainData.txt’); trX = sparse(D(:,2), D(:,1), D(:,3)); trLb = load(‘trainLabels.txt’); 6 For this part of the problem, you have the following tasks: 1. (8 points) Solve Lasso to predict the number of points a Wine will receive. Starting at λmax, run Lasso on the training set, decreasing λ by a factor of 2, using previous solutions as initial conditions to each problem. Stop when you have considered enough λ’s that, based on validation error, you can choose a good solution with confidence (for instance, when validation error begins increasing or stops decreasing significant). At each solution, record the root-mean-squared-error (RMSE) on training and validation data. In addition, record the number of nonzeros in each solution. Plot the train and validation RMSE values together on a plot against λ. Separately plot the number of nonzeros as well. 2. (5 points) Find the λ that achieves best validation performance. Compute a model using this λ and list the top 10 features with largest weights and the top 10 features with lowest weights. Comment if the weights make sense intuitively. 3. (3 points) Use your model to predict the points for the reviews in test data. Save the predicted values on a CSV file predTestLabels.csv (see 5.2 for the format). The order of predicted values should correspond to the order of the reviews in testData.txt (counting starts from 1). Submit this predTestLabels.txt file on Kaggle and report the RMSE. 4. (10 bonus points) Submit predTestLabels.csv and enter our Kaggle competition for fame. You must use your Stony Brook email to register on Kaggle. We will maintain a leader board, and the top three entries at the end of the competition (due date) will receive 10 bonus points. The ranking is based on RMSE. To prevent exploiting test data, you are allowed to make a maximum of 2 submissions per 24 hours. Your submission will be evaluated immediately and the leader board will be updated. You are allowed to create new features from the existing set of features. You are allowed to perform feature normalization on existing and new features. You can use both training and validation data to train your classifier. You can also use Ridge regression instead of Lasso. If you use Ridge regression, you must implement it yourself. You cannot use any other classifier types such as SVM or boosting. 5 What to submit? 5.1 Blackboard submission You will need to submit both your code and your answers to questions on Blackboard. Do not submit the provided data. Put the answer file and your code in a folder named: SUBID FirstName LastName (e.g., 10947XXXX lionel messi). Zip this folder and submit the zip file on Blackboard. Your submission must be a zip file, i.e, SUBID FirstName LastName.zip. The answer file should be named: answers.pdf, and it should contain: 1. Answers to Question 1, 2, 3, and 4.1 2. Answers to Question 4.3, including the requested plots. 3. Answers to Question 4.4.1, and 4.4.2, including the requested plots. 5.2 Kaggle submission For Questions 4.4.3 and 4.4.4, you must submit a CSV file to get the RMSE from the competition site https://www.kaggle.com/c/hw1-wine-goodness-cse512-spr18. A submission file should 7 contain two columns: instanceId and Prediction. The file should contain a header and have the following format. instanceId, P rediction 1, 82 2, 91 … … (5) A sample submission file is available from the competition site and our handout. Cheating warnings: Don’t cheat. You must do the homework yourself, otherwise you won’t learn. You cannot ask and discuss with students from previous year. You cannot look up the solution online. You must use your real name to register on Kaggle. Do not create multiple accounts to bypass the submission limitation per 24 hours. Doing so will be considered cheating. 8