CSE 158/258 Homework 3

$30.00

Download Details:

  • Name: Homework3-iq3nxh.zip
  • Type: zip
  • Size: 6.39 MB

Category:

Description

Rate this product

Tasks (Read prediction)
Since we don’t have access to the test labels, we’ll need to simulate validation/test sets of our own.
So, let’s split the training data (‘train Interactions.csv.gz’) as follows:
(1) Reviews 1-190,000 for training
(2) Reviews 190,001-200,000 for validation
(3) Upload to Kaggle for testing only when you have a good model on the validation set. This will save you
time (since Kaggle can take several minutes to return results), and prevent you from exceeding your daily
submission limit.
1. Although we have built a validation set, it only consists of positive samples. For this task we also need
examples of user/item pairs that weren’t read. For each entry (user,book) in the validation set, sample a
negative entry by randomly choosing a book that user hasn’t read.1 Evaluate the performance (accuracy)
of the baseline model on the validation set you have built (1 mark).
2. The existing ‘read prediction’ baseline just returns True if the item in question is ‘popular,’ using a
threshold of the 50th percentile of popularity (totalRead/2). Assuming that the ‘non-read’ test examples
are a random sample of user-book pairs, this threshold may not be the best one. See if you can find a
better threshold and report its performance on your validatin set (1 mark).
3. A stronger baseline than the one provided might make use of the Jaccard similarity (or another similarity
metric). Given a pair (u, b) in the validation set, consider all training items b
0
that user u has read. For
each, compute the Jaccard similarity between b and b
0
, i.e., users (in the training set) who have read
b and users who have read b
0
. Predict as ‘read’ if the maximum of these Jaccard similarities exceeds a
threshold (you may choose the threshold that works best). Report the performance on your validation
set (1 mark).
4. Improve the above predictor by incorporating both a Jaccard-based threshold and a popularity based
threshold. Report the performance on your validation set (1 mark). 2
5. To run our model on the test set, we’ll have to use the files ‘pairs Read.txt’ to find the reviewerID/itemID
pairs about which we have to make predictions. Using that data, run the above model and upload your
solution to Kaggle. Tell us your Kaggle user name (1 mark). If you’ve already uploaded a better solution
to Kaggle, that’s fine too!
(CSE 158 only) Tasks (Category prediction)
For these experiments, you may want to select a smaller dictionary size (i.e., fewer words), or a smaller training
set size, if the experiments are taking too long to run.
1This is how I constructed the test set; a good solution should mimic this procedure as closely as possible so that your Kaggle
performance is close to their validation performance.
2This could be further improved by treating the two values as features in a classifier — the classifier would then determine the
thresholds for you!
1
6. Using the review data (train Category.json.gz), build training/validation sets consisting of 190,000/10,000
reviews. We’ll start by building features to represent common words. Start by removing punctuation
and capitalization, and finding the 1,000 most common words across all reviews (‘review text’ field) in
the training set. See the ‘text mining’ lectures for code for this process. Report the 10 most common
words, along with their frequencies (1 mark).
7. Build bag-of-words feature vectors by counting the instances of these 1,000 words in each review. Set the
labels (y) to be the ‘genreID’ column for the training instances. You may use these labels directly with
sklearn’s LogisticRegression model, which will automatically perform multiclass classification. Report
performance on your validation set (1 mark).
8. Try to improve upon the performance of the above classifier by using different dictionary sizes, or changing
the regularization constant C passed to the logistic regression model. Report the performance of your
solution, and upload it to Kaggle (1 mark).
(CSE 258 only) Tasks (Rating prediction)
Let’s start by building our training/validation sets much as we did for the first task. This time building a
validation set is more straightforward: you can simply use part of the data for validation, and do not need to
randomly sample non-read users/books.
9. Fit a predictor of the form
rating(user, item) ‘ α + βuser + βitem,
by fitting the mean and the two bias terms as described in the lecture notes. Use a regularization
parameter of λ = 1. Report the MSE on the validation set (1 mark).
10. Report the user and book IDs that have the largest and smallest values of β (1 mark).
11. Find a better value of λ using your validation set. Report the value you chose, its MSE, and upload your
solution to Kaggle by running it on the test data (1 mark).