CS 5590 Assignment 3 Foundations of Machine Learning 

$35.00

Download Details:

  • Name: Assignment-3-mjjh7e.zip
  • Type: zip
  • Size: 9.95 MB

Category:

Description

5/5 - (1 vote)

This homework is intended to cover programming exercises in the following topics: • Neural Networks • Boosting/XGBoost • Random Forests Instructions • Please upload your submission on Google Classroom by the deadline mentioned above. Your submission should comprise of a single file (PDF/ZIP), named Assign3, with all your solutions. • For late submissions, 10% is deducted for each day (including weekend) late after an assignment is due. Note that each student begins the course with 7 grace days for late submission of assignments (of which atmost 4 can be used for a given submission). Late submissions will automatically use your grace days balance, if you have any left. You can see your balance on the FoML Marks and Grace Days document. • Please use Python for the programming questions. You can use sklearn, imblearn libraries. • Please read the department plagiarism policy. Do not engage in any form of cheating – strict penalties will be imposed for both givers and takers. Please talk to instructor or TA if you have concerns. Questions: Theory 1. Neural Networks: (2+2=4 marks) (a) The XOR function (exclusive or) returns true only when one of the arguments is true and another is false. Otherwise, it returns false. Show that a two-layer perceptron (a perceptron with one hidden layer) can solve the XOR problem. Submit a figure and a network diagram (with associated weights). 1 (b) x, y, and z are inputs with values -2, 5, and -4 respectively. You have a neuron q and neuron f with functions: q = x − y f = q ∗ z Show the graphical representation, and compute the gradient of f with respect to x, y, and z. 2. Neural Networks: (4 marks) The extension of the cross-entropy error function for a multi-class classification problem is given by: E(w) = − X N n=1 X K k=1 tkn ln yk(xn, w) where K is the number of classes, N is the number of data samples, and tn is a one-hot vector which designates the expected output for a data sample xn (note that a one-hot vector has a 1 in the correct class’ position and zeroes elsewhere, e.g. tn = [0, 0, 1, 0, · · · , 0] for the ground truth of the 3rd class). The network outputs yk(xn, w) = p(tk = 1|x) are given by the softmax activation function: yk(x, w) = exp(ak(x, w)) P j exp(ak(x, w)) which satisfies 0 ≤ yk ≤ 1 and P k yk = 1, where ’ak’s are the pre-softmax activations of the output layer neurons (also called logits). For a given input, show that the derivative of the above error function with respect to the activation ak for an output unit having a logistic sigmoid activation function is given by: ∂E ∂ak = yk − tk 3. Ensemble Methods: (2+2=4 marks) Consider a convex function f(x) = x 2 . Show that the average expected sum-of-squares error EAV = 1 M PM m=1 Ex[(ym(x) − f(x))2 ] of the members of an ensemble model and the expected error EENS = Ex[( 1 M PM m=1 ym(x)−f(x))2 ] of the ensemble satisfy: EENS ≤ EAV Show further that the above result holds for any error function E(y), not just sum-of-squares, as long as it is convex in y. (Hint: The only tool you may need is the Jensen’s inequality. Read up about it, and use it!) Questions: Programming 4. Random Forests: (5 + 2.5 + 2.5 = 10 marks) (a) Write your own random forest classifier (this should be relatively easy, given you have written your own decision tree code) to apply to the Spam dataset [data, information]. Use 30% of the provided data as test data and the remaining for training. Compare your results in terms of accuracy and time taken with Scikitlearn’s built-in random 2 forest classifier. (Note that you can’t use in-built decision tree functions to implement your code. You can modify your decision tree code of the Assignment 1, or code a new one, to implement a random forest. You can however use the inbuilt train test split of sklearn to divide the data into train and test.) (b) Explore the sensitivity of Random Forests to the parameter m (the number of features used for best split). (c) Plot the OOB (out-of-bag) error (you have to find what this is, and read about it!) and the test error against a suitably chosen range of values for m. (Use your implementation of random forest to perform this analysis.) Deliverables: • Code • Brief report (PDF) with your solutions for the above questions 5. Gradient Boosting: (3 + 5 = 8 marks) In this question, we will explore the use of pre-processing methods and Gradient Boosting on the popular Lending Club dataset. You are provided with two files: loan train.csv and loan test.csv. The dataset is almost as provided by the the original source, and you may have to make the necessary changes to make it suitable for applying ML algorithms. (If required, you can further divide loan train.csv into a validation set for model selection.) Your efforts will be to pre-process the data appropriately, and then apply gradient boosting to classify whether a customer should be given a loan or not. The target attribute is in the column loan status, which has values “Fully Paid” for which you can assign +1 to, and “Charged off” for which you can assign -1 to. The other records with loan status values “Current” (in both train and test) are not relevant to this problem. You can see this link to know more about the different attributes on the dataset (but please use the provided data, there are several versions of the dataset online.) Your tasks are to do the following: (a) Pre-process the data as needed to apply the classifier to the training data (you are free to use pandas or other relevant libraries. Note that test data should not be used for pre-processing in any way, but the same pre-processing steps can be used on test data. Some steps to consider: • Check for missing values, and how you want to handle them (you can delete the records, or replace the missing value with mean/median of the attribute – this is a decision you must make. Please document your decisions/choices in the final submitted report.) • Check whether you really need all the provided attributes, and choose the necessary attributes. (You can employ feature selection methods, if you are familiar; if not, you can eyeball.) • Transform categorical data into binary features, and any other relevant columns to suitable datatypes • Any other steps that help you perform better (b) Apply gradient boosting using the function sklearn.ensemble.GradientBoostingClassifier for training the model. You will need to import sklearn, sklearn.ensemble, and numpy. Your effort will be focused on predicting whether or not a loan is likely to default. 3 • Get the best test accuracy you can, and show what hyperparameters led to this accuracy. Report the precision and recall for each of the models that you built. • In particular, study the effect of increasing the number of trees in the classifier. • Compare your final best performance (accuracy, precision, recall) against a simple decision tree built using information gain. (You can use sklearn’s inbuilt decision tree function for this.) Deliverables: • Code • Brief report (PDF) with your solutions for the above questions