Description
Questions:
1. (20 points) Lagrangian Dual for Absolute Loss Regression: Consider the problem of finding optimal linear regression model weights in the RRM framework using the absolute loss and a squared 2-norm
regularizer:
θ∗ = arg min
θ
C
X
N
n=1
|yn − f(xn, θ)| + kwk
2
2
where θ = [w, b] and f(x, θ) = wxT + b. This problem is convex, but not differentiable. However,
the problem can be converted into an alternative linearly constrained form where the objective function is
quadratic in an expanded set of variables [θ, ] where = [
+
1
, −
1
, …, +
N
, −
N
]:
1
θ∗, ∗ = arg min
θ,
C
X
N
n=1
(
+
n +
−
n
) + kwk
2
2
s.t. ∀n yn ≤ f(xn, θ) +
+
n
∀n yn ≥ f(xn, θ) −
−
n
∀n +
n ≥ 0
∀n −
n ≥ 0
a. (5 pts) Derive the Lagrangian function for the constrained formulation shown above.
b. (10 pts) Derive the Lagrangian dual for the constrained formulation shown above.
c. (5 pts) Explain the advantages of solving the Lagrangian dual problem instead of the constrained version
of the primal problem.
2. (40 points) Subgradient Descent for SVC: Consider the hinge loss formulation of the support vector
classifier learning problem shown below:
w∗ = arg min
w
C
X
N
n=1
max(0, 1 − yn(wxT
n + b)) + ||w||2
2
a. (5 pts) Derive an expression for a subgradient of the objective function with respect to w and b.
b. (20 pts)Starting from the provided template (svm.py), implement a Scikit-Learn compatible class for this
model including objective, subgradient, fit, predict, set_model, and get_model functions. Your fit method
must use a subgradient descent procedure. As your answer to this question, describe your approach to
learning in detail (stepsize and convergence rules used, any acceleration techniques, etc.) and submit your
commented code for auto grading as described above.
c. (5 pts) Use your implementation to learn optimal model parameters wSV C and bSV C on the provided
training data using C = 1. Report the value of the SVC objective function at the optimal model parameters,
as well as the classification error rate on the training data.
d. (5 pts)Apply Scikit-learn’s linear logistic regression implementation sklearn.linear_model.LogisticRegression
to the provided training data using C = 1 to learn optimal model parameters wLR and bLR. Report the value
of the logistic regression objective function at wLR and bLR, the value of the SVC objective function at wLR
and bLR, and the classification error rate on the training data.
e. (5 pts) How do SVC and logistic regression compare on the provided data? Under what circumstances
would you expect SVC to provide lower generalization error on test data (drawn from the same distribution
as the training data) than a logistic regression model?
3. (40 points) Multi-Output Neural Networks: Neural networks are flexible models that can be optimized
with respect to many different losses. In this question, you will implement a custom neural network model
that simultaneously solves a regression problem (localizing an object in an image), and a binary classification
2
problem (determining the class of the object present in an image) using a composite loss. The prediction
model f(x, θ) will simultaneously produce a class probability f
c
(x, θ) ∈ [0, 1] and a localization output
f
l
(x, θ) ∈ R × R. The outputs are formally defined as y = [y
c
, y
l
] where y
c ∈ {0, 1} is the class label for
data case n and y
l = [y
v
, yh
] are the vertical and horizontal locations of the object. The objective function
for this learning problem is shown below where α is a parameter that trades off between the classification
loss (cross entropy, Lce) and the localization loss (squared error):
θ∗ = arg min
θ
X
N
n=1
αLce(y
c
n
, fc
(xn, θ)) + (1 − α)ky
l
n − f
l
(xn, θ)k
2
2
Lce(y, p) = −y log(p) − (1 − y) log(1 − p)
In this question, you will begin by implementing an architecture for this problem consisting of an input layer,
three hidden layers, and two parallel output layers (one each for localization and classification). The inputs
are 60×60 images, resulting in 3600-long input vectors. All hidden layers will be fully connected and use
RELU non-linearities. The class probability output layer will use a logistic non-linearity. The localization
output layer will consist of two linear units. A diagram of the network architecture you will implement is
shown below.
Input Layer (3600): x
Hidden Layer 1 (256): h1
Hidden Layer 2 (64): h2
Hidden Layer 3 (32): h3
Class Probability (1): p Location (2): [yv
, yh]
w1: 3600×256
w2: 256×64
w3: 64×32
w5: 32×1 w4: 32×2
b1: 256
b2: 64
b3: 32
b5: 1 b4: 2
a. (25 pts)Starting from the provided template (nn.py), implement a Scikit-Learn compatible class for the
model shown above including objective, fit, predict, set_model, and get_model functions. You can develop
your implementation using numpy, pytorch (version 0.2.0), or tensorflow (version 1.3). You may use automatic differentiation methods to develop your implementation. As your answer to this question, describe
your approach to learning in detail (parameter initialization, optimization algorithm, stepsize selection and
convergence rules used, acceleration techniques, etc.), and submit your commented code for auto grading as
described above.
b. (5 pts)Using the provided training data set, learn the model using α = 0.5. Report the classification error
rate and the mean squared error (MSE) of the location predictions obtained on both the training and test sets.
3
c. (10 pts) Use your implementation to perform experiments to assess how changing the value of α affects
the classification and location prediction results on the test set. Describe your experiments and conclusions.
Present at least one graph to support your conclusions.
d. (5 pts) Bonus: The data file bonus_train.npz contains data for an expanded version of the above
problem that includes data from 10 different classes. In this question, you will develop a neural network
architecture for the 10-class version of this problem. This will require changing the binary cross-entropy
loss to the multi-class cross-entropy loss. The goal for this problem is to develop an end-to-end model that
results in the highest possible classification accuracy. You may make any changes to the model architecture.
As your answer to this question, you will provide a description of your model architecture and learning
algorithm, as well as a trained model that will be evaluated on held-out test data. See the instructions in
bonus.py for a description of how to submit your code and pre-trained model. You can use up to one
additional page to describe your solution. Note that your submission must out-perform a reference model
created by the course staff to qualify for bonus points.

