CSC 413/2516  Homework 1

$30.00

Download Details:

  • Name: hw1-ktlzgu.zip
  • Type: zip
  • Size: 5.75 MB

Category:

Description

5/5 - (1 vote)

1 Hard-Coding Networks
Can we use neural networks to tackle coding problems? Yes! In this question, you will build a
neural network to find the k
th smallest number from a list using two different approaches: sorting
and counting (Optional). You will start by constructing a two-layer perceptron “Sort 2” to sort
two numbers and then use it as a building block to perform your favorite sorting algorithm (e.g.,
Bubble Sort, Merge Sort). Finally, you will output the k
th element from the sorted list as the final
answer.
Note: Before doing this problem, you need to have a basic understanding of the key components of
neural networks (e.g., weights, activation functions). The reading on multilayer perceptrons located
at https://uoft-csc413.github.io/2022/assets/readings/L02a.pdf may be useful.
1
https://markus.teach.cs.toronto.edu/2022-01/
2
https://uoft-csc413.github.io/2022/assets/misc/syllabus.pdf
3
https://piazza.com/class/ky8yug7i9ol3b1
1

1.1 Sort two numbers [1pt]
In this problem, you need to find a set of weights and bias for a two-layer perceptron “Sort 2” that
sorts two numbers. The network takes a pair of numbers (x1, x2) as input and output a sorted pair
(y1, y2), where y1 ≤ y2. You may assume the two numbers are distinct and positive for simplicity.
You will use the following architecture:
Please specify the weights and activation functions for your network. Your answer should include:
• Two weight matrices: W(1)
, W(2) ∈ R
2×2
• Two bias vector: b
(1)
, b
(2) ∈ R
2
• Two activation functions: ϕ
(1)(z), ϕ
(2)(z)
You do not need to show your work.
Hints: Sorting two numbers is equivalent to finding the min and max of two numbers.
max(x1, x2) = 1
2
(x1 + x2) + 1
2
|x1 − x2|, min(x1, x2) = 1
2
(x1 + x2) −
1
2
|x1 − x2|
1.2 Perform Sort [1.5pt]
Draw a computation graph to show how to implement a sorting function ˆf : R
4 → R
4 where
ˆf(x1, x2, x3, x4) = (ˆx1, xˆ2, xˆ3, xˆ4) where (ˆx1, xˆ2, xˆ3, xˆ4) is (x1, x2, x3, x4) in sorted order. Let us
assume ˆx1 ≤ xˆ2 ≤ xˆ3 ≤ xˆ4 and x1, x2, x3, x4 are positive and distinct. Implement ˆf using your
favourite sorting algorithms (e.g. Bubble Sort, Merge Sort). Let us denote the “Sort 2” module as
S, please complete the following computation graph. Your answer does not need to give the label
for intermediate nodes, but make sure to index the “Sort 2” module.
Hints: Bubble Sort needs 6 “Sort 2” blocks, while Merge Sort needs 5 “Sort 2” blocks.
2

1.3 Find the k
th smallest number [0pt]
Based on your sorting network, you may want to add a new layer to output your final result (k
th
smallest number). Please give the weight W(3) for this output layer when k = 3.
Hints: W(3) ∈ R
1×4
.
1.4 Counting Network [0pt]
The idea of using a counting network to find the k
th smallest number is to build a neural network
that can determine the rank of each number and output the number with the correct rank. Specifically, the counting network will count how many elements in a list are less than a value of interest.
And you will apply the counting network to all numbers in the given list to determine their rank.
Finally, you will use another layer to output the number with the correct rank.
The counting network has the following architecture, where y is the rank of x1 in a list containing
x1, x2, x3, x4.
Please specify the weights and activation functions for your counting network. Draw a diagram to
show how you will use the counting network and give a set of weights and biases for the final layer
to find the k
th smallest number. In other words, repeat the process of sections 1.1, 1.2, 1.3 using
the counting idea.
Hints: You may find the following two activation functions useful.
1) Hard threshold activation function:
ϕ(z) = I(z ≥ 0) =
1 if z ≥ 0
0 if z < 0
2) Indicator activation function:
ϕ(z) = I(z ∈ [−1, 1]) =
1 if z ∈ [−1, 1]
0 otherwise
2 Backpropagation
This question helps you to understand the underlying mechanism of back-propagation. You need
to have a clear understanding of what happens during the forward pass and backward pass and be
able to reason about the time complexity and space complexity of your neural network. Moreover,
you will learn a commonly used trick to compute the gradient norm efficiently without explicitly
writing down the whole Jacobian matrix.
Note: The reading on backpropagation located at https://uoft-csc413.github.io/2022/assets/
readings/L02b.pdf may be useful for this question.
3

2.1 Automatic Differentiation
Consider a neural network defined with the following procedure:
z1 = W(1)x + b
(1)
h1 = ReLU(z1)
z2 = W(2)x + b
(2)
h2 = σ(z2)
g = h1 ◦ h2
y = W(3)g + W(4)x,
y
′ = softmax(y)
S =
X
N
k=1
I(t = k) log(y

k
)
J = −S
for input x with class label t where ReLU(z) = max(z, 0) denotes the ReLU activation function,
σ(z) = 1
1+e−z denotes the Sigmoid activation function, both applied elementwise, and softmax(y) =
exp(y)
PN
i=1 exp(yi)
. Here, ◦ denotes element-wise multiplication.
2.1.1 Computational Graph [0pt]
Draw the computation graph relating x, t, z1, z2, h1, h2 , g, y, y

, S and J .
2.1.2 Backward Pass [1pt]
Derive the backprop equations for computing x¯ =
∂J
∂x

, one variable at a time, similar to the
vectorized backward pass derived in Lec 2.
Hints: Be careful about the transpose and shape! Assume all vectors (including error vector) are column vector and all Jacobian matrices adopt numerator-layout notation 4
. You can use softmax′
(y)
for the Jacobian matrix of softmax.
2.2 Gradient Norm Computation
Many deep learning algorithms require you to compute the L
2 norm of the gradient of a loss function with respect to the model parameters for every example in a minibatch. Unfortunately, most
differentiation functionality provided by most software frameworks (Tensorflow, PyTorch) does not
support computing gradients for individual samples in a minibatch. Instead, they only give one
gradient per minibatch that aggregates individual gradients for you. A naive way to get the perexample gradient norm is to use a batch size of 1 and repeat the back-propagation N times, where
N is the minibatch size. After that, you can compute the L
2 norm of each gradient vector. As
you can imagine, this approach is very inefficient. It can not exploit the parallelism of minibatch
operations provided by the framework.
4Numerator-layout notation: https://en.wikipedia.org/wiki/Matrix_calculus#Numerator-layout_notation
4

In this question, we will investigate a more efficient way to compute the per-example gradient
norm and reason about its complexity compared to the naive method. For simplicity, let us consider
the following two-layer neural network.
z = W(1)x
h = ReLU(z)
y = W(2)h,
where W(1) =


1 2 1
−2 1 0
1 −2 −1

 and W(2) =


−2 4 1
1 −2 −3
−3 4 6

.
2.2.1 Naive Computation [1pt]
Let us assume the input x =

1 3 1⊤
and the error vector y =
∂J
∂y

=

1 1 1⊤
. In this question, write down the Jacobian matrix (numerical value) ∂J
∂W(1) and ∂J
∂W(2) using back-propagation.
Then, compute the square of Frobenius Norm of the two Jacobian matrices, ∥A∥
2
F
. The square of
Frobenius norm of a matrix A is defined as follows:
∥A∥
2
F =
Xm
i=1
Xn
j=1
|aij |
2 = trace
A
⊤A

Hints: Be careful about the transpose. Show all your work for partial marks.
2.2.2 Efficient Computation [0.5pt]
Notice that weight Jacobian can be expressed as the outer product of the error vector and activation
∂J
∂W(1) = (zx⊤)
⊤ and ∂J
∂W(2) = (yh⊤)
⊤. We can compute the Jacobian norm more efficiently using
the following trick:

∂J
∂W(1) ∥
2
F = trace
∂J
∂W(1)
⊤ ∂J
∂W(1)!
(Definition)
= trace
zx⊤xz

= trace
x
⊤xz
⊤z

(Cyclic Property of Trace)
=

x
⊤x
z
⊤z

(Scalar Multiplication)
= ∥x∥
2
2∥z∥
2
2
Compute the square of the Frobenius Norm of the two Jacobian matrices by plugging the value
into the above trick.
Hints: Verify the solution is the same as naive computation. Show all your work for partial marks.

2.2.3 Complexity Analysis [1.5pt]
Now, let us consider a general neural network with K − 1 hidden layers (K weight matrices). All
input units, output units, and hidden units have a dimension of D. Assume we have N input vectors. How many scalar multiplications T (integer) do we need to compute the per-example gradient
norm using naive and efficient computation, respectively? And, what is the memory cost M (big
O notation)?
For simplicity, you can ignore the activation function and loss function computation. You can
assume the network does not have a bias term. You can also assume there are no in-place operations.
Please fill up the table below.
T (Naive) T (Efficient) M (Naive) M (Efficient)
Forward Pass
Backward Pass
Gradient Norm Computation
Hints: The forward pass computes all the activations and needs memory to store model parameters
and activations. The backward pass computes all the error vectors. Moreover, you also need to
compute the parameter’s gradient in naive computation. During the Gradient Norm Computation,
the naive method needs to square the gradient before aggregation. In contrast, the efficient method
relies on the trick. Thinking about the following questions may be helpful. 1) Do we need to store
all activations in the forward pass? 2) Do we need to store all error vectors in the backward pass?
3) Why standard backward pass is twice more expensive than the forward pass? Don’t forget to
consider K and N in your answer.
2.3 Inner product of Jacobian: JVP and VJP [0pt]
A more general case of computing the gradient norm is to compute the inner product of the Jacobian
matrices computed using two different examples. Let f1, f2 and y1, y2 be the final outputs and layer
outputs of two different examples respectively. The inner product Θ of Jacobian matrices of layer
parameterized by θ is defined as:
Θθ (f1, f2) := ∂f1
∂θ
∂f2
∂θ

=
∂f1
∂y1
∂y1
∂θ
∂y2
∂θ
⊤ ∂f2
∂y2

=
∂f1
∂y1
|{z}
O×Y
∂y1
∂θ
|{z}
Y×P
∂y2
∂θ

| {z }
P×Y
∂f2
∂y2

| {z }
Y×O
,
Where O, Y, P represent the dimension of the final output, layer output, model parameter respectively. How to formulate the above computation using Jacobian Vector Product (JVP) and
Vector Jacobian Product (VJP)? What are the computation cost using the following three ways of
contracting the above equation?
(a) Outside-in: M1M2M3M4 = ((M1M2)(M3M4))
(b) Left-to-right and right-to-left: M1M2M3M4 = (((M1M2)(M3)M4) = (M1(M2(M3M4)))
(c) Inside-out-left and inside-out-right: M1M2M3M4 = ((M1(M2M3))M4) = (M1((M2M3)M4))
6

3 Linear Regression
The reading on linear regression located at https://uoft-csc413.github.io/2022/assets/readings/
L01a.pdf may be useful for this question.
Given n pairs of input data with d features and scalar label (xi
, ti) ∈ R
d × R, we wish to
find a linear model f(x) = wˆ
⊤x with wˆ ∈ R
d
that minimizes the squared error of prediction on
the training samples defined below. This is known as an empirical risk minimizer. For concise
notation, denote the data matrix X ∈ R
n×d and the corresponding label vector t ∈ R
n
. The
training objective is to minimize the following loss:
min

1
n
Xn
i=1
(wˆ
⊤xi − ti)
2 = min

1
n
∥Xwˆ − t|
2
2
.
We assume X is full rank: X⊤X is invertible when n > d, and XX⊤ is invertible otherwise.
Note that when d > n, the problem is underdetermined, i.e. there are less training samples than
parameters to be learned. This is analogous to learning an overparameterized model, which is
common when training of deep neural networks.
3.1 Deriving the Gradient [0pt]
Write down the gradient of the loss w.r.t. the learned parameter vector wˆ .
3.2 Underparameterized Model
3.2.1 [0.5pt]
First consider the underparameterized d < n case. Show that the solution obtained by gradient
descent is wˆ = (X⊤X)
−1X⊤t, assuming training converges. Show your work.
3.2.2 [0.5pt]
Now consider the case of noisy linear regression. The training labels ti = w∗⊤xi + ϵi are generated
by a ground truth linear target function, where the noise term, ϵi
, is generated independently with
zero mean and variance σ
2
. The final training error can be derived as a function of n and ϵ, as:
Error =
1
n
||(X(X⊤X)
−1X⊤ − I)ϵ||2
2
,
Show this is true by substituting your answer from 3.2.1 into 1
n
∥Xwˆ − t|
2
2
. Also, find the expectation of the above training error in terms of n, d and σ.
Hints: you might find the cyclic property 5 of trace useful.
5
https://en.wikipedia.org/wiki/Trace_(linear_algebra)#Cyclic_property
7

3.3 Overparameterized Model
3.3.1 [0pt]
Now consider the overparameterized d > n case. We first illustrate that there exist multiple
empirical risk minimizers. For simplicity we let n = 1 and d = 2. Choose x1 = [1; 1] and t1 = 3,
i.e. the one data point and all possible wˆ lie on a 2D plane. Show that there exists infinitely many
wˆ satisfying wˆ
⊤x1 = y1 on a real line. Write down the equation of the line.
3.3.2 [1pt]
Now, let’s generalize the previous 2D case to the general d > n. Show that gradient descent from
zero initialization i.e. wˆ (0) = 0 finds a unique minimizer if it converges. Show that the solution by
gradient decent is wˆ = X⊤(XX⊤)
−1
t. Show your work.
Hints: You can assume that the gradient is spanned by the rows of X and write wˆ = X⊤a for some
a ∈ R
n
.
3.3.3 [0pt]
Repeat part 3.2.2 for the overparameterized case.
3.3.4 [0.5pt]
Visualize and compare underparameterized with overparameterized polynomial regression: https:
//colab.research.google.com/github/uoft-csc413/2022/blob/master/assets/assignments/
LS_polynomial_regression.ipynb Include your code snippets for the fit_poly function in the
write-up. Does overparameterization (higher degree polynomial) always lead to overfitting, i.e.
larger test error?
3.3.5 [0pt]
Give n1, n2 with n1 ≤ n2, and fixed dimension d for which L2 ≥ L1, i.e. the loss with n2 data
points is greater than loss with n1 data points. Explain the underlying phenomenon. Be sure to
also include the error values L1 and L2 or provide visualization in your solution.
Hints: use your code to experiment with relevant parameters, then vary to find region and report
one such setting.