Neural Network 5

 Q1. Backpropagation in MLP

Which of the following options are true with respect to Backpropagation?

Choose the correct answer from below, please note that this question may have multiple correct answers

A.     In backpropagation, we calculate the error contribution of each neuron.

B.     In backpropagation, we calculate the loss gradients with respect to inputs.

C.      In backpropagation, we calculate the loss gradient with respect to weights and biases.

D.     In backpropagation, we update the weights of neurons in each iteration.

Ans: A, C,D

Correct options :

i) In backpropagation, we calculate the error contribution of each neuron.

ii) In backpropagation, we calculate the loss gradient with respect to weights and biases.

iii) In backpropagation, we update the weights of neurons in each iteration.

Explanation :

Only this statement is false “It is used to calculate the loss gradients with respect to inputs.” as we can’t update the inputs. All the other options are true.

Backpropagation calculates the rates with which loss changes with respect to weights and biases and then weights and biases are updated inorder to minimize the loss function.

Q2. Complete the updating code




We want to use the above code snippet for a simple binary classification task where if the model() returns 1 for an observation, then the observation will be classified as '+'(1) otherwise '-'(0).

model() will return 1 only if the weighted sum is greater than or equal to the threshold thresh. In the above code snippet, the fit function will be used for getting a weight matrix for the classification task.

Complete the updating syntax for weights [?] and threshold [??].


Note: 
The inputs are always positive

Choose the correct answer from below:

A.     w = w + lr * x, thresh = thresh + lr

B.     w = w - lr * x, thresh = thresh + lr

C.      w = w + lr * x, thresh = thresh - lr * x.

D.     w = w - lr * x, thresh = thresh - lr * x.

Ans: A

Correct Answer:

  • w = w + lr * x, thresh = thresh + lr

Explanation

The code snippet is basically a simple version of the implementation of perceptron where :

  • we are updating the weights and threshold with the same intuition as in SGD but without formulation.

If the expected output is ”+” and the predicted one is ”-“, then :

  • we should increase the weights in order to increase the weighted sum i.e. w.x
  • and decrease the threshold (thresh)

If the expected output is ”-“ and the predicted one is ”+”, then :

  • we should decrease the weights in order to decrease the weighted sum
  • and increase the threshold (thresh)

The code for model() is as follows:

def model(x,w,thresh):

  return 1 if (np.dot(w, x) >= thresh) else 0

Q3. Weight's value

Consider a neural network as shown in the image below:




The initial values of x1,x2 and x3 are [10,5,5]. The true value of output is 4. If the loss function is mean squared error then what is the value of w1​ after the first epoch?

Consider initial value of all w1​, w2​, w3​, w4​ and w5​ as 0.1 and the learning rate is 0.01

Choose the correct answer from below:

A.     0.550

B.     0.252

C.      0.111

D.     0.340

Ans: B

Correct option : 0.252

Explanation :

Let o1​ be the output coming out of the first neuron in the hidden layer and o2​ be the output coming out of the secon neuron in the hidden layer

Now, o1​=F(x2), where x = w1​.x1​
and o2​ = F(x), where x = w2​.x2​+w3​.x3​

o1​=w12​.x12​=(0.1).(0.1).(10).(10)=1
o2​=w2​.x2​+w3​.x3​=(0.1).(5)+(0.1).(5)=1

Similalry,
y^​=w4​.o1​+w5​.o2​=w4​.(w12​.x12​)+w5​(w2​.x2​+w3​.x3​)

According to the question, the loss function is :
loss=(yy^​)2

Using the chain rule of differentiation :

dw1​d(loss)​=do1​d(loss)​.dw1​d(o1​)​

Thus,

do1​d(loss)​=do1​d(yw4​.o1​−w5​.o2​)2​=(−2).(w4​).(yw4​.o1​−w5​.o2​)

=(−2).(0.1).(4−(0.1).(1)−(0.1).(1))=(−0.2).(4−0.2)=(−0.2).(3.8)=−0.76

Similarly,

dw1​d(o1​)​=dw1​d(w12​.x12​)​=2.w1​.x1​.x1​=(2).(0.1).(10).(10)=20

Finally,

dw1​d(loss)​=(−0.76).(20)=−15.2

Updating the weight :

w1​←w1​−α.dw1​d(loss)​

where α is the learning rate

=0.1−0.01.(−15.2)

=0.1+0.152=0.252

Q4. Convergence

Fill in the blank :

In a multi-layered perceptron architecture, gradient descent ______ .

Choose the correct answer from below:

A.     always converges to the global minimum.

B.     doesn't converge to the global minimum.

C.      may or may not converge to the global minimum.

D.     will always converge to the global minimum if the learning rate is appropriate.

Ans: C

Correct option : may or may not converge to the global minimum

Explanation :

Gradient descent may or may not converge to a global minimum depending on the initial weights and learning rate. The loss function for a multi-layered perceptron is neither convex nor concave due to which it can have multiple local minima. So, it is not guaranteed that the gradient descent will converge.

Q5. Calculate the loss

Given the dataset, calculate the loss after completing the code snippet. Blanks are [?] .

def hypothesis(w,b,x):                           #Section 1
  return 1.0/(1.0 + np.exp(-(w*x + b)))

def error(w,b):                                  #Section 2       
  err=0.0
  for x,y in zip(train,label):
    fx = hypothesis(w,b,x)
    err += 0.5 * (fx-y) ** 2
  return err

def grad_w(w,b,x,y):                              #Section 3
  fx=hypothesis(w,b,x)
  return (fx-y)*fx*(1-fx)*x

def grad_b(w,b,x,y):                              #Section 4
  fx=hypothesis(w,b,x)
  return (fx-y)*fx*(1-fx)

def gradient_descent(train,label,w,b,lr,max_epochs):    #Section 5
  dw=0
  db=0
  for i in range(max_epochs):
    for x,y in zip(train,label):
      dw+=grad_w(w,b,x,y)
      db+=grad_b(w,b,x,y)
    w = w [?] lr*dw         # [?] is here arithmetic sign
    b = b [?] lr*db         # [?] is here arithmetic sign
    print("For Epoch {}, the loss is {}".format(i+1, error(w,b)))
  return w,b

df=pd.read_csv("filepath")
train=df['X']
label=df['Y']
initial_w = 1
initial_b = 1
lr=0.01
max_epochs=50
w,b = gradient_descent(train,label,initial_w,initial_b,lr,max_epochs)

Choose the correct option for the loss and arithmetic signs.

df=pd.read_csv(data)

Choose the correct answer from below:

A.     0.016, -, -

B.     0.28, +, +

C.      0.028, -, -

D.     0.050, +, +

Ans: A

Correct option : 0.016, -, -

Explanation :

The rule of parameter updation in gradient descent is :

wwα.(∂L/∂w)
bbα.(∂L/∂b),
where α is the learning rate

Thus the signs in place of [?] will be -,-

The loss after 50 epochs comes out to be around 0.016

Q6. Fully connected neural network

Which, if any, of the given propositions is true about fully-connected neural networks (FCNN)?

Choose the correct answer from below:

A.     In a FCNN, there are connections between neurons of a same layer.

B.     In a FCNN, the most common weight initialization scheme is the zero initialization, because it leads to faster and more robust training.

C.      The neurons of one layer are connected to every neuron of its preceding layer.

D.     None of the options

Ans: C

Correct option : The neurons of one layer are connected to every neuron of its preceding layer.

Explanation :

  • In a FCNN, Neurons of one layer are connected to every neuron of its preceding layer, But there are no connections between neurons of the same layer.
  • Zero initialization leads to weight symmetry and undermines training. ( If all the weights are the same, then they will all receive the same update in each training round, so no learning can occur )

Q7. Compare the learnings

The following graph shows the learning speeds versus the number of epochs for the four hidden layers where Hidden layer 1 and Hidden layer 4 are the first and last hidden layers respectively.





Considering the graph mark the correct option.

Choose the correct answer from below:

A.     The initial layers learn slower since the weights of the initial layers are always higher

B.     The gradients for initial layers are smaller than those of later ones, causing slow learning in them

C.      The slow learning in initial layers is due to the dead activation of sigmoid in the initial layers

D.     The slow learning in initial layers is due to faster learning at initial stages

Ans: B

Correct option: The gradients for initial layers are smaller than those of later ones, causing slow updating of weights and biases in them hence slow learning.

Explanation:

  • Recall that the backpropagation algorithm involves heavy usage of the chain rule, which means that the gradient received by the initial layers is represented as a long series of multiplications. During backpropagation, If the gradients at the final layers are fairly small, the initial layers will get a very tiny gradient. Hence, weights will get updated very slowly in these layers.
  • One option suggests that the initial layers don't require much updating which is wrong as the weights are initialized randomly and they do require significant updating. Therefore it can't be true that the earlier layers require lesser updating.
  • One option suggests that the initial layers learn slower later because they learn faster in the initial epochs which is not true.
  • One option says it's due to dead activation in the initial layers which is not true, there isn't any such concept.

Comments