Deep Learning: UNIT- II: CNN: 4. structure

Deep Learning: UNIT -II: CNN : 3.pooling layers

 3. Pooling Layer

Apart from Conv2D, we also used another specific module called MaxPooling2D in our model, which had no learnable parameters.

Let's try to understand what is Pooling and what is the significance of a Pooling Module.

The pooling layer helps in reducing the height and width of the image. Thus, it helps in reducing the number of parameters and also make invariant to the position in the input. In other words, it induces "compositionality" in our model architecture.

There are two kinds of pooling :

  1. Max pooling : The filter of size (𝑓,𝑓)( is slided over the input with a stride and we take the maximum in that region. It is used in between the Conv Layers.
  2. Average pooling : The filter of size (𝑓,𝑓) is slided over the input with a sride and we take the average in that region. This type of pooling is generally used just before the Fully connected layers for feature extraction!

The above figure illustrates the max pooling using a filter (2,2) dimension and having a stride of 2.




The above figure illustrates the average pooling using a filter (2,2) dimension and having a stride of 2.

Does pooling layers add any learnable weights to the network?

  • No, since notice that no kernel/filter matrix is considered.
  • Hence no parameters are needed while computing pooling

What is the Output shape after pooling layer?

·      The output shape is given by: 



where n is the input size, f is the pooling filter size and s is the stride size.

 

How Pooling works in Keras?

layers.MaxPooling2D()

layers.AveragePooling2D()

Both these Pooling layers have 2 important arguments:

  1. pool_size: This is the size of the window on which pooling will be applied. By default this size is 2. Can be an integer or a tuple of 2 integers. It can also be thought of as a non-learnable filter which either calculates max over the region or average out the values.
  2. strides: This is same as the one in Conv2D layer. Default is None, which means that the stride size is same as the pool size.

 

Deep Learning: UNIT- II: CNN 2.striding and padding

Deep Learning: UNIT-II CNN -1. Introduction

 1.    Introduction:

What is Computer Vision (CV)?

Simply put, Computer Vision is a field of AI that enables machine to derive Meaningful Information from Images, Some Examples:

  1. Assigning classes to an entire image
  2. Detect all the Objects in the Image
  3. Extract Text from the Image
  4. Write one-line describing the Image(known as Image Captioning)
  5. Generating images

·        Assigning classes to an entire image





·       Detect all the Objects in the Image



·       Extract Text from the Image

 


·       Write one-line describing the Image(known as Image Captioning)



·       Generating images

 


These problems might be easy for humans as we are trained with many years of image data, but What about Computers?

Before even answering this question, let's investigate –

What a Computer actually sees?

  • Digital Image is a matrix of size (H,W,C), comprising of numbers (also known as pixel values) typically ranging from 0-255. Here H,W,C denote Height, Width and No. of Channels in a image.
  • For a Grayscale image, # channels is 1, while for Colored Image its 3.

 

Q. Is CV Easy? We used Google's Vision API to find out what it detects in the above image.
And the result was ......BAD

 

 

CV is Hard, How?

  1. Occlusion
  2. Illumination variability
  3. Pose Variability
  • Occlusion

·       Illumination variability

·       Pose Variability

Note that the above image of dog is artificially generated using Stable Diffusion model, with the following prompt: A dog made of origami. Highly detailed. Photograph. 4k. Colorful.

 

Scope of CV is Enormous!

  • Computer vision is used in industry leading products like self-driving cars, automated robots, drones, medical image analysis, etc.
  • Apart from this, CV is also used in conjunction with other fields of AI like Speech and NLP.
  • For example, in the case of Speech Recognition, we first convert audio chunks to 2D matrices (called mel spectrogram) and treat it like images. In NLP, we use CV techniques for Image Captioning and very recently Image Synthesis from prompts.

Why Deep Neural Network for Images?

In order to understand why DNN works, we need to account for how visual cortex functions!

Our Visual Cortex system is arranged in layers and as information passes from the eyes to deeper parts of the brain, higher and higher order representations are formed. Since a DNN also works in a similar way, using it for analyzing Images makes perfect sense.

The problems with MLP for Image Data?

a.      MLP will react differently to an image and its shifted version

b.      MLP doesnot consider Spatial relations

c.      Includes too many Parameters

 

 

a.     MLP will react differently to an image and its shifted version

  • Since MLP flattens the image, it is not position invariant
  • we fitted our image to ANN
  • Converted the image into a single feature vector,
  • hence not considering the neighbouring pixels, and
  • most importantly the Image channels (R-G-B)

Lets take an example

  • Supposedly we have two images of the same dog but at two different position
  • One on the Upper left while one on the middle right

 

  • Now since MLP will flatten the matrix, the neurons which might be more active for the first image will be dormant for the second one
  • Making MLP think these two images having completely different objects

b. MLP doesnot consider Spatial relations

  • Spatial Information (like if a Person is standing at the right side of the Car or The red car is on the left side of the blue bike) gets lost when image is flattened
  • Flattening also loses the internal representation of the 2D image.

c. Includes too many Parameters

  • Since MLP is a fully connected model, it requires a neuron for every input pixel of the image
  • Now Lets take an example with an image of size (1280 x 720) .
    • For an image with dimension as such the vector for the input layer becomes (921600 x 1). if a Dense layer of 128 is used then the number of parameters equals = 921600*128.
    • This makes MLP infeasible for large image and it may cause overfitting.

 

Do we even require global connectivity ?

  • The global connectivity caused due to densely connected neurons leads to more reduntant parameters which makes the MLP overfit

With all the above discussion we need:

  • to make the system translation (position) invariant
  • to leverage the spatial correlation between the pixels
  • focus only on the local connectivity

 

What should be the SPECIAL Features of CNN?

From the above discussion and taking inspiration from our visual cortex system, there are 3 essential properties of image data:

1. LOCALITY : Correlation between neighbouring pixels in a Image

2. STATIONARITY : Similar Patterns appearing multiple times in a Image

3. COMPOSITIONALITY : Extracting higher level features by pooling lower level features

 

 

 

 

 

Deep Learning: UNIT-2 CNN

 UNIT II

CNN

1.     Introduction
2.     striding and padding
3.     pooling layers
4.     structure
5.     operations and prediction of CNN with layers
6.     CNN -Case study with MNIST

Deep Learning: UNIT-2 PPT

UNIT II 
CNN 
1. Introduction 
2. striding and padding 
4. structure 
5. operations and prediction of CNN with layers 
6. CNN -Case study with MNIST 
7. CNN VS Fully Connected

Deep Learning: UNIT 2- CNN Notes

                                                                              UNIT II 

CNN 
1. Introduction 
2. striding and padding 
4. structure 
5. operations and prediction of CNN with layers 
6. CNN -Case study with MNIST 


About Machine Learning

Welcome! Your Hub for AI, Machine Learning, and Emerging Technologies In today’s rapidly evolving tech landscape, staying updated with the ...