Deep Learning: UNIT-II CNN -1. Introduction

1. Introduction:

What is Computer Vision (CV)?

Simply put, Computer Vision is a field of AI that enables machine to derive Meaningful Information from Images, Some Examples:

Assigning classes to an entire image
Detect all the Objects in the Image
Extract Text from the Image
Write one-line describing the Image(known as Image Captioning)
Generating images

· Assigning classes to an entire image

· Detect all the Objects in the Image

· Extract Text from the Image

· Write one-line describing the Image(known as Image Captioning)

· Generating images

These problems might be easy for humans as we are trained with many years of image data, but What about Computers?

Before even answering this question, let's investigate –

What a Computer actually sees?

A Digital Image is a matrix of size (H,W,C), comprising of numbers (also known as pixel values) typically ranging from 0-255. Here H,W,C denote Height, Width and No. of Channels in a image.
For a Grayscale image, # channels is 1, while for Colored Image its 3.

Q. Is CV Easy? We used Google's Vision API to find out what it detects in the above image.
And the result was ......BAD

CV is Hard, How?

Occlusion
Illumination variability
Pose Variability

Occlusion

· Illumination variability

· Pose Variability

Note that the above image of dog is artificially generated using Stable Diffusion model, with the following prompt: A dog made of origami. Highly detailed. Photograph. 4k. Colorful.

Scope of CV is Enormous!

Computer vision is used in industry leading products like self-driving cars, automated robots, drones, medical image analysis, etc.
Apart from this, CV is also used in conjunction with other fields of AI like Speech and NLP.
For example, in the case of Speech Recognition, we first convert audio chunks to 2D matrices (called mel spectrogram) and treat it like images. In NLP, we use CV techniques for Image Captioning and very recently Image Synthesis from prompts.

Why Deep Neural Network for Images?

In order to understand why DNN works, we need to account for how visual cortex functions!

Our Visual Cortex system is arranged in layers and as information passes from the eyes to deeper parts of the brain, higher and higher order representations are formed. Since a DNN also works in a similar way, using it for analyzing Images makes perfect sense.

The problems with MLP for Image Data?

a. MLP will react differently to an image and its shifted version

b. MLP doesnot consider Spatial relations

c. Includes too many Parameters

a. MLP will react differently to an image and its shifted version

Since MLP flattens the image, it is not position invariant
we fitted our image to ANN
Converted the image into a single feature vector,
hence not considering the neighbouring pixels, and
most importantly the Image channels (R-G-B)

Lets take an example

Supposedly we have two images of the same dog but at two different position
One on the Upper left while one on the middle right

Now since MLP will flatten the matrix, the neurons which might be more active for the first image will be dormant for the second one
Making MLP think these two images having completely different objects

b. MLP doesnot consider Spatial relations

Spatial Information (like if a Person is standing at the right side of the Car or The red car is on the left side of the blue bike) gets lost when image is flattened
Flattening also loses the internal representation of the 2D image.

c. Includes too many Parameters

Since MLP is a fully connected model, it requires a neuron for every input pixel of the image
Now Lets take an example with an image of size (1280 x 720) .

For an image with dimension as such the vector for the input layer becomes (921600 x 1). if a Dense layer of 128 is used then the number of parameters equals = 921600*128.
This makes MLP infeasible for large image and it may cause overfitting.

Do we even require global connectivity ?

The global connectivity caused due to densely connected neurons leads to more reduntant parameters which makes the MLP overfit

With all the above discussion we need:

to make the system translation (position) invariant
to leverage the spatial correlation between the pixels
focus only on the local connectivity

What should be the SPECIAL Features of CNN?

From the above discussion and taking inspiration from our visual cortex system, there are 3 essential properties of image data:

1. LOCALITY : Correlation between neighbouring pixels in a Image

2. STATIONARITY : Similar Patterns appearing multiple times in a Image

3. COMPOSITIONALITY : Extracting higher level features by pooling lower level features

Machine Learning - Deep Learning

Deep Learning: UNIT-II CNN -1. Introduction

No comments:

Post a Comment

About Machine Learning

SOFTWARE ENGINEERING