1. Introduction:
What is Computer Vision
(CV)?
Simply put, Computer
Vision is a field of AI that enables machine to derive Meaningful
Information from Images, Some Examples:
- Assigning classes to an entire
image
- Detect all
the Objects in the Image
- Extract Text from the Image
- Write one-line describing the
Image(known as Image Captioning)
- Generating images
·
Assigning classes to an entire image

·
Detect all the Objects in
the Image
·
Extract Text from the Image
·
Write one-line describing the
Image(known as Image Captioning)
·
Generating images
These problems might be
easy for humans as we are trained with many years of image data, but What about
Computers?
Before even answering
this question, let's investigate –
What a Computer actually
sees?
- A Digital Image is a
matrix of size (H,W,C), comprising of numbers (also known as pixel values)
typically ranging from 0-255. Here H,W,C denote Height, Width and No. of
Channels in a image.
- For a Grayscale image, # channels
is 1, while for Colored Image its 3.
Q. Is CV Easy? We
used Google's Vision API to find out what it detects in the above image.
And the result was ......BAD
CV is Hard, How?
- Occlusion
- Illumination variability
- Pose Variability
- Occlusion
·
Illumination variability
·
Pose Variability
Note that the above image
of dog is artificially generated using Stable Diffusion model,
with the following prompt: A dog made of origami. Highly detailed.
Photograph. 4k. Colorful.
Scope of CV is Enormous!
- Computer vision is used in industry
leading products like self-driving cars, automated robots, drones, medical
image analysis, etc.
- Apart from this, CV is also used in
conjunction with other fields of AI like Speech and NLP.
- For example, in the case of Speech
Recognition, we first convert audio chunks to 2D matrices (called mel
spectrogram) and treat it like images. In NLP, we use CV techniques for
Image Captioning and very recently Image Synthesis from prompts.
Why Deep Neural Network
for Images?
In order to understand
why DNN works, we need to account for how visual cortex functions!
Our Visual Cortex system
is arranged in layers and as information passes from the eyes to deeper parts
of the brain, higher and higher order representations are formed. Since a DNN
also works in a similar way, using it for analyzing Images makes perfect sense.
The problems with MLP for Image Data?
a. MLP
will react differently to an image and its shifted version
b. MLP doesnot consider Spatial relations
c. Includes
too many Parameters
a. MLP
will react differently to an image and its shifted version
- Since MLP
flattens the image, it is not position invariant
- we
fitted our image to ANN
- Converted
the image into a single feature vector,
- hence not
considering the neighbouring pixels, and
- most
importantly the Image channels (R-G-B)
Lets
take an example
- Supposedly
we have two images of the same dog but at two different position
- Now
since MLP will flatten the matrix, the neurons which might be more
active for the first image will be dormant for the second one
- Making MLP
think these two images having completely different objects
b.
MLP doesnot consider Spatial relations
- Spatial
Information (like if a Person is standing at the right side of the Car or
The red car is on the left side of the blue bike) gets lost when image is
flattened
- Flattening
also loses the internal representation of the 2D image.
c. Includes
too many Parameters
- Since MLP
is a fully connected model, it requires a neuron for every input pixel of
the image
- Now
Lets take an example with an image of size (1280 x 720) .
- For
an image with dimension as such the vector for the input layer
becomes (921600 x 1). if a Dense layer of 128 is used then
the number of parameters equals = 921600*128.
- This
makes MLP infeasible for large image and it may cause
overfitting.
Do we even require global
connectivity ?
- The
global connectivity caused due to densely connected neurons leads to more
reduntant parameters which makes the MLP overfit
With all the above discussion we
need:
- to
make the system translation (position) invariant
- to
leverage the spatial correlation between the pixels
- focus
only on the local connectivity
What
should be the SPECIAL Features of CNN?
From
the above discussion and taking inspiration from our visual cortex system,
there are 3 essential properties of image data:
1. LOCALITY : Correlation
between neighbouring pixels in a Image
2. STATIONARITY : Similar
Patterns appearing multiple times in a Image
3. COMPOSITIONALITY :
Extracting higher level features by pooling lower level features
No comments:
Post a Comment