U-Net Paper Review
Last time, we talked about CNN, and let's find out where CNN can be used.
The U-Net is a segmentation model developed by Olaf Ronneberger et al. for Bio-Medical Image Segmentation.
If you look at the architecture of U-Net, it is symmetric and consists of an encoder and a decoder.
The encoder is called a contraction path, as it reduces the height and the width of the input image, and the decoder is called an expansive path, as it expands the height and the width of the resulting tensor from the encoder.
What you need to see from the U-Net architecture is that the encoder halves the spatial dimensions while doubling the number of feature channels
at each encoder block. By contrast, the decoder doubles the spatial dimensions while reducing the number of feature channels in half.
First of all, let's talk about the "Encoder" part.
The encoder extracts feature maps from images so that the network and layers can learn the abstract representation of images.
It consists of two 3x3 convolution layers with a ReLU activation function at the end of each layer. The reason for using ReLU is to apply non-linearity to generalize the data.
The output of ReLU will later be added to the corresponding decoder layer through a skip connection.
The encoder uses a 2x2 max pooling layer at the end of each block in order to reduce the spatial dimensions in half. This will reduce the total computational cost as it decreases the number of parameters in the network.
Next, the "Decoder" part.
The decoder takes the output of the encoder network (which has an abstract representation of an input image), and generates a semantic segmentation mask as an output.
The output of the encoder network goes through 2x2 transpose convolution and it is concatenated with the corresponding skip connection feature map from the encoder block. In this way, the data in the decoder block can
get the lost features and additional spatial information of an input image. After that, just like the encoder block, two 3x3 convolutions are used, where each convolution is followed by a ReLU activation function.
At the end of the decoder, the output passes through a 1x1 convolution with sigmoid activation for pixel-wise classification.
I believe, it is also important to talk about what is skip connections used in the network.
What skip connections do is just simply add features from encoder blocks to decoder blocks; however, despite its simplicity, those added features enable the decoder to avoid degradation and also give additional spatial information to
the decoder.
Lastly, let's talk about the "Bridge" that connects the encoder network and the decoder network.
The bridge "consists of two 3x3 convolutions, where each convolution is followed by a ReLU activation function".
1. Use "Batch Normalization Layer"
- This will reduce internal covariance shift and makes the network more stable
2. Use "Dropout"
- By using a dropout technique after the ReLU, the network will learn different representations of input images and be less dependent on certain nodes.
Before we finish, there are some key features you may want to know about U-Net. [2]
1. U-Net learns segmentation in an end-to-end setting.
- You input a raw image and get a segmentation map as the output.
2. U-Net is able to precisely localize and distinguish borders.
- Performs classification on every pixel so that the input and output share the same size.
3. U-Net uses very few annotated images.
- Data augmentation with elastic deformations reduces the number of annotated images required for training.
References:
[1] https://medium.com/analytics-vidhya/what-is-unet-157314c87634
[2] https://www.educative.io/answers/what-is-u-net
[3] https://towardsdatascience.com/understanding-semantic-segmentation-with-unet-6be4f42d4b47
Comments
Post a Comment