The blurred line between reality and social media

Set one more place at the table this Christmas, as social media continues to remain prominent in our lives. Christmas is finally here — a holiday where most will get the chance to spend some quality…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




3. DETR model

Object detection and instance segmentation are fundamental tasks in computer vision that play a pivotal role in a myriad of applications, ranging from autonomous driving to medical imaging. Traditional methods often leverage bounding box techniques for object localization followed by per-pixel classification to assign classes to these localized instances. However, these methods often falter when handling overlapping objects of the same class, or in scenarios where the number of objects per image varies.

Classical approaches such as Faster R-CNN, Mask R-CNN, and others, although highly effective, have struggled with these challenges due to their inherently fixed-size output space. They typically predict a fixed number of bounding boxes and classes per image, which may not match the actual number of instances in an image, especially when it varies across images. Furthermore, they may not adequately handle situations where objects of the same class overlap, leading to classification inconsistencies.

Mask R-CNN Overlapping Bounding Boxes Problem | by Buse Yaren Tekin |  Towards AI
Mask R-CNN Overlapping Bounding Boxes Problem | by Buse Yaren Tekin | Towards AI

In this article, we are going to talk about MaskFormer, a method released by Facebook AI Research in 2017 for instance segmentation that transcends these limitations.

Let’s jump right in guys ! Humm.. but first, i owe you some explainations to understand :

This method refers to assigning a class label to every individual pixel in an image. In this case, every pixel is treated independently, and the model predicts what class that pixel belongs to, based on the input features at that pixel’s location. Per-pixel classification can be highly accurate for well-defined objects with clear boundaries. However, it can struggle in situations where the objects of interest have complex shapes, overlap with each other, or are situated within a cluttered background this can be explained because of the tendency of these models to view objects in terms of their spatial boundaries first.

Consider an image depicting multiple overlapping cars. Traditional instance segmentation models such as per-pixel models might struggle with such a scenario as you can se below. There cars overlap, these models might create a single, merged mask for the entire set of overlapping cars. They could misinterpret the scene as containing one large, oddly shaped car instead of multiple distinct cars.

Per pixel classification usually makes single mask for several similar objects.

Examples of models using per-pixel classification/segmentation: FCNs, ASPP, OCNet, SETR, Segmenter, Vit…

Mask classification (used in MaskFormer), on the other hand, takes a different approach. Instead of classifying each pixel independently, a mask classification model predicts a class-specific mask for each object instance in an image. This mask is essentially a binary image that indicates which pixels belong to the object instance and which don’t. In other words, a single mask represents the entire object, not just individual pixels.

In the former example, using mask classification make us able to recognizes that there are multiple instances of the “car” class in the image and assigns each a unique mask, even where they overlap. Each car is treated as a distinct instance, and given its own unique mask, preserving its identity separate from the other cars.

examples of models using mask classification/segmentation: Mask R-CNN, DETR, Max-deeplab..

Consider our busy street scenario with overlapping cars. In a traditional mask classification approach, if two cars overlap, it might be still challenging to separate them as distinct entities, even if it’s better than per-pixel method. DETR offers an elegant solution to such problems. Instead of generating masks for each car, DETR predicts a fixed set of bounding boxes and associated class probabilities. This “set prediction” approach allows DETR to handle complex scenes involving overlapping objects with remarkable efficiency.

Pretty cool, but where does MaskFormer fit into this picture?

While DETR revolutionizes the bounding box predictions, it doesn’t directly provide segmentation masks — a detail crucial in many applications. Here, MaskFormer steps in, extending the robust set prediction mechanism of DETR to create class-specific masks for each detected object. MaskFormer thus builds upon the strengths of DETR and augments it with the ability to generate high-quality segmentation masks. In our car scenario, MaskFormer not only recognizes each car as a separate entity (thanks to DETR’s set prediction mechanism) but also generates a precise mask for each car, accurately capturing their boundaries, even in cases of overlap.

This synergy between DETR and MaskFormer opens a world of possibilities for more accurate and efficient instance segmentation, transcending the limitations of traditional per-pixel and mask classification methods. In the next sections, we will delve deeper into the working of MaskFormer and understand its architecture and advantages.

Here the architecture of MaskFormer :

Let’s details together this scheme:

The term “segment” here refers to potential instances of objects in the image that the model is trying to identify and segment.

Usually, the encoder processes the input data and the decoder uses this processed data to generate the output. The inputs to the encoder and decoder are generally sequences, like sentences in a machine translation task.However, in the context of DETR and MaskFormerthe role of the encoder and decoder is somewhat different. The ‘encoder’ in this case is the backbone (a Resnet50 for maskFormer), which processes the input image and generates a set of feature maps. These feature maps serve the same purpose as the encoder output in a traditional Transformer, providing a rich, high-level representation of the input data.

4. Class and Mask Prediction: These embeddings Q are then used to predict N class labels and N corresponding mask embeddings (E mask). This is where MaskFormer really shines. Unlike traditional segmentation models that predict class labels for each pixel, MaskFormer predicts class labels for each potential object segment, along with a corresponding mask embedding.

5. Binary Mask Prediction: After obtaining the mask embeddings, MaskFormer produces N binary masks through a dot product between the pixel embeddings (E pixel) and mask embeddings (E mask), followed by a sigmoid activation. This process results in potentially overlapping binary masks for each object instance.

6. Final Prediction (for Semantic Segmentation): Lastly, for tasks like semantic segmentation, MaskFormer can compute the final prediction by combining the N binary masks with their corresponding class predictions. This combination is achieved via a straightforward matrix multiplication, giving us the final segmented and classified image.

Let’s make a quick reminder:

The difference between semantic and instance segmentation is an important distinction in the field of computer vision.

Most traditional computer vision models treat semantic and instance segmentation as separate problems and would require different models, loss functions, and training procedures for each.

MaskFormer, however, is designed to handle both tasks in a unified manner, thanks to its mask classification approach, which works by predicting a class label and a binary mask for each object instance in the image. This approach inherently combines aspects of both semantic and instance segmentation.

For the loss function, MaskFormer uses a unified loss function that is designed to handle this mask classification problem. This loss function evaluates the quality of the predicted masks in a way that is consistent with both semantic and instance segmentation tasks.

Therefore, the same MaskFormer model, trained with the same loss function and training procedure, can be applied to both semantic and instance segmentation tasks without any modifications.

In summary, MaskFormer presents a new approach to image segmentation, integrating the strengths of the DETR model and the Transformer architecture. It uses mask-based prediction, enhancing the handling of complex object interactions within images.

Its ability to tackle both semantic and instance segmentation tasks using the same model, loss, and training procedure demonstrates the effectiveness and flexibility of mask classification. The use of a Transformer decoder allows for variable object predictions, tackling challenges with overlapping and nested instances.

MaskFormer’s unified approach is a substantial step forward in image segmentation, opening new possibilities for advancements in computer vision. It sets the stage for further research, aiming to enhance our capability to comprehend and interpret the visual world.

Add a comment

Related posts:

Coisas que sonham o mundo dos vivos

Eu sou espirito livre. “Coisas que sonham o mundo dos vivos” is published by Eu tava aqui pensando…..

Os desejos que desejamos pro outro

Por mais que ele mereça aprender com momentos difíceis, por mais atrapalhado que ele seja, você deseja que ele tenha sabedoria. Que ele saiba passar por esses momentos. Por mais que eu queira que os…

Brunch Fest Toronto joins vendors from across city

Brunch Fest Toronto took over the grounds of Hotel X Toronto this past weekend, showcasing a variety of tastes from unique vendors all in one location. Hundreds of event-goers with big appetites…