Paper Picks #1: Zero-Shot Instance Segmentation with AI

For our first edition of Passion Lab’s Paper Picks, the team selected "The Devil is in the Object Boundary: Towards Annotation-free Instance Segmentation using Foundation Models." by Shi, Cheng, and Sibei Yang presented in this year’s International Conference on Learning Representations (ICLR)

Link: https://arxiv.org/pdf/2404.11957

What problem does the paper address?

The paper is about object detection in images and specifically the case where you have several instances of an object (i.e. oranges on a tree) and want to detect each orange separately.

‍

‍

The ambition is to do this without any annotated data and leverage a foundational model instead - so in a zero-shot learning scenario. We can all imagine how tedious annotating single grapes in a fruit bowl would be…

Existing systems addressing this task still have some issues with detecting the boundaries of objects. They either fuse multiple instances of an object into a single one or tend to over-segment single objects.

‍

How do they solve the problem?

Their work uses CLIP as a foundational model. This is quite an interesting choice as it was trained on image-level textual descriptions rather than object-level annotations. However, it turns out that the output of a specific layer in the middle of CLIP highlights object-level boundaries. Using this output as a feature representation, they then use a pretty clever clustering mechanism that also makes use of patch-level activation maps from other CLIP layers. The output of the clustering is a set of initial boundary estimates. These are then passed through a second foundational model, Meta’s “segment anything”, to obtain more precise object-level boundaries.

This is only a rough overview of the workflow and there is a lot more technical detail in the paper. In any case, they outperform all existing zero-shot approaches which is a great achievement!

Our conclusion

We think that this paper is a perfect example of how foundational models can be used to solve very complex problems. However, as in this case, this path is not always straightforward and requires a lot of additional thinking, innovation and engineering. We will look into how the sample principle could be applied to other domains, i.e. when segmenting sound streams or videos.

‍