Paper Picks #1: Zero-Shot Instance Segmentation with AI

Paper Picks #1: Zero-Shot Instance Segmentation with AI

For our first edition of Passion Lab’s Paper Picks, the team selected "The Devil is in the Object Boundary: Towards Annotation-free Instance Segmentation using Foundation Models." by Shi, Cheng, and Sibei Yang presented in this year’s International Conference on Learning Representations (ICLR)


What problem does the paper address?

The paper is about object detection in images and specifically the case where you have several instances of an object (i.e. oranges on a tree) and want to detect each orange separately.

The ambition is to do this without any annotated data and leverage a foundational model instead - so in a zero-shot learning scenario. We can all imagine how tedious annotating single grapes in a fruit bowl would be…

Existing systems addressing this task still have some issues with detecting the boundaries of objects. They either fuse multiple instances of an object into a single one or tend to over-segment single objects.

How do they solve the problem?

Their work uses CLIP as a foundational model. This is quite an interesting choice as it was trained on image-level textual descriptions rather than object-level annotations. However, it turns out that the output of a specific layer in the middle of CLIP highlights object-level boundaries. Using this output as a feature representation, they then use a pretty clever clustering mechanism that also makes use of patch-level activation maps from other CLIP layers. The output of the clustering is a set of initial boundary estimates. These are then passed through a second foundational model, Meta’s “segment anything”, to obtain more precise object-level boundaries.

This is only a rough overview of the workflow and there is a lot more technical detail in the paper. In any case, they outperform all existing zero-shot approaches which is a great achievement!

Our conclusion

We think that this paper is a perfect example of how foundational models can be used to solve very complex problems. However, as in this case, this path is not always straightforward and requires a lot of additional thinking, innovation and engineering. We will look into how the sample principle could be applied to other domains, i.e. when segmenting sound streams or videos.