Weakly-Supervised Semantic Segmentation
using Motion Cues

People

Abstract

Fully convolutional neural networks (FCNNs) trained on a large number of images with strong pixel-level annotations have become the new state of the art for the semantic segmentation task. While there have been recent attempts to learn FCNNs from image-level weak annotations , they need additional constraints, such as the size of an object , to obtain reasonable performance. To address this issue, we present motion-CNN (M-CNN), a novel FCNN framework which incorporates motion cues and is learned from video-level weak annotations. Our learning scheme to train the network uses motion segments as soft constraints, thereby handling noisy motion information. When trained on weakly-annotated videos, our method outperforms the state-of-the-art approach on the PASCAL VOC 2012 image segmentation benchmark. We also demonstrate that the performance of M-CNN learned with 150 weak video annotations is on par with state-of-the-art weakly-supervised methods trained with thousands of images. Finally, M-CNN substantially out-performs recent approaches in a related task of video co-localization on the YouTube-Objects dataset.

Paper

ECCV 2016 Paper

BibTeX
@InProceedings{Tokmakov16a,
  author    = "Tokmakov, P. and Alahari, K. and Schmid, C.",
  title     = "Weakly-Supervised Semantic Segmentation using Motion Cues",
  booktitle = "ECCV",
  year      = "2016"
}

Technical report

BibTeX
@Article{Tokmakov16,
  author    = "Tokmakov, P. and Alahari, K. and Schmid, C.",
  title     = "Learning Semantic Segmentation with Weakly-Annotated Videos",
  journal   = "ArXiv e-prints, arXiv:1603.07188",
  year      = "2016"
}

Code

You can find the code and the trained models under this link. All the details are given in the README.txt file.
Please note that the release includes only single label inference code and models (relevant for ILSVRC experiments). The multilable inference extension will be released later, but in practice the results don't change significantly.

Acknowledgements

This work was supported in part by the ERC advanced grant ALLEGRO, the MSR-Inria joint project, a Google research award and a Facebook gift. We gratefully acknowledge the support of NVIDIA with the donation of GPUs used for this research.

Copyright Notice

The documents contained in these directories are included by the contributing authors as a means to ensure timely dissemination of scholarly and technical work on a non-commercial basis. Copyright and all rights therein are maintained by the authors or by other copyright holders, notwithstanding that they have offered their works here electronically. It is understood that all persons copying this information will adhere to the terms and constraints invoked by each author's copyright. This page style is taken from Guillaume Seguin.