Fully convolutional neural networks (FCNNs) trained on a large number of images
with strong pixel-level annotations have become the new state of the art for
the semantic segmentation task. While there have been recent attempts to learn
FCNNs from image-level weak annotations, they need additional constraints, such
as the size of an object, to obtain reasonable performance. To address this
issue, we present motion-CNN (M-CNN), a novel FCNN framework which incorporates
motion cues and is learned from video-level weak annotations. Our learning
scheme to train the network uses motion segments as soft constraints, thereby
handling noisy motion information. When trained on weakly-annotated videos, our
method outperforms the state-of-the-art EM-Adapt approach on the PASCAL VOC
2012 image segmentation benchmark. We also demonstrate that the performance of
M-CNN learned with 150 weak video annotations is on par with state-of-the-art
weakly-supervised methods trained with thousands of images. Finally, M-CNN
substantially outperforms recent approaches in a related task of video
co-localization on the YouTube-Objects dataset.