Both a good understanding of geometrical concepts and a broad familiarity with
objects lead to our excellent perception of moving objects. The human ability
to detect and segment moving objects works in the presence of multiple objects,
complex background geometry, motion of the observer and even camouflage. How
humans perceive moving objects so reliably is a longstanding research question
in computer vision and borrows findings from related areas such as psychology,
cognitive science and physics. One approach to the problem is to teach a deep
network to model all of these effects. This contrasts with the strategy used by
human vision, where cognitive processes and body design are tightly coupled and
each is responsible for certain aspects of correctly identifying moving
objects. Similarly from the computer vision perspective, there is evidence that
classical, geometry-based techniques are better suited to the "motion-based"
parts of the problem, while deep networks are more suitable for modeling
appearance. In this work, we argue that the coupling of camera rotation and
camera translation can create complex motion fields that are difficult for a
deep network to untangle directly. We present a novel probabilistic model to
estimate the camera's rotation given the motion field. We then rectify the flow
field to obtain a rotation-compensated motion field for subsequent
segmentation. This strategy of first estimating camera motion, and then
allowing a network to learn the remaining parts of the problem, yields improved
results on the widely used DAVIS benchmark as well as the recently published
motion segmentation data set MoCA (Moving Camouflaged Animals).