Computer Vision & Deep Learning Symposium

Wednesday-Friday, June 8, 9, 10, 2016

Venue

Grand Amphithéâtre, Inria Grenoble - Rhône-Alpes (Montbonnot/Inovallée site: Directions)

Registration

Online registration for the symposium on the 9th is mandatory, but free of charge. Registration closes June 3rd.
The defenses on June 8th and 10th can be attended without registration.

Contact

Nathalie Gillot and Jakob Verbeek (firstname.lastname@inra.fr)

Program June 8th

15:00 - 17:00 HDR defense Jakob Verbeek Machine learning solutions to visual recognition problems. [manuscript] [slides]
Jury: M. Cord, E. Gaussier, E. Learned-Miller, C. Schmid, T. Tuytelaars, A. Zisserman.
Abstract
I will give an overview of my activities since my arrival in December 2005 at INRIA Rhone-Alpes, then as a postdoc in the LEAR team, now as a permanent researcher in the THOTH team. My contributions can be grouped along three themes: Fisher-vector image representations, metric learning, and learning visual recognition models from incomplete supervision. I will highlight several contributions in detail, and present perspectives on future research directions.

Program June 9th

9:00 - 9:30 Room open + coffee
9:30 - 10:15 Matthieu Cord (UPMC-Sorbonne Universities, Paris, France)
Deep learning and weak supervision for image classification Abstract [slides]
Deep learning and Convolutional Neural Networks (CNN) are state-of-the-art methods for various visual recognition tasks, e.g. image classification or object detection. To better identify or localize objects, bounding box annotations are often used. These rich annotations quickly become too costly to get, making the development of Weakly Supervised Learning (WSL) models appealing. I discuss several strategies to automatically select relevant image regions from weak annotations (e.g. image-level labels) in deep CNN. I also introduce our architecture, WELDON, for WEakly supervised Learning of Deep cOnvolutional neural Networks. Our deep learning framework, leveraging recent improvements on the Multiple Instance Learning paradigm, is validated on several recognition tasks.
10:15 - 10:30   Coffee
10:30 - 11:15   Tinne Tuytelaars (Katholieke Universiteit Leuven, Belgium)
Lightweight CNN domain adaptation, and dynamic filter networks Abstract [slides]
First, I'll talk about domain adaptation (DA) in a deep learning (CNN) context. While good results have been obtained using "deep DA" (i.e., learning to overcome domain shift in an end-to-end manner), we argue that for many practical applications, more lightweight DA algorithms, that can be applied at minimal cost, are needed. Furthermore, in contrast to what's commonly believed, we show that domain shifts are not limited to the upper layers of a CNN, but can already manifest themselves in the first few layers (usually considered to be "generic"). Based on these observations, we develop a novel lightweight DA algorithm, building on the idea of filter reconstruction. Next, I'll talk about a new network architecture we proposed recently. While in a traditional convolutional layer, filters stay fixed after training, our Dynamic Filter Network has its filters generated dynamically conditioned on an input. This results in a flexible yet compact model, that can be used for a wide range of applications. In particular, I will show results in the context of video prediction.
11:15 - 11:30   Coffee
11:30 - 12:15   Erik Learned-Miller (University of Massachusetts, Amherst, USA)
It's moving! A probabilistic model for causal motion segmentation in moving camera videos Abstract [slides]
The human ability to detect and segment moving objects works in the presence of multiple objects, complex background geometry, motion of the observer, and even camouflage. In addition to all of this, the ability to detect motion is nearly instantaneous. While there has been much recent progress in motion segmentation, it still appears we are far from human capabilities. In this work, we derive from first principles a new likelihood function for assessing the probability of an optical flow vector given the 3D motion direction of an object. This likelihood uses a novel combination of the angle and magnitude of the optical flow to maximize the information about the true motions of objects. Using this new likelihood and several innovations in initialization, we develop a motion segmentation algorithm that beats current state-of-the-art methods by a large margin. We compare to five state-of-the-art methods on two established benchmarks, and a third new data set of camouflaged animals, which we introduce to push motion segmentation to the next level.
12:15 - 14:00   Lunch (for registered participants, cantine/restaurant at 1 min walking for others)
14:00 - 14:45   Andrew Zisserman (University of Oxford, UK)
3D Shape attributes & Signs in time Abstract [slides]
The talk will cover two topics. First, an investigation of using ConvNets to infer 3D shape attributes, such as planarity, symmetry and occupied space, from a single image. For this we have assembled an annotated dataset of 150K images of over 2000 different sculptures. We show that 3D attributes can be learnt from these images and generalize to images of other (non-sculpture) object classes. This is joint work with Abhinav Gupta and David Fouhey. Second, an investigation of learning to recognise and localise short temporal signals in image time series, where strong supervision is not available for training. For this we have used a large dataset of signed gestures in British Sign Language (BSL) videos with only weak and noisy supervision. We show that suitable image encodings enable ConvNets to learn and localize the sign-gestures. This is joint work with Joon Son Chung.
14:45 - 15:00   Coffee
15:00 - 15:45   Jiri Matas (Czech Technical University, Prague)
Beyond Similarity in Image Retrieval Abstract [slides]
Classically, image retrieval has been formulated as an efficient search for similar images. In very large collections, such search is often of limited interest as it returns near duplicates. Instead, we argue that finding the *most dissimilar* images of the scene depicted in the query is desirable. We will present a simple modification of the standard large scale image retrieval pipeline that can be used to find dissimilar images of the query scene, in terms of geometry (a very different resolution, viewpoint), photometry (different lighting conditions), or photographers' interest. As an application, we will show the benefits the rich retrieval output has on a 3D reconstruction pipeline. The new retrieval formulations will be demonstrated in real-time if the speed of the Internet connection permits.
15:45 - 16:00Coffee
16:00 - 16:45   Patrick Perez (Imaging Science Lab, Technicolor, Rennes)
Reconstruction of Personalized 3D Face Rigs from Monocular Video Abstract [slides]
We present a novel approach for the automatic creation of a personalized high quality 3D face rig of an actor from just monocular video data. Our rig is based on three distinct layers that allow us to model the actor’s facial shape as well as capture his person-specific expression characteristics at high fidelity, ranging from coarse-scale geometry to fine scale static and transient detail on the scale of folds and wrinkles. At the heart of our approach is a parametric shape prior that encodes the plausible sub-space of facial identity and expression variations. Based on this prior, a coarse-scale reconstruction is obtained by means of a novel variational fitting approach. We represent person specific idiosyncrasies, which cannot be represented in the restricted shape and expression space, by learning a set of medium-scale corrective shapes. Fine-scale skin detail, such as wrinkles, are captured from video via shading-based refinement, and a generative detail formation model is learned. Both the medium and fine-scale detail layers are coupled with the parametric prior by means of a novel sparse linear regression formulation. Once reconstructed, all layers of the face rig can be conveniently controlled by a low number of blendshape expression parameters, as widely used by animation artists. We show captured face rigs and their motions for several actors filmed in different monocular video formats, including legacy footage from YouTube, and demonstrate how they can be used for 3D animation and 2D video editing.

Program June 10th

13:30 - 15:30 PhD defense Yang Hua : Towards robust visual object tracking: Proposal selection and occlusion reasoning [manuscript]
Jury: K. Alahari, J. Matas, P. Perez, F. Perronnin, D. Ramanan, C. Schmid
Abstract
In this dissertation we address the problem of visual object tracking, wherein the goal is to localize an object and determine its trajectory over time. In particular, we focus on challenging scenarios where the object undergoes significant transformations, becomes occluded or leaves the field of view. To this end, we propose two robust methods which learn a model for the object of interest and update it, to reflect its changes over time. Our first method addresses the tracking problem in the context of objects undergoing severe geometric transformations, such as rotation, change in scale. We present a novel proposal-selection algorithm, which extends the traditional discriminative tracking-by-detection approach. This method proceeds in two stages -- proposal followed by selection. In the proposal stage, we compute a candidate pool that represents the potential locations of the object by robustly estimating the geometric transformations. The best proposal is then selected from this candidate set to localize the object precisely using multiple appearance and motion cues. Second, we consider the problem of model update in visual tracking, i.e., determining when to update the model of the target, which may become occluded or leave the field of view. To address this, we use motion cues to identify the state of the object in a principled way, and update the model only when the object is fully visible. In particular, we utilize long-term trajectories in combination with a graph-cut based technique to estimate parts of the objects that are visible. We have evaluated both our approaches extensively on several tracking benchmarks, notably, recent online tracking benchmark and the visual object tracking challenge datasets. Both our approaches compare favorably to the state of the art and show significant improvement over several other recent trackers. Specifically, our submission to the visual object tracking challenge organized in 2015 was the winner in one of the competitions.