9:45 | registration + coffee | |
10:00 - 11:00 | Thomas Brox | Bored by Classification ConvNets? End-to-end Learning of other Computer Vision Tasks Abstract [slides] |
Convolutional Networks have suddenly become very popular in computer vision since they ticked off some major challenges of recent years: feature design, transfer learning, object classification. Will the conquest of ConvNets stop here? Most likely not. I will present our latest networks for three very different computer vision tasks: image generation, image segmentation, and optical flow estimation. All three networks can do surprising things although they have a disarmingly simple structure.
| ||
11:00 - 11:15 | coffee | |
11:15 - 12:15 | Jason Corso | Toward the Who and Where of Action Recognition Abstract [slides] |
Action recognition has been hotly studied in computer vision for more
than two decades. Recent action recognition systems are adept at
classifying web videos in a closed-world of action categories. But,
next generation cognitive systems will require far more than action
classification. Full action recognition requires not only classifying
the action, but also localizing it and potentially even finely
segmenting its boundaries. It requires not only focusing on human
action but also the action of other agents in the environment, such as
animals or vehicles. In this talk, I will describe our recent work in
moving toward these more rigorous aspects of action recognition. Our
work is the first effort in the computer vision community to jointly
consider various types of actors undergoing various actions. We
consider seven actor-types and eight action-types in three action
understanding problems including single-label action classification,
multi-label action classification and actor-action joint semantic
segmentation. We propose graduated strata of models for this task and
analyze the performance of each in all three tasks. The talk will
thoroughly discuss these models, the results, and a new dataset that
we released to support these more rigorous action understanding
problems. This talk involves work appearing in both CVPR 2015 and
new material.
| ||
12:15 - 14:00 | lunch (for registered participants) | |
14:00 - 15:00 | Marco Baroni | Grounding word representations in the visual world [slides] |
15:00 - 15:30 | coffee | |
15:30 - 16:30 | Andrew Zisserman | Human Pose Estimation in Videos and Spatial Transformers [slides: part 1 part 2] |
| ||
16:30 - 19:30 | Kinovis demo & Posters (with wine + cheese) | Poster presentations |
Rejla Arandjelovic, PowerPCA: Dimensionality Reduction for Nearest Neighbour Search
Guilhem Cheron, P-CNN: Pose-based CNN Features for Action Recognition Minsu Cho, Unsupervised Object Discovery and Localization in the Wild Bumsub Ham, Robust Image Filtering Using Joint Static and Dynamic Guidance Yang Hua, Online Object Tracking with Proposal Selection Vicky Kalogeiton, Analysing domain shift factors between videos and images for object detection Suha Kwak, Unsupervised Object Discovery and Tracking in Video Collections Diane Larlus, Fisher Vectors Meet Neural Networks: A Hybrid Classification Architecture Hongzhu Lin, A Universal Catalyst for First-Order Optimization Julien Mairal, Convolutional Kernel Networks Jerome Revaud, EpicFlow: Edge-Preserving Interpolation of Correspondences for Optical Flow Gregory Rogez, First-Person Pose Recognition Using Egocentric Workspaces Guillaume Seguin, Multi-instance video segmentation from object tracks Matthew Trager, Visual hulls and duality Tuan-Hung Vu, Context-aware CNNs for person detection Philippe Weinzaepfel, Learning to Detect Motion Boundaries |
9:45 | welcome + coffee | |
10:00 - 11:00 | Cees Snoek | What objects tell about actions Abstract [slides] |
This talk is about automatic classification and localization of human actions in video. Whereas motion is the key ingredient in modern approaches, we assess the benefits of having objects in the video representation. Rather than considering a handful of carefully selected and localized objects, we conduct an empirical study on the benefit of encoding 15,000 object categories for action using 6 datasets totaling more than 200 hours of video and covering 180 action classes. Our key contributions are i) the first in-depth study of encoding objects for actions, ii) we show that objects matter for actions, and are often semantically relevant as well. iii) We establish that actions have object preferences. Rather than using all objects, selection is advantageous for action recognition. iv) We reveal that object-action relations are generic, which allows to transferring these relationships from the one domain to the other. And, v) objects, when combined with motion, improve the state-of-the-art for both action classification and localization.
This is joint work with Mihir Jain and Jan van Gemert. | ||
11:00 - 11:15 | coffee | |
11:15 - 12:15 | Patrick Perez | On learned visual embedding Abstract [slides] |
Once described with state-of-art techniques, images and image fragments are turned into fixed size, high-dimensional real-valued vectors that can be used in a number of ways. In particular, they can be compared or analyzed in a meaningful way. Yet, it is often beneficial to further encode these descriptors: Such a final encoding is learned to get speed and/or performance gains. We shall put such a generic mechanism at work for three distinct problems: image search by non-linear similarity; image search and classification based on Euclidean distance; face track verification. Corresponding encoding are respectively based on kernel PCA, exemplar SVM and latent metric learning.
|