# Seminars

For other interesting seminars nearby see:

 stands for open seminars stands for team meetings

## Seminars in 2015

The future is weakly supervised
INRIA Rhône-Alpes, F107
Monday, May 11, 16:00 pm

### Abstract:

In this talk I present recent work developed at VISICS in Weakly Supervised (WS) learning. The main idea behind WS learning is to find a connection between a main task, which is fully supervised and a subordinate task, which is only partially annotated. More specifically, we introduce latent variables and use them to learn the subordinate task as collateral effect of the fully supervised learning. We apply this approach to 3 different tasks: i) Classification with WS detection ii) Detection with WS facial point localization iii) Detection with WS pose estimation In these examples we show that WS learning is a learning approach that has a broad field of application, can improve the fully supervised task and, more importantly, learns effectively a subordinate task using only partial annotations.
Projection-free Learning and Optimization
INRIA Rhône-Alpes, F107
Thursday, March 26, 4:30 pm

### Abstract:

Linear optimization is many times algorithmically simpler than non-linear convex optimization. Linear optimization over matroid polytopes, matching polytopes and path polytopes are example of problems for which we have simple and efficient combinatorial algorithms, but whose non-linear convex counterpart is harder and admit significantly less efficient algorithms. This motivates the computational model of convex optimization, including the offline, online and stochastic settings, using a linear optimization oracle. In this computational model we give several new results that improve over the previous state of the art. Our main result is a novel conditional gradient algorithm for smooth and strongly convex optimization over polyhedral sets that performs only a single linear optimization step over the domain on each iteration and enjoys a linear convergence rate. This gives an exponential improvement in convergence rate over previous results. Based on this new conditional gradient algorithm we give the first algorithms for online convex optimization over polyhedral sets that perform only a single linear optimization step over the domain while having optimal regret guarantees, answering an open question of Kalai and Vempala [COLT'03], and Hazan and Kale [ICML'12]. Our online algorithms also imply conditional gradient algorithms for non-smooth and stochastic convex optimization with the same convergence rates as projected (sub)gradient methods.
Probabilistic low-rank matrix completion on finite alphabets
INRIA Rhône-Alpes, F107
Monday, March 31, 12:00

### Abstract:

The task of reconstructing a matrix given a sample of observedentries is known as the matrix completion problem. It arises ina wide range of problems, including recommender systems, collaborativefiltering, dimensionality reduction, image processing, quantum physics or multi-class classificationto name a few. Most works have focused on recovering an unknown real-valued low-rankmatrix from randomly sub-sampling its entries.Here, we investigate the case where the observations take a finite number of values, corresponding for examples to ratings in recommender systems or labels in multi-class classification.We also consider a general sampling scheme (not necessarily uniform) over the matrix entries.The performance of a nuclear-norm penalized estimator is analyzed theoretically.More precisely, we derive bounds for the Kullback-Leibler divergence between the true and estimated distributions.In practice, we have also proposed an efficient algorithm based on lifted coordinate gradient descent in order to tacklepotentially high dimensional settings.
Some recent results
INRIA Rhône-Alpes, F107
Friday, March 06, 15:15

### Abstract:

Recent Advances in Large-Scale Convex Optimization: Algorithms, Complexities, and Applications
INRIA Rhône-Alpes, A104
Monday, January 19, 12:00

### Abstract:

In the modern era of large-scale machine learning and high-dimensional statistics, using mixing regularization and kernelization become increasingly popular and important modeling strategies. However, they often lead to very complex optimization models with extremely large scale and nonsmooth objective functions, which bring new challenges to the traditional first-order methods, due to the expensive computation or memory cost of proximity operators and even gradients. In this talk, I will discuss some recent algorithmic advances that cope with these challenges by taking advantage of the underlying structures and using randomization techniques. I will present (i) my work on the composite mirror prox algorithm for a broad class of variational inequalities, allowing to cover the composite minimization problem with multiple nonsmooth regularization tems, (ii) my work on the doubly stochastic gradient descent algorithm for stochastic optimization problems over reproducing kernel Hilbert spaces. These algorithms exhibit the optimal convergence rates and make it practical to handle problems with extremely large dimensions and large datasets. Besides the theoretical efficiency, the algorithms are also proven useful in a wide range of interesting applications in machine learning, image processing, and statistical inferences.

## Seminars in 2014

Incremental proximal majorization-minimization algorithms for large-scale machine learning
INRIA Rhône-Alpes, F107
Wednesday, November 26, 11:00

### Abstract:

Recently, a efficient first-order optimization algorithm called MISO (Minimization by Incremental Surrogate Optimization) was proposed for incremental unconstrained majorization-minimization, with large-scale machine learning applications. We propose several extensions of MISO, under less stringent assumptions, including a proximal counterpart of MISO, called Prox-MISO, that allows to include non-smooth regularization in the learning objectives.

This work was performed as part my Master's internship in the LEAR team, supervised by Julien Mairal and Zaid Harchaoui.
Self-Learning Camera: Autonomous Adaptation of Object Detectors to Unlabeled Video Streams
INRIA Rhône-Alpes, F107
Tuesday, September 23, 11:30

### Abstract:

Learning object detectors requires massive amounts of labeled training samples from the specific data source of interest. This is impractical when dealing with many different sources (e.g., in camera networks), or constantly changing ones such as mobile cameras (e.g., in robotics or driving assistant systems). In this talk, I will describe how to address the problem of self-learning detectors in an autonomous manner, i.e. (i) detectors continuously updating themselves to efficiently adapt to streaming data sources (contrary to transductive algorithms), (ii) without any labeled data strongly related to the target data stream (contrary to self-paced learning), and (iii) without manual intervention to set and update hyper-parameters. To that end, we propose an unsupervised, on-line, and self-tuning learning algorithm to optimize a multi-task learning convex objective. Our method uses confident but laconic oracles (human operators or high-precision but low-recall off-the-shelf generic detectors), and exploits the structure of the problem to jointly learn on-line an ensemble of instance-level trackers, from which we derive an adapted category-level object detector. Our approach is validated on real-world publicly available video object datasets.
Human pose recognition: from third person to first person views
INRIA Rhône-Alpes, A103
Tuesday, September 16, 12:00

### Abstract:

In this seminar, I will present an overview of my PhD work, which focused on the problem of full body human pose recognition, and I will introduce some of our more recent work on egocentric image analysis. In the first part, I will present our hierarchical cascade classifier that simultaneously detects humans and estimates their pose by tackling detection as a multi-class classification problem. In the second part of this seminar, I will show how some properties of projective geometry can be exploited for view-invariant monocular tracking in surveillance-scenes. Then, I will present our latest work on hand pose estimation from egocentric viewpoints. For this problem specification, I will show that depth sensors are particularly informative for extracting near-field interactions of the camera wearer with his/her environment. Our method uses task and viewpoint specific synthetic training exemplars, trained with object interactions, in a discriminative detection framework. I will provide an insightful analysis of the performance of our algorithm on a new real-world annotated dataset of egocentric scenes. Finally, I will analyze the limitations of the current approach and give some ideas for future work.
Fast convergence rates in semi-supervised multi-class learning
Yuri Maximov
INRIA Rhône-Alpes, F107
Thursday, August 28, 12:00

### Abstract:

We propose a multi-class classification generalization error bound for semi-supervised learning. The bound involves the margin distribution of the classifier, a transductive Rademacher complexity, and the empirical adequacy of the majority rule assigning pseudo-labels to unlabeled data within identified clusters with the learned function and the true labels of examples. For a given class of functions, the bound is tight when the data clusters contain, in majority, examples of the same class and that the errors of the learned function is concentrated on low margin regions. The working hypothesis of our study is that data can be separated into dense regions, such that the optimal Bayes classifier assign to all unlabeled examples within one region the same class label. Following this assumption, we propose a two stage multi-class semi-supervised algorithm which first assigns pseudo-labels to the set of unlabeled training examples, that are found to be in a dense regions using the majority vote, and then learn a classifier using both sets of labeled and pseudo-labeled examples. With this learning scheme we achieve fast convergence rates and empirical results on different datasets show the effectiveness of our approach compared to state-of-the-art semi-supervised algorithms.
A new primal-dual splitting algorithm for convex optimization; application as a heuristic for super-resolution
INRIA Rhône-Alpes, F107
Monday, June 2, 14:00

### Abstract:

Abstract: A new splitting algorithm is proposed to minimize the sum of convex functions, potentially nonsmooth and composed with linear operators. This generic formulation encompasses numerous regularized inverse problems in image processing. The algorithm, whose weak convergence is proved, calls the individual gradient or proximity operators of the functions, without any inner loop or linear system to solve. The classical Douglas-Rachford, forward-backward and Chambolle-Pock algorithms are recovered as particular cases. In the second part of the talk, we address the recovery of a spike train from noisy linear measurements, through a reformulation as a low rank matrix approximation problem. Used as a heuristic for this problem, our algorithm outperforms the state of the art.
Bayesian Error Estimation for Classifier Model Selection
INRIA Rhône-Alpes, A104
Wednesday, May 14, 14:00

### Abstract:

The estimation of classification error is a critical step in classifier design, and closely related to model selection. Typical model selection procedures are either based on estimating the error (e.g., cross-validation, bootstrap, holdout, etc.) or information theoretic principles (e.g., AIC, BIC, MDL). The problem with the former approach is that the traditional counting-based estimators are both computationally expensive and inaccurate. On the other hand, the latter approach optimizes a measure that is not directly connected to the prediction error and often requires a careful selection of hyperparameters.
In this talk we concentrate on the recently proposed Bayesian Error Estimator (BEE), and on its uses for model selection among a family of generalized linear models. More specifically, we will show that the estimator is more accurate than the traditional error estimation approaches when selecting the best model along the regularization path of a LASSO regularized logistic regression model. Moreover, the BEE estimates the error directly from the training set, thus avoiding multiple training stages typical of cross-validation procedures.
As a case study, we will describe the anatomy of our submission into the IEEE MLSP 2013 Bird sound classification competition (https://www.kaggle.com/c/mlsp-2013-birds). The method was essentially a BEE-selected generalized linear model with BoW-like features calculated from a sparse dictionary representation calculated with the SPAMS toolbox developed by INRIA.
An algorithm for variable density sampling with block-constrained acquisition
INRIA Rhône-Alpes, F107
Tuesday, April 23, 12:00

### Abstract:

Reducing acquisition time is of fundamental importance in various imaging modalities. The concept of variable density sampling provides an appealing framework to address this issue. It was justified recently from a theoretical point of view in the compressed sensing (CS) literature. Unfortunately, the sampling schemes suggested by current CS theories may not be relevant since they do not take the acquisition constraints into account (for example, continuity of the acquisition trajectory in Magnetic Resonance Imaging - MRI). In this talk, we propose a numerical method to perform variable density sampling with block constraints. Our main contribution is to propose a new way to draw the blocks in order to mimic CS strategies based on isolated measurements. The basic idea is to minimize a tailored dissimilarity measure between a probability distribution defined on the set of isolated measurements and a probability distribution defined on a set of blocks of measurements. This problem turns out to be convex and solvable in high dimension. Our second contribution is to define an efficient minimization algorithm based on Nesterov's accelerated gradient descent in metric spaces. We study carefully the choice of the metrics and of the prox function. We show that the optimal choice may depend on the type of blocks under consideration. Finally, we show that we can obtain better MRI reconstruction results using our sampling schemes than standard strategies such as equiangularly distributed radial lines.
Two approaches for domain adaptation: Unsupervised subspace alignment and majority vote adaptation
INRIA Rhône-Alpes, F107
Tuesday, April 15, 15:00

### Abstract:

Domain adaptation is an important machine learning problem arising when the learning distribution differs from that of the test data. Many classification tasks in computer vision or natural language processing for example are affected by this problem. A general trend to deal with this issue is to try to move closer the two distributions, w.r.t. to a divergence measure, while ensuring a good accuracy on the learning sample. In this talk, we present and discuss two possible approaches for this problem. The first one, which takes the form on an algorithmic contribution, proposes to move closer the two distributions by an unsupervised subspace alignment method. The second one is based on a new domain adaptation framework relying on the PAC-Bayesian theory that aims at learning an adaptive majority vote of classifiers.
Supervised Metric Learning with Generalization Guarantees
INRIA Rhône-Alpes, F107
Tuesday, April 15, 14:00

### Abstract:

Using an appropriate metric is key to the performance of many learning algorithms. For this reason, a lot of effort has gone during the past 10 years into metric learning, the research topic devoted to automatically optimizing distance and similarity functions from data. A large body of work has been devoted to supervised metric learning from feature vectors, in particular Mahalanobis distance learning, which essentially learns a linear projection of the data (in the form of a matrix M) into a new space where some discriminative constraints are satisfied. Beyond the fact that M usually has to be PSD, that is a costly constraint, one main limitation of the current supervised metric learning methods is a substantial lack of theoretical understanding of generalization in metric learning. Indeed, one may be interested in the generalization ability of the metric itself, i.e., its consistency not only on the training sample but also on unseen data coming from the same distribution. Second, one may also be interested in the generalization ability of the learning algorithm that uses the learned metric. In this talk, we make use of the formal framework of good similarities introduced by Balcan et al. to design an algorithm for learning a non PSD metric, which is then used to build a global linear classifier. We show that this approach has uniform stability and derive a generalization bound on the classification error.
Predicting an Object Location using a Global Image Representation
INRIA Rhône-Alpes, A103
Thursday, March 27, 12:00

### Abstract:

We tackle the detection of prominent objects in images as a retrieval task: given a global image descriptor, we find the most similar images in an annotated dataset, and transfer the object bounding boxes. We refer to this approach as data driven detection (DDD), that is an alternative to sliding windows. Previous works have used similar notions but with task-independent similarities and representations, i.e. they were not tailored to the end-goal of localization. This article proposes two contributions: (i) a metric learning algorithm and (ii) a representation of images as object probability maps, that are both optimized for detection. We show experimentally that these two contributions are crucial to DDD, do not require costly additional operations, and in some cases yield comparable or better results than state-of-the-art detectors despite conceptual simplicity and increased speed. As an application of prominent object detection, we improve fine-grained categorization by pre-cropping images with the proposed approach.
Spatial Information and End-to-End Learning for Visual Recognition
INRIA Rhône-Alpes, F107
Wednesday, March 26, 11:00

### Abstract:

We present our research on visual recognition and machine learning. Two types of visual recognition problems are investigated: action recognition and human body part segmentation problem. Our objective is to combine spatial information such as label configuration in feature space, or spatial layout of labels into an end-to-end framework to improve recognition performance.

For human action recognition, we apply the bag-of-words model and reformulate it as a neural network for end-to-end learning. We propose two algorithms to make use of label configuration in feature space to optimize the codebook. One is based on classical error backpropagation. The codewords are adjusted by using gradient descent algorithm. The other is based on cluster reassignments, where the cluster labels are reassigned for all the feature vectors in a Voronoi diagram. As a result, the codebook is learned in a supervised way. We demonstrate the effectiveness of the proposed algorithms on the standard KTH human action dataset.

For human body part segmentation, we treat the segmentation problem as classification problem, where a classifier acts on each pixel. Two machine learning frameworks are adopted: randomized decision forests and convolutional neural networks. We integrate a priori information on the spatial part layout in terms of pairs of labels or pairs of pixels into both frameworks in the training procedure to make the classifier more discriminative, but pixelwise classification is still performed in the testing stage. Three algorithms are proposed: (i) Spatial part layout is integrated into randomized decision forest training procedure; (ii) Spatial pre-training is proposed for the feature learning in the ConvNets; (iii) Spatial learning is proposed in the logistical regression (LR) or multilayer perceptron (MLP) for classification.
Adaptive Euclidean Maps for Histograms: Generalized Aitchison Embeddings
INRIA Rhône-Alpes, A104
Friday, February 3 2014, 12:00

### Abstract:

Learning distances that are specifically designed to compare histograms in the probability simplex has recently attracted the attention of the machine learning community. Learning such distances is important because most machine learning problems involve bags of features rather than simple vectors. Ample empirical evidence suggests that the Euclidean distance in general and Mahalanobis metric learning in particular may not be suitable to quantify distances between points in the simplex. We propose in this paper a new contribution to address this problem by generalizing a family of embeddings proposed by Aitchison (1982) to map the probability simplex onto a suitable Euclidean space. We provide algorithms to estimate the parameters of such maps by building on previous work on metric learning approaches. The criterion we study is not convex, and we consider alternating optimization schemes as well as accelerated gradient descent approaches. These algorithms lead to representations that outperform alternative approaches to compare histograms in a variety of contexts.
Co-Occurrence Statistics for Zero-Shot Classification
INRIA Rhône-Alpes, F107
Monday, January 13 2014, 12:00

### Abstract:

In this paper we aim for zero-shot classification, but in contrast to the common setting of multi-class image classification, we focus on multi-label image datasets. The goal is to transfer knowledge from the known labels to the unseen labels. Our method relies on easy to obtain co-occurrence statistics of class labels harvested from existing annotations, web-search hit counts or image tags. Our main contribution is to use inter-dependencies that arise naturally between classes, for zero-shot classification. We propose various similarity metrics for leveraging the these co-occurrences, and show that our zero-shot classifiers can serve as priors for few-shot learning. Experiments on three challenging multi-labelled datasets reveal that our proposed zero-shot methods, are approaching and occasionally outperforming supervised SVMs. We conclude that co-occurrence statistics suffice for zero-shot classification. (This talk is based on my current CVPR submission, so this work is yet unpublished).
Learning with Asymmetric Information
INRIA Rhône-Alpes, F107
Tuesday, January 7 2014, 11:00

### Abstract:

Many computer vision problems have an asymmetric distribution of information, i.e. less or more information about a problem is available at training time than at test time. In my talk I will discuss our recent work on both situations: 1) the LUPI framework for the case when we have additional data modalities available for the training data, and 2) a label propagation approach for the case when an additional similarity measure is available at test time (both published at ICCV 2013).

## Seminars in 2013

The return of AdaBoost.MH: multi-class Hamming trees
INRIA Rhône-Alpes, F107
Wednesday, October 9 2013, 17:00

### Abstract:

Within the framework of AdaBoost.MH, we propose to train vector-valued decision trees to optimize the multi-class edge without reducing the multi-class problem to K binary one-against-all classifications. The key element of the method is a vector-valued decision stump, factorized into an input-independent vector of length K and label-independent scalar classifier. At inner tree nodes, the label-dependent vector is discarded and the binary classifier can be used for partitioning the input space into two regions. The algorithm retains the conceptual elegance, power, and computational efficiency of binary AdaBoost. In experiments it is on par with support vector machines and with the best existing multi-class boosting algorithm AOSOLogitBoost, and it is significantly better than other known implementations of AdaBoost.MH.
High-dimensional change-point detection with sparse alternatives
INRIA Rhône-Alpes, F107
Thursday, September 12 2013, 14:00

### Abstract:

We consider the problem of detecting a change in mean in a sequence of high-dimensional Gaussian vectors. We assume that the change happens only in an unknown subset of the vector components. We propose a testing procedure that is adaptive to the number of non-zero components. Under high-dimensional assumptions we obtain the detection boundary and prove rate optimality of the test.
Sinkhorn Distances: Lightspeed Computation of Optimal Transportation Distances
INRIA Rhône-Alpes, Grand Amphithéâtre
Wednesday, August 28 2013, 11:00

### Abstract:

Optimal transportation distances are a fundamental family of parameterized distances for histograms. Despite their appealing theoretical properties, excellent performance in retrieval tasks and intuitive formulation, their computation involves the resolution of a linear program whose cost is prohibitive whenever the histograms' dimension exceeds a few hundreds. We propose in this work a new family of optimal transportation distances that look at transportation problems from a maximum-entropy perspective. We smooth the classical optimal transportation problem with an entropic regularization term, and show that the resulting optimum is also a distance which can be computed through Sinkhorn-Knopp's matrix scaling algorithm at a speed that is several orders of magnitude faster than that of transportation solvers. We also report improved performance over classical optimal transportation distances on the MNIST benchmark problem.
Representation Learning by Archetypal Analysis
Yuansi Chen
INRIA Rhône-Alpes, F107
Thursday, July 18 2013, 12:00

### Abstract:

Archetypal analysis is an unsupervised data analysis technique which was intro- duced by Cutler and Breiman [9]. It represents multivariate data by a convex combina- tion of data prototypes called archetypes, which are themselves convex combinations of data points. Unlike many other unsupervised learning techniques such as sparse coding or non-negative matrix factorization, archetypes are easy to interpret. In our work, we first introduce an efficient implementation of the archetypal analysis method with recent optimization techniques. Second, we conduct numerical experiments showing that archetypal analysis leads to state-of-the-art results when used for learning the underlying structure of natural patches in image denoising and classification tasks.
The Three R's of Computer Vision: Recognition, Reconstruction and Reorganization
INRIA Rhône-Alpes, F107
Thursday, July 11 2013, 11:00

### Abstract:

Over the last two decades, we have seen remarkable progress in computer vision with demonstration of capabilities such as face detection, handwritten digit recognition, reconstructing three-dimensional models of cities, automated monitoring of activities, segmenting out organs or tissues in biological images, and sensing for control of robots and cars. Yet there are many problems where computers still perform significantly below human perception. For example, in the recent PASCAL benchmark challenge on visual object detection, the average precision for most 3D object categories was under 50%.
I will argue that further progress on the classic problems of computational vision: recognition, reconstruction and re-organization requires us to study the interaction among these processes. For example recognition of 3d objects benefits from a preliminary reconstruction of 3d structure, instead of just treating it as a 2d pattern classification problem. Recognition is also reciprocally linked to reorganization, with bottom up grouping processes generating candidates, which with top-down activations of object and part detectors. In this talk, I will show some of the progress we have made towards the goal of a unified framework for the 3 R's of computer vision. I will also point towards some of the exciting applications we may expect over the next decade as computer vision starts to deliver on even more of its grand promise.
Recent work on patch descriptor selection and exploiting layout in image classification
INRIA Rhône-Alpes, F107
Tuesday, July 9 2013, 15:00

### Abstract:

This talk covers two recent papers.
The first (ICMR'11) investigates the use of photographic style for category-level image classification. Specifically, we exploit the assumption that images within a category share a similar style defined by attributes such as colorfulness, lighting, depth of eld, viewpoint and saliency. For these style attributes we create correspondences across images by a generalized spatial pyramid matching scheme. Where the spatial pyramid groups features spatially, we allow more general feature grouping and in this paper we focus on grouping images on photographic style. We evaluate our approach in an object classification task and investigate style di erences between professional and amateur photographs. We show that a generalized pyramid with style-based attributes improves performance on the professional Corel and amateur Pascal VOC 2009 image datasets.
In the second (ECCV'12), we start from the observation that local image descriptors are generally designed for describing all possible image patches. Such patches may be subject to complex variations in appearance due to incidental object, scene and recording conditions. Because of this, a single-best descriptor for accurate image representation under all conditions does not exist. Therefore, we propose to automatically select from a pool of descriptors the one that is best suitable based on object surface and scene properties. These properties are measured on the y from a single image patch through a set of attributes. Attributes are input to a classifier which selects the best descriptor. Our experiments on a large dataset of colored object patches show that the proposed selection method outperforms the best single descriptor and a-priori combinations of the descriptor pool.
Rooms: Where are things and where could they be?
INRIA Rhône-Alpes, F107
Monday, July 8 2013, 16:00

### Abstract:

Rooms are interesting, because people live in rooms. Autonomous robots will need to manage in rooms; surveillance programs will need to understand pictures of rooms; and there is much commercial value in being able to manipulate pictures of rooms, for example, to show how a high-value sofa would look in your living room.
I will describe current work on understanding rooms from a single image. Our methods can now estimate a "box" describing a room and block out the major structure of the space in that box. Methods from other groups can identify major furniture items, too.
I will then show how these representations can be used to insert items into the room. Inserted items can be rendered realistically, so they look as though they are participating in the light transfer in the room environment.
These methods allow us to build speculative representations: if there were more furniture in this room, what would it look like, and where would it be? These ideas suggest an ideology of visual representation as an exposition of likely futures (rather than as an account of what is seen). There are important consequences: identifying objects may not be as important as understanding free space, materials, and the potential of objects.
Articulated Pose Estimation using Discriminative Armlet Classifiers
INRIA Rhône-Alpes, F107
Friday, July 5 2013, 14:00

### Abstract:

We propose a novel approach for human pose estimation in real-world cluttered scenes, and focus on the challenging problem of predicting the pose of both arms for each person in the image. For this purpose, we build on the notion of poselets [4] and train highly discriminative classifiers to differentiate among arm configurations, which we armlets. We propose a rich representation which, in addition to standard HOG features, integrates the information of strong contours, skin color and contextual cues in a principled manner. Unlike existing methods, we evaluate our approach on a large subset of images from the PASCAL VOC detection dataset, where critical visual phenomena, such as occlusion, truncation, multiple instances and clutter are the norm. Our approach outperforms Yang and Ramanan [26], the state-of-the-art technique, with an improvement from 29.0% to 37.5% PCP accuracy on the arm keypoint prediction task, on this new pose estimation dataset.
Event retrieval in large video collections with circulant temporal encoding
INRIA Rhône-Alpes, C108
Friday, June 14 2013, 14:00

### Abstract:

This paper presents an approach for large-scale event retrieval. Given a video clip of a specific event, e.g., the wedding of Prince William and Kate Middleton, the goal is to retrieve other videos representing the same event from a dataset of over 100k videos. Our approach encodes the frame descriptors of a video to jointly represent their appearance and temporal order. It exploits the properties of circulant matrices to compare the videos in the frequency domain. This offers a significant gain in complexity and accurately localizes the matching parts of videos. Furthermore, we extend product quantization to complex vectors in order to compress our descriptors, and to compare them in the compressed domain. Our method outperforms the state of the art both in search quality and query time on two large-scale video benchmarks for copy detection, TRECVID and CCWEB. Finally, we introduce a challenging dataset for event retrieval, EVVE, and report the performance on this dataset.
Scene Understanding: What more can we do to better understand scenes?
INRIA Rhône-Alpes, F107
Tuesday, June 11 2013, 11:30

### Abstract:

The problem of scene understanding has manifested itself in various forms, including, but not limited to, object recognition, 3D scene recovery, and image segmentation. In this talk I will discuss some of my attempts to address these tasks, starting with our energy based formulation for reasoning about regions, objects, and their attributes such as object class, location, and spatial extent. We define a global energy function, which combines results from sliding window detectors, and low-level pixel-based unary and pairwise relations. I will also briefly describe methods for solving the inference and parameter learning problems efficiently in the context of these optimization problems.

In the second part of the talk I will focus on other related challenges: (i) Video segmentation – Video not only provides rich visual cues such as motion and appearance, but also long-range temporal interactions among objects. We present a method to capture such interactions and to construct a powerful intermediate-level representation for subsequent recognition. (ii) Text recognition in scenes – Scene text provides useful cues, such as geographical location, types of buildings in the scene, and the problem of recognizing it is receiving significant attention. I will describe our framework that exploits bottom-up cues, derived from individual character detections from the image, and top-down constraints, obtained from language statistics, for solving this problem.
Structure-Preserving Object Tracking and Forensic Painting Analysis
INRIA Rhône-Alpes, F107
Monday, June 10 2013, 11:30

### Abstract:

The talk gives an overview of my work in computer vision. Specifically, I will present my work on model-free tracking and on forensic painting analysis. In addition, I will briefly highlight my work on the visualization of high-dimensional data, and on the regularization of learning models.

Model-free tracking. Model-free trackers track arbitrary objects based on a single annotation of the object. Whilst the performance of model-free trackers has recently improved substantially, simultaneously tracking multiple objects with similar appearance remains very hard. We propose a new multi-object model-free tracker (based on tracking-by-detection) that resolves this problem by incorporating spatial constraints between the objects. The spatial constraints are learned along with the object detectors using an online structured SVM algorithm. The experimental evaluation of our structure-preserving object tracker reveals significant performance improvements in both multi-object and single-object tracking.

Painting analysis. High-resolution radiographs of paintings reveal the structure of the canvas on which the painting was made. Due to the way in which canvas is produced, the spacings between canvas threads form a "fingerprint" that can be used to identify canvases that originated from the same bolt of canvas. We present a technique to extract and compare canvas fingerprints from painting radiographs, and we show how our techniques may provide new art-historical insights reuniting Poussin's Bacchanals painted for Cardinal Richelieu.

The talk presents joint work with Lu Zhang (Delft University of Technology) and Robert Erdmann (University of Arizona).
On cutting planes for mixed integer linear programming
Alberto Del Pia
INRIA Rhône-Alpes, F107
Monday, April 29 2013, 12:00

### Abstract:

This talk gives an introduction to a recently established link between the geometry of numbers and mixed integer linear optimization. The main focus is to provide a review of families of lattice-free polyhedra and their use in a disjunctive programming approach. The use of lattice-free polyhedra in the context of deriving and explaining cutting planes for mixed integer programs is not only mathematically interesting, but it leads to some fundamental new discoveries, such as an understanding under which conditions cutting planes algorithms converge finitely. These theoretical results suggest the possibility that cutting planes from special families of lattice-free polyhedra could give rise to numerically efficient novel algorithms.
A unified framework for change point detection and other related problems
INRIA Rhône-Alpes, Grand Amphithéâtre
Friday, April 26 2013, 12:00

### Abstract:

We propose a unified convex-optimization-based framework for problems of detecting a signal of a given shape in Gaussian noise. The framework covers various detection settings including: detection of jumps in curves and their derivatives; detection of a periodic component in Gaussian time series; and signal detection from indirect observations. We present a general detection procedure, analyze its properties and show that it cannot be improved in some specific settings.
Comparison of mid-level feature coding approaches and pooling strategies in visual concept detection
Piotr Koniusz
INRIA Rhône-Alpes, F107
Wednesday, April 17 2013, 11:00

### Abstract:

Bag-of-Words lies at a heart of modern object category recognition systems. After descriptors are extracted from images, they are expressed as vectors representing visual word content, referred to as mid-level features. In this talk we review a number of techniques for generating mid-level features, including two variants of Soft Assignment, Locality-constrained Linear Coding, and Sparse Coding. Moreover, we investigate various pooling methods that aggregate mid-level features into vectors representing images. Average pooling, Max-pooling, and a family of likelihood inspired pooling strategies are scrutinised. We generalise the investigated pooling methods to account for the descriptor interdependence and introduce an intuitive concept of improved pooling. We also propose a coding-related improvement to increase its speed. As the pooling step aggregates only occurrences of visual words represented by coefficients of each mid-level feature vector, we refer to it as First-order Occurrence Pooling. We propose to aggregate over co-occurrences of visual words in mid-level features. A derivation of Second- and Higher-order Occurrence Pooling based on linearisation of so-called Minor Polynomial Kernel is demonstrated. We evaluate how First-, Second-, and Third-order Occurrence Pooling performs given various coders and pooling operators. For bi- and multi-modal coding with two or more coders, we demonstrate an extension of Second- and Higher-order Occurrence Pooling based on linearisation of Minor Polynomial Kernel. Lastly, we compare the proposed approaches to other renowned methods (e.g. Fisher Vector Encoding) in the same testbed and attain state-of-the-art results with 69.2% MAP on PascalVOC07, 90.2% accuracy on Flower102, 83.6% accuracy on Caltech101, and 41.2% MAP on ImageCLEF11.
Unsupervised Learning of Invariant Object Representations—A Probabilistic Generative Modeling Approach
INRIA Rhône-Alpes, F107
Tuesday, April 16 2013, 14:00

### Abstract:

A fundamental problem of computer vision is how to learn and infer objects in images robustly. For instance, objects need to be represented in spatially and temporally efficient forms and their representations need to be flexible in order to be used for various learning and inference tasks. Objects need to be inferred invariant w.r.t. varied conditions, e.g., different illumination conditions, changes of viewpoints, etc. Objects need to be learned and inferred in cluttered scenes (with the existence of other objects and a variety of noise). Learning object representations with these desired properties is a long standing goal in computer vision. As a step towards this direction, we study the problem of autonomously learning invariant object representations from visual scenes. It covers three aspects of the above mentioned properties: autonomous (unsupervised) object learning, learning invariant object representations, and modeling occlusive objects in visual scenes. New generative models have been proposed together with efficient algorithms for their parameter optimization. The limitations of previous works have been avoided by using a more principled approach to derive efficient learning algorithms and addressing a novel scheme of modeling occlusion.
Hierarchical analysis of hyperspectral images using binary partition trees
INRIA Rhône-Alpes, F107
Wednesday, April 3 2013, 11:00

### Abstract:

After decades of use of multispectral remote sensing, most of the major space agencies now have new programs to launch hyperspectral sensors, recording the reflectance information of each point on the ground in hundreds of narrow and contiguous spectral bands. The spectral information is instrumental for the accurate analysis of the physical component present in one scene. But, every rose has its thorns: most of the traditional signal and image processing algorithms fail when confronted to such high dimensional data (each pixel is represented by a vector with several hundereds of dimensions).

In this talk, we focus on the extension to hyperspectral data of a very powerful image processing analysis tool: the Binary Partition Tree (BPT). It provides a generic hierarchical representation of images and consists of the two following steps:
• construction of the tree: one starts from the pixel level and merge pixels/regions progressively until the top of the hierarchy (the whole image is considered as one single region) is reached. To proceed, one needs to define a model to represent the regions (for instance: the average spectrum—but this is not a good idea) and one also needs to define a similarity measure between neighbouring regions to decide which ones should be merged first (for instance the euclidean distance between the model of each region—but this is not a good idea either). This step (construction of the tree) is very much related to the data.
• the second step is the pruning of the tree: this is very much related to the considered application. The pruning of the tree leads to one segmentation. The resulting segmentation might not be any of the result obtained during the iterative construction of the tree. This is where this representation outperforms the standard approaches. But one may also perform classification, or objet detection (assuming an object of interest will appear somewhere as one noode of the tree, the game is to define a suitable criterion, related to the application, to find this node).
Results are presented on various hyperspectral images.
Large-scale learning from interaction data
INRIA Rhone-Alpes, F107
Thursday, January 17 2013, 10:30

### Abstract:

In many important applications, we need to make decisions in environments where the reward is only partially observed, but can be modeled as a function of an action and an observed context. Examples include user content optimization, Internet advertising and health-care policy. In the first part of the talk, I will discuss the problem of evaluation of a new policy (e.g., a user serving policy) given historic data. The key statistical challenge is properly accounting for the fact that the past policy and the proposed policy differ. I will present an accurate technique that solves this without collecting any new data. In the second part of the talk, I will focus on a computational challenge of learning from massive interaction data sets. I will describe a distributed optimization technique that allows solving tera-scale problems in 1 hour (using 1000 machines/cores).

Based on joint work with John Langford, Lihong Li, Alekh Agarwal and Olivier Chapelle.

## Seminars in 2012

Towards efficient video representations for action recognition
INRIA Rhone-Alpes, A104
Friday, November 30 2012, 12:00

### Abstract:

In this talk, we first review some popular spatial-temporal features for video, and compare their performance in action recognition. In total, we consider four different feature detectors and six local feature descriptors. We demonstrate that dense sampling at regular positions consistently outperforms all tested space-time interest point detectors in real-world videos.

The second part will introduce our recent video features based on dense trajectories and motion boundary descriptors. Dense trajectories capture the local motion patterns in the video and guarantee a good coverage of the context information. Additionally, motion boundary descriptors show to consistently outperform other state-of-the-art descriptors, in particular on real-world videos that contain a significant amount of camera motion. We will also discuss some drawbacks of the current methods and possible further extensions.
The Role of V4 During Natural Vision
INRIA Rhone-Alpes, F107
Monday, November 26 2012, 11:00

### Abstract:

The functional organization of area V4 in the mammalian ventral visual pathway is far from being well understood. V4 is believed to play an important role in the recognition of shapes and objects and in visual attention, but its complexity makes it hard to analyze. Individual cells in V4 have been shown to exhibit a large diversity of preferences to visual stimuli characteristics, including orientation, curvature, motion, color and texture. Such observations were for a large part obtained from electrophysiological and imaging studies, when a subject (monkey or human) is shown a sequence of artificial stimuli during data acquisition. In our study, we intend to go beyond such an approach and analyze a population of V4 neurons in naturalistic conditions. More precisely, we record responses from V4 neurons to grayscale still natural images---that is, discarding color and motion content. We propose a new computational model for V4 that does not rely on any pre-defined image features but only on invariance and sparse coding principles. Our approach is the first to achieve comparable prediction performance for V4 as for V1 cells on responses to natural images. Our model is also interpretable using sparse principal component analysis. In the neuron population observed and based on our computational model, we discover as our main finding two groups of neurons: those selective to texture versus those selective to contours. This supports the thesis that one primary role of V4 is to extract objects from background in the visual field. Moreover, our study also confirms the diversity of V4 neurons. Among those selective to contours, some of them are selective to orientation, others to acute curvature features.

This is a joint work with Yuval Benjamini, Ben Willmore, Michael Oliver, Jack Gallant and Bin Yu. This work was performed at UC Berkeley.
Refresher on neural networks and overview of libraries for deep learning
INRIA Rhone-Alpes, A104
Friday, November 23 2012, 11:30

### Abstract:

Recent results [1] highlighted the excellent performance of deep learning architectures for complex high-level computer vision tasks. This talk aims at providing some basic practical knowledge in order to start playing around with these algorithms.

We will begin with a brief refresher on neural networks and the back-propagation algorithm. We will then provide an overview of two Open Source libraries that can be used to learn deep architectures: Theano [2] (python) and EBLearn [3] (C++).

Chapter 11 (on Neural Networks) from The Elements of Statistical Learning

Presentation: http://lear.inrialpes.fr/people/gaidon/lear_xrce_deep_learning_01.html
Block-Coordinate Frank-Wolfe for Structural SVMs
INRIA Rhone-Alpes, F107
Monday, November 12 2012, 14:00

### Abstract:

We propose a randomized block-coordinate variant of the classic Frank-Wolfe algorithm for convex optimization with block-separable constraints. Despite its lower iteration cost, we show that it achieves the same convergence rate as the full Frank-Wolfe algorithm. We also show that, when applied to the dual struc- tural support vector machine (SVM) objective, this algorithm has the same low iteration complexity as primal stochastic subgradient methods. However, unlike stochastic subgradient methods, the stochastic Frank-Wolfe algorithm allows us to compute the optimal step-size and yields a computable duality gap guarantee. Our experiments indicate that this simple algorithm outperforms competing structural SVM solvers.
Using Machine Learning to Predict Protein-Protein and Protein-Ligand Interactions
INRIA Rhone-Alpes, F107
Friday, November 9 2012, 10:30

### Abstract:

Protein-protein and protein-ligand interactions are crucial for many biological processes such as signal transduction, DNA replication, etc. Such interactions are also fundamental in many diseases (e.g. cancers). In this talk, I will describe our recent work on machine learning techniques that predict these interactions.

Due to the difficulties, time and cost of the experimental methods for determining the structures and binding affinities of molecular complexes, efficient computational methods are usually used in this field. However, the accuracy of these computational methods is often rather low due to the crude approximations of the interactions within the complex and also due to insufficient sampling of the configurational space for the molecules that form the complex.

I will describe a new machine learning algorithm that very precisely reconstructs the interactions between the molecules based on the structural information currently available in the databases. These databases contain three-dimensional molecular structures determined by experimental techniques and have been growing very rapidly. In 2012, the PDB (Protein Data Bank) contained about 80,000 of protein structures. The CSD (Cambridge Structural Database), a database for small molecules, contained about 500,000 entries at the beginning of 2012. We trained our interaction model with some 60,000 parameters on structures from these databases and verified the results on several standard benchmarks as well as in blind docking prediction competitions. The success rates of our model, according to the benchmarks, rank it among the top-3 methods currently available.
Predicting Binary Features for Attribute-Based and Multi-Label Classification
INRIA Rhone-Alpes, Grand Amphi
Friday, October 26 2012, 15:30

### Abstract:

The prediction of attributes, i.e. semantic properties of objects or scenes, has recently received a lot of attention in the computer vision community. In their simplest form, one can interpret attributes simply as a layer of binary mid-level features that can be computed from the image contents. In my talk I will discuss two recent works in this area: the automatic learning of additional, non-semantic, binary features that augment an existing set of attributes (ECCV 2012), and a method for more efficiently predicting binary outputs in highly connected graphical models, where inference has to performed by sampling (NIPS 2012).
Multi-step flow fusion: towards accurate and dense correspondences in long video shots
INRIA Rhone-Alpes, F107
Thursday, October 25 2012, 10:00

### Abstract:

The aim of this work is to estimate dense displacement fields over long video shots. Put in sequence, they are useful for representing point trajectories but also for propagating (pulling) information from a reference frame to the rest of the video. Highly elaborated optical flow estimation algorithms are at hand, and they were applied before for dense point tracking by simple accumulation, however with unavoidable position drift. On the other hand, direct long-term point matching is more robust to such deviations, but it is very sensitive to ambiguous correspondences. Why not combining the benefits of both approaches? Following this idea, we develop a multi-step flow fusion method that optimally generates dense long-term displacement fields by first merging several candidate estimated paths and then filtering the tracks in the spatio-temporal domain. Our approach permits to handle small and large displacements with improved accuracy and it is able to recover a trajectory after temporary occlusions. Especially useful for video editing applications, we attack the problem of graphic element insertion and video volume segmentation, together with a number of quantitative comparisons on ground-truth data with state-of-the-art approaches.
Score-based Bayesian Skill Learning
CANCELLED

### Abstract:

We extend the Bayesian skill rating system of TrueSkill to accommodate score-based match outcomes. TrueSkill has proven to be a very effective algorithm for matchmaking --- the process of pairing competitors based on similar skill-level --- in competitive online gaming. However, for the case of two teams/players, TrueSkill only learns from win, lose, or draw outcomes and cannot use additional match outcome information such as scores. To address this deficiency, we propose novel Bayesian graphical models as extensions of TrueSkill that (1) model player's offence and defence skills separately and (2) model how these offence and defence skills interact to generate score-based match outcomes. We derive efficient (approximate) Bayesian inference methods for inferring latent skills in these new models and evaluate them on three real data sets including Halo 2 XBox Live matches. Empirical evaluations demonstrate that the new score-based models (a) provide more accurate win/loss probability estimates than TrueSkill when training data is limited, (b) provide competitive and often better win/loss classification performance than TrueSkill, and (c) provide reasonable score outcome predictions with an appropriate choice of likelihood --- prediction for which TrueSkill was not designed, but which can be useful in many applications.
Distances and Kernels on Discrete Structures: the generating-function trick
Marco Cuturi
INRIA Rhone-Alpes, F107
Thursday, October 5 2012, 12:00

### Abstract:

Distances and positive definite kernels lie at the core of many machine learning algorithms. When comparing vectors, these two concepts form well-matched pairs that are almost interchangeable: trivial operations such as changing signs, adding renormalization factors, taking logarithms or exponentials are usually sufficient to recover one from the other (e.g. Euclidean distances & Laplace kernels). However, when comparing discrete structures, this harmonious symmetry falls apart. The culprit lies in the introduction of combinatorial optimization to compute distances (e.g. edit distances for strings / time series / trees; minimum cost matching distances for sets of points; transportation distances for histograms etc.). Simple counterexamples show that such considerations -- finding a minimal cost matching or a maximal alignment to compare two objects -- tend to destroy any hope of recovering a positive definite kernel from such distances. We present a review of several results in the recent literature that have overcome this limitation. We provide a unified framework for these approaches by highlighting the fact that they all rely on generating functions to achieve positive definiteness.
Metric Learning for Large Scale Image Classification: Generalizing to New Classes at Near-Zero Cost
Thomas Mensink
INRIA Rhone-Alpes, F107
Monday, October 1 2012, 14:00

### Abstract:

We are interested in large-scale image classification and especially in the setting where images corresponding to new or existing classes are continuously added to the training set. Our goal is to devise classifiers which can incorporate such images and classes on-the-fly at (near) zero cost. We cast this problem into one of learning a metric which is shared across all classes and explore k-nearest neighbor (k-NN) and nearest class mean (NCM) classifiers. We learn metrics on the ImageNet 2010 challenge data set, which contains more than 1.2M training images of 1K classes. Surprisingly, the NCM classifier compares favorably to the more flexible k-NN classifier, and has comparable performance to linear SVMs. We also study the generalization performance, among others by using the learned metric on the ImageNet-10K dataset, and we obtain competitive performance. Finally, we explore zero-shot classification, and show how the zero-shot model can be combined very effectively with small training datasets.
Hyperbolic wavelet transform : a new tool for analyzing anisotropic textures
INRIA Rhone-Alpes, Grand Amphi
Wednesday, October 3 2012, 11:00

### Abstract:

In recent years, there has been a paradigm shift in the size of the datasets statisticians are working with. In the "classical" setting, one worked with datasets consisting of n observations of a vector of size p, and p was much smaller than n. In the "modern" setting of high-dimensional statistics, it is now common to work with datasets where p and n are comparable and quite large (for instance a few hundreds). Sometime p is also much greater than n.

I will discuss work which sheds light on the behavior of commonly used statistical procedures in the large n, large p" setting, where we study the asymptotic behavior of statistical estimators assuming that p and n both go to infinity while p/n has a finite non-zero limit. Building on this understanding, we can propose alternative to classical statistical methods which are better able to handle the difficulties inherent in high-dimensional statistics. I will describe some of my work in this direction.

At the heart of a number of these analyses is modern random matrix theory. I will talk about the role played by this theory and its potential limitations for statistical modeling, highlighting the connection with the concentration of measure phenomenon.
Some connections between random matrix theory and high-dimensional statistics
Nourredine El Karoui
INRIA Rhone-Alpes, F107
Thursday, July 19 2012, 11:00

### Abstract:

In recent years, there has been a paradigm shift in the size of the datasets statisticians are working with. In the "classical" setting, one worked with datasets consisting of n observations of a vector of size p, and p was much smaller than n. In the "modern" setting of high-dimensional statistics, it is now common to work with datasets where p and n are comparable and quite large (for instance a few hundreds). Sometime p is also much greater than n.

I will discuss work which sheds light on the behavior of commonly used statistical procedures in the large n, large p" setting, where we study the asymptotic behavior of statistical estimators assuming that p and n both go to infinity while p/n has a finite non-zero limit. Building on this understanding, we can propose alternative to classical statistical methods which are better able to handle the difficulties inherent in high-dimensional statistics. I will describe some of my work in this direction.

At the heart of a number of these analyses is modern random matrix theory. I will talk about the role played by this theory and its potential limitations for statistical modeling, highlighting the connection with the concentration of measure phenomenon.
A Few Machine Learning-Friendly Optimization and Algorithmic Properties
INRIA Rhone-Alpes, F107
Thursday, July 4 2012, 15:00

### Abstract:

I will introduce some of the main results of my PhD. First, the so-called "proximal" methods have drawn a lot of attention, lately, for solving non-smooth optimization problems that naturally arise for Machine Learning and Signal Processing, among others. The efficiency of those methods relies on the computation of the proximity operator, which, in a lot of problems, can't be obtained in closed form. In those situations, the proximity operator is approximated through the use of iterative procedures. We will see how some finite-time analysis can lead to unexpected strategies where the precision of the approximations can be chosen so that the global procedure has: a) good theoretical properties of the quality of the solution, b) a minimal computational cost.

Then, we will investigate the use of a non-standard performance measure of interest for (multi-class) machine learning problems, namely the Confusion Matrix. We advocate that in several cases, this quantity could be "minimized", instead of the more standard "risk" that is usually considered in ML problems. Along with this a framework, we provide some of its theoretical grounds with generalization bounds, that can be obtained through a generalization of the "stability" analysis, which consists in leveraging algorithmic properties to provide statistical guarantees of the classifiers."
Hypothesis Testing and Bayesian Inference: New Applications of Kernel Methods
INRIA Rhone-Alpes, Grand Amphi
Monday, June 11 2012, 11:00

### Abstract:

In the early days of kernel machines research, the "kernel trick" was considered a useful way of constructing nonlinear learning algorithms from linear ones, by applying the linear algorithms to feature space mappings of the original data. Recently, it has become clear that a potentially more far reaching use of kernels is as a linear way of dealing with higher order statistics, by mapping probabilities to a suitable reproducing kernel Hilbert space (i.e., the feature space is an RKHS).

I will describe how probabilities can be mapped to reproducing kernel Hilbert spaces, and how to compute distances between these mappings. A measure of strength of dependence between two random variables follows naturally from this distance. Applications that make use of kernel probability embeddings include:

* Nonparametric two-sample testing and independence testing in complex (high dimensional) domains. As an application, we find whether text in English is translated from the French, as opposed to being random extracts on the same topic.

* Bayesian inference, in which the prior and likelihood are represented as feature space mappings, and a posterior feature space mapping is obtained. In this case, Bayesian inference can be undertaken even in the absence of a model, by learning the prior and likelihood mappings from samples.
Helping each other to see: Humans and machines
Larry Zitnick
INRIA Rhone-Alpes, Grand Amphi
Tuesday, April 24 2012, 11:00

### Abstract:

Humans and machines see the world differently, each having their own strengths and weaknesses. In this talk, I describe two projects exploring how they may help each other.

Visual object recognition by machines is notoriously difficult. To help in the learning process, humans are typically used to gather large hand-labeled training datasets from which the machines may learn. However, humans may also be used to "debug" the machine's recognition pipeline to learn what aspects are lacking. Specifically, we explore the various stages of part-based person detectors. We perform human studies in which subjects perform the same sub-tasks as their machine counterparts, and accuracies are compared.

The typical human has significant difficultly in drawing everyday objects containing complex structures, such as faces or bikes. When learning to draw, humans must learn to see the word differently. That is, they must not only recognize what they are seeing, but they must perceive the spacing and structural layout of an object. We demonstrate an application in which machines can recognize what a human is drawing and provide visual guidance to the drawer in the form of shadows. The shadows, which may be either used or ignored by the drawer, help the drawer achieve more realistic overall shapes and spacing, while maintaining their own unique drawing style.

Point Process models for multiple object detection
Ahmed Gamal-Eldin
INRIA Rhone-Alpes, F107
TBD

### Abstract:

I will start by a brief introduction to Point Process models in image processing, while mainly focusing on remote sensing. I will talk about existing optimization methods, and discuss what are the main characteristics of a good optimizers for these models. Next, I will present a new optimization algorithm we call "Multiple Births and Cut" (MBC). It combines the recently developed optimization algorithm Multiple Births and Deaths (MBD) and the Graph-Cut. I will present three different variants of this algorithm. I will present results on synthetic data to show how the algorithm scale with the problem size. Finally I will present results on different applications.
Leveraging category-level labels for instance-level image retrieval
Albert Gordo
INRIA Rhone-Alpes, F107
Thursday, March 15 2012, 15:00

### Abstract:

We consider the problem of query-by-example instance-level image retrieval: given a query image of an object or a scene, we want to retrieve within a potentially large dataset other instances of the exact same object or scene. For efficiency reasons, it is common to represent an image by a fixed-length descriptor which is subsequently encoded into a small number of bits. We note that most encoding techniques include an unsupervised dimensionality reduction step. Our goal in this work is to learn a better subspace in a supervised manner. We especially raise the following question: "can category-level labels be used to learn such a subspace?"
To answer this question, we experiment with four learning techniques: a metric learning approach, attributes representation, Canonical Correlation Analysis (CCA) and Joint Subspace and Classifier Learning (JSCL). While the first three approaches have been applied in the past to the image retrieval problem , we believe we are the first to show the usefulness of JSCL in this context.
In our experiments, we use ImageNet as a source of category-level labels and report retrieval results on two standard datasets: INRIA Holidays and the University of Kentucky benchmark. Our experimental study shows that metric learning and attributes do not lead to any significant improvement in retrieval accuracy, as opposed to CCA and JSCL. As an example, we report on Holiday an increase in accuracy from 39.3% to 48.6% with 32-dimensional representations.
Structured Models for Image Labeling
Thomas Mensink
INRIA Rhone-Alpes, F107
Monday, March 12 2012, 11:00

### Abstract:

In this paper we propose structured prediction models for image labeling that explicitly take into account dependencies among image labels. We describe a tree-based structure, where image labels are nodes, and edges encode dependency relations. Our models are more expressive than independent label predictors, such as one vs. rest SVMs and lead to more accurate predictions in the case of fully-automatic image labeling. However, the gain becomes more significant in an interactive scenario where a user provides the value of some of the image labels at test time. Such an interactive scenario offers an interesting trade-off between label accuracy and manual labeling effort. The structured models are used to decide which labels should be set by the user, and transfer the user input to more accurate predictions on other image labels. This is an extended version of my CVPR 2011 paper.
On visual tracking
Pérez Patrick
INRIA Rhone-Alpes, Grand Amphi
Thursday, January 26 2012, 11:00

### Abstract:

Visual motion estimation is a generic task of crucial importance in a variety of video analysis and processing systems. It comes under multiple guises, depending on the extent and the density of the spatial estimation support (from sparse fragments to whole objects and complete scenes) and on the extent of the temporal analysis (from instantaneous velocity estimation to long-term visual tracking). This variety, along with the long history of this branch of computer vision, makes its rapid overview difficult. There are nonetheless several important methodological concepts, pertaining to sequential inference and to visual appearance modeling/matching, that traverse many works in this field, including most recent ones. With a focus on visual tracking, I will touch upon such tools will the help of a large range of illustrative examples.
CVPR submission
Zeynep Akata & Gokberk Cinbis
INRIA Rhone-Alpes, F107
Thursday, January 19 2012, 12:00

### Abstract:

Zeynep Akata and Gokberk Cinbis will talk about their recent CVPR submissions.
Learning temporal information for action recognition
INRIA Rhone-Alpes, F107
Monday, January 16 2012, 16:00

### Abstract:

Current state-of-the-art models of human actions in realistic videos, e.g. the bag of spatio-temporal visual words, are often based on the aggregation of local features in an orderless fashion. However, actions are by essence temporal phenomena and some actions, like "sitting down" and "getting up", can only be reliably classified if their models incorporate some temporal structure. We present two recent results on incorporating temporal information in state-of-the-art recognition methods. First, we describe a simple action model, called the Actom Sequence Model (ASM), encoding global ordering constraints between temporal parts. We explain how we learn the temporal structure of an action and perform efficient action detection on large video databases. Then, we introduce a new kernel between multivariate time series, called the Difference between Auto-Correlation Operators (DACO) kernel, and demonstrate its applicability to videos. This kernel compares two actions based on their dynamics, represented by the auto-correlation operator in the Reproducing Kernel Hilbert Space (RKHS) associated with a "base" kernel between frames. We show that it leverages useful temporal dependency information, that complements traditional kernels on bag-of-words. Finally, we illustrate the performance of our algorithms on challenging action recognition benchmarks and show improvements w.r.t. the state of the art. Joint work with Zaid Harchaoui and Cordelia Schmid

## Seminars in 2011

Automatic human face 3D modeling from a single view image / Fast low-rank metric learning
Danila Potapov / Dan Oneata
INRIA Rhone-Alpes, F107
Monday, December 19 2011, 12:00

### Abstract:

Automatic human face 3D modeling from a single view image:
Human face 3D modeling from a single image is a very challenging problem. The missing depth information has to be inferred using a generative model with dozens of latent variables. Probabilistic analysis suggests minimization of a huge energy function with respect to parameters of different nature. In this talk, I will give an overview of existing techniques based on the 3D Morphable Model approach. I will present our method that makes use of additional information (facial feature points and contours). I will discuss implementation details and show experimental results. The method is flexible enough to be applied to both natural and sculptured human faces.

Fast low-rank metric learning:
We propose two families of algorithms for reducing the computational cost of the NCA method. First, we consider ideas inspired by the sub-sampling methods. We investigate a mini-batch method that forms mini-batches by clustering and a sub-set learning algorithm that is theoretically justied by stochastic optimization arguments. Our experiments demonstrate that these method offer significant speed-up gains while obtaining classifications scores similar to the classical NCA. The second family of algorithms includes variants of approximate methods. We derive these methods by first interpreting NCA as a class-conditional kernel density estimation (CC-KDE) problem. This formulation offers several advantages: (i) it allows us to adapt existing algorithms for fast kernel-density estimation (e.g., Gray and Moore, 2003) into the context of NCA and (ii) it offers more flexibility; for example, we develop a compact support version NCA method that achieves considerable speed-ups when combined with the stochastic learning procedure.
The role of attractiveness for web image search
Bo Geng
INRIA Rhone-Alpes, F107
Thursday, December 08 2011, 16:00

### Abstract:

Existing web image search engines are mainly designed to optimize topical relevance. However, according to our user study, attractiveness is becoming a more and more important factor for web image search engines to satisfy users' search intentions. Important as it can be, web image attractiveness from the search users' perspective has not been sufficiently recognized in both the industry and the academia. In this paper, we present a definition of web image attractiveness with three levels according to the end users' feedback, including perceptual quality, aesthetic sensitivity and affective tune. Corresponding to each level of the definition, various visual features are investigated on their applicability to attractiveness estimation of web images. To further deal with the unreliability of visual features induced by the large variations of web images, we propose a contextual approach to integrate the visual features with contextual cues mined from image EXIF information and the associated web pages. We explore the role of attractiveness by applying it to various stages of a web image search engine, including the online ranking and the interactive reranking, as well as the offline index selection. Experimental results on three large-scale web image search datasets demonstrate that the incorporation of attractiveness can bring more satisfaction to 80% of the users for ranking/reranking search results and 30.5% index coverage improvement for index selection, compared to the conventional relevance based approaches.
Portmanteau vocabularies: multi-cue visual neologism on the cheap
Andrew D. Bagdanov
INRIA Rhone-Alpes, F107
Monday, November 28 2011, 11:00

### Abstract:

The success of the bag-of-words (BOW) model for image classification is highly dependent on the quality of the visual vocabulary used. This talk will consider visual vocabularies used to represent images whose local features are described by both shape and color. In it, I will describe a new approach to feature combination in the BOW model that builds discriminative compound words from primitive cues learned independently from training images. Motivated by the observation that modeling joint-cue distributions independently is more statistically robust for typical classification problems than attempting to directly estimate joint-cue distribution empirically, the statistics of joint visual words are modeled assuming conditional independence of individual features once the class is known. We apply information theoretic vocabulary compression to find discriminative combinations of joint-cues and the resulting vocabulary of visual portmanteaux is compact, has the cue binding property, and supports individual weighting of cues in the final image representation. State-of-the-art results on both the Oxford Flower-102 and Caltech-UCSD Bird-200 datasets demonstrate the effectiveness of the approach compared to other, significantly more complex approaches to multi-cue image representation.
Training Random Forests with Ambiguously Labeled Data
Christian Leistner
INRIA Rhone-Alpes, F107
Wednesday, October 19 2011, 15:00

### Abstract:

Although nowadays the number of digital images is exploding, collecting large amounts of labeled data can still be tedious and costly. Additionally, the labels can be noisy or formatted in a way which might not be optimal to exploit by the learning method - consider bounding box annotations in images. This motivates the development and usage of learning algorithms that are able to exploit both small amounts of labeled data and large amounts of unlabeled data, which are usually easy to get. Also, the learning method should allow for a certain amount of flexibility in the labeling. In this talk, I will show how to use Random Forests (RFs) to tackle these challenges. RFs are able to deliver state-of-the-art results in various applications, are fast to train and evaluate, are inherently multi-class, run on parallel architectures and are robust to label noise, which makes them perfect candidates to exploit large amounts of unlabeled or ambiguously labeled samples. In particular, I will present extensions of RFs to semi-supervised and multiple-instance learning as well as to online learning, which is needed in many applications. Finally, I will present a new method that is able to benefit from unlabeled videos, even if the content is unrelated to the given task.
Recent research
INRIA Rhone-Alpes F107
Friday, October 7 2011, 10:00 am

### Abstract:

Internal talk about some recent researchers
Monocular 3D Pose Estimation
Srimal Jayawardena
INRIA Rhone-Alpes F107
Wednesday, November 2 2011, 11:00 am

### Abstract:

The problem of identifying the 3D pose of a known object from a given 2D image has important applications in Computer Vision. Our proposed method of registering a 3D model of a known object on a given 2D photo of the object has numerous advantages over existing methods. It does not require prior training, knowledge of the camera parameters, explicit point correspondences or matching features between the image and model. Unlike techniques that estimate a partial 3D pose (as in an overhead view of traffic or machine parts on a conveyor belt), our method estimates the complete 3D pose of the object. It works on a single static image from a given view under varying and unknown lighting conditions. For this purpose we derive a novel illumination-invariant distance measure between the 2D photo and projected 3D model, which is then minimised to find the best pose parameters. Results for vehicle pose detection in real photographs are presented.
Manifold Learning by Semidefinite Facial Reduction
Nathan KRISLOCK
INRIA Rhone-Alpes F107
Thursday, June 16 2011, 4:00 pm

### Abstract:

The problem of nonlinear dimensionality reduction is most often formulated as a semidefinite programming (SDP) problem. Currently SDP problems of only limited size can be directly solved using current SDP solvers. To overcome this difficulty, we propose a novel SDP formulation for dimensionality reduction based on semidefinite facial reduction. The key observation is that in manifold learning, the structure of a large chunk of the data can be preserved as a whole, instead of dividing it into very small neighborhoods. This observation leads to a new formulation that significantly reduces the size and the number of constraints of the SDP problem. Our method is a stable, fast, and scalable algorithm for manifold learning, allowing us to solve very large problems. We obtain high quality solutions without the need for post-processing by local gradient descent search methods, as is often required by other large-scale SDP-based methods for manifold learning. This is joint work with Babak Alipanahi and Ali Ghodsi (University of Waterloo, Canada).
First Order Methods for Large-Scale Convex Optimization
INRIA Rhone-Alpes
Friday, May 20 2011, 12:30 am

### Abstract:

We discuss several state-of-the-art computationally cheap, as opposed to the polynomial time Interior Point algorithms, first-order methods for minimizing convex objectives over "simple" large-scale feasible sets. We are particularly interested in first-order methods for "well-structured" large-scale nonsmooth convex programs. These methods utilize the problem structure in order to convert the original nonsmooth minimization problem into a saddle point problem with smooth convex-concave cost function. This reformulation allows accelerating significantly the solution process. Our emphasis is on methods which, under favorable circumstances, exhibit (nearly) dimension-independent convergence rate. We also outline possibilities to further accelerate first-order methods by randomization.
Variational Approximations for Factor Analysis
Guillaume Bouchard
INRIA Rhone-Alpes, F107
Friday, May 20 2011, 11am

### Abstract:

Many statistical techniques, such as the computation of the data likelihood in the presence of nuisance parameters, the prediction in the presence of missing data, or the computation of the posterior distribution over parameters can be simply expressed as integration problems. Variational approaches enable us to transform an intractable integral into an optimization problem. After a brief tutorial on common variational techniques used to solve machine learning problems, we will present recent developments on the use of variational bounds to solve large scale missing data problems when data are heterogeneous (i.e. when there are both discrete and continuous observations) and heteroscedastic (i.e. when the data variance is not the same for all the observed entities). The final part of the talk will introduce Split Variational Inference, a generic to computing large scale non-Gaussian integrals by splitting them into small pieces that are easier to approximate by unnormalized Gaussian distributions.
Cascaded distinctive features for specific and class object recognition
Jerome Revaud
INRIA Rhone-Alpes, F107
Thursday, March 31 2011, 3pm

### Abstract:

Object recognition in images is a growing field. Since several years, the emergence of invariant interest points such as SIFT [Lowe, 2001] has enabled rapid and effective systems for the recognition of instances of specific objects as well as classes of objects (e.g. using the bag-of-words model). However, our experiments on the recognition of specific object instances have shown that under realistic conditions of use (e.g. the presence of various noises such as blur, poor lighting, low resolution cameras, etc.) progress remain to be done in terms of recall: despite the low rate of false positives, too few actual instances are detected regardless of the system (RANSAC, votes / Hough ...). In this presentation, we first present a contribution to overcome this problem of robustness for the recognition of object instances, then we straightly extend this contribution to the detection and localization of classes of objects.

Initially, we have developed a method inspired by graph matching to address the problem of fast recognition of instances of specific objects in noisy conditions. This method allows to easily combine any types of local features (eg contours, textures ...) less affected by noise than keypoints, while bypassing the normalization problem and without penalizing too much the detection speed. In this approach, the detection system consists of a set of cascades of micro-classifiers trained beforehand. Each micro-classifier is responsible for comparing the test image locally and from a certain point of view (e.g. as contours, or textures ...) to the same area in the model image. The cascades of micro-classifiers can therefore recognize different parts of the model in a robust manner (only the most effective cascades are selected during learning). Finally, a probabilistic model that combines those partial detections infers global detections. Unlike other methods based on a global rigid transformation, our approach is robust to complex deformations such as those due to perspective or those non-rigid inherent to the model itself (e.g. a face, a flexible magazine). Our experiments on several datasets have showed the relevance of our approach. It is overall slightly less robust to occlusion than existing approaches, but it produces better performances in noisy conditions.

In a second step, we have developed an approach for detecting classes of objects in the same spirit as the bag-of-visual-words model. For this we use our cascaded micro-classifiers to recognize visual words more distinctive than the classical words simply based on visual dictionaries (like [Csurka, 2004] or [Zhang, 2006]). Training is divided into two parts: First, we generate cascades of micro-classifiers for recognizing local parts of the model pictures and then in a second step, we use a classifier to model the decision boundary between images of class and those of non-class. This classifier bases its decision on a vector counting the outputs of each binary micro-classifier. This vector is extremely sparse and a simple classifier such as Real-Adaboost manages to produce a system with good performances (this type of classifier is similar in fact to the subgraph membership kernel). In particular, we show that the association of classical visual words (from keypoints patches) and our disctinctive words results in a significant improvement. The computation time is generally quite low, given the structure of the cascades that minimizes the detection time and the form of the classifier is extremely fast to evaluate.
Recent results
Jean Ponce
INRIA Rhone-Alpes, A109
Friday, March 29 2011, 11am

### Abstract:

Informal talk on some recent results
Seam Carving for Image Retargeting
Alex Mansfield
INRIA Rhone-Alpes, F107
Friday, March 25 2011, 4pm

### Abstract:

Seam carving defines an energy over the image, and uses dynamic programming to efficiently optimize for 8-connected paths (seams) through the image pixels that can be removed, shrinking the image by 1 pixel in one dimension. I will introduce and motivate the problem of image retargeting, which seam carving aims to solve, and describe the key solution approaches. I will describe the seam carving algorithm in detail, and show its successes and failures. I will describe extensions to this method, including our recent work, in which we focus on understanding seam carving further as an optimization process and on improving results when user interaction is possible. I will evaluate the success of the field in tackling the problem of image retargeting, and finally give some key insights and hint at the challenges ahead.
Robust Estimation for an Inverse Problem Arising in Multiview Geometry
Arnak Dalalyan
INRIA Rhone-Alpes, F107
Thursday, March 3rd 2011, 16h00

### Abstract:

We propose a new approach to the problem of robust estimation for some inverse problems arising in multiview geometry. Inspired by recent advances in the statistical theory of recovering sparse vectors, we define our estimator as a Bayesian maximum a posteriori with multivariate Laplace prior on the vector describing the outliers. This leads to an estimator in which the fidelity to the data is measured by the $L_\infty$- norm while the regularization is done by the L1-norm. The proposed procedure is fairly fast since the outlier removal is done by solving one linear program (LP). An important difference compared to existing algorithms is that for our estimator it is not necessary to specify neither the number nor the proportion of the outliers; only an upper bound on the maximal measurement error for the inliers should be specified. We present theoretical results assessing the accuracy of our procedure, as well as numerical examples illustrating its efficiency on synthetic and real data. This is a joint work with Renaud Keriven.

## Seminars in 2010

Union Support Recovery in Multi-task Learning
INRIA Rhone-Alpes, F107
Monday, November 29th 2010, 16h00

### Abstract:

We sharply characterize the performance of different penalization schemes for the problem of selecting the relevant variables in the multi-task setting. Previous work focuses on the regression problem where conditions on the design matrix complicate the analysis. A clearer and simpler picture emerges by studying the Normal means model. This model, often used in the field of statistics, is a simplified model that provides a laboratory for studying complex procedures. These theoretical results will be presented together with implications for practitioners. With John Lafferty and Larry Wasserman. [link]
Learning structured prediction models for interactive image labeling
Thomas Mensink
INRIA Rhone-Alpes, F107
Thursday, November 25th 2010, 14h00

### Abstract:

In this talk I will present my CVPR submission. In the paper we propose structured models for image labeling, which take into account label dependencies. These models are more expressive than independent label predictors, and lead to more accurate predictions. While the improvement is modest for fully-automatic image annotation, the gain is significant in an interactive scenario where a user provides the value of some of the image labels. In this interactive scenario, the structured models are used to decide which labels should be set by the user, and to infer the remaining labels conditioned on the user responses. We also apply our models to attribute-based image classification, where attribute predictions of a test image are mapped to class probabilities by means of a given attribute-class mapping. In this case the structured models are built at the attribute level. We also consider an interactive system where the system asks a user to set some of the attribute values in order to maximally improve class prediction performance. Experimental results on three publicly available benchmark data sets show that in all scenarios structured models lead to more accurate predictions, and leverage user input much more effectively then state-of-the-art independent models. This is joint work with Jakob and Gabriela Csurka (XRCE).
Fast tropical matrix multiplication and applications to message passing
Julian McAuley
INRIA Rhone-Alpes, F107
Thuesday, November 23rd 2010, 14h00

### Abstract:

In discrete pairwise graphical models containing loops, exact inference via message passing amounts to repeatedly computing matrix products. In order to efficiently compute marginals in such models, one could in principle apply any of the well-known subcubic solutions to this problem. However computing MAP states requires solving matrix product in the max-product (or 'tropical') semiring, where the existence of a subcubic solution remains an open question. In this talk, we discuss expected-case subcubic solutions to this problem, and show how they can lead to faster message passing algorithms in a variety of computer vision problems.
Human Action Recognition in Uncontrolled Videos
INRIA Rhone-Alpes, A104
Thursday, November 18th 2010, 14h00

### Abstract:

In this talk, I will present our two recent approaches to human action recognition in uncontrolled videos. The first approach deals with the case where there are not enough training sequences to learn the action classifiers directly from videos. In this case, we show how we can make use of the images collected from the Web to learn representations of actions and use this knowledge to automatically annotate actions in videos. Our approach is unsupervised, in the sense that it requires no human intervention other than the text querying. The benefits are two-fold: first, we show that we can improve retrieval of action images, and second, we can collect a large generic database of action poses, which can then be used in tagging videos. We present experimental evidence that using action images collected from the Web, annotating actions is possible. In the second part of the talk, I will present our approach which uses the scene and object information in the videos together with the pose and motion information to infer human actions. Here, our observation is that human actions can be identified not only by the singular observation of the human body in motion, but also properties of the surrounding scene and the related objects. We propose an approach that integrates multiple feature channels from several entities and formulate the problem in a multiple instance learning (MIL) framework. Our experimental results show that scene and object information can be effectively used to complement person features for human action recognition.
Faster Algorithms for Max-Product Message-Passing
INRIA Rhone-Alpes, F107
Thursday, October 14th 2010, 16h00

### Abstract:

Maximum A Posteriori inference in graphical models is often solved via message-passing algorithms, such as the junction-tree algorithm, or loopy belief-propagation. The exact solution to this problem is well known to be exponential in the size of the model's maximal cliques after it is triangulated, while approximate inference is typically exponential in the size of the model's factors. In this presentation, I'll show recent work from our lab in which we take advantage of the fact that many models have maximal cliques that are larger than their constituent factors, and also of the fact that many factors consist entirely of latent variables (i.e., they do not depend on an observation). This is a common case for several practical models, including many models on grids, trees, ring-structured models and skip-chain models. In such cases, we are able to decrease the exponent of complexity for message-passing for both exact and approximate inference. We illustrate the practical advantages of the improved algorithm in a number of tasks, such as protein design, text and image denoising, optical flow inference, stereo disparity estimation, and graph matching.
Joint work with Julian McAuley. [Paper]
Set Based Modeling Of Objects And Their Context
INRIA Rhone-Alpes, F107
Friday, October 8th 2010, 14h00

### Abstract:

In computer vision, many image entities can be represented as sets of high-dimensional items. For example, an object in an image can be represented as a set of image patches, where each image patch has a feature vector encoding the local appearance. Training classification models directly on sets of unordered items, where each set can have varying cardinality, can be difficult. In this talk, I will introduce a new boosting-based supervised learning algorithm, called SetBoost, for building set classifiers. In the second part of the talk, I will give details about our novel contextual object detection model that uses SetBoost. In natural images, objects tend to appear in certain arrangements with respect to the other objects (object context) and the scene (scene context). The aim of our proposed model is to improve localization and recognition accuracy of object detection algorithms using object context and scene context. Our approach outperforms existing state-of-the-art methods in challenging object detection benchmark datasets.
Scene and object recognition with lots of categories
INRIA Rhone-Alpes, Grand Amphithéâtre
Monday, September 27st 2010, 16h30
Dense Interest Points
INRIA Rhone-Alpes, Grand Amphithéâtre
Monday, September 27st 2010, 15h30

### Abstract:

Local features or image patches have become a standard tool in computer vision, with numerous application domains. Roughly speaking, two different types of patch-based image representations can be distinguished: interest points, such as corners or blobs, whose position, scale and shape are computed by a feature detector algorithm, and dense sampling, where patches of fixed size and shape are placed on a regular grid (possibly repeated over multiple scales). Interest points focus on 'interesting' locations in the image and include various degrees of viewpoint and illumination invariance, resulting in better repeatability scores. Dense sampling, on the other hand, gives a better coverage of the image, a constant amount of features per image area, and simple spatial relations between features. In this paper, we propose a hybrid scheme, which we call dense interest points, where we start from densely sampled patches yet optimize their position and scale parameters locally. We investigate whether doing so it is possible to get the best of both worlds.
Recent advances in structured sparse models
INRIA Rhone-Alpes, F107
Tuesday, September 21st 2010, 16h00

### Abstract:

Sparse linear models have received a lot of attention in statistics, machine learning, computer vision and neuroscience. We consider here extensions of these models applied to various machine learning problems, where the sparsity pattern (set of nonzero coefficients) of the variables are not only encouraged to be sparse, but also structured. Whereas this approach enriches classical sparse models, it raises challenging new optimization problems, and we propose several algorithms for solving them efficiently. We illustrate our method with wavelet denoising, learning tree-structured dictionaries of natural image patches, and background subtraction in videos.
This is a joint work with Rodolphe Jenatton, Guillaume Obozinski and Francis Bach. The material of the talk is based on the following publications:
[1] J. Mairal, R. Jenatton, G. Obozinski and F. Bach. Network Flow Algorithms for Structured Sparsity. NIPS, 2010.
[2] R. Jenatton, J. Mairal, G. Obozinski and F. Bach. Proximal Methods for Hierarchical Sparse Coding. arXiv:1009.2139v1.
[3] R. Jenatton, J. Mairal, G. Obozinski and F. Bach. Proximal Methods for Sparse Hierarchical Dictionary Learning. ICML, 2010.
Reverse Multi-Label Learning
INRIA Rhone-Alpes, F107
Monday, September 20th 2010, 16h00

### Abstract:

Multi-label classification is the task of predicting potentially multiple labels for a given instance. This is common in several applications such as image annotation, document classification and gene function prediction. In this paper we present a formulation for this problem based on reverse prediction: we predict sets of instances given the labels. By viewing the problem from this perspective, the most popular quality measures for assessing the performance of multi-label classification admit relaxations that can be efficiently optimised. We optimise these relaxations with standard algorithms and compare our results with several state-of-the-art methods, showing excellent performance in a number of datasets from several different domains, including biology, images, text and music.
Online Learning for Object Tracking
INRIA Rhone-Alpes, F107
Thursday, August 26th 2010, 14h00

### Abstract:

Online learning deals with decision making problems where the model does not have access to the entire data domain and needs to predict and learn as the data appears. In this talk, I will mainly focus on object tracking as an application and show how different online and semi-supervised learning models can be used for this task.
Examples of Positive Definite Kernels on Time Series
INRIA Rhone-Alpes, F107
Wednesday, July 7th 2010, 16h30

### Abstract:

We propose a new family of kernels to handle time series, within the framework of kernel methods which includes popular algorithms such as the support vector machine. These kernels elaborate on the well known dynamic time warping (DTW) family of distances by considering the same set of elementary operations, namely substitutions and repetitions of tokens, to map a sequence onto another. Associating to each of these operations a given score, DTW algorithms use dynamic programming techniques to compute an optimal sequence of operations with high overall score, in this paper we consider instead the score spanned by all possible alignments, take a smoothed version of their maximum and derive a kernel out of this formulation. We prove that this kernel is positive definite under favorable conditions and show how it can be tuned effectively for practical applications.
Visual Recognition with Humans in the Loop
INRIA Rhone-Alpes, F107
Wednesday, June 2nd 2010, 11h30

### Abstract:

We present an interactive, hybrid human-computer method for object classification. The method applies to classes of problems that are difficult for most people, but are recognizable by people with the appropriate expertise (e.g., animal species or airplane model recognition). The classification method can be seen as a visual version of the 20 questions game, where questions based on simple visual attributes are posed interactively. The goal is to identify the true class while minimizing the number of questions asked, using the visual content of the image. Incorporating user input drives up recognition accuracy to levels that are good enough for practical applications; at the same time, computer vision reduces the amount of human interaction required. The resulting hybrid system is able to handle difficult, large multi-class problems with tightly-related categories. We introduce a general framework for incorporating almost any off-the-shelf multi-class object recognition algorithm into the visual 20 questions game, and provide methodologies to account for imperfect user responses and unreliable computer vision algorithms. We evaluate the accuracy and computational properties of different computer vision algorithms and the effects of noisy user responses on a dataset of 200 bird species and on the Animals With Attributes dataset. Our results demonstrate the effectiveness and practicality of the hybrid human-computer classification paradigm.

This work is part of the Visipedia project, in collaboration with Steve Branson, Catherine Wah, Florian Schroff, Boris Babenko, Peter Welinder and Pietro Perona.
From generic object detection to weakly supervised learning of object classes
INRIA Rhone-Alpes, F107
Monday, May 31th 2010, 17h00

### Abstract:

In the first part of this talk I will present a generic objectness measure, quantifying how likely it is for an image window to contain an object of any class. The measure is trained to distinguish objects with a well-defined boundary in space, such as cows and telephones, from amorphous background elements, such as grass and road. The measure combines several image cues measuring characteristics of objects, such as appearing different from their surroundings and having a closed boundary. In experiments on the PASCAL VOC 07 dataset, we show that objectness outperforms a state-of-the-art saliency measure [Hou CVPR 07]. Moreover, we give an algorithm to employ objectness to greatly reduce the number of windows that class-specific object detectors need to evaluate.

In the second part of the talk I will present a novel technique for weakly supervised learning of object classes, which employs objectness as a focus of attention mechanism. Learning an object class from cluttered training images is very challenging when the location of object instances is unknown. While previous works typically require objects covering a large portion of the images, our technique can cope with extensive clutter as well as large scale and appearance variations between instances. It simultaneously localizes instances in their training images while learning an appearance model specific to the class. We report experiments on the very challenging PASCAL VOC 07 dataset and compare to two existing methods [Chum CVPR 07], [Russell CVPR 06]. Finally, we demonstrate an application by training the fully supervised model of [Felzenszwalb PAMI 2009] from objects localized by our method, evaluate it on the PASCAL VOC 07 test set, and compare its performance to the original model trained from ground-truth bounding-boxes.
Unsupervised video indexing based on audiovisual characterization of persons
INRIA Rhone-Alpes, F107
Friday, May 28th 2010, 16h00

### Abstract:

The characterization of persons within an audiovisual document is one of the challenging problems in current research activities. Many of them have addressed this problem with only one modality.
From the audio point of view, the characterization of persons is generally known as speaker diarization: it aims to segment the audio stream into turns of speakers and then cluster all turns that belong to the same speaker. In other meanings, its goal is to answer the question "who talk? and when?".
From the video point of view, the characterization of persons is generally known as people detection, tracking and recognition. In other words, it aims to answer the question "who appear? and when?".
A few other research activities have addressed the problem of persons characterization from a multimodal point of view. However their applications were generally limited, constrained and supervised.
At the beginning of this thesis, we propose an efficient audio indexing system that aims to split the audio channel into homogeneous segments, discard the non-speech segments, and group the segments into clusters, that each corresponds ideally to one speaker. This system must process without a priori knowledge (unsupervised learning) and must be as suitable to any kind of data: TV/radio broadcast news, TV/radio debates, movies, etc.
Secondly, we propose an efficient video indexing system that aims to split the video channel into shots, detect and track people in every shot, and group all faces into clusters, that each corresponds ideally to one person. This video system must process without a priori knowledge and may be suitable to any kind of data.
Finally, we propose an efficient audiovisual indexing system that aims to combine audio and video indexing systems in order to deliver an audiovisual characterization of each person talking and/or appearing in the audiovisual document, and a robustified audio indexing output (respectively video indexing output) using the help of video (respectively the help of audio).
Experiments done on broadcast news, debates and movies show the efficiency of each of our proposed systems, and confirm the correlation between audio and video information, and the gain ensured by using both media.
Learning Image Retrieval Models using Cross Media Pseudo Relevance Feedback
INRIA Rhone-Alpes, F107
Thursday, May 27th 2010, 16h00

### Abstract:

Undisclosed
Applying bottom-up and top-down color attention for improved bag-of-words based object recognition
INRIA Rhone-Alpes
Tuesday, May 25th 2010, 16h00

### Abstract:

Generally the bag-of-words based image representation follows a bottom-up paradigm. The subsequent stages of the process: feature detection, feature description, vocabulary construction and image representation are performed independent of the intentioned object classes to be detected. In such a framework, combining multiple cues such as shape and color often provides below-expected results.
The two main strategies to combine multiple cues, known as early- and late fusion both suffer from significant drawbacks. In this talk I presents a novel method by separating the shape and color cue. Subsequently, color is used to construct a top-down category-specific attention map. The color attention map is then further deployed to modulate the shape features by taking more features from regions within an image that are likely to contain an object instance. This procedure leads to a category-specific image histogram representation for each category.
Evaluation on several data sets shows that the proposed method outperforms both early- and late fusion. Additionally, I will comment on its usage in our submission to the VOC PASCAL 2009 image classification challenge.
Large-scale image categorization and retrieval
Florent Perronnin
INRIA Rhone-Alpes, F107
Wednesday, May 12th 2010, 11:00

### Abstract:

This talk will consist of two parts:

1) In the first part, we will address the challenge of learning efficiently image categorizers on large datasets (e.g. > 100,000 images) with the popular bag-of-visual-words (BOV) framework. In this framework an image is described by a histogram of quantized local vectors (e.g. SIFT) and classification is typically performed using non-linear support vector machines (SVMs). Non-linear SVMs can perform significantly better than their linear counter-parts but do not scale well on large datasets. As kernel machines rely on an implicit mapping of the data it has been proposed to perform an explicit mapping of the data and to learn directly linear classifiers in the new space.
We experimented with three approaches to BOV embedding: 1) kernel PCA (kPCA), 2) a modified kPCA we propose for additive kernels and 3) random projections for shift-invariant kernels. An important conclusion is that simply square-rooting BOV vectors -- which corresponds to an exact mapping for the Bhattacharyya kernel -- already leads to large improvements, often quite close to the best results obtained with additive kernels. Another conclusion is that, although it is possible to go beyond additive kernels, the embedding comes at a much higher cost.

2) In the second part, we will provide an update on our work on Fisher kernels (FK). This is an elegant framework which extends the traditional BOV by going beyond counting. We will show that with several well-motivated modifications over the original framework, we can boost the accuracy of the FK for image categorization tasks.
For instance, on PASCAL VOC 2007 we increase the Average Precision (AP) from 47.9% to 58.3%. A major advantage is that these results are obtained using only SIFT descriptors and costless linear classifiers. We will also show that, for the task of query-by-example image retrieval, the FK performs very well using as little as a few hundreds of bits per image and significantly better than the BOV.
Trecvid and ACM MM video challenges
Matthijs Douze, Adrien Gaidon and Alessandro Prest
INRIA Rhone-Alpes, C207
Monday, May 3rd 2010, 16h00

### Abstract:

We will present the different video tasks in Trecvid 2010 and ACM Multimedia 2010 challenges.
Multimodal semi-supervised learning for image classification
INRIA Rhone-Alpes, F107
Wednesday, March 31st 2010, 11h00

### Abstract:

In image categorization the goal is to decide if an image belongs to a certain category or not. A binary classifier can be learned from manually labeled images; while using more labeled examples improves performance, obtaining the image labels is a time consuming process.
We are interested in how other sources of information can aid the learning process given a fixed amount of labeled images. In particular we consider a scenario where keywords are associated with the training images, e.g. as found on photo sharing websites. The goal is to learn a classifier for images alone, but we will use the keywords associated with labeled and unlabeled images to improve the classifier using semi-supervised learning. We first learn a strong Multiple Kernel Learning (MKL) classifier using both the image content and keywords, and use it to score unlabeled images. We then learn classifiers on visual features only, either support vector machines (SVM) or least-squares regression (LSR), from the MKL output values on both the labeled and unlabeled images.
In our experiments on 58 classes from the PASCAL VOC'07 and MIR Flickr sets, we demonstrate the benefit of our semi-supervised approach over only using the labeled images. We also present results for a scenario where we do not use any manual labeling but directly learn classifiers from the image tags. Also in this case using the semisupervised approach improves classification performance.
Aggregating local descriptors into a compact image representation
INRIA Rhone-Alpes, C207
Wednesday, March 24nd 2010, 11h00

### Abstract:

Abstract: We address the problem of image search on a very large scale, where three constraints have to be considered jointly: the accuracy of the search, its efficiency, and the memory usage of the representation. We first propose a simple yet efficient way of aggregating local image descriptors into a vector of limited dimension, which can be viewed as a simplification of the Fisher kernel representation. We then show how to jointly optimize the dimension reduction and the indexing algorithm, so that it best preserves the quality of vector comparison. The evaluation shows that our approach significantly outperforms the state of the art: the search accuracy is comparable to the bag-of-features approach for an image representation that fits in 20 bytes. Searching a 10 million image dataset takes about 50ms.
Discovering semantic concepts in large collections
INRIA Rhone-Alpes, F107
Tuesday, March 9th 2010, 16h00

### Abstract:

Supervised learning is standard for many tasks, including computer vision tasks such as object recognition or scene categorization. Powerful classifiers can obtain impressive results but require sufficient amounts of annotated training data. Despite their success, supervised methods have important limitations. In particular, annotations are expensive to obtain, prone to error, often biased, and consequently do not easily scale. We propose to move beyond supervised methods and use large collections which are available today at minimal cost and effort.
In a first part, we will see how semi-supervised learning can be used to lower the need of annotations for the recognition of activities from sensor data.
In a second part, I will present recent works where we focus on two promising research directions: unsupervised structure discovery and semi-supervised learning, for computer vision approaches. The first one extracts semantic connections between images while the second one uses few labeled images to predict the label of new images. While focusing on the problem of object categorization, we will look at the following questions: i) How well are current image representations suited for unsupervised structure discovery and what distance measures are most applicable? and ii) How well does semi- supervised learning perform on these representations to automatically label object classes in realistic databases? Answers to these questions will be proposed through a deep experimental study involving 3 datasets, many state of the art descriptors and semi-supervised algorithms.
Improving web image search results using query-relative classifiers
INRIA Rhone-Alpes, C207
Thursday, February 18th 2010, 14h00

### Abstract:

Image web search using text queries has received considerable attention. However, current approaches require separate training for every new query, and are therefore unsuitable for real-world web search applications. The idea I'll present in this talk is to use generic classifiers that are based on query-relative features which can be used for new queries without additional training. They combine textual features, based on the occurrence of query terms in web pages and image meta-data, and visual histogram representations of images. More precisely, the visual features result from the comparison of overall statistics of visual words and statistics in images highly ranked by textual features. For evaluation purposes we use a new database which includes 71478 images returned by a web search engine for 353 different search queries, along with their meta-data and ground-truth annotations. Using this data set, we compared the image ranking performance of our model with that of the search engine, and with an approach that learns a separate classifier for each query. Our generic models that use query-relative features improve significantly over the raw search engine ranking, and also outperform the query-specific models.
A Corrected Likelihood Approach for the Pile-Up Model with Application to Fluorescence Lifetime Measurements Using Exponential Mixtures
INRIA Rhone-Alpes, F107
Wednesday, January 27th 2010, 11h30

### Abstract:

A fast and efficient estimation method is proposed that compensates the so-called pile-up effect encountered in fluorescence lifetime measurements. The pile-up effect is due to the fact that only the shortest arrival time of a random number of emitted fluorescence photons can be detected. A likelihood-based estimator is developed that can be computed by an EM-type algorithm. The new estimator is particularly well-suited for fluorescence lifetime measurements, where arrival times are often modeled by a mixture of exponential distributions. The consistency of the estimator is shown and its limit distribution is provided. The method is evaluated on real and synthetic data. Compared to currently used methods in fluorescence, the new estimator should allow a reduction of the acquisition time of an order of magnitude.
Image Representations for 3D Reconstruction and Recognition
INRIA Rhone-Alpes, F107
Tuesday, January 5th 2010, 11h30

### Abstract:

I will present two different transformations that can be applied to images before further processing. The first transformation is called DAISY, and was originally developed for wide baseline dense reconstruction. DAISY computes dense local descriptors in an efficient way, then we use a simple graph-cut techniques to match the images based on these descriptors. The second transformation was developed for fast object detection and reduces the image to local dominant orientations. This yields a compact but discriminative binary representation, which can be parsed using SSE instructions to detect objects in real-time.
Learning Distinguishing Marks for Image Classification
INRIA Rhone-Alpes, F107
Monday, January 4th 2010, 16h

### Abstract:

We tackle here the problem of multi-class image classification from few training examples, where only small parts of the image help discriminating between classes. Such problems arise when classifiying images of objects/persons in the wild. In such settings, standard kernel-based classifiers perform well only when combined with strong prior knowledge and efficient discriminative part detectors. We propose here a convex sparsity-enforced kernel-based methods for this task, introducing a pool-L1 penalty which automatically singles out discriminant "distinguishing marks" to leverage classification performance. We report experimental results on a horses in the wild dataset and on several benchmarks datasets.

## Seminars in 2009

Ranking user-annotated images for multiple query terms
INRIA Rhone-Alpes, C207
Friday, September 24th 2009, 14h

### Abstract:

We show how web image search can be improved by taking into account the users who provided different images, and that performance when searching for multiple terms can be increased by learning a new combined model and taking account of images which partially match the query. Search queries are answered by using a mixture of kernel density estimators to rank the visual content of web images from the Flickr website whose noisy tag annotations match the given query terms. Experiments show that requiring agreement between images from different users allows a better model of the visual class to be learnt, and that precision can be increased by rejecting images from untrustworthy' users. We focus on search queries for multiple terms, and demonstrate enhanced performance by learning a single model for the overall query, treating images which only satisfy a subset of the search terms as negative training examples.
Combining efficient object localization and image classification
INRIA Rhone-Alpes, J220
Friday, September 21st 2009, 14h

### Abstract:

In this paper we present a combined approach for object localization and classification. Our contribution is twofold. (a) A contextual combination of localization and classification which shows that classification can improve detection and vice versa. (b) An efficient two stage sliding window object localization method that combines the efficiency of a linear classifier with the robustness of a sophisticated non-linear one. Experimental results evaluate the parameters of our two stage sliding window approach and show that our combined object localization and classification methods outperform the state-of-the-art on the PASCAL VOC 2007 and 2008 datasets.
TagProp: Discriminative Metric Learning in Nearest Neighbor Models for Image Auto-Annotation
INRIA Rhone-Alpes, C208
Friday, September 2nd 2009, 17h

### Abstract:

Image auto-annotation is an important open problem in computer vision. For this task we propose TagProp, a discriminatively trained nearest neighbor model. Tags of test images are predicted using a stochastic, weighted nearest neighbor selection model to exploit labeled training images. Neighbor weights are based on neighbor rank or distance. TagProp allows the integration of metric learning by directly maximizing the log-likelihood of the expected tag predictions in the training set. In this manner, we can optimally combine a collection of image similarity metrics that cover different aspects of image content, such as local shape descriptors, or global color histograms. We also introduce a word specific sigmoidal modulation of the weighted neighbor tag predictions to boost the recall of rare words. We investigate the performance of different variants of our model and compare to existing work. We present experimental results for three challenging data sets. On all three, TagProp makes a marked improvement as compared to the current state-of-the-art.
Mining visual actions from movies
INRIA Rhone-Alpes, C208
Friday, July 31th 2009, 12h

### Abstract:

This paper presents an approach for mining visual actions from real-world videos. Given a large number of movies, we want to automatically extract short video sequences corresponding to visual human actions. Firstly, we retrieve actions by mining verbs extracted from the transcripts aligned with the videos. Not all of these samples visually characterize the action and, therefore, we rank these videos by visual consistency. We investigate two unsupervised outlier detection methods: one-class Support Vector Machine (SVM) and densest component estimation of a similarity graph. Alternatively, we show how to use automatic weak supervision provided by a random background class, either by directly applying a binary SVM, or by using an iterative re-training scheme for Support Vector Regression machines (SVR). Experimental results explore actions in 144 episodes of the TV series Buffy the Vampire Slayer'' and show: (a) the applicability of our approach to a large scale set of real-world videos, (b) the importance of visual consistency for ranking videos retrieved from text, (c) the added value of random non-action samples and (d) the ability of our iterative SVR re-training algorithm to handle weak supervision. The quality of the rankings obtained is assessed on manually annotated data for six different action classes.
Evaluation of local spatio-temporal features for action recognition
INRIA Rhone-Alpes, C208
Friday, July 29th 2009, 17h

### Abstract:

Local space-time features have recently become a popular video representation for action recognition. Several methods for feature localization and description have been proposed in the literature and promising recognition results were demonstrated for a number of action classes. The comparison of existing methods, however, is often limited given the different experimental settings used. The purpose of this paper is to evaluate and compare previously proposed space-time features in a common experimental setup. In particular, we consider four different feature detectors and six local feature descriptors and use a standard bag-of-features SVM approach for action recognition. We investigate the performance of these methods on a total of 25 action classes distributed over three datasets with varying difficulty. Among interesting conclusions, we demonstrate that regular sampling of space-time features consistently outperforms all tested space-time interest point detectors for human actions in realistic settings. We also demonstrate a consistent ranking for the majority of methods over different datasets and discuss their advantages and limitations.
Evaluation of GIST descriptors for web-scale image search
INRIA Rhone-Alpes, C207
Friday, July 3rd 2009, 14h

### Abstract:

The GIST descriptor has recently received increasing attention in the context of scene recognition. In this paper we evaluate the search accuracy and complexity of the global GIST descriptor for two applications, for which a local description is usually preferred: same location/object recognition and copy detection. We identify the cases in which a global description can reasonably be used. The comparison is performed against a state-of-the-art bag-of-features representation. We propose an indexing strategy for global descriptors that optimizes the trade-off between memory usage and precision. Our scheme provides a reasonable accuracy in some widespread application cases together with very high efficiency: In our experiments, querying an image database of 110 million images takes 0.18 second per image on a single machine. For common copyright attacks, this efficiency is obtained without noticeably sacrificing the search accuracy compared with state-of-the-art approaches.
Supervised (yes, supervised) learning with 0 examples and other methods for obviating those pesky training sets
INRIA Rhone-Alpes, F107
Wednesday, July 1st 2009, 17h

### Abstract:

Sometimes the first examples we have seen of particular objects or patterns come at test time rather than at training time. A simple example is reading a highly stylized font, say, on a store front. Appearance models trained a priori tend to do very poorly in classifying the letters of such new fonts. In this talk, I discuss our recent work in addressing the difficult problem of encountering new types of patterns at test time, especially those that are not well modeled by training data, either labeled or unlabeled. In the first part of the talk, I present ways of constraining the interpretations of patterns that are invariant to their appearance. This sounds paradoxical, but is quite simple. For example, the string 01221221331 is an encoding of a common string where each letter has been substituted with a digit. (Can you guess the string?) We show how such techniques can be used to provide important constraints in difficult problems like scene text recognition. In the second part of the talk, I discuss our work in optical character recognition. I discuss a "font free" OCR system which has never been trained on, or given any information about the specific appearance of any character, and yet can easily read the majority of most documents correctly. I also discuss new work in bootstrapping training sets in OCR problems. In this work, we automatically extract "training sets" from noisy documents so that we can dynamically build document specific models. We call this "Learning on the Fly". Finally, I discuss potential application of such ideas to other problems in computer vision and pattern recognition.
Automatic Film Editing for Storytelling Using a Computational Model of Film Grammar
INRIA Grenoble (work done at Xtranormal Technology,
Montreal, Quebec during a leave of absence from INRIA)
INRIA Rhone-Alpes, F107
Thursday, May 28th 2009, 16h

### Abstract:

This talk presents new tools that I have been developping in the last two years for an application that lets a non-expert user write a story in words and translate it into a short animated movie. I will focus on the important step of editing the shots from many virtual cameras into a "correct" movie, according to the rules of traditional film grammar. I will explain how this editing process can be modeled with a semi-Markov conditional random field and how its parameters can be learned directly from movies. I will show "proof-of-concept" results that were obtained in a simplified setting, and conclude with a discussion of research topics that still need to be addressed in future work.
Kernel-based Methods for Detection
Zaïd Harchaoui
Laboratoire Traitement et Communication de
l'Information, CNRS-TELECOM ParisTech
INRIA Rhone-Alpes, F107
Monday, January 26th 2009, 14h

### Abstract:

Kernel-methods have enjoyed considerable success in machine learning during two decades, especially for tackling supervised learning tasks. We address here the issue of building kernel-based methods for solving unsupervised detection problems. First, we propose a family of kernels for computer vision, based on the soft-matching of common subtree-patterns. Second, we introduce a regularized kernel-based test statistic for testing homogeneity of two samples, for which we established the null distribution and proved the consistency in power in a large-sample setting. Our regularized kernel-based test statistic was successfully applied in a speaker verification task. We also derived a computationally attractive variant of this approach within a sliding-window framework for the temporal segmentation of audio tracks from archives of entertainment TV-shows for indexation purpose. Third, we introduce a regularized kernel-based test statistic for change-point analysis, which was successfully applied to the temporal segmentation of Brain-Computer interface acquired signals into segments corresponding to mental tasks. Finally, we proposed two retrospective multiple change-point estimation methods, one without kernels and one with kernels, which we applied successfully for the temporal segmentation respectively of well-log data and pop songs.
Inferring the relevance of images from eye movements
Teofilo de Campos
Textual & Visual Pattern Analysis group, Xerox XRCE
INRIA Rhone-Alpes, F107
Wednesday, January 21st 2009, 14h

### Abstract:

Query formulation and efficient navigation through data to reach relevant results are undoubtedly major challenges for image or video retrieval. Queries of good quality are typically not available and the search process needs to rely on relevance feedback given by the user, which makes the search process iterative and laborious. A key question then is: Is it possible to replace or complement scarce explicit feedback with implicit feedback (IF)? IF can be inferred from various sensors not specifically designed for the retrieval task. In this talk, I will present preliminary results on inferring the relevance of images based on IF about users' attention, measured using an eye tracking device. We have shown that, in reasonably controlled setups at least, already fairly simple features and classifiers are capable of detecting the relevance based on eye movements alone, without using any explicit feedback. This work is one of the outcomes of PinView, a EU FP7 collaborative project. It was done in collaboration with A Klami, C Saunders and S. Kaski.
Probabilistic Models of Textual Collections for Information Access
Eric Gaussier
Université Joseph Fourier
INRIA Rhone-Alpes, F107
Wednesday, January 14th 2009, 10h

### Abstract:

Several probabilistic models of text collections have recently been introduced in the text processing community. These models are often defined from a statistical learning perspective. Over the years, however, several empirical findings on how words behave in documents have been reported (from the work of G. Zipf in 1949 to more recent studies). In this presentation, we study the links between probabilistic models of text collections and empirical observations concerning word frequency distributions. In the first part, we will introduce formal characterizations of several empirical observations. We will then review retrieval heuristics and propose an analytical characterization of them which can be used to design IR (Information Retrieval) models.We will then review standard probabilistic models in light of our characterizations and finally introduce new models (based on the beta negative binomial and log-logistic distributions) compatible with empirical observations. We will finally illustrate the behavior of our models on standard text collections.

## Seminars in 2008

High-dimensional estimation of Information-theoretic measures in nonlocal variational methods of computer vision
INRIA Rhone-Alpes, F107
Tuesday, December 16th 2008, 11h

### Abstract:

One variational formulation of image and video processing problems expresses the solution through a minimization of a statistical energy to account for uncertainty in the observations. In return, the energy is expressed as a function of the data considered as random variables. This representation aims at defining models on the image with probability density functions (PDF). The cost for discriminative power of PDFs built on images is to deal with PDFs of domains of definition of high dimension, such as nonlocal patch-based representations. To overcome high dimensionality, a standard solution is to assume independence between the different features in order to bring out low-dimension marginal laws and/or to make some parametric assumptions on the PDFs, thus loosing generality. At the foundation of statistics, the k-th nearest neighbor can solve these difficulties by locally adapting to the repartition of the data and treating the channels jointly. We propose a general framework based on statistics to efficiently estimate information-theoretic measures in high dimension. This new framework is dedicated to variational problems as it estimates efficiently, high dimensional statistical energies, gradients of these energies, local probabilities, and as it is also fast since the implementation is performed on GPU. This framework is applied to three variational problems where high dimensionality is important: tracking, optical flow, and segmentation. For the first one, the problem is to determine in successive frames the region which best matches, in terms of a similarity measure, a ROI defined in a reference frame. We define a tracking algorithm based on the Kullback-Leibler divergence combining efficiently several visual features. We show tracking results high-dimensional feature vectors containing color information (including pixel-based, gradient-based and patch-based) and spatial layout. The proposed procedure performs tracking on sequences with various difficulties such as occlusions, variations of illumination or noise. I will also detail the optical flow and segmentation algorithms derived from this framework. Finally, I will give some perspectives and future directions.
Kernel-based systems for online image retrieval
INRIA Rhone-Alpes, F107
Friday, November 28th 2008, 15h

### Abstract:

In this presentation, I will talk about image retrieval systems. The key components of Content-based image retrieval (CBIR) techniques are image representation including features and similarity, and the search engine aiming at retrieving data from large databases. For the indexing part, visual dictionaries are traditionally used to encode the image features. I also present how similarity between image features may be embedded into kernel function framework. For the retrieval part, I discuss about online learning strategies motivated by Machine-Learning developments such as Active Learning. I will also talk about recent applications like iTOWNS (image-based Time On-line Web Navigation and Searchengine) project or K-videoScan project to illustrate my presentation.
Improving People Search Using Query Expansions: How Friends Help To Find People
INRIA Rhone-Alpes, F107
Friday, October 10th 2008, 12h

### Abstract:

We are interested in finding images of people on the web, and more specifically within large databases of captioned news images. It has recently been shown that visual analysis of the faces in images returned on a text-based query over captions can significantly improve search results. The underlying idea to improve the text-based results is that although this initial result is imperfect, it will render the queried person to be relatively frequent as compared to other people, so we can search for a large group of highly similar faces. The performance of such methods depends strongly on this assumption: for people whose face appears in less than about 40% of the initial text-based result, the performance may be very poor. The contribution of this paper is to improve search results by exploiting faces of other people that co-occur frequently with the queried person. We refer to this process as query expansion'. In the face analysis we use the query expansion to provide a query-specific relevant set of negative' examples which should be separated from the potentially positive examples in the text-based result set. We apply this idea to a recently-proposed method which filters the initial result set using a Gaussian mixture model, and apply the same idea using a logistic discriminant model. We experimentally evaluate the methods using a set of 23 queries on a database of 15.000 captioned news stories from \yahoonews. The results show that (i) query expansion improves both methods, (ii) that our discriminative models outperform the generative ones, and (iii) our best results surpass the state-of-the-art results by 10% precision on average.
Content-based image retrieval in the large scale: from content to user
INRIA Rhone-Alpes, F107
Wednesday, October 8th 2008, 14h

### Abstract:

Scalability issues are nowaday essential for any multi-media search engine, even for relatively small datasets when using recent computer vision techniques. In this seminar, we will illustrate through different works, how scalability considerations can be included at several stages of a complete visual indexing and retrieval chain (from content description to search results clustering, via indexing and retrieval problematics).
In the first part, we will present two works on local visual features extraction which aim at reducing space and/or time complexity. The first one concerns new local photometric descriptors based on dissociated dipoles for transformed images or rigid objects retrieval. Dissociated dipoles are non local differential operators which are more stable than purely local standard differential operators. We define specific oriented dissociated dipoles around multi-resolution color Harris points and we form 20-dimensional normalized features, invariant to rotation, affine luminance transformations, negative or flip. The second work describes a new symmetry oriented interest point detector based on gradient orientations convergence. The aim is to reach better visual saliency than current detectors and, as a consequence, to reduce the amount of features required for content-based retrieval tasks. In the second part, we will present a new high dimensional similarity search structure, which improves upon recent theoretical work on multi-probe and query adaptive LSH. Whereas these methods are based on likelihood criteria that a given bucket contains query results, we define a more reliable a posteriori model taking account some prior about the queries and the searched objects. This prior knowledge allows a better quality control of the search and a more accurate selection of the most probable buckets. We show that our a posteriori scheme outperforms other multi-probe LSH while offering a better quality control. Comparisons to the basic LSH technique show that our method allows consistent improvements both in space and time efficiency. The last part of the seminar will present a work on multi-source image search results clustering. The aim is to synthetize the search results obtained from a possibly large set of different search engines, working with heterogeneous data and similarity measures. The developed technique is based on the Relevant-Set Correlation (RSC) model, that requires no direct knowledge of the nature or representation of the data. Instead, the RSC model relies solely on the existence of an oracle that accepts a query in the form of a reference to a data item, and returns a ranked set of references to items that are most relevant to the query. In the presented work, we describe and compare 3 different fusion strategies extending the original RSC-based clustering algorithm to the case of several oracles.
Research at the Image Understanding and
Pattern Recognition (IUPR) research group
INRIA Rhone-Alpes, F107
Tuesday, June 10th 2008, 11h

### Abstract:

Prof. Thomas Breuel is director of the Image Understanding and Pattern Recognition (IUPR) research group at the Computer Science Department of the University of Kaiserslautern and the German Research Center for Artificial Intelligence (DFKI). The group conducts basic and applied research in pattern recognition, machine learning, image understanding, and artificial intelligence, with practical applications to digital libraries, network security, bioinformatics, historical document analysis, and scientific data analysis.
Crossing textual and visual content
in different application scenarios
INRIA Rhone-Alpes, F107
Thursday, June 5th 2008, 14h

### Slides

Inria only: ppt slides

### Abstract:

I will present two approaches for hybrid text-image information processing that can be straightforwardly generalized to more general multimodal scenarios. Both approaches fall in the trans-media pseudo-relevance feedback category. The first method proposes to use a mixture model of the aggregate components, considering them as a single relevance concept. The second approach, to determine trans-media similarities between a new multimedia document and the objects of some repository, define these similarities as an aggregation of mono-modal similarities between the elements of the aggregate and the new multimodal object. I further show how one can frame a large variety of problems in order to address them with the proposed techniques: image annotation or captioning, text illustration and multimedia retrieval and clustering. As an example scenario, the travel blog assistant system is used to illustrate some of the experimental results.
Towards a Theory of Cascaded Detectors
INRIA Rhone-Alpes, F107
Wednesday, May 7th 2008, 11h30

### Slides

Inria only: ppt slides

### Abstract:

Cascades of boosted ensembles have become popular in the object detection community following their introduction in the face detector of Viola and Jones. Since then, researchers have sought to improve upon the original approach by exploring alternative boosting methods, feature sets, etc. Nevertheless, key decisions about the most basic aspects of the original cascade classifier, such as how many hypotheses to include in an ensemble and the appropriate balance of detection and false positive rates in the individual stages, have not been studied systematically. Choices which have a significant effect on the cascade's performance are usually made with heuristics or through trial and error.

We propose a novel method for training cascade classifiers, which exploits the shape of the ROC curve for a cascade in ways that have been previously overlooked. We present a new mathematical characterization of the space of possible cascade operating points. The results of our approach are cascade detectors with significantly-improved testing speeds in comparison to other automatic training methods. We automatically produce cascades whose detection speeds match those of the best hand-tuned detectors.
Improving fast nearest neighbour search in large database
for visual recognition.
INRIA Rhone-Alpes, F107
Tuesday April 8rd 2008, 16h00

### Abstract:

Local feature detectors and descriptors of local image structures are used in many state of-the-art vision methods that require local image-to-image correspondences. In this talk I will discuss an approach for linear discriminant projection of high dimensional image descriptors to reduce the number of dimensions and to improve their matching performance. The method is based on Fischer Analysis and global statistics which can be estimated from a real or simulated training data. The projected descriptors are more discriminative than the original ones, 3-4 times more memory efficient, and require only a small computational overhead. I will show experimental results in the context of fast search for visual correspondence using different tree data structures and approximate nearest neighbour search. Finally, a recognition system based on a vocabulary forest of local features will be presented. The system is capable of simultaneous categorization and localization of scenes, objects and actions.The talk will consist of two parts. It will start with a broad overview of text mining, its main goals, tasks, and problems. Several common tasks will be described in some detail, including building and preprocessing of text collections, text categorization, extraction of terms, entities and relations, and document summarization. Known well-performing techniques for solving these problems will be briefly discussed. In the second part, several complete information extraction and text mining systems will be presented in more detail, their strengths and shortcomings demonstrated and contrasted.
Techniques of information extraction and text mining
Benjamin Rozenfeld
INRIA Rhone-Alpes, C208
Thusday April 3rd 2008, 11h00

Inria only:

### Abstract:

The talk will consist of two parts. It will start with a broad overview of text mining, its main goals, tasks, and problems. Several common tasks will be described in some detail, including building and preprocessing of text collections, text categorization, extraction of terms, entities and relations, and document summarization. Known well-performing techniques for solving these problems will be briefly discussed. In the second part, several complete information extraction and text mining systems will be presented in more detail, their strengths and shortcomings demonstrated and contrasted.
Improving People Search Using Query Expansions:
How Friends Help To Find People
Thomas Mensink
INRIA Rhone-Alpes, C207
Friday March 28th 2008, 15h00

### Abstract:

Faces are important to people, so detecting and recognizing faces are important applications for visual pattern recognition methods. Recently these have found their way into consumer products such as digital cameras. In this paper we are interested in finding images of people on the web, and more specifically in large databases of captioned news images.
It has recently been shown that analysis of the faces in images returned on a text-based query over captions can significantly improve search results. The idea underlying this clean-up of text-based results is that the queried person will appear relatively often compared to other people, so we can search for a large group of highly similar faces. The performance of such methods depends strongly on this assumption: for people whose face appears in less than about 40\% of the initial result, set performance may be very poor.
The contribution of this paper is to improve search results by exploiting faces of other people that co-occur frequently with the queried person. We refer to this process as query expansion'. In the face analysis we use the query expansion to provide a query-specific relevant set of negative' examples which should be separated from the potentially positive examples in the initial result set.We apply this idea to a recently-proposed method which filters the initial result set using a Gaussian mixture model. We also consider replacing the Gaussian mixture with a linear discriminant as basic tool to refine the text-based query results.
We experimentally evaluate the methods using a set of 23 queries on a database of 15.000 captioned news stories from Yahoo! News. The results show that query expansion improves both methods, that our new discriminative method outperforms generative approaches; and state-of-the-art results by 10\% precision on average.
Hierarchical Spectral Latent Variable Models (HSLVM)
for Perceptual Inference
INRIA Rhone-Alpes, F107
Friday March 7th 2008, 15h00

### Abstract:

I will discuss a recently introduced class of non-linear generative models referred to as Spectral Latent Variable Models (SLVM), that combine the advantages of spectral embeddings with the ones of latent variable models: (1) provide latent spaces that preserve geometric properties -- either global or local -- of the data distribution; (2) offer low-dimensional spaces with probabilistic, bi-directional mappings to and from the data space, (3) are probabilistically consistent, i.e. reflect the data distribution, both jointly and marginally, and can be learned with reasonable efficiency. Time allowing, I will discuss the extension of this model to hierarchies (HSLVM) that represent multiple levels of correlation in the data. In this case, training boils down to learning a partially observed directed graphical model with tree dependency and local distributions modeled as SLVMs. In practice, HSLVM provide competitive priors compared to PCA, GPLVM (Gaussian Process Latent Variable Model) or GTM (Generative Topographic Mapping) when tracking facial expressions or human body motions like walking, running, pantomime or dancing not only in benchmarks datasets like HumanEva, but also in movies like Run Lola Run.
Viewpoint-Independent Object Class Detection
using 3D Feature Maps
INRIA Rhone-Alpes, C207
Friday February 22nd 2008, 16h00

### Abstract:

We present a 3D approach to multi-view object class detection. Most existing approaches recognize object classes for a particular viewpoint or combine classifiers for a few discrete views. We propose instead to build 3D representations of object classes which allow to handle viewpoint changes and intra-class variability. Our approach extracts a set of pose and class discriminant features from synthetic 3D object models using a filtering procedure, evaluates their suitability for matching to real image data and represents them by their appearance and 3D position. We term these representations 3D Feature Maps. For recognizing an object class in an image we match the synthetic descriptors to the real ones in a 3D voting scheme. Geometric coherence is reinforced by means of a robust pose estimation which yields a 3D bounding box in addition to the 2D localization. The precision of the 3D pose estimation is evaluated on a set of images of a calibrated scene. The 2D localization is evaluated on the PASCAL 2006 dataset for motorbikes and cars, showing that its performance can compete with state-of-the-art 2D object detectors.
Automatic Face Naming with Caption-based Supervision
INRIA Rhone-Alpes, C207
Monday February 11th 2008, 16h00

### Abstract:

We consider two scenarios of naming people in databases of news photos with captions: (i) finding faces of a single person, and (ii) assigning names to all faces. We combine an initial text-based step, that restricts the name assigned to a face to the set of names appearing in the caption, with a second step that analyzes visual features of faces. By searching for groups of highly similar faces that can be associated with a name, the results of purely text-based search can be greatly improved. We improve a recent graph-based approach, in which nodes correspond to faces and edges connect highly similar faces. We introduce constraints when optimizing the objective function, and propose improvements in the low-level methods used to construct the graphs. Furthermore, we generalize the graph-based approach to face naming in the full data set. In this multi-person naming case the optimization quickly becomes computationally demanding, and we present an important speed-up using graph-flows to compute the optimal name assignments in documents. Generative models have previously been proposed to solve the multi-person naming task. We compare the generative and graph-based methods in both scenarios, and find significantly better performance using the graph-based methods in both cases.
Category level object segmentation
using appearance models and Markov Random Fields
INRIA Rhone-Alpes, C207
Thursday January 31st 2008, 15h00

### Abstract:

Object models based on bag-of-words representations achieve state-of-the-art performance for image classification and object localization tasks. However, as they consider objects as loose collections of local patches they fail to accurately locate object boundaries and are not able to produce accurate object segmentation. On the other hand, Markov Random Field models used for image segmentation focus on object boundaries but can hardly use the global constraints necessary to deal with object categories whose appearance may vary significantly. Here we propose to to combine advantages of these two approaches. First, a mechanism based on blobs of local regions allows to detect objects using visual word occurrences and produces rough image segmentation. Second, a MRF component gives clean cuts and enforces label consistency, guided by local image cues (color, texture and edge cues) and by long-distance dependencies. Gibbs sampling is used to infer the model. The proposed method is used to segment object categories with highly varying appearance in presence of cluttered backgrounds and large view point changes.
Learning realistic human actions from movies
Yvan Laptev et Marcin Marszalek
IRISA Rennes and INRIA Rhone-Alpes
INRIA Rhone-Alpes, C207
Tuesday January 19th 2008, 15h00

### Abstract:

The aim of this paper is to address recognition of natural human actions in diverse and realistic video settings. This challenging but important subject has mostly been ignored in the past due to several problems one of which is the lack of realistic and annotated video datasets. Our first contribution is to address this limitation and to investigate the use of movie scripts for automatic annotation of human actions in videos. We evaluate alternative methods for action retrieval from scripts and show benefits of a text-based classifier. Using the retrieved action samples for visual learning, we next turn to the problem of action classification in video. We present a new method for video classification that builds upon and extends several recent ideas including local space-time features, space-time pyramids and multi-channel non-linear SVMs. The method is shown to improve state-of-the-art results on the standard KTH action dataset by achieving 91.8\% accuracy. Given the inherent problem of noisy labels in automatic annotation, we particularly investigate and show high tolerance of our method to annotation errors in the training set. We finally apply the method to the learning and classification of challenging action classes in movies and show promising results.

# Seminars in 2007

Scene Segmentation with CRFs Learned
from Partially Labeled Images
INRIA Rhone-Alpes, C207
Friday November 30th 2007, 16h00

### Abstract:

Conditional Random Fields (CRFs) are an effective tool for a variety of different data segmentation and labelling tasks including visual scene interpretation, which seeks to partition images into their constituent semantic-level regions and assign appropriate class labels to each region. For accurate labelling it is important to capture the global context of the image as well as local information. We introduce a CRF based scene labelling model that incorporates both local features and features aggregated over the whole image or large sections of it. Secondly, traditional CRF learning requires fully labelled datasets. Complete labellings are typically costly and troublesome to produce. We introduce an algorithm that allows CRF models to be learned from datasets where a substantial fraction of the nodes are unlabeled. It works by marginalizing out the unknown labels so that the log-likelihood of the known ones can be maximized by gradient ascent. Loopy Belief Propagation is used to approximate the marginals needed for the gradient and log-likelihood calculations and the Bethe free-energy approximation to the log-likelihood is monitored to control the step size. Our experimental results show that incorporating top-down aggregate features significantly improves the segmentations and that effective models can be learned from fragmentary labellings. The resulting methods give scene segmentation results comparable to the state-of-the-art on three different image databases.
Open Source, Distributed and Peer-to-Peer
Information Retrieval
INRIA Rhone-Alpes, Grand Amphi
Monday November 67th 2007, 15h00

### Abstract:

I will review arguments for open source and distributed search/IR, introduce the basic concepts used in distributed IR, and discuss some aspects of P2P IR in this context. The talk will be based mainly on my tutorial given at the 6th European Summer School in Information Retrieval (ESSIR 2007) in Glasgow in August.
Vision Biologique et Vision Artificielle :
Vers une convergence ?
Centre de Recherche Cerveau et Cognition, Toulouse
SpikeNet Technology SARL, Labège
INRIA Rhone-Alpes, F107
Wednesday November 7th 2007, 14h00

### Abstract:

Il y a plus de 25 ans, David Marr a proposé que la vision biologique et la vision par machine pourraient faire partie d'une même discipline. Force est de constater que cette fusion ne s'est pas réalisée. Or, il y a certains signes montrant que cette convergence pourrait avoir lieu. Depuis une bonne dizaine d'années, les recherches sur les systèmes biologiques ont suggéré que certaines tâches/ a priori/ complexe (comme décider si une image contient un animal) peuvent être réalisées de façon tellement rapide que seul un traitement essentiellement feed-forward semble pouvoir être impliqué. Il est d'ailleurs probable que ce type de jugement se fasse avant même que la scène soit segmentée. Il est intéressant de constater qu'en vision par machine c'est justement ce type d'architecture qui représente l'état de l'art. Est-il possible que la sélection naturelle et les chercheurs en vision par machine convergent vers les mêmes solutions ?
Fisher Kernels on Visual Vocabularies
for Image Categorization
Florent Perronin
INRIA Rhone-Alpes, F107
Wednesday October 31st 2007, 14h30

### Abstract:

Within the field of pattern classification, the Fisher kernel is a powerful framework which combines the strengths of generative and discriminative approaches. The idea is to characterize a signal with a gradient vector derived from a generative probability model and to subsequently feed this representation to a discriminative classifier. We propose to apply this framework to image categorization where the input signals are images and where the underlying generative model is a visual vocabulary: a Gaussian mixture model which approximates the distribution of low-level features in images. We show that Fisher kernels can actually be understood as an extension of the popular bag-of-visterms. Our approach demonstrates excellent performance on the VOC 2006 and VOC 2007 databases. It is also very practical: it has low computational needs both at training and test time and vocabularies trained on one set of categories can be applied to another set without any significant loss in performance.
Sprite learning and object category recognition
using invariant features
INRIA Rhone-Alpes, F107
Thursday October 25th 2007, 11h00

### Abstract:

This talk will discuss the use of invariant features to learn the appearance of specific objects and to learn to detect and locate instances of object categories. A popular framework for the interpretation of image sequences is the layers or sprite model. Jojic and Frey (2001) provide a generative probabilistic model framework for this task, but their algorithm is slow as it needs to search over discretised transformations for each layer. We show that by using invariant features and clustering their motions we can reduce or eliminate search and thus learn the sprites much faster. The Generative Template of Features (GTF) is a parts-based model for visual object category detection. The GTF consists of a number of parts, and for each part there is a corresponding spatial location distribution and a distribution over 'visual words' (clusters of invariant features). We examine the performance of the GTF model, and discuss the connection of the GTF to Hough-transform-like methods for object localisation.
Enhanced Local Texture Feature Sets
for Face Recognition under Difficult Lighting Conditions
Xiaoyang Tan
INRIA Rhone-Alpes, C208
Thursday October 11th 2007, 17h00

### Abstract:

Abstract:Recognition in uncontrolled situations is one of the most important bottlenecks for practical face recognition systems. We address this by combining the strengths of robust illumination normalization, local texture based face representations and distance transform based matching metrics. Specifically, we make three main contributions: 1) we present a simple and efficient preprocessing chain that eliminates most of the effects of changing illumination while still preserving the essential appearance details that are needed for recognition; 2) we introduce Local Ternary Patterns (LTP), a generalization of the Local Binary Pattern (LBP) local texture descriptor that is more discriminant and less sensitive to noise in uniform regions; and 3) we show that replacing local histogramming with a local distance transform based similarity metric further improves the performance of LBP/LTP based face recognition. The resulting method gives state-of-the-art performance on several popular datasets chosen to test recognition under difficult illumination conditions: Face Recognition Grand Challenge experiment 4(version 1 and 2) , Extended Yale-B, and CMU PIE.
Using non-expert collaborative work sources
to create ontologies for visual recognition
Pierre Bernard
ENSIMAG, Grenoble
INRIA Rhone-Alpes, C207
Friday October 5rd 2007, 16h30

### Abstract:

In the framework of multi-class recognition, we propose to automatically extract inter-class knowledge from non-expert work sources to build visual-centered hierarchies. We demonstrate the quality of these hierarchies expressing visual similarity or contextual links between classes. We describe how to build and train classifiers taking advantages of them to perform object detection. We evaluate our approach on the Pascal VOC'07 dataset, a set of challenging real-world images, showing a significant average gain compared to the standard one-against-rest method.
How to Dispatch Observers to Track an Evolving Boundary
INRIA Rhone-Alpes, F107
Wednesday October 3rd 2007, 11h30

### Abstract:

Some distributed-sensing applications make it necessary to dispatch a limited number of observers (ships, vehicles, or airplanes with cameras; field workers with chemical kits; high-flying balloons with atmospheric sensors) to track the evolving boundary of a large phenomenon such as an oil spill, a fire, a hurricane, air or water pollution, or EL Nino. This paper develops a new framework for controlling the movements of the observers to maximize the information gained about the boundary's shape and position. To this end, we represent boundary uncertainty by a particle filter where each particle is a binary indicator function. This makes our dispatch algorithms applicable to arbitrary boundary representations from which indicator functions can be computed, including level sets and polygonal approximations. We demonstrate the benefits of optimal dispatch on both synthetic and real data. These benefits are most apparent when the observers are sparse relative to the boundary size.
Randomized forests for learning the distance
between visual object classes
Josip Krapac
INRIA Rhone-Alpes, C207
Tuesday September 25th 2007, 16h

### Abstract:

I will present the work on the use of combination of randomized forests and SVMs to learn the distance between object classes from image pairs labeled as "same" or "different". The work is extension of previous work that used randomized forests to learn the distancef between object instances of the same class. It was shown that this learned distance generalizes well to the instances of the same class which were never seen before.
In order to handle increased within-class variability in the case of visual object classes the representative images of the class (focal images) are used. Distance to a class is obtained as combination of distances to each of representative images of the class.
A Large Scale Tracking Problem: Tracking Migrating and Proliferating Cells in Phase-Contrast Microscopy Imagery
INRIA Rhone-Alpes, Grand Amphi
Monday September 13rd 2007, 11h

### Abstract:

In Tissue Engineering, the development of tissue substitutes to restore, maintain, or improve the human tissues involves implanting scaffolds (biodegradable exracellular matrices) and seeding and culturing cells with hormones to induce growth of tissue. Computer vision can provide the capability to "engineer individual cells" - precisely and individually tracking a large number of cells in vivo in real time to study and direct the migration and proliferation of tissue cells. The varying density of the cell culture and the complexity of the cell behavior (shape deformation, division/mitosis, close contact and partial occlusion) pose many challenges to tracking techniques. Using our work in collaboration with biomedical engineers, I will present the challenge and excitement of the new application area of motion image analysis.
Character Recognition using Bag of Features: Baseline Results for Latin and Kannada Characters
Teo de Campos
INRIA Rhone-Alpes, C207
Wednesday August 22nd 2007, 15h

### Abstract:

In this talk, I will present the ongoing work that I've started at Microsoft Research India, with Manik Varma. We targeted characters recognition from natural images. An intended application is recognition of text from portable cameras to aid tourists who do not know the local language. We acquired an image data set of Latin and Kannada characters composed of synthesized characters using computer fonts, handwritten characters and natural images obtained from photographs. Our main test sets are from the latter group. The problem was approached using bag of features and five feature extraction methods were evaluated.
Accurate object detection
with deformable shape models learnt from images
Vittorio Ferrari
INRIA Rhone-Alpes, Amphi F107
Wednesday July 18th 2007, 16h

### Abstract:

In this talk we present an object class detection approach which fully integrates the complementary strengths offered by shape matchers. Like an object detector, it can learn class models directly from images, and localize novel instances in the presence of intra-class variations, clutter, and scale changes. Like a shape matcher, it finds the accurate boundaries of the objects, rather than just their bounding-boxes. This is made possible with a novel technique for learning both the prototypical shape of an object class and a statistical model of how it can deform, given just images of example instances. Once the model is learnt, we localize novel instances in cluttered images by combining a Hough-style voting process with a non-rigid point matcher. Through experimental evaluation, we show how the method can detect objects and localize their boundaries accurately, while needing no segmented training examples (only bounding-boxes).
Local Subspace Classifiers
Hakan Cevikalp
INRIA Rhone-Alpes, C207
Friday July 13th 2007, 16h

### Abstract:

The K-local hyperplane distance nearest neighbor (HKNN) algorithm is a local classification method that builds nonlinear decision surfaces by using locally linear manifolds directly in the original sample space. Although it has been successfully applied in several classification tasks, it is limited to using the Euclidean distance metric, which is a significant limitation in the practice. In this paper we reformulate HKNN in terms of subspaces, and propose a variant, the Local Discriminative Common Vector (LDCV) method, that is more suitable for classification tasks where the classes have similar intra-class variations. We then extend both methods to the nonlinear case by using the kernel trick to map the data into a higher-dimensional space, in which the linear manifolds are constructed. This construction allows us to use a wide variety of distance functions for the local classifiers, while computing distances between the query sample and the nonlinear manifolds remains straightforward owing to linear nature of the manifolds in the mapped space. We tested the proposed methods on several classification tasks, obtaining better results than both the Support Vector Machines (SVMs) and their local counterpart SVM-KNN on the USPS and Image segmentation databases, and outperforming the local SVM-KNN on the Caltech and Xerox10 visual recognition databases.
Using shape information for recognition
INRIA Rhone-Alpes, Amphi F107
Wednesday June 27th 2007, 10h

### Abstract:

Shape information is an important cue for recognizing object and object categories in images. In fact, many categories are characterized primarily by the consistency of their shape while intra-class texture statistics may not be as informative. This is true even for categories that include a large degree of geometric deformation. Recent work in the community has shown progress in using shape cues for recognition, including learned boundary detectors, matching and classification using local configuration of contour fragments. In this talk, I will review three recent developments in this area. The first one is an algorithm for category recognition which relies on very simple shape features (oriented points sampled on contour fragments). The algorithm uses an efficient spectral matching technique for both matching and learning. The category models can be learned from semi-supervised data (i.e., images labeled as containing/not containing the object without manual delination of the object). An added benefit of this approach is that it uses an explicit matching approach between image features and model parts. As a result, it is possible to extend the classification algorithm to an efficient detection algorithm, which includes object localization.
Two other developments will be very briefly described. The first one has to do with using motion information to detect boundaries; the second one addresses the problem of extracting boundaries from a single image by using estimates of the local geometry of the scene (using the results from our earlier work on estimating geometric layout from an image). Both approaches provide information about object boundaries that are useful for recognition.
Accurate Object Localization with Shape Masks
INRIA Rhone-Alpes, Amphi C207
Tuesday June 12th 2007, 16h

### Abstract:

We will discuss an object class localization approach which goes beyond bounding boxes, as it also determines the outline of the object. Unlike most current localization methods, our approach does not require any hypothesis parameter space to be defined. Instead, it directly generates, evaluates and clusters shape masks. Thus, the presented framework produces much richer answers to the object class localization problem. For example, it easily learns and detects possible object viewpoints and articulations, which are often well characterized by the object outline. We evaluate the proposed approach on the challenging natural-scene Graz-02 object classes dataset. The results demonstrate the extended localization capabilities of our method.
A contextual dissimilarity measure
for accurate and efficient image search
INRIA Rhone-Alpes, Amphi C207
Wednesday June 6th 2007, 16h

### Abstract:

In this paper we present two contributions to improve accuracy and speed of an image search system based on bag-of-features: a contextual dissimilarity measure (CDM) and an efficient search structure for visual word vectors.

Our measure (CDM) takes into account the local distribution of the vectors and iteratively estimates distance correcting terms. These terms are subsequently used to update an existing distance, thereby modifying the neighborhood structure. Experimental results on the Nist\'er-Stew\'enius dataset show that our approach significantly outperforms the state-of-the-art in terms of accuracy.

Our efficient search structure for visual word vectors is a two-level scheme using inverted files. The first level partitions the image set into clusters of images. At query time, only a subset of clusters of the second level has to be searched. This method allows fast querying in large sets of images. We evaluate the gain in speed and the loss in accuracy on large datasets (up to 500k images).
Learning and Recognizing Visual Object Categories
Without Detecting Features
INRIA Rhone-Alpes, Grand Amphi
Tuesday June 5th 2007, 11h

### Abstract:

Over the past few years there has been substantial progress in the development of systems that can recognize generic categories of objects in images, such as automobiles, bicycles, airplanes, and human faces. Much of this progress can be traced to two underlying technical advances: (i) detectors for locally invariant features of an image, and (ii) the application of techniques from machine learning. Despite recent successes, however, there are some fundamental concerns about methods that rely heavily on feature detection, as local image evidence is often highly ambiguous due to the absence of contextual information.

We are taking a different approach to learning and recognizing visual object categories, in which there is no separate feature detection stage. In our approach, objects are modeled as local image patches with spring-like connections that constrain the spatial relations between patches. Such models are intuitively natural, and their use dates back over 30 years. Until recently such models were largely abandoned due to computational challenges that are addressed by our work. Our approach can be used to learn models from weakly labeled training data, without any specification of the location of objects or their parts. The recognition accuracy for such models is better than when using feature-based techniques with similar forms of spatial constraint.
From objects to actions:
Detection using boosted histogram classifier
INRIA Rhone-Alpes, Amphi F107
Thursday May 31st 2007, 16h

### Abstract:

This talk will address the detection of object and action classes in unconstrained scenes. We first consider object class recognition and localisation in still images. Building upon recent advances in the field we show how histogram-based descriptors combined with the boosting classifier provide a state of the art object detector. Among improvements we introduce Fisher weak learner for multi-valued histogram features and address the training from limited sets of examples. We also address computational aspects and analyse the tradeoff between the speed and the accuracy of the detector. Validation of the method on VOC05 and VOC06 benchmarks for object recognition shows its superior performance. In particular, the approach outperforms all the methods reported in VOC05 Challenge for 7 out of 8 detection tasks while using a single set of parameters and providing close to real-time performance.

We next consider recognition and localisation of "atomic" actions in video. We treat such actions similarly to the objects in images and extend the boosted histogram detector to action detection in space-time. Using this approach, we address recognition and localisation of human actions in realistic scenarios with substantial variation in subject appearance, motion, surrounding scenes, viewing angles and spatio-temporal extents. In contrast to the previous works that study action recognition in controlled settings, here we train and test the algorithms on real movies. We in particular investigate the combination of shape and motion information for action understanding. To this end we introduce keyframe priming'' that combines discriminative models of human appearance and motion in action. Keyframe priming is shown to significantly improve the performance of action detection. We present detection results for the action class drinking'' evaluated on two episodes of the movie Coffee and Cigarettes'' with 36,000 frames in total.
Penalized least squares with nonquadratic penalties
INRIA Rhone-Alpes, Amphi F107
Monday May 28th 2007, 15h

### Slides

Inria only: Slides pdf format

### Abstract:

A popular method for fitting a linear regression model from data measurements is regularization: minimize an objective function which enforces a roughness penalty in addition to coherence with the data. This is the case when formulating penalized least squares for linear regression models. We focus on penalized regression methods involving a variety of nonquadratic penalties, pointing out some basic principles they have in common. We end this talk with an application of such penalties for feature selection in model-based clustering problems.
Learning Visual Similarity Measures for
Comparing Never Seen Objects
INRIA Rhone-Alpes, Amphi C207
Friday May 25th 2007, 16h

### Abstract:

In this paper we propose and evaluate an algorithm that learns a similarity measure for comparing never seen objects. The measure is learned from pairs of training images labeled same'' or different''. This is far less informative than the commonly used individual image labels (e.g. car model X''), but it is cheaper to obtain. The proposed algorithm learns the characteristic differences between local descriptors sampled from pairs of same'' and different'' images. These differences are vector quantized by an ensemble of extremely randomized binary trees, and the similarity measure is computed from the quantized differences. The extremely randomized trees are fast to learn, robust due to the redundant information they carry and they have been proved to be very good clusterers. Furthermore, the trees efficiently combine different feature types (SIFT and geometry). We evaluate our innovative similarity measure on four very different datasets and consistantly outperform the state-of-the-art competitive approaches.
Applying Generic Object Recognition Methods
to Environmental Monitoring and Ecological Science
INRIA Rhone-Alpes, Amphi F107
Wednesday April 25th 2007, 16h

### Abstract:

This talk will describe our work at Oregon State University to develop object recognition methods that can achieve high precision on the task of classifying small arthropods according to Family, Genus, and Species. Arthropods are challenging for computer vision because they have many internal degrees of freedom and because there is high within-class variation due to molting. Our interdisciplinary team combines expertise in computer vision, machine learning, mechanical engineering, and entomology to develop a high-throughput system for classifying stonefly larvae collected from freshwater streams.

We are pursuing the bag-of-SIFT approach based on many ideas from INRIA. Our system begins by applying three region detectors to each image. Two of these detectors (Harris Affine and Kadir) are well-known in computer vision, but the third is a new detector (PCBR) that we developed specifically for natural (non-man-made) objects based on principal curvature computations. Each detected region is re-represented as a SIFT descriptor vector. Next, we construct detector-specific/class-specific visual dictionaries by fitting Gaussian mixture models to the SIFT descriptor vectors. Finally, we re-represent the image as a concatenated histogram where each element counts the number of SIFT vectors mapped to corresponding dictionary entry. This feature vector is then classified using a bag of logistic model trees.

Our initial system is capable of identifying three taxa of stoneflies with 95% accuracy and four taxa with 82% accuracy. We are currently performing an 8-taxa experiment with 10 additional "distractor" classes. This talk will also describe our current research directions and discuss a new application problem: classification and sorting of soil mesofauna.
Expressive rendering
INRIA Rhone-Alpes, Amphi F107
Wednesday March 14th 2007, 16h

### Abstract:

A part of computer graphics can be viewed as a visual communication tool. Such a point of view implies several goals that we target in ARTIS with expressive rendering. In particular the user of an expressive rendering tool should be able to produce the images that corresponds to his own goals.
This involves, in particular, significant work on the notion of /relevance/, which is necessarily application-dependent. The relevance should guide the level of abstraction of the rendered scene to let the user emphasize the most important elements of the input 3d scene. It can also be defined from a levels-of-detail point of view: not only can we adapt the geometry to decrease the computation time, but we can also adapt the rendering style to meet the user's goals. Another research direction for expressive rendering concerns /rendering styles/: in many cases it should be possible to define the constitutive elements of styles, allowing the application of a given rendering style to different scenes, or in the long term the capture of style elements from collections of images.
Finally, since the application of expressive rendering techniques generally amounts to a visual simplification, or abstraction, of the scene, particular care must be taken to make the resulting images consistent over time, for interactive or animated imagery. This leads to various projects targeting the temporal coherence of animated scenes.
ROBIN project
INRIA Rhone-Alpes, C207
Tuesday February 27th 2007, 16h15

### Abstract:

This short talk is about the ROBIN project, funded by the french ministry of defense and the french ministry of research. Its main goal is to produce datasets, ground truths data, competition rules and evaluation metrics for visual object recognition algorithms that correspond to real operational matters. As the competitions have begun, I will present the various ROBIN competitions, the databases and the ways of submission. More informations on http://robin.inrialpes.fr.
Fun with Nearest-Neighbor Quantizers
INRIA Rhone-Alpes, Amphi F107
Tuesday February 6th 2007, 16h

### Abstract:

I will present recent research on using nearest-neighbor vector quantization for estimating intrinsic dimensionality of high-dimensional datasets and for learning informative partitions of labeled data.
In the first part of the talk, I will discuss a technique for intrinsic dimensionality estimation based on the theoretical notion of quantization dimension. This technique works by quantizing the dataset at increasing rates (in practice, we use k-means to learn the quantizer) and by fitting a parametric form to the plot of the empirical quantizer distortion as a function of rate. By using tree-structured quantization, we can simultaneously estimate dimensionality and partition the dataset into subsets having different intrinsic dimensions.
In the second part of the talk, I will discuss an information-theoretic method for learning a nearest-neighbor quantizer from labeled continuous data such that the index of the nearest prototype of a given data point approximates a sufficient statistic for its class label. I will demonstrate applications of this method to learning discriminative visual vocabularies for bag-of-features image classification and to image segmentation.

# Seminars in 2006

Inverse chronological order.

Learning a similarity measure to compare never seen objects
 Eric nowak 15 december 2006, 16h30 Lear Project, INRIA Rhone-Alpes C207, INRIA Rhône-Alpes
Human character recognition in TV-style movies
 Alexander Klaeser 6 december 2006, 16h00 Lear Project, INRIA Rhone-Alpes C207, INRIA Rhône-Alpes
Sensor Synchronization and Localization for Meeting Scene Analysis
 David Demirdjian 17 october 2006, 16h MIT Artificial Intelligence Laboratory F107, INRIA Rhône-Alpes
Presentation of an appearance model for small targets tracking
 Julien Bohn¿ 11 october 2006, 17h Lear Project, INRIA Rhone-Alpes C207, INRIA Rhône-Alpes
Contribution au mosa¿quage d'images a¿riennes
 Christophe Simler 25 september 2006, 14h00 Universit¿ de Haute-Alsace, composante Label C208, INRIA Rhône-Alpes
Efficient MAP approximation for dense energy functions
 Matial Hebert 18 july 2006, 14h30 The Robotics Institute, Carnegie Mellon University F107, INRIA Rhône-Alpes
Blind Vision
 Shai Avidan 17 july 2006, 17h00 Mitsubishi Electric Research Laboratories F107, INRIA Rhône-Alpes
Latent Mixture Vocabularies for Object Categorization
 Diane Larlus 12 july 2006, 14h00 LEAR Group C207, INRIA Rhône-Alpes
statistical models to address the problem of object recognition
 Thomas Deselaers 4 july 2006, 14h00 Computer Science Department, Aachen University grand Amphi, INRIA Rhône-Alpes
Conservative Learning and On-line Boosting for Vision
 Horst Bischof 5 june 2006, 14h00 Institute for Computer Graphics and Vision, TU Graz grand Amphi, INRIA Rhône-Alpes
Multiple Object Class Detection with a Generative Model
 Bernt Schiele 9 june 2006, 14h30 Department of Computer Science, Darmstadt University of Technology F107, INRIA Rhône-Alpes
Extremely randomized trees applied to image quantification combined to a visual attention process for object categorization
 Frank Moosmann 15 may 2006, 16h Lear Project, INRIA Rhône-Alpes C207, INRIA Rhône-Alpes
Brain Computer Interfaces
 Vincent Guigue 14 april 2006, 11h Lab. LITIS - INSA de Rouen F107, INRIA Rhône-Alpes
Methodes de filtrage pour du suivi dans des sequences d'images - application au suivi de points caracteristiques
 Elise Arnaud 4 april 2006, 16h30 Universit¿ de Genes, Italy et IRISA Rennes F107, INRIA Rhône-Alpes
Error-resilient source codes and joint source/channel codes
 Herve Jegou 3 april 2006, 16h IRISA/University of Rennes A104, INRIA Rhône-Alpes
Object Detection in Crowded Scenes
 Bastian Leibe 20 march 2006, 11h00 Multimodal Interactive Systems group, Darmstadt F107, INRIA Rhône-Alpes
Beyond bag-of-words: recent research developments on visual categorization at XRCE
 Florent Perronnin and Gabriela Csurka 16 march 2006, 16h00 Xerox Research Centre Europe, Image Processing Group C207, INRIA Rhône-Alpes
Modelling Scenes with Local Descriptors and Latent Aspects
 Tinne Tuytelaars 16 february 2006, 15h00 K.U.Leuven, VISICS Group F107, INRIA Rhône-Alpes
Le programme TRECVID : Exp¿rimentations en recherche par le contenu dans des bases de documents vid¿os
 Georges Quenot 9 february 2006, 14h00 CLIPS-IMAG F107, INRIA Rhône-Alpes
Geometric Context from a Single Image
 Derek Hoiem 6 february 2006, 15h00 Robotics Institute of Carnegie Mellon University F107, INRIA Rhône-Alpes
Computer vision using local binary patterns
 Matti Pietik¿inen 12 january 2006, 14h30 Information Processing Laboratory, University of Oulu, Finland F107, INRIA Rhône-Alpes
Evaluation de d¿tecteurs et de descripteurs de points d'int¿ret sur des images infrarouges
 Julien Bohn¿ 11 january 2006, 16h Lear Project, INRIA Rhone-Alpes C207, INRIA Rhône-Alpes

## Details of 2006 seminars

### Learning a similarity measure to compare never seen objects

Presenter: Eric nowak
 15 December, at 16h30 C207, INRIA Rhône-Alpes
Affiliation: Lear Project, INRIA Rhone-Alpes

Abstract:
We propose a similarity measure between two images that predicts how similar two images of never seen objects are, given a training set of similar and different object pairs. This similarity measure is used for visual identification from *one image*. It does not model any a priori deformation nor does it expect a linear or quadratic transformation of the input space to be relevant, instead it clusters local image representations and weights these clusters for the same/different prediction. An ensemble of extremely randomized decision trees is used as clusterer. These trees are particularly adapted to the clustering since they are very fast to learn and they produce redundant information, which brings robustness. We evaluate our similarity measure on three datasets and outperform state-of-the-art competitive methods.

### Human character recognition in TV-style movies

Presenter: Alexander Klaeser
 6 December, at 16h00 C207, INRIA Rhône-Alpes
Affiliation: Lear Project, INRIA Rhone-Alpes

Abstract:
This master thesis describes a supervised approach to the detection and the identification of humans in TV-style video sequences. In still images and video sequences, humans appear in different poses and views, fully visible and partly occluded, with varying distances to the camera, at different places, under different illumination conditions, etc. This diversity in appearance makes the task of human detection and identification to a particularly challenging problem. A possible solution of this problem is interesting for a wide range of applications such as video surveillance and content-based image and video processing. In order to detect humans in views ranging from full to close-up view and in the presence of clutter and occlusion, they are modeled by an assembly of several upper body parts. For each body part, a detector is trained based on a Support Vector Machine and on densely sampled, SIFT-like feature points in a detection window. For a more robust human detection, localized body parts are assembled using a learned model for geometric relations based on Gaussians. For a flexible human identification, the outward appearance of humans is captured and learned using the Bag-of-Features approach and non-linear Support Vector Machines. Probabilistic votes for each body part are combined to improve classification results. The combined votes yield an identification accuracy of about 80% in our experiments on episodes of the TV series ?Buffy the Vampire Slayer?. The Bag-of-Features approach has been used in previous work mainly for object classification tasks. Our results show that this approach can also be applied to the identification of humans in video sequences. Despite the difficulty of the given problem, the overall results are good and encourage future work in this direction.

### Sensor Synchronization and Localization for Meeting Scene Analysis

Presenter: David Demirdjian
 17 October, at 16h00 F107, INRIA Rhône-Alpes
Affiliation: MIT Artificial Intelligence Laboratory

Abstract:
In this talk we tackle the problems of automatically i) synchronizing audio-visual streams and ii) localizing a set of cameras in a meeting analysis setting. More exactly, we consider a conference meeting setup where each participant wears a close-talking microphone and is recorded by a personal video camera. The multiple audio and video streams are recorded in an unsynchronized manner and the location and orientation of the cameras are unknown. We propose here some techniques for automatically estimating the time discrepancy between all audio and video streams and recovering the location and orientation of the cameras. First we show how the mutual information between the estimated motion energy of the lips and the audio energy can be used to recover the time discrepancy between the video and audio streams corresponding to the same participant. Then we show how the same technique can be used to synchronize the audio-visual streams corresponding to different participants. Finally we describe a probabilistic Bayesian framework for estimating the location and orientation of a set of cameras. We show how the head direction of the users can be used as a constraint by exploiting gaze patterns in multiparty conversational settings. In order to evaluate the performance of our algorithms, we show some synchronization and calibration results on real meetings.

### Presentation of an appearance model for small targets tracking

Presenter: Julien Bohn¿
 11 October, at 17h00 C207, INRIA Rhône-Alpes
Affiliation: Lear Project, INRIA Rhone-Alpes

Abstract:
Our method combines a statistical appearance model of the target and an accurate modeling of the background in the neighborhood. The 2 models are updated during the image sequence to adapt appearance changes. We especially take care of the ability of the algorithm to provide a good estimation of the confidence in the position estimations

### Contribution au mosa¿quage d'images a¿riennes

Presenter: Christophe Simler
 25 September, at 14h00 C208, INRIA Rhône-Alpes
Affiliation: Universit¿ de Haute-Alsace, composante Label

Abstract:
Cet expos¿ intitul¿ ¿ Contribution au mosa¿quage d'images a¿riennes ¿, pr¿sente les travaux d'une th¿se. Nous d¿crivons notre dispositif exp¿rimental, ainsi que les caract¿ristiques des s¿quences d'images qui en ¿manent. Nous faisons ensuite un ¿tat de l'art des techniques de mosa¿quage, ainsi qu'une ¿tude approfondie des algorithmes. Dans la derni¿re partie nous parlons de nos contributions, qui sont l'¿laboration d'un vecteur descripteur invariant aux rotations selon l'axe optique pour la mise en correspondance de points sp¿cifiques, l'impl¿mentation d'une technique de recalage subpixellique des correspondances et l'¿laboration d'une m¿thode de compensation de l'accumulation d'erreurs d'une mosa¿que.

### Efficient MAP approximation for dense energy functions

Presenter: Martial Herbert
 18 July, at 14h30 F107, INRIA Rhône-Alpes
Affiliation: The Robotics Institute, Carnegie Mellon University

Abstract:
We present an efficient method for maximizing energy functions with first and second order potentials, suitable for MAP labeling estimation problems that arise in undirected graphical models. Our approach is to relax the integer constraints on the solution in two steps. First we efficiently obtain the relaxed global optimum following a procedure similar to the iterative power method for finding the largest eigenvector of a matrix. Next, we map the relaxed optimum on a simplex and show that the new energy obtained has a certain optimal bound. Starting from this energy we follow an efficient coordinate ascent procedure that is guaranteed to increase the energy at every step and converge to a solution that obeys the initial integral constraints. We also present a sufficient condition for ascent procedures that guarantees the increase in energy at every step.

### Blind Vision

Presenter: Shai Avidan
 17 July, at 17h00 F107, INRIA Rhône-Alpes
Affiliation: Mitsubishi Electric Research Laboratories

Abstract:
We have developed a general framework for secure image and video analysis that allows a client to have his data analyzed by a server, privately. For example, the client might submit his images to the server for face detection, without letting the server learn anything about the content of the images. Or, more generally, the client might use a query image to query an image database stored on the server, without revealing the content of the query image to the server. In the last year, we have implemented a secure face detector as a proof-of-concept, presented our work at a scientific conference and extended the method to work with different types of machine learning technologies.

### Latent Mixture Vocabularies for Object Categorization

Presenter: Diane Larlus
 12 July, at 14h00 C207, INRIA Rhône-Alpes
Affiliation: LEAR Group

Abstract:
The visual vocabulary is an intermediate level representation which has been proven to be very powerful for addressing object categorization problems. It is generally built by vector quantizing a set of local image descriptors, independently of the object model used for categorizing images. We propose here to embed the visual vocabulary creation within the object model construction, allowing to make it more suited for object class discrimination. We experimentally show that the proposed model outperforms approaches not learning such an adapted visual vocabulary.

### statistical models to address the problem of object recognition

Presenter: Thomas Deselaers
 4 July, at 14h00 Grand Amphi, INRIA Rhône-Alpes
Affiliation: Computer Science Department, Aachen University

Abstract:
Object Recognition in images, that is deciding whether an object is contained in an image or not and to tell where it is located is an active field of research. A promising approach to this problem is to model objects as a collection of parts where relationships can be modeled flexibly.

We present a set of methods following this approach where image patches extracted from certain points in the images are used as features.

Starting from approaches inspired by nearest neighbor classification we develop various statistical models to address the problem of object recognition. Though most of the models developed are strongly connected, the training method and the representation of the data have a strong impact on the performance of a system. Some of the methods offer interesting insights in the way computers might be able to learn the visual appearance of certain object categories. For example, an object recognition system trained to recognize faces learns that the most discriminative, i.e. the most relevant part, are the eyes.
Using the methods presented, very interesting and promising results for different tasks can be achieved.

### Conservative Learning and On-line Boosting for Vision

Presenter: Horst Bischof
 5 June, at 14h00 Grand Amphi, INRIA Rhône-Alpes
Affiliation: Institute for Computer Graphics and Vision, TU Graz

Abstract:
I will present two recently developed visual learning methods:

1. The conservative learning framework allows to learn object detectors with minimal or no supervision by exploiting the redundancy of the video stream of cameras. Conservative learning exploits generative and discriminative learning in a co-training fashion to obtain powerful object detectors. We demonstrate the framework on a surveillance task where we learn person and car detectors in an on-line fashion.

2. One method in the on-line conservative learning framework is a novel on-line Adaboost feature selection algorithm. Together with efficiently computable features (Haar Wavelets, Integral Orientation Histograms, etc.) training the classifier on-line and incrementally as new data arrives has several advantages and opens new application areas for boosting in computer vision. We will demonstrate on-line learning of detection, background modeling and tracking tasks based on on-line boosting, all algorithms are real-time capable. All approaches benefit significantly from the on-line training.

### Multiple Object Class Detection with a Generative Model

Presenter: Bernt Schiele
 9 June, at 14h30 F 107, INRIA Rhône-Alpes
Affiliation: Department of Computer Science Darmstadt University of Technology

Abstract:
In this talk we propose an approach capable of simultaneous recognition and localization of multiple object classes using a generative model. A novel hierarchical representation allows to represent individual images as well as various objects classes in a single, scale and rotation invariant model. The recognition method is based on a codebook representation where appearance clusters built from edge based features are shared among several object classes. A probabilistic model allows for reliable detection of various objects in the same image. The approach is highly effi- cient due to fast clustering and matching methods capable of dealing with millions of high dimensional features. The system shows excellent performance on several object categories over a wide range of scales, in-plane rotations, background clutter, and partial occlusions. The performance of the proposed multi-object class detection approach is comparable with state of the art approaches dedicated to a single object class recognition problem.

### Extremely randomized trees applied to image quantification combined to a visual attention process for object categorization

Presenter: Frank Moosmann
 15 may, at 16h00 C 207, INRIA Rhône-Alpes
Affiliation: Lear Project

Abstract:
Lately, the bag-of-features approach became very popular for Image Categorization. However, there are several areas where it can be improved: The selection of features is so far done either densely or with detector functions. While the dense approach achieves better results than detector-based approaches, it also has a higher complexity. The second area of possible improvement is the creation of visual codebooks. The standard clustering method - k-means - is not only slow, it also does not create codebooks suited to discriminate between classes. The associated nearest-neighbor routine to assign clusters is also slow.
We proposed to improve in both areas: Extremely-Randomized Trees are used to create a codebook efficiently and in a discriminative manner. Beside, a combined bottom-up/top-down process is introduced to bias the random selection of features, which leads to a smaller amount of features needed to obtain the same and even better results.

### Brain Computer Interfaces

Presenter: Vincent Guigue
 14 april, at 11h00 F 107, INRIA Rhône-Alpes
Affiliation: Lab. LITIS - INSA de Rouen

Abstract:
A lot of research have been carried out to design Brain Computer Interfaces (BCI), especially in the field of supervised classification of non stationary signals.
EEG signals require particular processing and we propose to tackle those problems according to three approaches: building a denoised compact representation for raw signals, introducing translation invariance in the procedure and dealing with the variability of EEG signals.
In all our approaches we keep two threads: non-parametric tools with kernel machines and a tripolar strategy including the representation of raw signals, the building of similarities between representations and the classification machine.

First, we face the problem of describing the raw signals.
We aim at constructing a denoised and compact representation of the raw signals.
We designed the Kernel Basis Pursuit (KBP) algorithm which combines multiple kernels, sparse regularization and very efficient solving of regression problems.
We add some heuristics to make this method parameter-free thus enabling us to deal with large amounts of data.

Then we make the assumption that one difficulty resides in the variable time position of the discriminant patterns.
We develop a translation invariant approach to classify non-stationary signals.
Such a method relies on a graph model of shift-covariant representation (wavelet transform or time-frequency) where all the time information becomes comparative.

Finally, the variability of EEG signals turned out to be the main difficulty in BCI problems.
We show that combining multiple classifiers and variable selection is an efficient strategy to identify evoked potential in EEG.

* Key words: Regularization L1, Kernel methods, Multiple kernel, Graph kernel, Translation invariance, Multiple classifiers, Brain Computer Interface.

### Methodes de filtrage pour du suivi dans des sequences d'images - application au suivi de points caracteristiques

Presenter: Elise Arnaud
 4 april, at 16h30 F 107, INRIA Rhône-Alpes
Affiliation: Universit¿ de Genes, Italy et IRISA Rennes

Abstract:
Cette ¿tude traite de l'utilisation de m¿thodes de filtrage (filtrage de Kalman, methodes sequentielles de Monte Carlo) pour du suivi dans des s¿quences d'images. Ces algorithmes reposent sur une repr¿sentation du syst¿me dynamique par une cha¿ne de Markov cach¿e, d¿crite par une loi dynamique et une vraisemblance des donn¿es. Pour construire une m¿thode g¿n¿rale, une loi dynamique estim¿e sur les images est consid¿r¿e. Ce choix met en ¿vidence les limitations du mod¿le simple de cha¿ne de Markov cach¿e, qui ne d¿crit pas la d¿pendance des ¿l¿ments du syst¿me aux images. Nous proposons d'abord une mod¿lisation originale du probl¿me. Celle-ci rend les images explicites et permet de construire des algorithmes sans information a priori. Les filtres associ¿s ¿ cette nouvelle repr¿sentation sont d¿riv¿s sur la base des filtres classiques, en consid¿rant un conditionnement par rapport ¿ la s¿quence. Il est ¿galement pr¿sent¿ comment ce nouveau sch¿ma permet de consid¿rer des mod¿les simples, pour lesquels la fonction d'importance optimale est disponible.

Ensuite, nous nous int¿ressons ¿ la validation pratique de la mod¿lisation propos¿e sur une application de suivi de points caract¿ristiques. Les syst¿mes mis en oeuvre sont enti¿rement estim¿s sur la s¿quence. Ils associent des mesures de similarit¿ ¿ une dynamique d¿finie ¿ partir d'un mouvement instantan¿ estim¿ par une m¿thode diff¿rentielle robuste. Les algorithmes construits sont valid¿s sur de nombreuses s¿quences r¿elles, et utilises pour differentes applications (imagerie medicale, reconnaissance d'objet).

### Error-resilient source codes and joint source/channel codes

Presenter: Herve Jegou
 3 april, at 16h A 104, INRIA Rhône-Alpes
Affiliation: IRISA, Rennes

Abstract:
L'expos¿ se d¿roulera en deux parties distinctes.

En premier lieu, deux contributions sur le codage conjoint source-canal seront pr¿sent¿es. La premi¿re concerne le d¿codage de codes ¿ longueur variable. Une technique d'agr¿gation du treillis de d¿codage optimal sera expos¿e. Elle permet de diminuer la complexit¿ du d¿codage bay¿sien d'un ordre de grandeur. Son optimalit¿ pour les r¿alisations typiquement conjointe source/canal est motiv¿e par le calcul de la quantit¿ d'information contenue dans la contrainte de terminaison. La seconde contribution consiste en l'introduction de codes fond¿s sur des r¿gles de r¿-¿criture et implant¿s par des transducteurs s¿quentiels. Quelques propri¿t¿s illustreront l'int¿r¿t de cette classe de codes.

La seconde partie de cet expos¿ traitera de la recherche par similarit¿ et plus particuli¿rement de la recherche approximative de plus proche voisins dans des espaces de grande dimension. Apr¿s une introduction de la probl¿matique, nous soulignerons les limitations d'un algorithme de l'¿tat de l'art, Omedrank, avant de poursuivre sur des am¿liorations cet algorithme. Nous montrerons en particulier qu'il est possible d'obtenir d'importants gains en modifiant la strat¿gie de vote utilis¿e. Nous donnerons enfin quelques perspectives de recherche sur ce th¿me.

### Object Detection in Crowded Scenes

Presenter: Bastian Leibe
 20 march, at 11h00 F 107, INRIA Rhône-Alpes
Affiliation: Multimodal Interactive Systems group, Darmstadt

Abstract:
The detection of object classes in real-world images is a challenging problem which is further complicated by the effects of overlaps and partial occlusions. We present a novel algorithm which addresses this problem by considering object categorization and top-down segmentation as two interleaved processes that closely collaborate towards a common goal. As we will show, the close coupling between those two processes allows our method to accumulate additional evidence about object hypotheses and resolve ambiguities caused by overlaps and partial visibility.

The core part of our approach is a flexible formulation for object shape that can combine the information observed on different training examples in a probabilistic extension of the Generalized Hough Transform. The resulting approach can detect categorical objects in novel images and automatically infer a top-down segmentation from the recognition result. The segmentation is then used to again improve recognition by allowing the system to focus on object pixels and discard misleading influences from the background. Moreover, the information from where in the image a hypothesis draws its support is used in an MDL based verification stage to resolve ambiguities between overlapping hypotheses and factor out the effects of partial occlusion.

As an application, we address the problem of detecting objects such as cars, motorbikes, and pedestrians in real-world street scenes. Qualitative and quantitative results on several challenging data set confirm that our method is able to reliably detect objects in crowded scenes, even when they overlap and partially occlude each other. In addition, the flexible nature of our approach allows it to operate on very small training sets.

### Beyond bag-of-words: recent research developments on visual categorization at XRCE

Presenters: Florent Perronnin and Gabriela Csurka
 16 march, at 16h00 C 207, INRIA Rhône-Alpes
Affiliation: Xerox Research Centre Europe, Image Processing Group

Abstract:
Generic Visual Categorization (GVC) is the pattern classification problem which consists in assigning one or multiple labels to an image based on its semantic content. Several state-of-the-art GVC systems were inspired by the bag-of-words (BOW) approach to text-categorization. In the BOW representation, a text document is encoded as a histogram of the number of occurrences of each word. Similarly, one can characterize an image by a histogram of "visual words" count. This is sometimes referred to as the bag-of-keypatches or bag-of-visterms. During this talk, we will discuss recent developments at the Xerox Research Centre Europe (XRCE) to improve on such representations.

We first present a novel and practical approach to GVC based on a universal vocabulary, which describes the content of all the considered classes of images, and class vocabularies obtained through the adaptation of the universal vocabulary using class-specific data. An image is characterized by a set of histograms - one per class - where each histogram describes whether the image content is best modeled by the universal vocabulary or the corresponding class vocabulary. It is shown experimentally on three very different databases that this novel representation outperforms those approaches which characterize an image with a single histogram.

In the second part we improve the categorizer by incorporating geometrical information. Based on scale, orientation or closeness of the keypatches we can consider a large number of simple geometrical relationships, each of which can be considered as a simplistic classifier. We select from this multitude of classifiers (several millions in our case) and combine them effectively with the original classifier. An improvement is demonstrated on a challenging 10 class dataset.

### Modelling Scenes with Local Descriptors and Latent Aspects

Presenter: Tinne Tuytelaars
 16 february, at 15h00 F 107, INRIA Rhône-Alpes
Affiliation: K.U.Leuven, VISICS Group

Abstract:
A new approach to model visual scenes in image collections is presented, based on local invariant features and probabilistic latent space models. We provide answers to the following three open questions: 1) whether the invariant local features are suited for scene (rather than object) classification; 2)whether unsupervised latent space models can be used for feature extraction in the classification task; and 3) whether the latent space formulation can discover visual co-occurrence patterns, motivating novel approaches to image organization and segmentation. Using a 9500 images-dataset, our approach is validated on each of these issues. First, we show with extensive experiments on binary and multiclass scene classification tasks, that the bag-of-words representation derived from local invariant descriptors, consistently outperforms state-of-the-art approaches. Second, we show that Probabilistic Latent Semantic Analysis (PLSA) generates a compact scene representation, discriminative for accurate classification, and significantly more robust when less training data are available. Third, we have exploited the ability of PLSA to automatically extract visually meaningful aspects, to propose new algorithms for aspect-based image ranking and context-sensitive image segmentation.

Additionally, I'll discuss some planned future work, exploiting a similar scheme based on latent aspects and local invariant features for the integration of visual and textual data.

### Le programme TRECVID : Exp¿rimentations en recherche par le contenu dans des bases de documents vid¿os

Presenter: Georges Qu¿not
 9 february, at 14h00 F 107, INRIA Rhône-Alpes
Affiliation: CLIPS-IMAG

Abstract:
Le National Institute of Standard and Technology am¿ricain (NIST) et DARPA ont lanc¿ une campagne d'¿valuation annuelle des syst¿mes de recherche par le contenu dans des bases de documents vid¿os (TRECVID). Les syt¿mes sont ¿valu¿s globalement dans le cadre d'une t¿che de recherche aussi r¿aliste que possible. Des composants ou techniques n¿cessaires pour ces syst¿mes sont ¿valu¿s ind¿pendamment comem la segmentation en plans, la segmentation en histoires, la d¿tection de concepts et la d¿tection du mouvement de la cam¿ra. Nous d¿crirons les principes g¿n¿raux de la campagne, les diff¿rentes t¿ches et les r¿sultats obtenus, repr¿sentatifs de l'¿tat de l'art dans le domaine. Nous pr¿senterons ¿galement les diff¿rents travaux conduits dans l'¿quipe MRIM et ¿valu¿s dans le cadre de TRECVID.

### Geometric Context from a Single Image

Presenter: Derek Hoiem
 6 february, at 15h00 F 107, INRIA Rhône-Alpes
Affiliation: Robotics Institute of Carnegie Mellon University

Abstract:
Humans have an amazing ability to instantly grasp the overall 3D structure of a scene -- ground orientation, relative positions of major landmarks, etc -- even from a single image. This ability is completely missing in most popular recognition algorithms, which pretend that the world is flat and/or view it through a patch-sized peephole. Yet it seems very likely that having a grasp of this "geometric context" of a scene should be of great assistance for many tasks, including recognition, navigation, and novel view synthesis. In this talk, I will describe our first steps toward the goal of estimating a 3D scene context from a single image. We propose to estimate the coarse geometric properties of a scene by learning appearance-based models of /geometric/ classes. Geometric classes describe the 3D orientation of an image region with respect to the camera. We provide a multiple-hypothesis segmentation framework for robustly estimating scene structure from a single image and obtaining confidences for each geometric label. These confidences can then (hopefully) be used to improve the performance of many other applications. We provide a quantitative evaluation of our algorithm on a dataset of challenging outdoor images.
We also demonstrate its usefulness in two applications:
1) improving object detection, and
2) automatic single-view reconstruction ("Automatic Photo Pop-up", SIGGRAPH'05).
Joint work with Alexei Efros and Martial Hebert at CMU.

### Computer vision using local binary patterns

Presenter: Matti Pietik¿inen
 12 january, at 14h30 F 107, INRIA Rhône-Alpes
Affiliation: Information Processing Laboratory, University of Oulu, Finland

Abstract:
The local binary pattern (LBP) operator is defined as a gray-scale invariant texture measure, derived from a general definition of texture in a local neighborhood. Through its recent extensions, the LBP operator has been made into a really powerful measure of image texture, showing excellent results in many empirical studies. The LBP operator can be seen as a unifying approach to the traditionally divergent statistical and structural models of texture analysis. Perhaps the most important property of the LBP operator in real-world applications is its invariance against monotonic gray level changes. Another equally important is its computational simplicity, which makes it possible to analyze images in challenging real-time settings. The LBP method has already been used in a large number of applications all over the world. This talk presents an overview of the LBP approach, emphasizing our recent research results. Theoretical foundations of the LBP and examples of applying it to various computer vision problems are presented, including classification of 3D textured surfaces, face recognition, face detection, facial expression recognition, content-based retrieval, modeling the background and detecting moving objects, and recognition of dynamic textures.

### Evaluation de d¿tecteurs et de descripteurs de points d'int¿ret sur des images infrarouges

Presenter: Julien Bohn¿
 11 january, at 16h00 C 207, INRIA Rhône-Alpes
Affiliation: Lear, INRIA Rhône-Alpes

Abstract:
Une ¿valuation de diff¿rents d¿tecteurs et descripteurs de points d'int¿r¿t appliqu¿s ¿ des images infra-rouges basse r¿solution sera pr¿sent¿e. Apr¿s une rapide pr¿sentation de la m¿thode de test, les r¿sultats des diff¿rents algorithmes seront comment¿s afin de souligner les avantages et inconv¿nients de chaque technique.

# Seminars in 2005

Inverse chronological order.

Discriminative Regions for Semi-Supervised Object Class Localization
 Caroline Pantofaru 7 December, 2005 at 16h00 Vision and Mobile Robotics Lab , Carnegie Mellon University C207, INRIA Rhône-Alpes
Discovering objects and their location in images
 Andrew Zisserman 5 December, 2005 at 16h00 Department of Engineering Science, University of Oxford Grand Amphi, INRIA Rhône-Alpes
Hyperfeatures - Multilevel Local Coding for Visual Recognition
 Ankur Agarwal 23 November, 2005 at 16h00 Lear Project, INRIA Rhone-Alpes C207, INRIA Rhône-Alpes
Manifold Learning and Image Segmentation
 Jakob Verbeek 24 August, 2005 at 16h00 Intelligent Autonomous Systems, University of Amsterdam C207, INRIA Rhône-Alpes
Dynamic Scene Analysis using Non-Parametric Statistics
 Yoni Wexler 30 June, 2005 at 16h00 Weizmann Institute, Israel F107, INRIA Rhône-Alpes
Infra-red image classification
 Eric Nowak 14 June, 2005 at 14h00 Lear Project, INRIA Rhône-Alpes C207, INRIA Rhône-Alpes
Object Detection with Line Segment Networks
 Vittorio Ferrari 30 May, 2005 at 14h00 Weizmann Institute, Israel F107, INRIA Rhône-Alpes
Creating Efficient Codebooks for Visual Recognition
 Frédéric Jurie 27 April, 2005 at 11h00 INRIA Rhône-Alpes, Project LEAR C207, INRIA Rhône-Alpes
Feature Detection in Color Images
 Joost van de Weijer 13 Avril, 2005 at 16h00 Lear, INRIA Rhône-Alpes C207, INRIA Rhône-Alpes
Semi-Local Parts and Adjacency Relations for Object Recognition
 Svetlana Lazebnik 21 Feb, 2005 at 1600hrs Beckman Institute (University of Illinois at Urbana-Champaign) F 107, INRIA Rhône-Alpes
High Dimensional Discriminant Analysis
 Charles Bouveyron 09 February, 2005 at 16h00 INRIA Rhône-Alpes - Project LEAR C 207, INRIA Rhône-Alpes
Strike a Pose: Tracking People by Finding Stylized Poses
 Deva Ramanan 04 February, 2005 at 1400hrs University of Berkeley, Computer Vision Group C207, INRIA Rhône-Alpes
Fast Image Retrieval using SIFT descriptors
 Micha¿l Sdika 21 January, 2005 at 1400hrs INRIA Rhône-Alpes, Project LEAR C207, INRIA Rhône-Alpes
Monocular Human Motion Capture with a Mixture of Regressors
 Ankur Agarwal 05 January, 2005 at 1600hrs INRIA Rhône-Alpes, Project LEAR C207, INRIA Rhône-Alpes

## Details

### Discriminative Regions for Semi-Supervised Object Class Localization

Presenter: Caroline Pantofaru
 7 December, 2005 at 16h00 C207, INRIA Rhône-Alpes
Affiliation: Vision and Mobile Robotics Lab , Carnegie Mellon University

Abstract:
I will present a method for object class localization using image regions. Image regions are extracted using unsupervised image segmentation, and provide a natural spatial support for detection results. Each region can be classified using both its texture content, as well as local interest points in and around it. Our framework allows selection of the most discriminative features for a given object class in a semi-supervised manner, where image labels are given but not the pixelwise delineation of training objects. Despite the semi-supervised training, this method allows pixelwise localization where the actual object mask is determined, not simply a bounding box or object centre.

### Discovering objects and their location in images

Presenter: Andrew Zisserman
 5 December, 2005 at 16h00 Grand Amphi, INRIA Rhône-Alpes
Affiliation: Department of Engineering Science, University of Oxford

Abstract:
This is joint work with Josef Sivic, Bryan Russell, Alexei Efros, and William Freeman.
There has been much recent research activity in recognizing object categories (such as cars, faces, motorbikes) in images. Most approaches start by learning a category model from a set of labelled training images for each category. The level of supervision of these training images can vary from segmenting in detail each object instance, through to simply labelling the image as containing that object category.
In this work we explore unsupervised training - we seek to discover the object categories depicted in a set of unlabelled images. We achieve this using a model developed in the statistical text literature: probabilistic Latent Semantic Analysis (pLSA). In text analysis this is used to discover topics in a corpus using the bag-of-words document representation. Here we treat object categories as topics, so that an image containing instances of several categories is modeled as a mixture of topics.
The model is applied to images by using a visual analogue of a word, formed by vector quantizing SIFT-like region descriptors. The topic discovery approach successfully translates to the visual domain: for a small set of objects, we show that both the object categories and their approximate spatial layout are found without supervision. Performance of this unsupervised method is compared to previous supervised approaches, and we show applications to category based retrieval in image databases and films.

### Hyperfeatures - Multilevel Local Coding for Visual Recognition

Presenter: Ankur Agarwal
 23 November, 2005 at 16h00 C207, INRIA Rhône-Alpes
Affiliation: Lear Project, INRIA Rhone-Alpes

Abstract:
Histograms of local appearance descriptors are a popular representation for visual recognition. They are highly discriminant and they have good resistance to local occlusions and to geometric and photometric variations, but they are not able to exploit spatial co-occurrence statistics of features at scales larger than their local input patches. We present a new multilevel visual representation, hyperfeatures', that is designed to remedy this. The basis of the work is the familiar notion that to detect object parts, in practice it often suffices to detect co-occurrences of more local object fragments ??? a process that can be formalized as comparison (vector quantization) of image patches against a codebook of known fragments, followed by local aggregation of the resulting codebook membership vectors to detect co-occurrences. This process converts collections of local image descriptor vectors into slightly less local histogram vectors ??? higher-level but spatially coarser descriptors. Our central observation is that it can therefore be iterated, and that doing so captures and codes ever larger assemblies of object parts and increasingly abstract or semantic' image properties. This repeated nonlinear folding' is essentially different from that of hierarchical models such as Convolutional Neural Networks and HMAX, being based on repeated comparison to local prototypes and accumulation of co-occurrence statistics rather than on repeated convolution and rectification. We formulate the hyperfeatures model and study its performance under several different image coding methods including clustering based Vector Quantization, Gaussian Mixtures, and combinations of these with Latent Discriminant Analysis. We find that the resulting high-level features provide improved performance in several object image and texture image classification tasks. Reference: Technical Report RR-5655, INRIA - Aug. 2005

### Manifold Learning and Image Segmentation

Presenter: Jakob Verbeek
 24 August, 2005 at 16h00 C 107, INRIA Rhône-Alpes
Affiliation: Intelligent Autonomous Systems, University of Amsterdam

### Dynamic Scene Analysis using Non-Parametric Statistics

Presenter: Yoni Wexler
 30 June, 2005 at 16h00 F 107, INRIA Rhône-Alpes
Affiliation: Weizmann Institute, Israel

Abstract:
Complex dynamic scenes are very difficult to model. They do not have a well defined geometric or parametric representations. Parametric and geometric methods have therefor been limited in their ability to solve real-world problems in Vision. Yet, texture and dynamic changes over time provide rich statistical information about the scene. This information is usually non-parametric. In this talk I will demonstrate how by taking a non-parametric statistical approach, we are able to solve difficult problems in the field of Computer Vision. In particular, I will demonstrate the power of this approach through several example problems. These include analysis, synthesis and manipulation of complex dynamic video sequences, recovery of Epipolar Geometry, and recovery of general unknown optical distortions without modeling them parametrically.

### Infra-red image classification

Presenter: Eric Nowak
 14 June, 2005 at 14h00 C 207, INRIA Rhône-Alpes
Affiliation: Lear Project, INRIA Rhône-Alpe

Abstract:
I will present my work on classification of infra red images and visible images too. This work is still in progress, so I will present you toughts and experimental results on different topics, including :
- dense representation of objects (SIFT based and raw pixel based)
- feature selection
- multiclass feature selection : how to share efficiently features between classes

### Object Detection with Line Segment Networks

Presenter: Vittorio Ferrari
 30 May, 2005 at 14h00 A 104, INRIA Rhône-Alpes
Affiliation: BIWI - ETHZ, Switzerland

Abstract:
We propose a system for object detection in cluttered real images, given only a hand-drawn outline as model. The edges are approximated by polygons, and the resulting line segments are organized into a novel image representation which encodes their interconnections: the Line Segment Network. The object detection problem is formulated as finding paths through the network resembling the model outline, and a computationally efficient detection algorithm is presented. As we demonstrate on several cluttered real images containing two object classes (bottles and swans), our method is capable of robust object detection and allows for considerable shape variation.

### Creating Efficient Codebooks for Visual Recognition

Presenter: Fr¿d¿ric Jurie
 27 April, 2005 at 11h00 C 207, INRIA Rhône-Alpes
Affiliation: INRIA Rhône-Alpes, Project LEAR

Abstract:
Visual codebooks built by vector quantizing appearance descriptors of local image patches are an effective means of capturing image statistics for texture analysis and visual classification. The input patches can either densely cover the image (texton' representation) or be restricted to a sparse set of keypoints (local features' representation). Methods such as k-means are common choices for codebook construction. Although k-means works well for the relatively homogeneous images typical of texture analysis, we show that it gives suboptimal codebooks when faced with the highly non-uniform statistics of the natural images found in object recognition problems. We describe a ball-deletion based mean shift clusterer that scales well to large datasets, and show that its codebooks significantly outperform k-means ones on several image classification tasks. We also show that dense representations greatly outperform keypoint based ones, and that mutual information based feature selection starting from a dense codebook gives a further improvement in performance.

### Feature Detection in Color Images

Presenter: Joost van de Weijer
 13 April, 2005 at 16h00 C 207, INRIA Rhône-Alpes
Affiliation: Lear, INRIA Rhône-Alpes

Abstract:
"Colors are only symbols. Reality is to be found in luminance alone.", Picasso exclaimed in one of his blue years. His message seems to be taken to heart by the computer vision community. In general the first thing to do, when trying to interpret the content of images, when looking for objects, persons, textures, or at a smaller scale for edges, ridges, and corners, is to discard color.
In this talk I will focus on two advantages of using color for computer vision tasks. First, color provides extra photometric information which allows the distinction between various physical causes for color variations in the world, such as changes due to shadows, light source reflections, and object reflectance variations. Secondly, color is an important discriminative property of objects and plays an important role in the attribution of saliency. These two advantages are applied to image features, which results in among others photometric invariant edge and corner detectors, and color-saliency focussed local features.

### Semi-Local Parts and Adjacency Relations for Object Recognition

Presenter: Svetlana Lazebnik
 21 Feb, 2005 at 1600hrs F 107, INRIA Rhône-Alpes
Affiliation: Beckman Institute (University of Illinois at Urbana-Champaign)

Abstract:
This talk will describe a framework for object recognition based on local scale- and affine-covariant image regions (keypoints) and their spatial relations. In many existing object recognition approaches, individual keypoints play the role of generic object parts. We have developed a more expressive object representation based on composite semi-local parts, defined as geometrically stable configurations of multiple regions that are robust against (limited) viewpoint changes and intra-class variations. Our framework includes a procedure for learning a vocabulary of semi-local parts for representing an object class that is weakly supervised (i.e., it works on unsegmented, cluttered training images) and can be combined with existing feature selection methods based on likelihood ratio or mutual information. The talk will conclude with a discussion of work in progress, namely, probabilistic models for combining semi-local parts and inter-part adjacency relations.

### High Dimensional Discriminant Analysis

Presenter: Charles Bouveyron
 09 February, 2005 at 16h00 C 207, INRIA Rhône-Alpes
Affiliation: INRIA Rhône-Alpes - Project LEAR

Abstract:
We propose a new method for discriminant analysis, called High Dimensional Discriminant Analysis (HHDA). Our approach is based on the assumption that high dimensional data live in different subspaces with low dimensionality. Thus, HDDA reduces the dimension for each class independently and regularizes class conditional covariance matrices in order to adapt the Gaussian framework to high dimensional data. This regularization is achieved by assuming that classes are spherical in their eigenspace. HDDA is applied to recognize objects in natural images and its performances are compared to classical classification methods.

### Strike a Pose: Tracking People by Finding Stylized Poses

Presenter: Deva Ramanan
 04 February, 2005 at 1400hrs C 207, INRIA Rhône-Alpes
Affiliation: University of Berkeley, Computer Vision Group

Abstract:
An important, open vision problem is to automatically describe what people are doing in a sequence of video. This problem is difficult for several reasons. First, one needs to determine how many people (if any) are in each frame and estimate their configurations (where they are and what their arms and legs are doing). But finding people and localizing their limbs is hard because people (a) wear a variety of different clothes, (b) appear in a variety of poses and (c) tend to partially occlude themselves and each other. Secondly, one must sew together estimated configuration reports from across frames into a motion path; this is tricky because people can move fast and unpredictably. Finally, one must describe what each person is doing; this problem is poorly understood, not least because there is no natural or canonical set of categories into which to classify activities.
In this talk I will discuss our progress on this problem. We develop a tracker that works in two stages; it first (a) builds a model of appearance of each person in a video and then (b) tracks by detecting those models in each frame ("tracking by model-building and detection"). We then marry our tracker with a motion synthesis engine that works by re-assembling pre-recorded motion clips. The synthesis engine generates new motions that are human-like and close to the image measurements reported by the tracker. By using labeled motion clips, our synthesizer also generates activity labels for each image frame ("analysis by synthesis"). We have extensively tested our system, running it on hundreds of thousands of frames of unscripted indoor and outdoor activity, a feature-length film, and legacy sports footage.

### Fast Image Retrieval using SIFT descriptors

Presenter: Micha¿l Sdika
 21 January, 2005 at 1400hrs C 207, INRIA Rhône-Alpes
Affiliation: INRIA Rhone-Alpes, LeaR

Abstract:
I will present the basis and the techniques used for fast image retrieval using SIFT descriptors in the team's demo and my contribution to the lava library. More precisely, I will talk about:
1) the new implementation of the SIFT descriptor,
2) the new angle estimator,
3) an indexing method using dimensionality reduction and kd-tree,
4) and, the D. Lowe Hough transform to add geometric constraints on matches.

I will conclude by giving some ideas on what can be done to improve the retrieval process.

### Monocular Human Motion Capture with a Mixture of Regressors

Presenter: Ankur Agarwal
 05 January, 2005 at 1600hrs C 207, INRIA Rhône-Alpes
Affiliation: INRIA Rhone-Alpes, LeaR

Abstract:
We address 3D human motion capture from monocular images, taking a learning based approach to construct a probabilistic pose estimation model from a set of labelled human silhouettes. To compensate for ambiguities in the pose reconstruction problem, our model explicitly calculates several possible pose hypotheses. It uses locality on a manifold in the input space and connectivity in the output space to identify regions of multi-valuedness in the mapping from silhouette to 3D pose. This information is used to fit a mixture of regressors on the input manifold, giving us a global model capable of predicting the possible poses with corresponding probabilities. These are then used in a dynamical-model based tracker that automatically detects tracking failures and re-initializes in a probabilistically correct manner. The system is trained on optical sensor based motion capture data, using the corresponding real human silhouettes supplemented with silhouettes synthesized artificially from several different models for improved robustness to inter-person variations. Static pose estimation is illustrated on a variety of silhouettes. The robustness of the method is demonstrated by tracking on a real image sequence requiring multiple automatic re-initializations.

# Seminars in 2004

## Titles

Color Constancy from local invariant regions
 Tijmen Moerland 25 November, 2004 at 1600hrs INRIA Rhône-Alpes, Project LEAR C207, INRIA Rhône-Alpes
Summary of Summer school on Machine Learning
 Ankur Agarwal 04 November, 2004 at 1600hrs INRIA Rhône-Alpes, Project LEAR C207, INRIA Rhône-Alpes
Summary of International Workshop on Object Recognition
 Frédéric Jurie 28 October, 2004 at 1600hrs INRIA Rhône-Alpes, Project LEAR C207, INRIA Rhône-Alpes
Detecting Keypoints with Stable Position, Orientation and Scale under Illumination Changes
 Bill Triggs 17 June, 2004 at 1700hrs INRIA Rhône-Alpes, Project LEAR C207, INRIA Rhône-Alpes
Title Unkown
 Michel Dhome 28 April, 2004 at 14h30 LASMEA, Universit¿ Blaise Pascal F 107, INRIA Rhône-Alpes
Learning 3D Human Pose from Silhouettes
 Ankur Agarwal 24 March, 2004 at 1530hrs INRIA Rhône-Alpes, Project LEAR C207, INRIA Rhône-Alpes
Bandelettes et repr¿sentation g¿om¿trique des images
 Erwan Le Pennec 03 March, 2004 at 11h00 CMAP, Ecole Polytechnique F 107, INRIA Rhône-Alpes
Reading of: New Algorithms for Efficient High-Dimensional Nonparameteric Classification
 Salil Jain and Peter Carbonetto 19 February, 2004 at 1600hrs INRIA Rhône-Alpes, Project LEAR C207, INRIA Rhône-Alpes
Kernel fisher discriminant for texture segmentation
 Jianguo Zhang 05 February, 2004 at 1700hrs INRIA Rhône-Alpes C 207, INRIA Rhône-Alpes
Improving KD Trees. L-infinity distance for Triangulation
 Richard Hartley 21 January, 2004 at 1600hrs The Australian National University Grand Amphi, INRIA Rhône-Alpes
Human detection based on a probabilistic assembly of robust part detectors
 Krystian Mikolajczyk 15 January, 2004 at 1600hrs Robotics Research Group, University of Oxford F 107, INRIA Rhône-Alpes

## Abstracts

### Color Constancy from local invariant regions

Presenter: Tijmen Moerland
 25 November, 2004 at 1600hrs C 207, INRIA Rhône-Alpes
Affiliation: INRIA Rhone-Alpes, LeaR

Abstract: This master's thesis investigates methods for combining the research fields of color constancy and invariant region matching. Color constancy aims at removing the influence of illumination from images so that the 'true' surface color of objects can be seen. The color constancy algorithm used in this thesis operates on two images and aims at approximating the joint color change, the 'color flow'. This makes object colors invariant to illumination changes. Other invariancies such as rotation and scaling of images and appearance, disappearence and moving of objects are achieved using DoG keypoint detection and SIFT matching. Robust color flow estimation based on normalized support regions makes color constancy viewpoint independent, which is the main contribution of this work. Furthermore the color flow algorithm is improved by operation in Hue, Saturation space and thus obtaining robustness to shadows and highlights.

### Summary of Summer school in Machine Learning

Presenter: Ankur Agarwal
 04 November, 2004 at 1600hrs C 207, INRIA Rhône-Alpes
Affiliation: INRIA Rhône-Alpes, Project LEAR

Abstract:

### Summary of International Workshop on Object Recognition

Presenter: Fr¿d¿ric Jurie
 28 October, 2004 at 1600hrs C 207, INRIA Rhône-Alpes
Affiliation: INRIA Rhône-Alpes, Project LEAR

Abstract:

### Detecting Keypoints with Stable Position, Orientation and Scale under Illumination Changes

Presenter: Bill Triggs
 17 June, 2004 at 1700hrs C 207, INRIA Rhône-Alpes
Affiliation: INRIA Rhône-Alpes, Project LEAR

Abstract:
Local feature approaches to vision geometry and object recognition are based on selecting and matching sparse sets of visually salient image points, known as keypoints' or points of interest'. Their performance depends critically on the accuracy and reliability with which corresponding keypoints can be found in subsequent images. Among the many existing keypoint selection criteria, the popular Förstner-Harris approach explicitly targets geometric stability, defining keypoints to be points that have locally maximal self-matching precision under translational least squares template matching. However, many applications require stability in orientation and scale as well as in position. Detecting translational keypoints and verifying orientation/scale behaviour post hoc is suboptimal, and can be misleading when different motion variables interact. We give a more principled formulation, based on extending the Förstner-Harris approach to general motion models and robust template matching. We also incorporate a simple local appearance model to ensure good resistance to the most common illumination variations. We illustrate the resulting methods and quantify their performance on test images.

### Title Unknown

Presenter: Michel Dhome
 28 April, 2004 at 1430hrs F 107, INRIA Rhône-Alpes
Affiliation: LASMEA, Universit¿ Blaise Pascal

Abstract:
Michel Dhome (LASMEA, Clermont-Ferrand) will present his recent work on real-time scene reconstruction using a moving camera - a car manually driven in a city-like environment. The scene is then automatically reconstructed, allowing later a car to run autonomously along the learned trajectory.

### Learning 3D Human Pose from Silhouettes

Presenter: Ankur Agarwal
 24 March, 2004 at 1530hrs C 207, INRIA Rhône-Alpes
Affiliation: INRIA Rhône-Alpes, Project LEAR

Abstract:
I will describe a sparse Bayesian regression method for recovering 3D human body motion from single images and monocular video sequences. The method requires neither an explicit body model nor prior labelling of body parts in the image. Instead, it recovers pose by direct nonlinear regression against shape descriptor vectors extracted automatically from image silhouettes. For robustness against local silhouette segmentation errors, silhouette shape is encoded by histogram-of-shape-contexts descriptors. Different regressors are evaluated for the main regression, and a Relevance Vector Machine (RVM) regressor is used to provide a sparse regressor without compromising performance. The regression scheme is also extended into a tracking framework by combining a learned autoregressive dynamical model with the robust shape descriptors. The methods are demonstrated on a 54-parameter full body pose model, both quantitatively using motion capture based test sequences, and qualitatively on a test video sequence.

### Bandelettes et repr¿sentation g¿om¿trique des images

Presenter: Erwan Le Pennec
 03 March, 2004 at 1100hrs F 107, INRIA Rhône-Alpes
Affiliation: CMAP, Ecole Polytechnique

Abstract:
La recherche de repr¿sentations efficaces des signaux est au coeur du traitement du signal pour des applications telles que la compression, l'estimation ou les probl¿mes inverses. Pour les images, la repr¿sentation dans une base d'ondelettes est sous optimale car elle n'exploite pas la r¿gularit¿ de nature g¿om¿trique de celles-ci. Les bandelettes sont elles construites dans ce but. Apr¿s les avoir pr¿sent¿es, nous montrerons qu'elles permettent des r¿sultats optimaux d'approximation non lin¿aire. Ces propri¿t¿s seront illustr¿es pas des applications ¿ la compression et au d¿bruitage.

### Reading of: New Algorithms for Efficient High-Dimensional Nonparameteric Classification

Presenter: Salil Jain and Peter Carbonetto
 19 February, 2004 at 1600hrs C 207, INRIA Rhône-Alpes
Affiliation: INRIA Rhône-Alpes, Project LEAR

Abstract:
The reading group is about non-approximate acceleration of high-dimensional operations, such as classification, using basic properties of ball trees (similar to kd-trees). Salil and Peter will present a short introduction to ball-tree algorithms and summarize the paper, and then discussion will follow.
local copy can be obtained from: /home/edgar/carbonet/public/liu-moore.ps.gz

### Kernel fisher discriminant for texture segmentation

Presenter: Jianguo Zhang
 05 February, 2004 at 1700hrs C 207, INRIA Rhône-Alpes
Affiliation: INRIA Rhône-Alpes, Project LEAR

Abstract:
Kernel Fisher discrimiant (KFD) is a state-of-the-art nonlinear machine learning method, and it has great potential to outperform linear Fisher discrimiant. In this talk, I will present a nonlinear discriminative texture feature extraction method based on KFD for texture classification. It is also mathematically shown that finding the optimal discriminative texture features is equivalent to finding the optimal discriminative projection directions of the input data by KFD. The KFD-based method integrates texture feature extraction, nonlinear dimensionality reduction, and discrimination in a unified framework. Optimized and closed-form solutions are derived for both two-class and multi-class texture classification problems, individually. Extensive experimental results clearly show that the proposed method yields excellent performance in texture classification and outperforms other kernel based texture classification method.

In this talk, if the time is allowed, I will also present part of my previous work on MRI tumor segmentation by one-class SVM learning
The abstract is as follows:
In image segmentation, one challenge is how to deal with the nonlinearity of real data distribution, which often makes segmentation methods need more human interactions and make unsatisfied segmentation results. In this talk, we formulate this research issue as a one-class learning problem from both theoretical and practical viewpoints with application on medical image segmentation. For that, a novel and user-friendly tumor segmentation method is proposed by exploring one-class support vector machine (SVM), which has the ability of learning the nonlinear distribution of the tumor data without using any prior knowledge about the data distribution. Extensive experimental results obtained from real patients' medical images clearly show that the proposed unsupervised one-class SVM segmentation method outperforms supervised two-class SVM segmentation method in terms of segmentation accuracy and with less human intervention.

### Improving KD Trees. L-infinity distance for Triangulation

Presenter: Richard Hartley
 21 January, 2004 at 1600hrs Grand Amphi, INRIA Rhône-Alpes
Affiliation: The Australian National University

### Human detection based on a probabilistic assembly of robust part detectors

Presenter: Krystian Mikolajczyk
 15 January, 2004 at 1600hrs F 107, INRIA Rhône-Alpes

Affiliation:
Robotics Research Group, University of Oxford
Abstract:
I will present a novel method for human detection which can detect pedestrians as well as close-up views of humans in the presence of clutter and occlusion. Humans are modeled as flexible assemblies of parts. The key point of the approach is a robust part detection. The part detectors are based on gradient and Laplacian based local features which efficiently capture the shape information. Using the probabilistic co-occurrence of these features increases their distinctiveness while the robustness remains the same. Learning with AdaBoost combines features with the highest co-occurrence probabilities.
Furthermore, the parts include a larger local context than in previous part-based work [Forsyth'97,Ronfard02] and they are therefore more distinctive. They are also not global (cf. previous work on pedestrian detectors [Papageorgiou'00]) and they therefore allow for occlusion and the detection of close-up views. The detection results are further improved by computing a probabilistic score for the assembly of parts which takes into account their relative position. The approach is also very efficient as (i) all part detectors use the same initial features, (ii) a coarse-to-fine cascade approach is used for part detection, (iii) an assembly strategy reduces the number of spurious detections and the search space. The results are very promising and outperform existing human detectors.

# Seminars in 2003

## Titles

Transductive Learning for Scene Classification
 Bill Triggs 18 December, 2003 at 1700hrs INRIA Rhône-Alpes - Project LEAR C 208, INRIA Rhône-Alpes
Indices de forme invariants ¿ l'¿chelle pour la reconnaissance de cat¿gories d'objets
 Frédéric Jurie 04 December, 2003 at 1600hrs INRIA Rhône-Alpes - Project LEAR C 208, INRIA Rhône-Alpes
Unsupervised Statistical Models for General Object Recognition
 Peter Carbonetto 27 November, 2003 at 1530hrs INRIA Rhône-Alpes - Project LEAR C 207, INRIA Rhône-Alpes
Apprentissage Direct de la Matrice Jacobienne Inverse d'une Fonction
 Frédéric Jurie 06 November, 2003 at 1600hrs INRIA Rhône-Alpes - Project LEAR F 107, INRIA Rhône-Alpes
Texture Recognition Using Affine-Invariant Regions
 Svetlana Lazebnik 23 October, 2003 at 1600hrs Beckman Institute (University of Illinois at Urbana-Champaign) F 107, INRIA Rhône-Alpes
Méthodes de réduction de dimensionnalité pour le dépliage du ruban cortical
 Charles Bouveyron 01 October, 2003 at 1600hrs INRIA Rhône-Alpes - Project LEAR C 207, INRIA Rhône-Alpes
Learning Dyanamical Models for Tracking Complex Motion
 Ankur Agarwal 18 September, 2003 at 1600hrs INRIA Rhône-Alpes - Project LEAR C 207, INRIA Rhône-Alpes
The Trade-off Between Generative and Discriminative Classifiers
 Guillaume Bouchard 04 September, 2003 at 1600hrs INRIA Rhône-Alpes - Project LEAR C 207, INRIA Rhône-Alpes

## Abstracts

### Transductive Learning for Scene Classification

Presenter: Bill Triggs
 18 December, 2003 at 1700hrs C 208, INRIA Rhône-Alpes
Affiliation: INRIA Rhone-Alpes - Project LEAR

### Indices de forme invariants ¿ l'¿chelle pour la reconnaissance de cat¿gories d'objets

Presenter: Fr¿d¿ric Jurie
 04 December, 2003 at 1600hrs C 208, INRIA Rhône-Alpes
Affiliation: INRIA Rhone-Alpes - Project LEAR

Abstract:
In this talk we introduce a new method for extracting shape interest regions which capture the local structure of the contour image. They are in spirit similar to local interest points extracted from grey-level images, but describe the shape instead of the texture. Our approach detects local shape convexities in scale-space. The detection is based on a robust measure, the entropy of the gradient orientations in the neighborhood of a circle defined by the scale. The detected regions allow for clutter, occlusions as well as spurious detections and are invariant to scale changes and rotations. Experimental results show a very good performance for shape matching and recognition of object categories.

R¿sum¿:
Nous pr¿sentons une nouvelle m¿thode pour la d¿tection de zones d'int¿r¿t bas¿e sur la forme, qui capture la structure locale des contours des images. Elle est con¿ue dans le m¿me esprit que les d¿tectueurs de points d'int¿r¿t locaux qui travaillent ¿ partir d'images en niveaux de gris, mais d¿crit la forme plut¿t que la texture. Notre approche d¿crit des convexit¿s locales des formes, dans l'espace des ¿chelles. Les r¿gions sont d¿tect¿es de mani¿re robuste, malgr¿ des occultations, le bruit dans les images ou les changements d'¿chelles. Des r¿sultats exp¿rimentaux montrent de tr¿s bonnes performances lors de mise en correspondance de formes et de reconnaissance de cat¿gories d'objets.

### Unsupervised Statistical Models for General Object Recognition

Presenter: Peter Carbonetto
 27 November, 2003 at 1530hrs C 207, INRIA Rhône-Alpes
Affiliation: INRIA Rhône-Alpes - Project LEAR

Abstract:
I will present an overview of the work I did for my Master's thesis at the University of British Columbia. I will also touch upon some major issues I uncovered in my work and discuss some future directions for research.
We approach the object recognition problem as the process of attaching meaningful labels to specific regions of an image. Given a set of images and their captions, we segment the images, then learn the proper associations between words and regions. Previous models are limited by the scope of the representation, and performance is constrained by noise from poor initial clusterings of the image features. We propose three improvements that address these issues.

Releated papers:
1. Bayesian feature weighting for unsupervised learning, with application to object recognition. P. Carbonetto, N. de Freitas, P. Gustafson and N. Thompson. AI-Stats, 2003. PDF
2. Why can't Jose read? The problem of learning semantic associations in a robot environment. P. Carbonetto and N. de Freitas. HLT Conference Workshop on Learning Word Meaning from Non-Linguistic Data, 2003. PDF
3. A Statistical Model for General Contextual Object Recognition. P. Carbonetto, N. de Freitas and K. Barnard. Submitted to ECCV 2004. (local intranet access -- /home/albireo/carbonet/eccv2004.pdf)

### Apprentissage Direct de la Matrice Jacobienne Inverse d'une Fonction

Presenter: Frédéric Jurie
 6 November, 2003 at 1600hrs F 107, INRIA Rhône-Alpes
Affiliation:
INRIA Rhône-Alpes - Project LEAR
Also Université Blaise Pascal, Project LASMEA

Abstract:
A method to estimate the inverse Jacobian matrix of of a function, without computing the direct Jacobian matrix is presented. This kind of inverse Jacobian matrix proves to perform much better in modeling a relation $\theta = f^{-1}(x)$ (where parameters $\theta$ are to be computed from observations $x$) than the traditional computation of the Moore-Penrose inverse.

Theoretical insight as well as comparisons in the domain like visual servoing or tracking will be provided to prove the correctness of the assertion.

Résumé:
Une méthode sera présentée qui permettant l'estimation de la matrice Jacobienne inverse d'une fonction, qui n'utilise pas le calcul de la matrice Jacobienne. Ce type de matrice Jacobienne inverse possède des propriétés meilleures, dans des probl¿mes d'inversion (calcul de paramètres d'un modèle à partir de mesures), que la méthode de Moore-Penrose.

Aussi, quelques idées sur les aspects théoriques ainsi que des comparaisons dans diff¿rents domaines d'applications de la vision tels que l'asservissement visuel ou le suivi d'objets seront présentés.

### Texture Recognition Using Affine-Invariant Regions

Presenter: Svetlana Lazebnik
 23 October, 2003 at 1600hrs F 107, INRIA Rhône-Alpes
Affiliation: Beckman Institute (University of Illinois at Urbana-Champaign)

Abstract:
This talk will discuss texture representations using affine-invariant interest points. A model of a texture is constructed from a sparse set of image locations characterized by local appearance and affine shape. For more descriptive power, it is possible to incorporate neighborhood constraints based on co-occurrence statistics. Applications include retrieval, classification, and segmentation of images of textured surfaces under a wide range of transformations, including viewpoint changes and non-rigid deformations.

Releated papers
Svetlana Lazebnik, Cordelia Schmid, and Jean Ponce, Affine-Invariant Local Descriptors and Neighborhood Statistics for Texture Recognition,'' ICCV 2003.
Svetlana Lazebnik, Cordelia Schmid, and Jean Ponce, A Sparse Texture Representation Using Affine-Invariant Regions,'' CVPR 2003, vol. II, pp. 319-324.

### Méthodes de réduction de dimensionnalité pour le dépliage du ruban cortical

Presenter: Charles Bouveyron
 01 October, 2003 at 1600hrs C 207, INRIA Rhône-Alpes
Affiliation: INRIA Rhône-Alpes - Project LEAR

Presentation slides (pdf)
Related article (pdf)

### Learning Dyanamical Models for Tracking Complex Motion

Presenter: Ankur Agarwal
 18 September, 2003 at 1600hrs C 207, INRIA Rhône-Alpes
Affiliation: INRIA Rhône-Alpes - Project LEAR

Abstract:
I will address the problem of tracking complex human motions in monocular video sequences. Mainly, I will describe a new approach to modelling the non-linear and time-varying dynamics of generic human motions, using statistical methods to exploit structured motion patterns that exist in typical human activities. The method receives, as input, a set of hand-labelled motion sequences and it learns a piecewise dynamical model based on Gaussian autoregressive processes by automatically constructing connected regions in parameter space that exhibit similar dynamical characteristics. It also automatically partitions the state space into a number of classes corresponding to different motion patterns, making it useful for activity recognition.

### The Trade-off Between Generative and Discriminative Classifiers

Presenter: Guillaume Bouchard
 04 September, 2003 at 1600hrs C 207, INRIA Rhône-Alpes
Affiliation: INRIA Rhône-Alpes - Project LEAR

Abstract:
Given any generative classifier based on an inexact density model, we can define a discriminative counterpart that reduces its asymptotic error rate. We introduce a family of parameter estimation problems that interpolates the two approaches, thus providing a new way to compare them and giving an estimation procedure whose classification performance is well balanced between the bias of generative classifiers and the variance of discriminative ones. We show that an intermediate trade-off between the two strategies is often preferable, both theoretically and in experiments on real data.