Seminars
For other interesting seminars nearby see:
2015 | 2014 | 2013 | 2012 | 2011 | 2010 | 2009 | 2008 | 2007 | 2006 | 2005 | 2004
| stands for open seminars |
| stands for team meetings |
The future is weakly supervised
INRIA Rhône-Alpes, F107
Monday, May 11, 16:00 pm
Abstract:
In this talk I present recent work developed at VISICS in Weakly
Supervised (WS) learning.
The main idea behind WS learning is to find a connection between a main
task, which is fully supervised and a subordinate task, which is only
partially annotated. More specifically, we introduce latent variables
and use them to learn the subordinate task as collateral effect of the
fully supervised learning.
We apply this approach to 3 different tasks:
i) Classification with WS detection
ii) Detection with WS facial point localization
iii) Detection with WS pose estimation
In these examples we show that WS learning is a learning approach that
has a broad field of application, can improve the fully supervised task
and, more importantly, learns effectively a subordinate task using only
partial annotations.
Projection-free Learning and Optimization
INRIA Rhône-Alpes, F107
Thursday, March 26, 4:30 pm
Abstract:
Linear optimization is many times algorithmically simpler than
non-linear convex optimization. Linear optimization over matroid
polytopes, matching polytopes and path polytopes are example of
problems for which we have simple and efficient combinatorial
algorithms, but whose non-linear convex counterpart is harder and
admit significantly less efficient algorithms. This motivates the
computational model of convex optimization, including the offline,
online and stochastic settings, using a linear optimization oracle.
In this computational model we give several new results that improve
over the previous state of the art. Our main result is a novel
conditional gradient algorithm for smooth and strongly convex
optimization over polyhedral sets that performs only a single linear
optimization step over the domain on each iteration and enjoys a
linear convergence rate. This gives an exponential improvement in
convergence rate over previous results.
Based on this new conditional gradient algorithm we give the first
algorithms for online convex optimization over polyhedral sets that
perform only a single linear optimization step over the domain while
having optimal regret guarantees, answering an open question of Kalai
and Vempala [COLT'03], and Hazan and Kale [ICML'12]. Our online
algorithms also imply conditional gradient algorithms for non-smooth
and stochastic convex optimization with the same convergence rates as
projected (sub)gradient methods.
Probabilistic low-rank matrix completion on finite alphabets
INRIA Rhône-Alpes, F107
Monday, March 31, 12:00
Abstract:
The task of reconstructing a matrix given a sample of observedentries is known as the matrix completion problem. It arises ina wide range of problems, including recommender systems, collaborativefiltering, dimensionality reduction, image processing, quantum physics or multi-class classificationto name a few. Most works have focused on recovering an unknown real-valued low-rankmatrix from randomly sub-sampling its entries.Here, we investigate the case where the observations take a finite number of values, corresponding for examples to ratings in recommender systems or labels in multi-class classification.We also consider a general sampling scheme (not necessarily uniform) over the matrix entries.The performance of a nuclear-norm penalized estimator is analyzed theoretically.More precisely, we derive bounds for the Kullback-Leibler divergence between the true and estimated distributions.In practice, we have also proposed an efficient algorithm based on lifted coordinate gradient descent in order to tacklepotentially high dimensional settings.
Some recent results
INRIA Rhône-Alpes, F107
Friday, March 06, 15:15
Abstract:
Recent Advances in Large-Scale Convex Optimization: Algorithms, Complexities, and Applications
INRIA Rhône-Alpes, A104
Monday, January 19, 12:00
Abstract:
In the modern era of large-scale machine learning and high-dimensional
statistics, using mixing regularization and kernelization become increasingly
popular and important modeling strategies. However, they often lead to very
complex optimization models with extremely large scale and nonsmooth objective
functions, which bring new challenges to the traditional first-order methods,
due to the expensive computation or memory cost of proximity operators and even
gradients. In this talk, I will discuss some recent algorithmic advances that
cope with these challenges by taking advantage of the underlying structures and
using randomization techniques. I will present (i) my work on the composite
mirror prox algorithm for a broad class of variational inequalities, allowing
to cover the composite minimization problem with multiple nonsmooth
regularization tems, (ii) my work on the doubly stochastic gradient descent
algorithm for stochastic optimization problems over reproducing kernel Hilbert
spaces. These algorithms exhibit the optimal convergence rates and make it
practical to handle problems with extremely large dimensions and large
datasets. Besides the theoretical efficiency, the algorithms are also proven
useful in a wide range of interesting applications in machine learning, image
processing, and statistical inferences.
Incremental proximal majorization-minimization algorithms for large-scale machine learning
INRIA Rhône-Alpes, F107
Wednesday, November 26, 11:00
Abstract:
Recently, a efficient first-order optimization algorithm called MISO
(Minimization by Incremental Surrogate Optimization) was proposed for
incremental unconstrained majorization-minimization, with large-scale machine
learning applications. We propose several extensions of MISO, under less
stringent assumptions, including a proximal counterpart of MISO, called
Prox-MISO, that allows to include non-smooth regularization in the learning
objectives.
This work was performed as part my Master's internship in the LEAR team, supervised by Julien Mairal and Zaid Harchaoui.
Self-Learning Camera: Autonomous Adaptation of Object Detectors to Unlabeled Video Streams
INRIA Rhône-Alpes, F107
Tuesday, September 23, 11:30
Abstract:
Learning object detectors requires massive amounts of labeled training samples
from the specific data source of interest. This is impractical when dealing
with many different sources (e.g., in camera networks), or constantly changing
ones such as mobile cameras (e.g., in robotics or driving assistant systems).
In this talk, I will describe how to address the problem of self-learning
detectors in an autonomous manner, i.e. (i) detectors continuously updating
themselves to efficiently adapt to streaming data sources (contrary to
transductive algorithms), (ii) without any labeled data strongly related to the
target data stream (contrary to self-paced learning), and (iii) without manual
intervention to set and update hyper-parameters. To that end, we propose an
unsupervised, on-line, and self-tuning learning algorithm to optimize a
multi-task learning convex objective. Our method uses confident but laconic
oracles (human operators or high-precision but low-recall off-the-shelf generic
detectors), and exploits the structure of the problem to jointly learn on-line
an ensemble of instance-level trackers, from which we derive an adapted
category-level object detector. Our approach is validated on real-world
publicly available video object datasets.
Human pose recognition: from third person to first person views
INRIA Rhône-Alpes, A103
Tuesday, September 16, 12:00
Abstract:
In this seminar, I will present an overview of my PhD work, which focused on the problem of full body human pose recognition, and I will introduce some of our more recent work on egocentric image analysis. In the first part, I will present our hierarchical cascade classifier that simultaneously detects humans and estimates their pose by tackling detection as a multi-class classification problem. In the second part of this seminar, I will show how some properties of projective geometry can be exploited for view-invariant monocular tracking in surveillance-scenes. Then, I will present our latest work on hand pose estimation from egocentric viewpoints. For this problem specification, I will show that depth sensors are particularly informative for extracting near-field interactions of the camera wearer with his/her environment. Our method uses task and viewpoint specific synthetic training exemplars, trained with object interactions, in a discriminative detection framework. I will provide an insightful analysis of the performance of our algorithm on a new real-world annotated dataset of egocentric scenes. Finally, I will analyze the limitations of the current approach and give some ideas for future work.
Fast convergence rates in semi-supervised multi-class learning
Yuri Maximov
INRIA Rhône-Alpes, F107
Thursday, August 28, 12:00
Abstract:
We propose a multi-class classification generalization error bound for
semi-supervised learning. The bound involves the margin distribution of the
classifier, a transductive Rademacher complexity, and the empirical adequacy of
the majority rule assigning pseudo-labels to unlabeled data within identified
clusters with the learned function and the true labels of examples. For a given
class of functions, the bound is tight when the data clusters contain, in
majority, examples of the same class and that the errors of the learned
function is concentrated on low margin regions. The working hypothesis of our
study is that data can be separated into dense regions, such that the optimal
Bayes classifier assign to all unlabeled examples within one region the same
class label. Following this assumption, we propose a two stage multi-class
semi-supervised algorithm which first assigns pseudo-labels to the set of
unlabeled training examples, that are found to be in a dense regions using the
majority vote, and then learn a classifier using both sets of labeled and
pseudo-labeled examples. With this learning scheme we achieve fast convergence
rates and empirical results on different datasets show the effectiveness of our
approach compared to state-of-the-art semi-supervised algorithms.
A new primal-dual splitting algorithm for convex optimization; application as a heuristic for super-resolution
INRIA Rhône-Alpes, F107
Monday, June 2, 14:00
Abstract:
Abstract: A new splitting algorithm is proposed to minimize the sum of convex functions, potentially nonsmooth and composed with linear operators. This generic formulation encompasses numerous regularized inverse problems in image processing. The algorithm, whose weak convergence is proved, calls the individual gradient or proximity operators of the functions, without any inner loop or linear system to solve. The classical Douglas-Rachford, forward-backward and Chambolle-Pock algorithms are recovered as particular cases. In the second part of the talk, we address the recovery of a spike train from noisy linear measurements, through a reformulation as a low rank matrix approximation problem. Used as a heuristic for this problem, our algorithm outperforms the state of the art.
Bayesian Error Estimation for Classifier Model Selection
INRIA Rhône-Alpes, A104
Wednesday, May 14, 14:00
Abstract:
The estimation of classification error is a critical step in
classifier design, and closely related to model selection.
Typical model selection procedures are either based on estimating
the error (e.g., cross-validation, bootstrap, holdout, etc.) or
information theoretic principles (e.g., AIC, BIC, MDL).
The problem with the former approach is that the traditional
counting-based estimators are both computationally expensive
and inaccurate. On the other hand, the latter approach optimizes
a measure that is not directly connected to the prediction error
and often requires a careful selection of hyperparameters.
In this talk we concentrate on the recently proposed
Bayesian Error Estimator (BEE), and on its uses for
model selection among a family of generalized linear
models. More specifically, we will show that the
estimator is more accurate than the traditional error
estimation approaches when selecting the best model
along the regularization path of a LASSO regularized
logistic regression model. Moreover, the BEE estimates
the error directly from the training set, thus avoiding
multiple training stages typical of cross-validation
procedures.
As a case study, we will describe the anatomy of our submission
into the IEEE MLSP 2013 Bird sound classification competition
(https://www.kaggle.com/c/mlsp-2013-birds). The method was
essentially a BEE-selected generalized linear model with BoW-like
features calculated from a sparse dictionary representation
calculated with the SPAMS toolbox developed by INRIA.
An algorithm for variable density sampling with block-constrained acquisition
INRIA Rhône-Alpes, F107
Tuesday, April 23, 12:00
Abstract:
Reducing acquisition time is of fundamental importance in various imaging
modalities.
The concept of variable density sampling provides an appealing framework to
address this issue. It was justified recently from a theoretical point of view
in the compressed sensing (CS) literature. Unfortunately, the sampling schemes
suggested by current CS theories may not be relevant since they do not take the
acquisition constraints into account (for example, continuity of the acquisition
trajectory in Magnetic Resonance Imaging - MRI).
In this talk, we propose a numerical method to perform variable density sampling
with block constraints. Our main contribution is to propose a new way to draw
the blocks in order to mimic CS strategies based on isolated measurements. The
basic idea is to minimize a tailored dissimilarity measure between a probability
distribution defined on the set of isolated measurements and a probability
distribution defined on a set of blocks of measurements. This problem turns out
to be convex and solvable in high dimension.
Our second contribution is to define an efficient minimization algorithm based
on Nesterov's accelerated gradient descent in metric spaces. We study carefully
the choice of the metrics and of the prox function. We show that the optimal
choice may depend on the type of blocks under consideration. Finally, we show
that we can obtain better MRI reconstruction results using our sampling schemes
than standard strategies such as equiangularly distributed radial lines.
Two approaches for domain adaptation: Unsupervised subspace alignment and majority vote adaptation
INRIA Rhône-Alpes, F107
Tuesday, April 15, 15:00
Abstract:
Domain adaptation is an important machine learning problem arising when
the learning distribution differs from that of the test data. Many
classification tasks in computer vision or natural language processing
for example are affected by this problem. A general trend to deal with
this issue is to try to move closer the two distributions, w.r.t. to a
divergence measure, while ensuring a good accuracy on the learning
sample. In this talk, we present and discuss two possible approaches for
this problem. The first one, which takes the form on an algorithmic
contribution, proposes to move closer the two distributions by an
unsupervised subspace alignment method. The second one is based on a new
domain adaptation framework relying on the PAC-Bayesian theory that aims
at learning an adaptive majority vote of classifiers.
Supervised Metric Learning with Generalization Guarantees
INRIA Rhône-Alpes, F107
Tuesday, April 15, 14:00
Abstract:
Using an appropriate metric is key to the performance of many learning
algorithms. For this reason, a lot of effort has gone during the past 10 years
into metric learning, the research topic devoted to automatically optimizing
distance and similarity functions from data. A large body of work has been
devoted to supervised metric learning from feature vectors, in particular
Mahalanobis distance learning, which essentially learns a linear projection of
the data (in the form of a matrix M) into a new space where some discriminative
constraints are satisfied. Beyond the fact that M usually has to be PSD, that
is a costly constraint, one main limitation of the current supervised metric
learning methods is a substantial lack of theoretical understanding of
generalization in metric learning. Indeed, one may be interested in the
generalization ability of the metric itself, i.e., its consistency not
only on the training sample but also on unseen data coming from the same
distribution. Second, one may also be interested in the generalization ability
of the learning algorithm that uses the learned metric. In this talk, we make
use of the formal framework of good similarities introduced by Balcan et al. to
design an algorithm for learning a non PSD metric, which is then used to build
a global linear classifier. We show that this approach has uniform stability
and derive a generalization bound on the classification error.
Predicting an Object Location using a Global Image Representation
INRIA Rhône-Alpes, A103
Thursday, March 27, 12:00
Abstract:
We tackle the detection of prominent objects in images as a retrieval task:
given a global image descriptor, we find the most similar images in an
annotated dataset, and transfer the object bounding boxes. We refer to this
approach as data driven detection (DDD), that is an alternative to sliding
windows. Previous works have used similar notions but with task-independent
similarities and representations, i.e. they were not tailored to the end-goal
of localization. This article proposes two contributions: (i) a metric learning
algorithm and (ii) a representation of images as object probability maps, that
are both optimized for detection. We show experimentally that these two
contributions are crucial to DDD, do not require costly additional operations,
and in some cases yield comparable or better results than state-of-the-art
detectors despite conceptual simplicity and increased speed. As an application
of prominent object detection, we improve fine-grained categorization by
pre-cropping images with the proposed approach.
Spatial Information and End-to-End Learning for Visual Recognition
INRIA Rhône-Alpes, F107
Wednesday, March 26, 11:00
Abstract:
We present our research on visual recognition and machine learning. Two types
of visual recognition problems are investigated: action recognition and human
body part segmentation problem. Our objective is to combine spatial information
such as label configuration in feature space, or spatial layout of labels into
an end-to-end framework to improve recognition performance.
For human action recognition, we apply the bag-of-words model and reformulate
it as a neural network for end-to-end learning. We propose two algorithms to
make use of label configuration in feature space to optimize the codebook. One
is based on classical error backpropagation. The codewords are adjusted by
using gradient descent algorithm. The other is based on cluster reassignments,
where the cluster labels are reassigned for all the feature vectors in a
Voronoi diagram. As a result, the codebook is learned in a supervised way. We
demonstrate the effectiveness of the proposed algorithms on the standard KTH
human action dataset.
For human body part segmentation, we treat the segmentation problem as
classification problem, where a classifier acts on each pixel. Two machine
learning frameworks are adopted: randomized decision forests and convolutional
neural networks. We integrate a priori information on the spatial part
layout in terms of pairs of labels or pairs of pixels into both frameworks in
the training procedure to make the classifier more discriminative, but
pixelwise classification is still performed in the testing stage. Three
algorithms are proposed:
(i) Spatial part layout is integrated into randomized decision forest training
procedure;
(ii) Spatial pre-training is proposed for the feature learning in the ConvNets;
(iii) Spatial learning is proposed in the logistical regression (LR) or
multilayer perceptron (MLP) for classification.
Adaptive Euclidean Maps for Histograms: Generalized Aitchison Embeddings
INRIA Rhône-Alpes, A104
Friday, February 3 2014, 12:00
Abstract:
Learning distances that are specifically designed to compare histograms in the probability simplex has recently attracted the attention of the machine learning community. Learning such distances is important because most machine learning problems involve bags of features rather than simple vectors. Ample empirical evidence suggests that the Euclidean distance in general and Mahalanobis metric learning in particular may not be suitable to quantify distances between points in the simplex. We propose in this paper a new contribution to address this problem by generalizing a family of embeddings proposed by Aitchison (1982) to map the probability simplex onto a suitable Euclidean space. We provide algorithms to estimate the parameters of such maps by building on previous work on metric learning approaches. The criterion we study is not convex, and we consider alternating optimization schemes as well as accelerated gradient descent approaches. These algorithms lead to representations that outperform alternative approaches to compare histograms in a variety of contexts.
Co-Occurrence Statistics for Zero-Shot Classification
INRIA Rhône-Alpes, F107
Monday, January 13 2014, 12:00
Abstract:
In this paper we aim for zero-shot classification, but in contrast to the common setting of multi-class image classification, we focus on multi-label image datasets. The goal is to transfer knowledge from the known labels to the unseen labels. Our method relies on easy to obtain co-occurrence statistics of class labels harvested from existing annotations, web-search hit counts or image tags. Our main contribution is to use inter-dependencies that arise naturally between classes, for zero-shot classification. We propose various similarity metrics for leveraging the these co-occurrences, and show that our zero-shot classifiers can serve as priors for few-shot learning. Experiments on three challenging multi-labelled datasets reveal that our proposed zero-shot methods, are approaching and occasionally outperforming supervised SVMs. We conclude that co-occurrence statistics suffice for zero-shot classification. (This talk is based on my current CVPR submission, so this work is yet unpublished).
Learning with Asymmetric Information
INRIA Rhône-Alpes, F107
Tuesday, January 7 2014, 11:00
Abstract:
Many computer vision problems have an asymmetric
distribution of information, i.e. less or more information about
a problem is available at training time than at test time. In my
talk I will discuss our recent work on both situations: 1) the
LUPI framework for the case when we have additional data
modalities available for the training data, and 2) a label propagation
approach for the case when an additional similarity measure
is available at test time (both published at ICCV 2013).
The return of AdaBoost.MH: multi-class Hamming trees
INRIA Rhône-Alpes, F107
Wednesday, October 9 2013, 17:00
Abstract:
Within the framework of AdaBoost.MH, we propose to train vector-valued decision
trees to optimize the multi-class edge without reducing the multi-class problem
to K binary one-against-all classifications. The key element of the method is a
vector-valued decision stump, factorized into an input-independent vector of
length K and label-independent scalar classifier. At inner tree nodes, the
label-dependent vector is discarded and the binary classifier can be used for
partitioning the input space into two regions. The algorithm retains the
conceptual elegance, power, and computational efficiency of binary AdaBoost. In
experiments it is on par with support vector machines and with the best
existing multi-class boosting algorithm AOSOLogitBoost, and it is significantly
better than other known implementations of AdaBoost.MH.
High-dimensional change-point detection with sparse alternatives
INRIA Rhône-Alpes, F107
Thursday, September 12 2013, 14:00
Abstract:
We consider the problem of detecting a change in mean in a sequence of
high-dimensional Gaussian vectors. We assume that the change happens only in an
unknown subset of the vector components. We propose a testing procedure that is
adaptive to the number of non-zero components. Under high-dimensional
assumptions we obtain the detection boundary and prove rate optimality of the
test.
Sinkhorn Distances: Lightspeed Computation of Optimal Transportation Distances
INRIA Rhône-Alpes, Grand Amphithéâtre
Wednesday, August 28 2013, 11:00
Abstract:
Optimal transportation distances are a fundamental family of parameterized distances for histograms. Despite their appealing theoretical properties, excellent performance in retrieval tasks and intuitive formulation, their computation involves the resolution of a linear program whose cost is prohibitive whenever the histograms' dimension exceeds a few hundreds. We propose in this work a new family of optimal transportation distances that look at transportation problems from a maximum-entropy perspective. We smooth the classical optimal transportation problem with an entropic regularization term, and show that the resulting optimum is also a distance which can be computed through Sinkhorn-Knopp's matrix scaling algorithm at a speed that is several orders of magnitude faster than that of transportation solvers. We also report improved performance over classical optimal transportation distances on the MNIST benchmark problem.
Representation Learning by Archetypal Analysis
Yuansi Chen
INRIA Rhône-Alpes, F107
Thursday, July 18 2013, 12:00
Abstract:
Archetypal analysis is an unsupervised data analysis technique which was intro-
duced by Cutler and Breiman [9]. It represents multivariate data by a convex combina-
tion of data prototypes called archetypes, which are themselves convex combinations of
data points. Unlike many other unsupervised learning techniques such as sparse coding
or non-negative matrix factorization, archetypes are easy to interpret. In our work,
we first introduce an efficient implementation of the archetypal analysis method with
recent optimization techniques. Second, we conduct numerical experiments showing
that archetypal analysis leads to state-of-the-art results when used for learning the
underlying structure of natural patches in image denoising and classification tasks.
The Three R's of Computer Vision: Recognition, Reconstruction and Reorganization
INRIA Rhône-Alpes, F107
Thursday, July 11 2013, 11:00
Abstract:
Over the last two decades, we have seen remarkable progress in computer vision with demonstration of capabilities such as face detection, handwritten digit recognition, reconstructing three-dimensional models of cities, automated monitoring of activities, segmenting out organs or tissues in biological images, and sensing for control of robots and cars. Yet there are many problems where computers still perform significantly below human perception. For example, in the recent PASCAL benchmark challenge on visual object detection, the average precision for most 3D object categories was under 50%.
I will argue that further progress on the classic problems of computational vision: recognition, reconstruction and re-organization requires us to study the interaction among these processes. For example recognition of 3d objects benefits from a preliminary reconstruction of 3d structure, instead of just treating it as a 2d pattern classification problem. Recognition is also reciprocally linked to reorganization, with bottom up grouping processes generating candidates, which with top-down activations of object and part detectors. In this talk, I will show some of the progress we have made towards the goal of a unified framework for the 3 R's of computer vision. I will also point towards some of the exciting applications we may expect over the next decade as computer vision starts to deliver on even more of its grand promise.
Recent work on patch descriptor selection and exploiting layout in image classification
INRIA Rhône-Alpes, F107
Tuesday, July 9 2013, 15:00
Abstract:
This talk covers two recent papers.
The first (ICMR'11) investigates the use of photographic style for category-level image classification. Specifically, we exploit the assumption that images within a category share a similar style defined by attributes such as colorfulness, lighting, depth of eld, viewpoint and saliency. For these style attributes we create correspondences across images by a generalized spatial pyramid matching scheme. Where the spatial pyramid groups features spatially, we allow more general feature grouping and in this paper we focus on grouping images on photographic style. We evaluate our approach in an object classification task and investigate style di erences between professional and amateur photographs. We show that a generalized pyramid with style-based attributes improves performance on the professional Corel and amateur Pascal VOC 2009 image datasets.
In the second (ECCV'12), we start from the observation that local image descriptors are generally designed for describing all possible image patches. Such patches may be subject to complex variations in appearance due to incidental object, scene and recording conditions. Because of this, a single-best descriptor for accurate image representation under all conditions does not exist. Therefore, we propose to automatically select from a pool of descriptors the one that is best suitable based on object surface and scene properties. These properties are measured on the y from a single image patch through a set of attributes. Attributes are input to a classifier which selects the best descriptor. Our experiments on a large dataset of colored object patches show that the proposed selection method outperforms the best single descriptor and a-priori combinations of the descriptor pool.
Rooms: Where are things and where could they be?
INRIA Rhône-Alpes, F107
Monday, July 8 2013, 16:00
Abstract:
Rooms are interesting, because people live in rooms. Autonomous robots
will need to manage in rooms; surveillance programs will need to
understand pictures of rooms; and there is much commercial value in being
able to manipulate pictures of rooms, for example, to show how a high-value
sofa would look in your living room.
I will describe current work on understanding rooms from a single image.
Our methods can now estimate a "box" describing a room and block out the
major structure of the space in that box. Methods from other groups can
identify major furniture items, too.
I will then show how these representations can be used to insert items
into the room. Inserted items can be rendered realistically, so they look
as though they are participating in the light transfer in the room environment.
These methods allow us to build speculative representations: if there were
more furniture in this room, what would it look like, and where would it be?
These ideas suggest an ideology of visual representation as an exposition
of likely futures (rather than as an account of what is seen). There are
important consequences: identifying objects may not be as important
as understanding free space, materials, and the potential of objects.
Articulated Pose Estimation using Discriminative Armlet Classifiers
INRIA Rhône-Alpes, F107
Friday, July 5 2013, 14:00
Abstract:
We propose a novel approach for human pose estimation in real-world cluttered scenes, and focus on the challenging problem of predicting the pose of both arms for each person in the image. For this purpose, we build on the notion of poselets [4] and train highly discriminative classifiers to differentiate among arm configurations, which we armlets. We propose a rich representation which, in addition to standard HOG features, integrates the information of strong contours, skin color and contextual cues in a principled manner. Unlike existing methods, we evaluate our approach on a large subset of images from the PASCAL VOC detection dataset, where critical visual phenomena, such as occlusion, truncation, multiple instances and clutter are the norm. Our approach outperforms Yang and Ramanan [26], the state-of-the-art technique, with an improvement from 29.0% to 37.5% PCP accuracy on the arm keypoint prediction task, on this new pose estimation dataset.
link to paper
Event retrieval in large video collections with circulant temporal encoding
INRIA Rhône-Alpes, C108
Friday, June 14 2013, 14:00
Abstract:
This paper presents an approach for large-scale event retrieval. Given a video clip of a specific event, e.g., the wedding of Prince William and Kate Middleton, the goal is to retrieve other videos representing the same event from a dataset of over 100k videos. Our approach encodes the frame descriptors of a video to jointly represent their appearance and temporal order. It exploits the properties of circulant matrices to compare the videos in the frequency domain. This offers a significant gain in complexity and accurately localizes the matching parts of videos. Furthermore, we extend product quantization to complex vectors in order to compress our descriptors, and to compare them in the compressed domain. Our method outperforms the state of the art both in search quality and query time on two large-scale video benchmarks for copy detection, TRECVID and CCWEB. Finally, we introduce a challenging dataset for event retrieval, EVVE, and report the performance on this dataset.
Scene Understanding: What more can we do to better understand scenes?
INRIA Rhône-Alpes, F107
Tuesday, June 11 2013, 11:30
Abstract:
The problem of scene understanding has manifested itself in various forms,
including, but not limited to, object recognition, 3D scene recovery, and image
segmentation. In this talk I will discuss some of my attempts to address these
tasks, starting with our energy based formulation for reasoning about regions,
objects, and their attributes such as object class, location, and spatial
extent. We define a global energy function, which combines results from sliding
window detectors, and low-level pixel-based unary and pairwise relations. I
will also briefly describe methods for solving the inference and parameter
learning problems efficiently in the context of these optimization problems.
In the second part of the talk I will focus on other related challenges: (i)
Video segmentation – Video not only provides rich visual cues such as motion
and appearance, but also long-range temporal interactions among objects. We
present a method to capture such interactions and to construct a powerful
intermediate-level representation for subsequent recognition. (ii) Text
recognition in scenes – Scene text provides useful cues, such as geographical
location, types of buildings in the scene, and the problem of recognizing it is
receiving significant attention. I will describe our framework that exploits
bottom-up cues, derived from individual character detections from the image,
and top-down constraints, obtained from language statistics, for solving this
problem.
Structure-Preserving Object Tracking and Forensic Painting Analysis
INRIA Rhône-Alpes, F107
Monday, June 10 2013, 11:30
Abstract:
The talk gives an overview of my work in computer vision. Specifically, I will
present my work on model-free tracking and on forensic painting analysis. In
addition, I will briefly highlight my work on the visualization of
high-dimensional data, and on the regularization of learning models.
Model-free tracking. Model-free trackers track arbitrary objects based on a
single annotation of the object. Whilst the performance of model-free trackers
has recently improved substantially, simultaneously tracking multiple objects
with similar appearance remains very hard. We propose a new multi-object
model-free tracker (based on tracking-by-detection) that resolves this problem
by incorporating spatial constraints between the objects. The spatial
constraints are learned along with the object detectors using an online
structured SVM algorithm. The experimental evaluation of our
structure-preserving object tracker reveals significant performance
improvements in both multi-object and single-object tracking.
Painting analysis. High-resolution radiographs of paintings reveal the
structure of the canvas on which the painting was made. Due to the way in which
canvas is produced, the spacings between canvas threads form a "fingerprint"
that can be used to identify canvases that originated from the same bolt of
canvas. We present a technique to extract and compare canvas fingerprints from
painting radiographs, and we show how our techniques may provide new
art-historical insights reuniting Poussin's Bacchanals painted for Cardinal
Richelieu.
The talk presents joint work with Lu Zhang (Delft University of Technology) and
Robert Erdmann (University of Arizona).
On cutting planes for mixed integer linear programming
Alberto Del Pia
INRIA Rhône-Alpes, F107
Monday, April 29 2013, 12:00
Abstract:
This talk gives an introduction to a recently established link between the geometry of numbers and mixed integer linear optimization. The main focus is to provide a review of families of lattice-free polyhedra and their use in a disjunctive programming approach. The use of lattice-free polyhedra in the context of deriving and explaining cutting planes for mixed integer programs is not only mathematically interesting, but it leads to some fundamental new discoveries, such as an understanding under which conditions cutting planes algorithms converge finitely. These theoretical results suggest the possibility that cutting planes from special families of lattice-free polyhedra could give rise to numerically efficient novel algorithms.
A unified framework for change point detection and other related problems
INRIA Rhône-Alpes, Grand Amphithéâtre
Friday, April 26 2013, 12:00
Abstract:
We propose a unified convex-optimization-based framework for problems of
detecting a signal of a given shape in Gaussian noise. The framework covers
various detection settings including: detection of jumps in curves and their
derivatives; detection of a periodic component in Gaussian time series; and
signal detection from indirect observations. We present a general detection
procedure, analyze its properties and show that it cannot be improved in some
specific settings.
Comparison of mid-level feature coding approaches and pooling strategies in visual concept detection
Piotr Koniusz
INRIA Rhône-Alpes, F107
Wednesday, April 17 2013, 11:00
Abstract:
Bag-of-Words lies at a heart of modern object category recognition systems.
After descriptors are extracted from images, they are expressed as vectors
representing visual word content, referred to as mid-level features. In this
talk we review a number of techniques for generating mid-level features,
including two variants of Soft Assignment, Locality-constrained Linear Coding,
and Sparse Coding. Moreover, we investigate various pooling methods that
aggregate mid-level features into vectors representing images. Average pooling,
Max-pooling, and a family of likelihood inspired pooling strategies are
scrutinised. We generalise the investigated pooling methods to account for the
descriptor interdependence and introduce an intuitive concept of improved
pooling. We also propose a coding-related improvement to increase its speed. As
the pooling step aggregates only occurrences of visual words represented by
coefficients of each mid-level feature vector, we refer to it as First-order
Occurrence Pooling. We propose to aggregate over co-occurrences of visual words
in mid-level features. A derivation of Second- and Higher-order Occurrence
Pooling based on linearisation of so-called Minor Polynomial Kernel is
demonstrated. We evaluate how First-, Second-, and Third-order Occurrence
Pooling performs given various coders and pooling operators. For bi- and
multi-modal coding with two or more coders, we demonstrate an extension of
Second- and Higher-order Occurrence Pooling based on linearisation of Minor
Polynomial Kernel. Lastly, we compare the proposed approaches to other renowned
methods (e.g. Fisher Vector Encoding) in the same testbed and attain
state-of-the-art results with 69.2% MAP on PascalVOC07, 90.2% accuracy on
Flower102, 83.6% accuracy on Caltech101, and 41.2% MAP on ImageCLEF11.
Unsupervised Learning of Invariant Object Representations—A Probabilistic Generative Modeling Approach
INRIA Rhône-Alpes, F107
Tuesday, April 16 2013, 14:00
Abstract:
A fundamental problem of computer vision is how to learn and infer objects in
images robustly. For instance, objects need to be represented in spatially and
temporally efficient forms and their representations need to be flexible in
order to be used for various learning and inference tasks. Objects need to be
inferred invariant w.r.t. varied conditions, e.g., different illumination
conditions, changes of viewpoints, etc. Objects need to be learned and inferred
in cluttered scenes (with the existence of other objects and a variety of
noise). Learning object representations with these desired properties is a long
standing goal in computer vision. As a step towards this direction, we study
the problem of autonomously learning invariant object representations from
visual scenes. It covers three aspects of the above mentioned properties:
autonomous (unsupervised) object learning, learning invariant object
representations, and modeling occlusive objects in visual scenes. New
generative models have been proposed together with efficient algorithms for
their parameter optimization. The limitations of previous works have been
avoided by using a more principled approach to derive efficient learning
algorithms and addressing a novel scheme of modeling occlusion.
Hierarchical analysis of hyperspectral images using binary partition trees
INRIA Rhône-Alpes, F107
Wednesday, April 3 2013, 11:00
Abstract:
After decades of use of multispectral remote sensing, most of the
major space agencies now have new programs to launch hyperspectral
sensors, recording the reflectance information of each point on the
ground in hundreds of narrow and contiguous spectral bands. The
spectral information is instrumental for the accurate analysis of the
physical component present in one scene. But, every rose has its
thorns: most of the traditional signal and image processing
algorithms fail when confronted to such high dimensional data (each
pixel is represented by a vector with several hundereds of
dimensions).
In this talk, we focus on the extension to hyperspectral data of a
very powerful image processing analysis tool: the Binary Partition
Tree (BPT). It provides a generic hierarchical representation of
images and consists of the two following steps:
-
construction of the tree: one starts from the pixel level and merge
pixels/regions progressively until the top of the hierarchy (the
whole image is considered as one single region) is reached. To
proceed, one needs to define a model to represent the regions (for
instance: the average spectrum—but this is not a good idea) and one
also needs to define a similarity measure between neighbouring regions
to decide which ones should be merged first (for instance the
euclidean distance between the model of each region—but this is not
a good idea either). This step (construction of the tree) is very much
related to the data.
-
the second step is the pruning of the tree: this is very much
related to the considered application. The pruning of the tree leads
to one segmentation. The resulting segmentation might not be any of
the result obtained during the iterative construction of the tree.
This is where this representation outperforms the standard approaches.
But one may also perform classification, or objet detection (assuming
an object of interest will appear somewhere as one noode of the tree,
the game is to define a suitable criterion, related to the
application, to find this node).
Results are presented on various hyperspectral images.
Large-scale learning from interaction data
INRIA Rhone-Alpes, F107
Thursday, January 17 2013, 10:30
Abstract:
In many important applications, we need to make decisions in environments where
the reward is only partially observed, but can be modeled as a function of an
action and an observed context. Examples include user content optimization,
Internet advertising and health-care policy. In the first part of the talk, I
will discuss the problem of evaluation of a new policy (e.g., a user serving
policy) given historic data. The key statistical challenge is properly
accounting for the fact that the past policy and the proposed policy differ. I
will present an accurate technique that solves this without collecting any new
data. In the second part of the talk, I will focus on a computational challenge
of learning from massive interaction data sets. I will describe a distributed
optimization technique that allows solving tera-scale problems in 1 hour (using
1000 machines/cores).
Based on joint work with John Langford, Lihong Li, Alekh Agarwal and Olivier
Chapelle.
Towards efficient video representations for action recognition
INRIA Rhone-Alpes, A104
Friday, November 30 2012, 12:00
Abstract:
In this talk, we first review some popular spatial-temporal features for video,
and compare their performance in action recognition. In total, we consider four
different feature detectors and six local feature descriptors. We demonstrate
that dense sampling at regular positions consistently outperforms all tested
space-time interest point detectors in real-world videos.
The second part will introduce our recent video features based on dense
trajectories and motion boundary descriptors. Dense trajectories capture the
local motion patterns in the video and guarantee a good coverage of the context
information. Additionally, motion boundary descriptors show to consistently
outperform other state-of-the-art descriptors, in particular on real-world
videos that contain a significant amount of camera motion. We will also discuss
some drawbacks of the current methods and possible further extensions.
The Role of V4 During Natural Vision
INRIA Rhone-Alpes, F107
Monday, November 26 2012, 11:00
Abstract:
The functional organization of area V4 in the mammalian ventral visual pathway
is far from being well understood. V4 is believed to play an important role in
the recognition of shapes and objects and in visual attention, but its
complexity makes it hard to analyze. Individual cells in V4 have been shown to
exhibit a large diversity of preferences to visual stimuli characteristics,
including orientation, curvature, motion, color and texture. Such observations
were for a large part obtained from electrophysiological and imaging studies,
when a subject (monkey or human) is shown a sequence of artificial stimuli
during data acquisition. In our study, we intend to go beyond such an approach
and analyze a population of V4 neurons in naturalistic conditions. More
precisely, we record responses from V4 neurons to grayscale still natural
images---that is, discarding color and motion content. We propose a new
computational model for V4 that does not rely on any pre-defined image features
but only on invariance and sparse coding principles. Our approach is the first
to achieve comparable prediction performance for V4 as for V1 cells on
responses to natural images. Our model is also interpretable using sparse
principal component analysis. In the neuron population observed and based on
our computational model, we discover as our main finding two groups of neurons:
those selective to texture versus those selective to contours. This supports
the thesis that one primary role of V4 is to extract objects from background in
the visual field. Moreover, our study also confirms the diversity of V4
neurons. Among those selective to contours, some of them are selective to
orientation, others to acute curvature features.
This is a joint work with Yuval Benjamini, Ben Willmore, Michael Oliver, Jack
Gallant and Bin Yu. This work was performed at UC Berkeley.
Refresher on neural networks and overview of libraries for deep learning
INRIA Rhone-Alpes, A104
Friday, November 23 2012, 11:30
Abstract:
Recent results [1] highlighted the excellent performance of deep learning
architectures for complex high-level computer vision tasks. This talk aims
at providing some basic practical knowledge in order to start playing
around with these algorithms.
We will begin with a brief refresher on neural networks and the
back-propagation algorithm. We will then provide an overview of two
Open Source libraries that can be used to learn deep architectures:
Theano [2] (python) and EBLearn [3] (C++).
Reading material:
Chapter 11 (on Neural Networks) from
The Elements of Statistical Learning
Presentation:
http://lear.inrialpes.fr/people/gaidon/lear_xrce_deep_learning_01.html
Block-Coordinate Frank-Wolfe for Structural SVMs
INRIA Rhone-Alpes, F107
Monday, November 12 2012, 14:00
Abstract:
We propose a randomized block-coordinate variant of the classic Frank-Wolfe
algorithm for convex optimization with block-separable constraints. Despite its
lower iteration cost, we show that it achieves the same convergence rate as the
full Frank-Wolfe algorithm. We also show that, when applied to the dual struc-
tural support vector machine (SVM) objective, this algorithm has the same low
iteration complexity as primal stochastic subgradient methods. However, unlike
stochastic subgradient methods, the stochastic Frank-Wolfe algorithm allows us to
compute the optimal step-size and yields a computable duality gap guarantee. Our
experiments indicate that this simple algorithm outperforms competing structural
SVM solvers.
Using Machine Learning to Predict Protein-Protein and Protein-Ligand Interactions
INRIA Rhone-Alpes, F107
Friday, November 9 2012, 10:30
Abstract:
Protein-protein and protein-ligand interactions are crucial for many biological
processes such as signal transduction, DNA replication, etc. Such interactions
are also fundamental in many diseases (e.g. cancers). In this talk, I will
describe our recent work on machine learning techniques that predict these
interactions.
Due to the difficulties, time and cost of the experimental methods for
determining the structures and binding affinities of molecular complexes,
efficient computational methods are usually used in this field. However, the
accuracy of these computational methods is often rather low due to the crude
approximations of the interactions within the complex and also due to
insufficient sampling of the configurational space for the molecules that form
the complex.
I will describe a new machine learning algorithm that very precisely
reconstructs the interactions between the molecules based on the structural
information currently available in the databases. These databases contain
three-dimensional molecular structures determined by experimental techniques
and have been growing very rapidly. In 2012, the PDB (Protein Data Bank)
contained about 80,000 of protein structures. The CSD (Cambridge Structural
Database), a database for small molecules, contained about 500,000 entries
at the beginning of 2012. We trained our interaction model with some 60,000
parameters on structures from these databases and verified the results on
several standard benchmarks as well as in blind docking prediction competitions.
The success rates of our model, according to the benchmarks, rank it among the
top-3 methods currently available.
Predicting Binary Features for Attribute-Based and Multi-Label Classification
INRIA Rhone-Alpes, Grand Amphi
Friday, October 26 2012, 15:30
Abstract:
The prediction of attributes, i.e. semantic properties of objects or scenes,
has recently received a lot of attention in the computer vision community. In
their simplest form, one can interpret attributes simply as a layer of binary
mid-level features that can be computed from the image contents. In my talk I
will discuss two recent works in this area: the automatic learning of
additional, non-semantic, binary features that augment an existing set of
attributes (ECCV 2012), and a method for more efficiently predicting binary
outputs in highly connected graphical models, where inference has to performed
by sampling (NIPS 2012).
Multi-step flow fusion: towards accurate and dense correspondences in long video shots
INRIA Rhone-Alpes, F107
Thursday, October 25 2012, 10:00
Abstract:
The aim of this work is to estimate dense displacement fields over long video
shots. Put in sequence, they are useful for representing point trajectories but
also for propagating (pulling) information from a reference frame to the rest
of the video. Highly elaborated optical flow estimation algorithms are at
hand, and they were applied before for dense point tracking by simple
accumulation, however with unavoidable position drift. On the other hand,
direct long-term point matching is more robust to such deviations, but it is
very sensitive to ambiguous correspondences. Why not combining the benefits of
both approaches? Following this idea, we develop a multi-step flow fusion
method that optimally generates dense long-term displacement fields by first
merging several candidate estimated paths and then filtering the tracks in the
spatio-temporal domain. Our approach permits to handle small and large
displacements with improved accuracy and it is able to recover a trajectory
after temporary occlusions. Especially useful for video editing applications,
we attack the problem of graphic element insertion and video volume
segmentation, together with a number of quantitative comparisons on
ground-truth data with state-of-the-art approaches.
Score-based Bayesian Skill Learning
CANCELLED
Abstract:
We extend the Bayesian skill rating system of TrueSkill to accommodate
score-based match outcomes. TrueSkill has proven to be a very
effective algorithm for matchmaking --- the process of pairing
competitors based on similar skill-level --- in competitive online
gaming. However, for the case of two teams/players, TrueSkill only
learns from win, lose, or draw outcomes and cannot use additional
match outcome information such as scores. To address this deficiency,
we propose novel Bayesian graphical models as extensions of TrueSkill
that (1) model player's offence and defence skills separately and (2)
model how these offence and defence skills interact to generate
score-based match outcomes. We derive efficient (approximate)
Bayesian inference methods for inferring latent skills in these new
models and evaluate them on three real data sets including Halo 2 XBox
Live matches. Empirical evaluations demonstrate that the new
score-based models (a) provide more accurate win/loss probability
estimates than TrueSkill when training data is limited,
(b) provide competitive and often better win/loss classification
performance than TrueSkill, and (c) provide reasonable score outcome predictions with an
appropriate choice of likelihood --- prediction for which TrueSkill was not
designed, but which can be useful in many applications.
Distances and Kernels on Discrete Structures: the generating-function trick
Marco Cuturi
INRIA Rhone-Alpes, F107
Thursday, October 5 2012, 12:00
Abstract:
Distances and positive definite kernels lie at the core of many
machine learning algorithms. When comparing vectors, these two
concepts form well-matched pairs that are almost interchangeable:
trivial operations such as changing signs, adding renormalization
factors, taking logarithms or exponentials are usually sufficient to
recover one from the other (e.g. Euclidean distances & Laplace
kernels). However, when comparing discrete structures, this harmonious
symmetry falls apart. The culprit lies in the introduction of
combinatorial optimization to compute distances (e.g. edit distances
for strings / time series / trees; minimum cost matching distances for
sets of points; transportation distances for histograms etc.). Simple
counterexamples show that such considerations -- finding a minimal
cost matching or a maximal alignment to compare two objects -- tend to
destroy any hope of recovering a positive definite kernel from such
distances. We present a review of several results in the recent
literature that have overcome this limitation. We provide a unified
framework for these approaches by highlighting the fact that they all
rely on generating functions to achieve positive definiteness.
Metric Learning for Large Scale Image Classification: Generalizing to New Classes at Near-Zero Cost
Thomas Mensink
INRIA Rhone-Alpes, F107
Monday, October 1 2012, 14:00
Abstract:
We are interested in large-scale image classification and especially
in the setting where images corresponding to new or existing classes are
continuously added to the training set. Our goal is to devise classifiers which
can incorporate such images and classes on-the-fly at (near) zero cost. We cast
this problem into one of learning a metric which is shared across all classes
and explore k-nearest neighbor (k-NN) and nearest class mean (NCM) classifiers.
We learn metrics on the ImageNet 2010 challenge data set, which contains more
than 1.2M training images of 1K classes. Surprisingly, the NCM classifier
compares favorably to the more flexible k-NN classifier, and has comparable
performance to linear SVMs. We also study the generalization performance, among
others by using the learned metric on the ImageNet-10K dataset, and we obtain
competitive performance. Finally, we explore zero-shot classification, and show
how the zero-shot model can be combined very effectively with small training
datasets.
Hyperbolic wavelet transform : a new tool for analyzing anisotropic textures
INRIA Rhone-Alpes, Grand Amphi
Wednesday, October 3 2012, 11:00
Abstract:
In recent years, there has been a paradigm shift in the size of the
datasets statisticians are working with.
In the "classical" setting, one worked with datasets consisting of n
observations of a vector of size p, and p was much smaller than n.
In the "modern" setting of high-dimensional statistics, it is now
common to work with datasets where p and n are comparable and quite
large (for instance a few hundreds). Sometime p is also much greater
than n.
I will discuss work which sheds light on the behavior of commonly used
statistical procedures in the ``large n, large p" setting, where we
study the asymptotic behavior of statistical estimators assuming that
p and n both go to infinity while p/n has a finite non-zero limit.
Building on this understanding, we can propose alternative to
classical statistical methods which are better able to handle the
difficulties inherent in high-dimensional statistics. I will describe
some of my work in this direction.
At the heart of a number of these analyses is modern random matrix
theory. I will talk about the role played by this theory and its
potential limitations for statistical modeling, highlighting the
connection with the concentration of measure phenomenon.
Some connections between random matrix theory and high-dimensional statistics
Nourredine El Karoui
INRIA Rhone-Alpes, F107
Thursday, July 19 2012, 11:00
Abstract:
In recent years, there has been a paradigm shift in the size of the
datasets statisticians are working with.
In the "classical" setting, one worked with datasets consisting of n
observations of a vector of size p, and p was much smaller than n.
In the "modern" setting of high-dimensional statistics, it is now
common to work with datasets where p and n are comparable and quite
large (for instance a few hundreds). Sometime p is also much greater
than n.
I will discuss work which sheds light on the behavior of commonly used
statistical procedures in the ``large n, large p" setting, where we
study the asymptotic behavior of statistical estimators assuming that
p and n both go to infinity while p/n has a finite non-zero limit.
Building on this understanding, we can propose alternative to
classical statistical methods which are better able to handle the
difficulties inherent in high-dimensional statistics. I will describe
some of my work in this direction.
At the heart of a number of these analyses is modern random matrix
theory. I will talk about the role played by this theory and its
potential limitations for statistical modeling, highlighting the
connection with the concentration of measure phenomenon.
A Few Machine Learning-Friendly Optimization and Algorithmic Properties
INRIA Rhone-Alpes, F107
Thursday, July 4 2012, 15:00
Abstract:
I will introduce some of the main results of my PhD. First, the
so-called "proximal" methods have drawn a lot of attention, lately,
for solving non-smooth optimization problems that naturally arise for
Machine Learning and Signal Processing, among others.
The efficiency of those methods relies on the computation of the
proximity operator, which, in a lot of problems, can't be obtained in
closed form. In those situations, the proximity operator is
approximated through the use of iterative procedures.
We will see how some finite-time analysis can lead to unexpected
strategies where the precision of the approximations can be chosen so
that the global procedure has: a) good theoretical properties of the
quality of the solution, b) a minimal computational cost.
Then, we will investigate the use of a non-standard performance
measure of interest for (multi-class) machine learning problems,
namely the Confusion Matrix. We advocate that in several cases, this
quantity could be "minimized", instead of the more standard "risk"
that is usually considered in ML problems. Along with this a
framework, we provide some of its theoretical grounds with
generalization bounds, that can be obtained through a generalization
of the "stability" analysis, which consists in leveraging algorithmic
properties to provide statistical guarantees of the classifiers."
Hypothesis Testing and Bayesian Inference: New Applications of Kernel Methods
INRIA Rhone-Alpes, Grand Amphi
Monday, June 11 2012, 11:00
Abstract:
In the early days of kernel machines research, the "kernel trick" was
considered a useful way of constructing nonlinear learning algorithms
from linear ones, by applying the linear algorithms to feature space
mappings of the original data. Recently, it has become clear that a
potentially more far reaching use of kernels is as a linear way of
dealing with higher order statistics, by mapping probabilities to a
suitable reproducing kernel Hilbert space (i.e., the feature space is
an RKHS).
I will describe how probabilities can be mapped to reproducing kernel
Hilbert spaces, and how to compute distances between these mappings.
A measure of strength of dependence between two random variables
follows naturally from this distance. Applications that make use of
kernel probability embeddings include:
* Nonparametric two-sample testing and independence testing in complex
(high dimensional) domains. As an application, we find whether text in
English is translated from the French, as opposed to being random
extracts on the same topic.
* Bayesian inference, in which the prior and likelihood are
represented as feature space mappings, and a posterior feature space
mapping is obtained. In this case, Bayesian inference can be
undertaken even in the absence of a model, by learning the prior and
likelihood mappings from samples.
Helping each other to see: Humans and machines
Larry Zitnick
INRIA Rhone-Alpes, Grand Amphi
Tuesday, April 24 2012, 11:00
Abstract:
Humans and machines see the world differently, each having their own strengths and weaknesses. In this talk, I describe two projects exploring how they may help each other.
Visual object recognition by machines is notoriously difficult. To help in the learning process, humans are typically used to gather large hand-labeled training datasets from which the machines may learn. However, humans may also be used to "debug" the machine's recognition pipeline to learn what aspects are lacking. Specifically, we explore the various stages of part-based person detectors. We perform human studies in which subjects perform the same sub-tasks as their machine counterparts, and accuracies are compared.
The typical human has significant difficultly in drawing everyday objects containing complex structures, such as faces or bikes. When learning to draw, humans must learn to see the word differently. That is, they must not only recognize what they are seeing, but they must perceive the spacing and structural layout of an object. We demonstrate an application in which machines can recognize what a human is drawing and provide visual guidance to the drawer in the form of shadows. The shadows, which may be either used or ignored by the drawer, help the drawer achieve more realistic overall shapes and spacing, while maintaining their own unique drawing style.
Point Process models for multiple object detection
Ahmed Gamal-Eldin
INRIA Rhone-Alpes, F107
TBD
Abstract:
I will start by a brief introduction to Point Process models in image processing, while mainly focusing on remote sensing. I will talk about existing optimization methods, and discuss what are the main characteristics of a good optimizers for these models. Next, I will present a new optimization algorithm we call "Multiple Births and Cut" (MBC). It combines the recently developed optimization algorithm Multiple Births and Deaths (MBD) and the Graph-Cut. I will present three different variants of this algorithm. I will present results on synthetic data to show how the algorithm scale with the problem size. Finally I will present results on different applications.
Leveraging category-level labels for instance-level image retrieval
Albert Gordo
INRIA Rhone-Alpes, F107
Thursday, March 15 2012, 15:00
Abstract:
We consider the problem of query-by-example instance-level image retrieval: given a query image of an object or a scene, we want to retrieve within a potentially large dataset other instances of the exact same object or scene.
For efficiency reasons, it is common to represent an image by a fixed-length descriptor which is subsequently encoded into a small number of bits.
We note that most encoding techniques include an unsupervised dimensionality reduction step.
Our goal in this work is to learn a better subspace in a supervised manner. We especially raise the following question:
"can category-level labels be used to learn such a subspace?"
To answer this question, we experiment with four learning techniques: a metric learning approach, attributes representation, Canonical Correlation Analysis (CCA) and Joint Subspace and Classifier Learning (JSCL).
While the first three approaches have been applied in the past to the image retrieval problem , we believe we are the first to show the usefulness of JSCL in this context.
In our experiments, we use ImageNet as a source of category-level labels and report retrieval results on two standard datasets:
INRIA Holidays and the University of Kentucky benchmark.
Our experimental study shows that metric learning and attributes do not lead to any significant improvement in retrieval accuracy, as opposed to CCA and JSCL.
As an example, we report on Holiday an increase in accuracy from 39.3% to 48.6% with 32-dimensional representations.
Structured Models for Image Labeling
Thomas Mensink
INRIA Rhone-Alpes, F107
Monday, March 12 2012, 11:00
Abstract:
In this paper we propose structured prediction models for image labeling that
explicitly take into account dependencies among image labels. We describe a
tree-based structure, where image labels are nodes, and edges encode dependency
relations. Our models are more expressive than independent label predictors,
such as one vs. rest SVMs and lead to more accurate predictions in the case of
fully-automatic image labeling.
However, the gain becomes more significant in an interactive scenario where a
user provides the value of some of the image labels at test time.
Such an interactive scenario offers an interesting trade-off between label
accuracy and manual labeling effort.
The structured models are used to decide which labels should be set by the user,
and transfer the user input to more accurate predictions on other image labels.
This is an extended version of my CVPR 2011 paper.
On visual tracking
Pérez Patrick
INRIA Rhone-Alpes, Grand Amphi
Thursday, January 26 2012, 11:00
Abstract:
Visual motion estimation is a generic task of crucial importance in a variety of video analysis and processing systems. It comes under multiple guises, depending on the extent and the density of the spatial estimation support (from sparse fragments to whole objects and complete scenes) and on the extent of the temporal analysis (from instantaneous velocity estimation to long-term visual tracking). This variety, along with the long history of this branch of computer vision, makes its rapid overview difficult. There are nonetheless several important methodological concepts, pertaining to sequential inference and to visual appearance modeling/matching, that traverse many works in this field, including most recent ones. With a focus on visual tracking, I will touch upon such tools will the help of a large range of illustrative examples.
CVPR submission
Zeynep Akata & Gokberk Cinbis
INRIA Rhone-Alpes, F107
Thursday, January 19 2012, 12:00
Abstract:
Zeynep Akata and Gokberk Cinbis will talk about their recent CVPR submissions.
Learning temporal information for action recognition
Adrien Gaidon
INRIA Rhone-Alpes, F107
Monday, January 16 2012, 16:00
Abstract:
Current state-of-the-art models of human actions in realistic videos,
e.g. the bag of spatio-temporal visual words, are often based on the
aggregation of local features in an orderless fashion. However, actions
are by essence temporal phenomena and some actions, like "sitting down"
and "getting up", can only be reliably classified if their models
incorporate some temporal structure.
We present two recent results on incorporating temporal information in
state-of-the-art recognition methods.
First, we describe a simple action model, called the Actom Sequence
Model (ASM), encoding global ordering constraints between temporal
parts. We explain how we learn the temporal structure of an action and
perform efficient action detection on large video databases.
Then, we introduce a new kernel between multivariate time series, called
the Difference between Auto-Correlation Operators (DACO) kernel, and
demonstrate its applicability to videos. This kernel compares two
actions based on their dynamics, represented by the auto-correlation
operator in the Reproducing Kernel Hilbert Space (RKHS) associated with
a "base" kernel between frames. We show that it leverages useful
temporal dependency information, that complements traditional kernels on
bag-of-words.
Finally, we illustrate the performance of our algorithms on challenging
action recognition benchmarks and show improvements w.r.t. the state of
the art.
Joint work with Zaid Harchaoui and Cordelia Schmid
Automatic human face 3D modeling from a single view image / Fast low-rank metric learning
Danila Potapov / Dan Oneata
INRIA Rhone-Alpes, F107
Monday, December 19 2011, 12:00
Abstract:
Automatic human face 3D modeling from a single view image:
Human face 3D modeling from a single image is a very challenging problem. The missing depth information has to be inferred using a generative model with dozens of latent variables. Probabilistic analysis suggests minimization of a huge energy function with respect to parameters of different nature. In this talk, I will give an overview of existing techniques based on the 3D Morphable Model approach. I will present our method that makes use of additional information (facial feature points and contours). I will discuss implementation details and show experimental results. The method is flexible enough to be applied to both natural and sculptured human faces.
Fast low-rank metric learning:
We propose two families of algorithms for reducing the computational cost of the NCA method. First, we consider ideas inspired by the sub-sampling methods. We investigate a mini-batch method that forms mini-batches by clustering and a sub-set learning algorithm that is theoretically justied by stochastic optimization arguments. Our experiments demonstrate that these method offer significant speed-up gains while obtaining classifications scores similar to the classical NCA. The second family of algorithms includes variants of approximate methods. We derive these methods by first interpreting NCA as a class-conditional kernel density estimation (CC-KDE) problem. This formulation offers several advantages: (i) it allows us to adapt existing algorithms for fast kernel-density estimation (e.g., Gray and Moore, 2003) into the context of NCA and (ii) it offers more flexibility; for example, we develop a compact support version NCA method that achieves considerable speed-ups when combined with the stochastic learning procedure.
The role of attractiveness for web image search
Bo Geng
INRIA Rhone-Alpes, F107
Thursday, December 08 2011, 16:00
Abstract:
Existing web image search engines are mainly designed to optimize topical relevance. However, according to our user study, attractiveness is becoming a more and more important factor for web image search engines to satisfy users' search intentions. Important as it can be, web image attractiveness from the search users' perspective has not been sufficiently recognized in both the industry and the academia. In this paper, we present a definition of web image attractiveness with three levels according to the end users' feedback, including perceptual quality, aesthetic sensitivity and affective tune. Corresponding to each level of the definition, various visual features are investigated on their applicability to attractiveness estimation of web images. To further deal with the unreliability of visual features induced by the large variations of web images, we propose a contextual approach to integrate the visual features with contextual cues mined from image EXIF information and the associated web pages. We explore the role of attractiveness by applying it to various stages of a web image search engine, including the online ranking and the interactive reranking, as well as the offline index selection. Experimental results on three large-scale web image search datasets demonstrate that the incorporation of attractiveness can bring more satisfaction to 80% of the users for ranking/reranking search results and 30.5% index coverage improvement for index selection, compared to the conventional relevance based approaches.
Portmanteau vocabularies: multi-cue visual neologism on the cheap
Andrew D. Bagdanov
INRIA Rhone-Alpes, F107
Monday, November 28 2011, 11:00
Abstract:
The success of the bag-of-words (BOW) model for image classification
is highly dependent on the quality of the visual vocabulary used. This
talk will consider visual vocabularies used to represent images whose
local features are described by both shape and color. In it, I will
describe a new approach to feature combination in the BOW model that
builds discriminative compound words from primitive cues learned
independently from training images. Motivated by the observation that
modeling joint-cue distributions independently is more statistically
robust for typical classification problems than attempting to directly
estimate joint-cue distribution empirically, the statistics of joint
visual words are modeled assuming conditional independence of
individual features once the class is known. We apply information
theoretic vocabulary compression to find discriminative combinations
of joint-cues and the resulting vocabulary of visual portmanteaux is
compact, has the cue binding property, and supports individual
weighting of cues in the final image representation. State-of-the-art
results on both the Oxford Flower-102 and Caltech-UCSD Bird-200
datasets demonstrate the effectiveness of the approach compared to
other, significantly more complex approaches to multi-cue image
representation.
Training Random Forests with Ambiguously Labeled Data
Christian Leistner
INRIA Rhone-Alpes, F107
Wednesday, October 19 2011, 15:00
Abstract:
Although nowadays the number of digital images is exploding,
collecting large amounts of labeled data can still be tedious and
costly. Additionally, the labels can be noisy or formatted in a way
which might not be optimal to exploit by the learning method -
consider
bounding box annotations in images. This motivates the development and
usage of learning algorithms that are able to exploit both small
amounts
of labeled data and large amounts of unlabeled data, which are usually
easy to get. Also, the learning method should allow for a certain
amount
of flexibility in the labeling.
In this talk, I will show how to use Random Forests (RFs) to tackle
these challenges. RFs are able to deliver state-of-the-art results in
various applications, are fast to train and evaluate, are inherently
multi-class, run on parallel architectures and are robust to label
noise, which makes them perfect candidates to exploit large amounts of
unlabeled or ambiguously labeled samples.
In particular, I will present extensions of RFs to semi-supervised and
multiple-instance learning as well as to online learning, which is
needed in many applications. Finally, I will present a new method that
is able to benefit from unlabeled videos, even if the content is
unrelated to the given task.
Recent research
INRIA Rhone-Alpes F107
Friday, October 7 2011, 10:00 am
Abstract:
Internal talk about some recent researchers
Monocular 3D Pose Estimation
Srimal Jayawardena
INRIA Rhone-Alpes F107
Wednesday, November 2 2011, 11:00 am
Abstract:
The problem of identifying the 3D pose of a known object from a given
2D image has important applications in Computer Vision. Our proposed
method of registering a 3D model of a known object on a given 2D photo
of the object has numerous advantages over existing methods. It does
not require prior training, knowledge of the camera parameters,
explicit point correspondences or matching features between the image
and model. Unlike techniques that estimate a partial 3D pose (as in an
overhead view of traffic or machine parts on a conveyor belt), our
method estimates the complete 3D pose of the object. It works on a
single static image from a given view under varying and unknown
lighting conditions. For this purpose we derive a novel
illumination-invariant distance measure between the 2D photo and
projected 3D model, which is then minimised to find the best pose
parameters. Results for vehicle pose detection in real photographs are
presented.
Manifold Learning by Semidefinite Facial Reduction
Nathan KRISLOCK
INRIA Rhone-Alpes F107
Thursday, June 16 2011, 4:00 pm
Abstract:
The problem of nonlinear dimensionality reduction is most often
formulated as a semidefinite programming (SDP) problem. Currently SDP
problems of only limited size can be directly solved using current SDP
solvers. To overcome this difficulty, we propose a novel SDP
formulation for dimensionality reduction based on semidefinite facial
reduction. The key observation is that in manifold learning, the
structure of a large chunk of the data can be preserved as a whole,
instead of dividing it into very small neighborhoods. This observation
leads to a new formulation that significantly reduces the size and the
number of constraints of the SDP problem. Our method is a stable,
fast, and scalable algorithm for manifold learning, allowing us to
solve very large problems. We obtain high quality solutions without
the need for post-processing by local gradient descent search methods,
as is often required by other large-scale SDP-based methods for
manifold learning.
This is joint work with Babak Alipanahi and Ali Ghodsi (University of
Waterloo, Canada).
First Order Methods for Large-Scale Convex Optimization
INRIA Rhone-Alpes
Friday, May 20 2011, 12:30 am
Abstract:
We discuss several state-of-the-art computationally cheap, as opposed to the polynomial time Interior Point algorithms, first-order methods for minimizing convex objectives over "simple" large-scale feasible sets. We are particularly interested in first-order methods for "well-structured" large-scale nonsmooth convex programs. These methods utilize the problem structure in order to convert the original nonsmooth minimization problem into a saddle point problem with smooth convex-concave cost function. This reformulation allows accelerating significantly the solution process. Our emphasis is on methods which, under favorable circumstances, exhibit (nearly) dimension-independent convergence rate. We also outline possibilities to further accelerate first-order methods by randomization.
Variational Approximations for Factor Analysis
Guillaume Bouchard
INRIA Rhone-Alpes, F107
Friday, May 20 2011, 11am
Abstract:
Many statistical techniques, such as the computation of the data likelihood in the presence of nuisance parameters, the prediction in the presence of missing data, or the computation of the posterior distribution over parameters can be simply expressed as integration problems. Variational approaches enable us to transform an intractable integral into an optimization problem. After a brief tutorial on common variational techniques used to solve machine learning problems, we will present recent developments on the use of variational bounds to solve large scale missing data problems when data are heterogeneous (i.e. when there are both discrete and continuous observations) and heteroscedastic (i.e. when the data variance is not the same for all the observed entities). The final part of the talk will introduce Split Variational Inference, a generic to computing large scale non-Gaussian integrals by splitting them into small pieces that are easier to approximate by unnormalized Gaussian distributions.
Cascaded distinctive features for specific and class object recognition
Jerome Revaud
INRIA Rhone-Alpes, F107
Thursday, March 31 2011, 3pm
Abstract:
Object recognition in images is a growing field. Since several years,
the emergence of invariant interest points such as SIFT [Lowe, 2001] has
enabled rapid and effective systems for the recognition of instances of
specific objects as well as classes of objects (e.g. using the
bag-of-words model). However, our experiments on the recognition of
specific object instances have shown that under realistic conditions of
use (e.g. the presence of various noises such as blur, poor lighting,
low resolution cameras, etc.) progress remain to be done in terms of
recall: despite the low rate of false positives, too few actual
instances are detected regardless of the system (RANSAC, votes / Hough
...). In this presentation, we first present a contribution to overcome
this problem of robustness for the recognition of object instances, then
we straightly extend this contribution to the detection and localization
of classes of objects.
Initially, we have developed a method inspired by graph matching to
address the problem of fast recognition of instances of specific objects
in noisy conditions. This method allows to easily combine any types of
local features (eg contours, textures ...) less affected by noise than
keypoints, while bypassing the normalization problem and without
penalizing too much the detection speed. In this approach, the detection
system consists of a set of cascades of micro-classifiers trained
beforehand. Each micro-classifier is responsible for comparing the test
image locally and from a certain point of view (e.g. as contours, or
textures ...) to the same area in the model image. The cascades of
micro-classifiers can therefore recognize different parts of the model
in a robust manner (only the most effective cascades are selected during
learning). Finally, a probabilistic model that combines those partial
detections infers global detections. Unlike other methods based on a
global rigid transformation, our approach is robust to complex
deformations such as those due to perspective or those non-rigid
inherent to the model itself (e.g. a face, a flexible magazine).
Our experiments on several datasets have showed the relevance of our
approach. It is overall slightly less robust to occlusion than existing
approaches, but it produces better performances in noisy conditions.
In a second step, we have developed an approach for detecting classes of
objects in the same spirit as the bag-of-visual-words model. For this we
use our cascaded micro-classifiers to recognize visual words more
distinctive than the classical words simply based on visual dictionaries
(like [Csurka, 2004] or [Zhang, 2006]). Training is divided into two
parts: First, we generate cascades of micro-classifiers for recognizing
local parts of the model pictures and then in a second step, we use a
classifier to model the decision boundary between images of class and
those of non-class. This classifier bases its decision on a vector
counting the outputs of each binary micro-classifier. This vector is
extremely sparse and a simple classifier such as Real-Adaboost manages
to produce a system with good performances (this type of classifier is
similar in fact to the subgraph membership kernel). In particular, we
show that the association of classical visual words (from keypoints
patches) and our disctinctive words results in a significant
improvement. The computation time is generally quite low, given the
structure of the cascades that minimizes the detection time and the form
of the classifier is extremely fast to evaluate.
Recent results
Jean Ponce
INRIA Rhone-Alpes, A109
Friday, March 29 2011, 11am
Abstract:
Informal talk on some recent results
Seam Carving for Image Retargeting
Alex Mansfield
INRIA Rhone-Alpes, F107
Friday, March 25 2011, 4pm
Abstract:
Seam carving defines an energy over the image, and uses dynamic
programming to efficiently optimize for 8-connected paths (seams)
through the image pixels that can be removed, shrinking the image by 1
pixel in one dimension. I will introduce and motivate the problem of
image retargeting, which seam carving aims to solve, and describe the
key solution approaches. I will describe the seam carving algorithm in
detail, and show its successes and failures. I will describe
extensions to this method, including our recent work, in which we
focus on understanding seam carving further as an optimization process
and on improving results when user interaction is possible. I will
evaluate the success of the field in tackling the problem of image
retargeting, and finally give some key insights and hint at the
challenges ahead.
Robust Estimation for an Inverse Problem Arising in Multiview Geometry
Arnak Dalalyan
INRIA Rhone-Alpes, F107
Thursday, March 3rd 2011, 16h00
Abstract:
We propose a new approach to the problem
of robust estimation for some inverse problems arising
in multiview geometry. Inspired by recent advances
in the statistical theory of recovering sparse vectors,
we define our estimator as a Bayesian maximum a posteriori
with multivariate Laplace prior on the vector
describing the outliers. This leads to an estimator in
which the fidelity to the data is measured by the $L_\infty$-
norm while the regularization is done by the L1-norm.
The proposed procedure is fairly fast since the outlier
removal is done by solving one linear program (LP). An
important difference compared to existing algorithms is
that for our estimator it is not necessary to specify neither
the number nor the proportion of the outliers; only
an upper bound on the maximal measurement error for
the inliers should be specified. We present theoretical
results assessing the accuracy of our procedure, as well
as numerical examples illustrating its efficiency on
synthetic and real data. This is a joint work with Renaud
Keriven.
Union Support Recovery in Multi-task Learning
Mladen Kolar
INRIA Rhone-Alpes, F107
Monday, November 29th 2010, 16h00
Abstract:
We sharply characterize the performance of different penalization
schemes for the problem of selecting the relevant variables in the
multi-task setting. Previous work focuses on the regression problem
where conditions on the design matrix complicate the analysis. A
clearer and simpler picture emerges by studying the Normal means
model. This model, often used in the field of statistics, is a
simplified model that provides a laboratory for studying complex
procedures. These theoretical results will be presented together with
implications for practitioners.
With John Lafferty and Larry Wasserman.
[link]
Learning structured prediction models for interactive image labeling
INRIA Rhone-Alpes, F107
Thursday, November 25th 2010, 14h00
Abstract:
In this talk I will present my CVPR submission.
In the paper we propose structured models for image labeling, which take into account label dependencies. These models are more expressive than independent label predictors, and lead to more accurate predictions.
While the improvement is modest for fully-automatic image annotation, the gain is significant in an interactive scenario where a user provides the value of some of the image labels. In this interactive scenario, the structured models are used to decide which labels should be set by the user, and to infer the remaining labels conditioned on the user responses.
We also apply our models to attribute-based image classification, where attribute predictions of a test image are mapped to class probabilities by means of a given attribute-class mapping. In this case the structured models are built at the attribute level. We also consider an interactive system where the system asks a user to set some of the attribute values in order to maximally improve class prediction performance.
Experimental results on three publicly available benchmark data sets show that in all scenarios structured models lead to more accurate predictions, and leverage user input much more effectively then state-of-the-art independent models.
This is joint work with Jakob and Gabriela Csurka (XRCE).
Fast tropical matrix multiplication and applications to message passing
INRIA Rhone-Alpes, F107
Thuesday, November 23rd 2010, 14h00
Abstract:
In discrete pairwise graphical models containing loops, exact inference via message passing amounts to repeatedly computing matrix products. In order to efficiently compute marginals in such models, one could in principle apply any of the well-known subcubic solutions to this problem. However computing MAP states requires solving matrix product in the max-product (or 'tropical') semiring, where the existence of a subcubic solution remains an open question. In this talk, we discuss expected-case subcubic solutions to this problem, and show how they can lead to faster message passing algorithms in a variety of computer vision problems.
Human Action Recognition in Uncontrolled Videos
INRIA Rhone-Alpes, A104
Thursday, November 18th 2010, 14h00
Abstract:
In this talk, I will present our two recent approaches to human action recognition in uncontrolled videos. The first approach deals with the case where there are not enough training sequences to learn the action classifiers directly from videos. In this case, we show how we can make use of the images collected from the Web to learn representations of actions and use this knowledge to automatically annotate actions in videos. Our approach is unsupervised, in the sense that it requires no human intervention other than the text querying. The benefits are two-fold: first, we show that we can improve retrieval of action images, and second, we can collect a large generic database of action poses, which can then be used in tagging videos. We present experimental evidence that using action images collected from the Web, annotating actions is possible.
In the second part of the talk, I will present our approach which uses the scene and object information in the videos together with the pose and motion information to infer human actions. Here, our observation is that human actions can be identified not only by the singular observation of the human body in motion, but also properties of the surrounding scene and the related objects. We propose an approach that integrates multiple feature channels from several entities and formulate the problem in a multiple instance learning (MIL) framework. Our experimental results show that scene and object information can be effectively used to complement person features for human action recognition.
Faster Algorithms for Max-Product Message-Passing
INRIA Rhone-Alpes, F107
Thursday, October 14th 2010, 16h00
Abstract:
Maximum A Posteriori inference in graphical models is often solved via message-passing algorithms, such as the junction-tree algorithm, or loopy belief-propagation. The exact solution to this problem is well known to be exponential in the size of the model's maximal cliques after it is triangulated, while approximate inference is typically exponential in the size of the model's factors. In this presentation, I'll show recent work from our lab in which we take advantage of the fact that many models have maximal cliques that are larger than their constituent factors, and also of the fact that many factors consist entirely of latent variables (i.e., they do not depend on an observation). This is a common case for several practical models, including many models on grids, trees, ring-structured models and skip-chain models. In such cases, we are able to decrease the exponent of complexity for message-passing for both exact and approximate inference. We illustrate the practical advantages of the improved algorithm in a number of tasks, such as protein design, text and image denoising, optical flow inference, stereo disparity estimation, and graph matching.
Joint work with Julian McAuley.
[Paper]
Set Based Modeling Of Objects And Their Context
INRIA Rhone-Alpes, F107
Friday, October 8th 2010, 14h00
Abstract:
In computer vision, many image entities can be represented as sets of high-dimensional items. For example, an object in an image can be represented as a set of image patches, where each image patch has a feature vector encoding the local appearance. Training classification models directly on sets of unordered items, where each set can have varying cardinality, can be difficult. In this talk, I will introduce a new boosting-based supervised learning algorithm, called SetBoost, for building set classifiers.
In the second part of the talk, I will give details about our novel contextual object detection model that uses SetBoost. In natural images, objects tend to appear in certain arrangements with respect to the other objects (object context) and the scene (scene context). The aim of our proposed model is to improve localization and recognition accuracy of object detection algorithms using object context and scene context. Our approach outperforms existing state-of-the-art methods in challenging object detection benchmark datasets.
Scene and object recognition with lots of categories
INRIA Rhone-Alpes, Grand Amphithéâtre
Monday, September 27st 2010, 16h30
Dense Interest Points
INRIA Rhone-Alpes, Grand Amphithéâtre
Monday, September 27st 2010, 15h30
Abstract:
Local features or image patches have become a standard tool in computer vision, with numerous application
domains. Roughly speaking, two different types of patch-based image representations can be distinguished: interest
points, such as corners or blobs, whose position, scale and shape are computed by a feature detector algorithm, and
dense sampling, where patches of fixed size and shape are placed on a regular grid (possibly repeated over multiple
scales). Interest points focus on 'interesting' locations in the image and include various degrees of viewpoint and illumination invariance, resulting in better repeatability scores. Dense sampling, on the other hand, gives a better coverage
of the image, a constant amount of features per image area, and simple spatial relations between features. In this paper, we propose a hybrid scheme, which we call dense interest points, where we start from densely sampled patches
yet optimize their position and scale parameters locally. We investigate whether doing so it is possible to get the best of
both worlds.
Recent advances in structured sparse models
INRIA Rhone-Alpes, F107
Tuesday, September 21st 2010, 16h00
Abstract:
Sparse linear models have received a lot of attention in statistics,
machine learning, computer vision and neuroscience. We consider here
extensions of these models applied to various machine learning problems,
where the sparsity pattern (set of nonzero coefficients) of the
variables are not only encouraged to be sparse, but also structured.
Whereas this approach enriches classical sparse models, it raises
challenging new optimization problems, and we propose several algorithms
for solving them efficiently. We illustrate our method with wavelet
denoising, learning tree-structured dictionaries of natural image
patches, and background subtraction in videos.
This is a joint work with Rodolphe Jenatton, Guillaume Obozinski and
Francis Bach. The material of the talk is based on the following
publications:
[1] J. Mairal, R. Jenatton, G. Obozinski and F. Bach. Network Flow
Algorithms for Structured Sparsity. NIPS, 2010.
[2] R. Jenatton, J. Mairal, G. Obozinski and F. Bach. Proximal Methods
for Hierarchical Sparse Coding. arXiv:1009.2139v1.
[3] R. Jenatton, J. Mairal, G. Obozinski and F. Bach. Proximal Methods
for Sparse Hierarchical Dictionary Learning. ICML, 2010.
Reverse Multi-Label Learning
INRIA Rhone-Alpes, F107
Monday, September 20th 2010, 16h00
Abstract:
Multi-label classification is the task of predicting potentially
multiple labels for a given instance. This is common in several
applications such as image annotation, document classification and gene
function prediction. In this paper we present a formulation for this
problem based on reverse prediction: we predict sets of instances
given the labels. By viewing the problem from this perspective, the most
popular quality measures for assessing the performance of multi-label
classification admit relaxations that can be efficiently optimised. We
optimise these relaxations with standard algorithms and compare our
results with several state-of-the-art methods, showing excellent
performance in a number of datasets from several different domains,
including biology, images, text and music.
Online Learning for Object Tracking
INRIA Rhone-Alpes, F107
Thursday, August 26th 2010, 14h00
Abstract:
Online learning deals with decision making problems where the model does
not have access to the entire data domain and needs to predict and learn
as the data appears. In this talk, I will mainly focus on object
tracking as an application and show how different online and
semi-supervised learning models can be used for this task.
Examples of Positive Definite Kernels on Time Series
INRIA Rhone-Alpes, F107
Wednesday, July 7th 2010, 16h30
Abstract:
We propose a new family of kernels to handle time series, within the framework of kernel methods which includes popular algorithms such as the support vector machine. These kernels elaborate on the well known dynamic time warping (DTW) family of distances by considering the same set of elementary operations, namely substitutions and repetitions of tokens, to map a sequence onto another. Associating to each of these operations a given score, DTW algorithms use dynamic programming techniques to compute an optimal sequence of operations with high overall score, in this paper we consider instead the score spanned by all possible alignments, take a smoothed version of their maximum and derive a kernel out of this formulation. We prove that this kernel is positive definite under favorable conditions and show how it can be tuned effectively for practical applications.
Visual Recognition with Humans in the Loop
INRIA Rhone-Alpes, F107
Wednesday, June 2nd 2010, 11h30
Abstract:
We present an interactive, hybrid human-computer
method for object classification. The method applies to classes of
problems that are difficult for most people, but are recognizable by
people with the appropriate expertise (e.g., animal species or airplane
model recognition). The classification method can be seen as a visual
version of the 20 questions game, where questions based on simple visual
attributes are posed interactively. The goal is to identify the true
class while minimizing the number of questions asked, using the visual
content of the image. Incorporating user input drives up recognition
accuracy to levels that are good enough for practical applications; at
the same time, computer vision reduces the amount of human interaction
required. The resulting hybrid system is able to handle difficult, large
multi-class problems with tightly-related categories. We introduce a
general framework for incorporating almost any off-the-shelf multi-class
object recognition algorithm into the visual 20 questions game, and
provide methodologies to account for imperfect user responses and
unreliable computer vision algorithms. We evaluate the accuracy and
computational properties of different computer vision algorithms and the
effects of noisy user responses on a dataset of 200 bird species and on
the Animals With Attributes dataset. Our results demonstrate the
effectiveness and practicality of the hybrid human-computer
classification paradigm.
This work is part of the Visipedia project, in collaboration with Steve
Branson, Catherine Wah, Florian Schroff, Boris Babenko, Peter Welinder
and Pietro Perona.
From generic object detection to weakly supervised
learning of object classes
INRIA Rhone-Alpes, F107
Monday, May 31th 2010, 17h00
Abstract:
In the first part of this talk I will present a generic objectness
measure, quantifying how likely it is for an image window to contain an
object of any class. The measure is trained to distinguish objects with a
well-defined boundary in space, such as cows and telephones, from
amorphous background elements, such as grass and road. The measure
combines several image cues measuring characteristics of objects, such as
appearing different from their surroundings and having a closed boundary.
In experiments on the PASCAL VOC 07 dataset, we show that objectness
outperforms a state-of-the-art saliency measure [Hou CVPR 07]. Moreover,
we give an algorithm to employ objectness to greatly reduce the number of
windows that class-specific object detectors need to evaluate.
In the second part of the talk I will present a novel technique for
weakly supervised learning of object classes, which employs objectness as
a focus of attention mechanism. Learning an object class from cluttered
training images is very challenging when the location of object instances
is unknown. While previous works typically require objects covering a
large portion of the images, our technique can cope with extensive
clutter as well as large scale and appearance variations between
instances. It simultaneously localizes instances in their training images
while learning an appearance model specific to the class. We report
experiments on the very challenging PASCAL VOC 07 dataset and compare to
two existing methods [Chum CVPR 07], [Russell CVPR 06]. Finally, we
demonstrate an application by training the fully supervised model of
[Felzenszwalb PAMI 2009] from objects localized by our method, evaluate
it on the PASCAL VOC 07 test set, and compare its performance to the
original model trained from ground-truth bounding-boxes.
Unsupervised video indexing based on audiovisual
characterization of persons
INRIA Rhone-Alpes, F107
Friday, May 28th 2010, 16h00
Abstract:
The characterization of persons within an audiovisual document is one of
the challenging problems in current research activities. Many of them have
addressed this problem with only one modality.
From the audio point of view, the characterization of persons is generally
known as speaker diarization: it aims to segment the audio stream into
turns of speakers and then cluster all turns that belong to the same
speaker. In other meanings, its goal is to answer the question "who talk?
and when?".
From the video point of view, the characterization of persons is generally
known as people detection, tracking and recognition. In other words, it
aims to answer the question "who appear? and when?".
A few other research activities have addressed the problem of persons
characterization from a multimodal point of view. However their
applications were generally limited, constrained and supervised.
At the beginning of this thesis, we propose an efficient audio indexing
system that aims to split the audio channel into homogeneous segments,
discard the non-speech segments, and group the segments into clusters,
that each corresponds ideally to one speaker. This system must process
without a priori knowledge (unsupervised learning) and must be as suitable
to any kind of data: TV/radio broadcast news, TV/radio debates, movies,
etc.
Secondly, we propose an efficient video indexing system that aims to split
the video channel into shots, detect and track people in every shot, and
group all faces into clusters, that each corresponds ideally to one
person. This video system must process without a priori knowledge and may
be suitable to any kind of data.
Finally, we propose an efficient audiovisual indexing system that aims to
combine audio and video indexing systems in order to deliver an
audiovisual characterization of each person talking and/or appearing in
the audiovisual document, and a robustified audio indexing output
(respectively video indexing output) using the help of video (respectively
the help of audio).
Experiments done on broadcast news, debates and movies show the efficiency
of each of our proposed systems, and confirm the correlation between audio
and video information, and the gain ensured by using both media.
Learning Image Retrieval Models using Cross Media Pseudo Relevance Feedback
INRIA Rhone-Alpes, F107
Thursday, May 27th 2010, 16h00
Abstract:
Undisclosed
Applying bottom-up and top-down color
attention for improved bag-of-words based object recognition
INRIA Rhone-Alpes
Tuesday, May 25th 2010, 16h00
Abstract:
Generally the bag-of-words based image representation follows a bottom-up paradigm. The subsequent stages of the process: feature detection, feature description, vocabulary construction and image representation are performed independent of the intentioned object classes to be detected. In such a framework, combining multiple cues such as shape and color often provides below-expected results.
The two main strategies to combine multiple cues, known as early- and late fusion both suffer from significant drawbacks. In this talk I presents a novel method by separating the shape and color cue. Subsequently, color is used to construct a top-down category-specific attention map. The color attention map is then further deployed to modulate the shape features by taking more features from regions within an image that are likely to contain an object instance. This procedure leads to a category-specific image histogram representation for each category.
Evaluation on several data sets shows that the proposed method outperforms both early- and late fusion. Additionally, I will comment on its usage in our submission to the VOC PASCAL 2009 image classification challenge.
Large-scale image categorization and retrieval
Florent Perronnin
INRIA Rhone-Alpes, F107
Wednesday, May 12th 2010, 11:00
Abstract:
This talk will consist of two parts:
1) In the first part, we will address the challenge of learning
efficiently image categorizers on large datasets (e.g. > 100,000 images)
with the popular bag-of-visual-words (BOV) framework. In this framework
an image is described by a histogram of quantized local vectors (e.g.
SIFT) and classification is typically performed using non-linear support
vector machines (SVMs). Non-linear SVMs can perform significantly better
than their linear counter-parts but do not scale well on large datasets.
As kernel machines rely on an implicit mapping of the data it has been
proposed to perform an explicit mapping of the data and to learn directly
linear classifiers in the new space.
We experimented with three approaches to BOV embedding: 1) kernel PCA
(kPCA), 2) a modified kPCA we propose for additive kernels and 3) random
projections for shift-invariant kernels. An important conclusion is that
simply square-rooting BOV vectors -- which corresponds to an exact
mapping for the Bhattacharyya kernel -- already leads to large
improvements, often quite close to the best results obtained with
additive kernels. Another conclusion is that, although it is possible to
go beyond additive kernels, the embedding comes at a much higher
cost.
2) In the second part, we will provide an update on our work on Fisher
kernels (FK). This is an elegant framework which extends the traditional
BOV by going beyond counting. We will show that with several
well-motivated modifications over the original framework, we can boost
the accuracy of the FK for image categorization tasks.
For instance, on
PASCAL VOC 2007 we increase the Average Precision (AP) from 47.9% to
58.3%. A major advantage is that these results are obtained using only
SIFT descriptors and costless linear classifiers. We will also show that,
for the task of query-by-example image retrieval, the FK performs very
well using as little as a few hundreds of bits per image and
significantly better than the BOV.
Trecvid and ACM MM video challenges
Matthijs Douze, Adrien Gaidon and Alessandro Prest
INRIA Rhone-Alpes, C207
Monday, May 3rd 2010, 16h00
Abstract:
We will present the different video tasks in Trecvid 2010 and
ACM Multimedia 2010 challenges.
Multimodal semi-supervised learning for image classification
INRIA Rhone-Alpes, F107
Wednesday, March 31st 2010, 11h00
Abstract:
In image categorization the goal is to decide if an image belongs to a
certain category or not. A binary classifier can be learned from manually
labeled images; while using more labeled examples improves performance,
obtaining the image labels is a time consuming process.
We are interested in how other sources of information can aid the
learning process given a fixed amount of labeled images. In particular
we consider a scenario where keywords are associated with the training
images, e.g. as found on photo sharing websites. The goal is to learn a
classifier for images alone, but we will use the keywords associated
with labeled and unlabeled images to improve the classifier using
semi-supervised learning. We first learn a strong Multiple Kernel
Learning (MKL) classifier using both the image content and keywords, and
use it to score unlabeled images. We then learn classifiers on visual
features only, either support vector machines (SVM) or least-squares
regression (LSR), from the MKL output values on both the labeled and
unlabeled images.
In our experiments on 58 classes from the PASCAL VOC'07 and MIR Flickr
sets, we demonstrate the benefit of our semi-supervised approach over
only using the labeled images. We also present results for a scenario
where we do not use any manual labeling but directly learn classifiers
from the image tags. Also in this case using the semisupervised
approach improves classification performance.
Aggregating local descriptors into a compact image representation
INRIA Rhone-Alpes, C207
Wednesday, March 24nd 2010, 11h00
Abstract:
Abstract: We address the problem of image search on a very large scale, where three constraints have to be considered jointly: the accuracy of the search, its efficiency, and the memory usage of the representation. We first propose a simple yet efficient way of aggregating local image descriptors into a vector of limited dimension, which can be viewed as a simplification of the Fisher kernel representation.
We then show how to jointly optimize the dimension reduction and the indexing algorithm, so that it best preserves the quality of vector comparison. The evaluation shows that our approach significantly outperforms the state of the art: the search accuracy is comparable to the bag-of-features approach for an image representation that fits in 20 bytes. Searching a 10 million image dataset takes about 50ms.
Discovering semantic concepts in large collections
INRIA Rhone-Alpes, F107
Tuesday, March 9th 2010, 16h00
Abstract:
Supervised learning is standard for many tasks, including computer vision
tasks such as object recognition or scene categorization.
Powerful classifiers can obtain impressive results but require sufficient
amounts
of annotated training data. Despite their success, supervised methods
have
important limitations. In particular, annotations are expensive to
obtain,
prone to error, often biased, and consequently do not easily scale. We
propose
to move beyond supervised methods and use large collections which are
available today at minimal cost and effort.
In a first part, we will see how semi-supervised learning can be used to
lower
the need of annotations for the recognition of activities from sensor
data.
In a second part, I will present recent works where we focus on two
promising
research directions: unsupervised structure discovery and semi-supervised
learning, for computer vision approaches. The first one extracts semantic
connections between images while the second one uses few labeled images
to
predict the label of new images. While focusing on the problem of object
categorization, we will look at the following questions: i) How well are
current image representations suited for unsupervised structure discovery
and
what distance measures are most applicable? and ii) How well does semi-
supervised learning perform on these representations to automatically
label
object classes in realistic databases? Answers to these questions will be
proposed through a deep experimental study involving 3 datasets, many
state of
the art descriptors and semi-supervised algorithms.
Improving web image search results using query-relative classifiers
INRIA Rhone-Alpes, C207
Thursday, February 18th 2010, 14h00
Abstract:
Image web search using text queries has received considerable attention.
However, current approaches require separate training for every new
query, and are therefore unsuitable for real-world web search
applications.
The idea I'll present in this talk is to use generic classifiers that are
based on query-relative features which can be used for new queries
without additional training.
They combine textual features, based on the occurrence of query terms in
web pages and image meta-data, and visual histogram representations of
images.
More precisely, the visual features result from the comparison of overall
statistics of visual words and statistics in images highly ranked by
textual features.
For evaluation purposes we use a new database which includes 71478 images
returned by a web search engine for 353 different search queries, along
with their meta-data and ground-truth annotations.
Using this data set, we compared the image ranking performance of our
model with that of the search engine, and with an approach that learns a
separate classifier for each query.
Our generic models that use query-relative features improve significantly
over the raw search engine ranking, and also outperform the
query-specific models.
A Corrected Likelihood Approach for the Pile-Up
Model with Application
to Fluorescence Lifetime Measurements Using Exponential Mixtures
INRIA Rhone-Alpes, F107
Wednesday, January 27th 2010, 11h30
Abstract:
A fast and efficient estimation method is proposed that compensates the
so-called pile-up effect encountered in fluorescence lifetime
measurements. The pile-up effect is due to the fact that only the
shortest arrival time of a random number of emitted fluorescence photons
can be detected. A likelihood-based estimator is developed that can be
computed by an EM-type algorithm. The new estimator is particularly
well-suited for fluorescence lifetime measurements, where arrival times
are often modeled by a mixture of exponential distributions. The
consistency of the estimator is shown and its limit distribution is
provided. The method is evaluated on real and synthetic data. Compared
to currently used methods in fluorescence, the new estimator should
allow a reduction of the acquisition time of an order of magnitude.
Image Representations for 3D Reconstruction and Recognition
INRIA Rhone-Alpes, F107
Tuesday, January 5th 2010, 11h30
Abstract:
I will present two different transformations that can be applied to
images before further processing. The first transformation is called
DAISY, and was originally developed for wide baseline dense
reconstruction. DAISY computes dense local descriptors in an efficient
way, then we use a simple graph-cut techniques to match the images based
on these descriptors. The second transformation was developed for
fast object detection and reduces the image to local dominant
orientations. This yields a compact but discriminative binary
representation, which can be parsed using SSE instructions to detect
objects in real-time.
Learning Distinguishing Marks for Image
Classification
INRIA Rhone-Alpes, F107
Monday, January 4th 2010, 16h
Abstract:
We tackle here the problem of multi-class image classification from few
training examples, where only small parts of the image help
discriminating between classes.
Such problems arise when classifiying images of objects/persons in the
wild. In such settings, standard kernel-based classifiers perform well
only when combined with strong prior knowledge and efficient
discriminative part detectors. We propose here a convex sparsity-enforced
kernel-based methods for this task,
introducing a pool-L1 penalty which automatically singles out
discriminant "distinguishing marks" to leverage classification
performance.
We report experimental results on a horses in the wild dataset and on
several benchmarks datasets.
Ranking user-annotated images for multiple query terms
INRIA Rhone-Alpes, C207
Friday, September 24th 2009, 14h
Abstract:
We show how web image search can be improved by taking into account the
users who provided different images, and that performance when searching
for multiple terms can be increased by learning a new combined model and
taking account of images which partially match the query. Search
queries are answered by using a mixture of kernel density estimators to
rank the visual content of web images from the Flickr website whose
noisy tag annotations match the given query terms. Experiments show
that requiring agreement between images from different users allows a
better model of the visual class to be learnt, and that precision can be
increased by rejecting images from `untrustworthy' users. We focus on
search queries for multiple terms, and demonstrate enhanced performance
by learning a single model for the overall query, treating images which
only satisfy a subset of the search terms as negative training examples.
Combining efficient object localization and image classification
INRIA Rhone-Alpes, J220
Friday, September 21st 2009, 14h
Abstract:
In this paper we present a combined approach for object localization
and classification. Our contribution is twofold. (a) A contextual
combination of localization and classification which shows that classification
can improve detection and vice versa. (b) An efficient two stage sliding
window object localization method that combines the efficiency of a linear
classifier with the robustness of a sophisticated non-linear one.
Experimental results evaluate the parameters of our two stage sliding
window approach and show that our combined object localization and
classification methods outperform the state-of-the-art on the PASCAL
VOC 2007 and 2008 datasets.
TagProp: Discriminative Metric Learning
in Nearest Neighbor Models for Image Auto-Annotation
INRIA Rhone-Alpes, C208
Friday, September 2nd 2009, 17h
Abstract:
Image auto-annotation is an important open problem in computer vision.
For this task we propose TagProp, a discriminatively trained nearest
neighbor model. Tags of test images are predicted using a stochastic,
weighted nearest neighbor selection model to exploit labeled training
images. Neighbor weights are based on neighbor rank or distance. TagProp
allows the integration of metric learning by directly maximizing the
log-likelihood of the expected tag predictions in the training set. In
this manner, we can optimally combine a collection of image similarity
metrics that cover different aspects of image content, such as local
shape descriptors, or global color histograms. We also introduce a word
specific sigmoidal modulation of the weighted neighbor tag predictions
to boost the recall of rare words. We investigate the performance of
different variants of our model and compare to existing work. We present
experimental results for three challenging data sets. On all three,
TagProp makes a marked improvement as compared to the current
state-of-the-art.
Mining visual actions from movies
INRIA Rhone-Alpes, C208
Friday, July 31th 2009, 12h
Abstract:
This paper presents an approach for mining visual actions from
real-world videos. Given a large number of movies, we want to
automatically extract short video sequences corresponding to visual
human actions. Firstly, we retrieve actions by mining verbs extracted
from the transcripts aligned with the videos. Not all of these samples
visually characterize the action and, therefore, we rank these videos by
visual consistency. We investigate two unsupervised outlier detection
methods: one-class Support Vector Machine (SVM) and densest component
estimation of a similarity graph. Alternatively, we show how to use
automatic weak supervision provided by a random background class, either
by directly applying a binary SVM, or by using an iterative re-training
scheme for Support Vector Regression machines (SVR). Experimental
results explore actions in 144 episodes of the TV series ``Buffy the
Vampire Slayer'' and show: (a) the applicability of our approach to a
large scale set of real-world videos, (b) the importance of visual
consistency for ranking videos retrieved from text, (c) the added value
of random non-action samples and (d) the ability of our iterative SVR
re-training algorithm to handle weak supervision. The quality of the
rankings obtained is assessed on manually annotated data for six
different action classes.
Evaluation of local spatio-temporal features for action recognition
INRIA Rhone-Alpes, C208
Friday, July 29th 2009, 17h
Abstract:
Local space-time features have recently become a popular video representation for action recognition. Several methods for feature localization and description have been proposed in the literature and promising recognition results were demonstrated for a number of action classes. The comparison of existing methods, however, is often limited
given the different experimental settings used. The purpose of this paper is to evaluate and compare previously proposed space-time features in a common experimental setup. In particular, we consider four different feature detectors and six local feature descriptors and use a standard bag-of-features SVM approach for action recognition. We investigate the performance of these methods on a total of 25 action classes distributed over three datasets with varying difficulty. Among interesting conclusions, we demonstrate that regular sampling of space-time features consistently outperforms all tested space-time
interest point detectors for human actions in realistic settings. We also demonstrate a consistent ranking for the majority of methods over different datasets and discuss their advantages and limitations.
Evaluation of GIST descriptors for web-scale image search
INRIA Rhone-Alpes, C207
Friday, July 3rd 2009, 14h
Abstract:
The GIST descriptor has recently received increasing attention in the
context of scene recognition. In this paper we evaluate the search
accuracy and complexity of the global GIST descriptor for two
applications, for which a local description is usually preferred: same
location/object recognition and copy detection. We identify the cases
in which a global description can reasonably be used.
The comparison is performed against a state-of-the-art bag-of-features
representation. We propose an indexing strategy for global descriptors that optimizes the trade-off between memory usage and precision. Our
scheme provides a reasonable accuracy in some widespread application
cases together with very high efficiency: In our experiments, querying
an image database of 110 million images takes 0.18 second per image on
a single machine. For common copyright attacks, this efficiency is
obtained without noticeably sacrificing the search accuracy compared
with state-of-the-art approaches.
Supervised (yes, supervised) learning with 0 examples
and other methods for obviating those pesky training sets
INRIA Rhone-Alpes, F107
Wednesday, July 1st 2009, 17h
Abstract:
Sometimes the first examples we have seen of particular objects or
patterns come at test time rather than at training time. A simple
example is reading a highly stylized font, say, on a store front.
Appearance models trained a priori tend to do very poorly in classifying
the letters of such new fonts.
In this talk, I discuss our recent work in addressing the difficult
problem of encountering new types of patterns at test time, especially
those that are not well modeled by training data, either labeled or
unlabeled. In the first part of the talk, I present ways of
constraining the interpretations of patterns that are invariant to their
appearance. This sounds paradoxical, but is quite simple. For example,
the string 01221221331 is an encoding of a common string where each
letter has been substituted with a digit. (Can you guess the string?)
We show how such techniques can be used to provide important constraints
in difficult problems like scene text recognition.
In the second part of the talk, I discuss our work in optical character
recognition. I discuss a "font free" OCR system which has never been
trained on, or given any information about the specific appearance of
any character, and yet can easily read the majority of most documents
correctly. I also discuss new work in bootstrapping training sets in OCR
problems. In this work, we automatically extract "training sets" from
noisy documents so that we can dynamically build document specific
models. We call this "Learning on the Fly". Finally, I discuss potential
application of such ideas to other problems in computer vision and
pattern recognition.
Automatic Film Editing for Storytelling Using a Computational Model of Film Grammar
INRIA Grenoble (work
done at Xtranormal Technology,
Montreal, Quebec during a leave of absence from INRIA)
INRIA Rhone-Alpes, F107
Thursday, May 28th 2009, 16h
Abstract:
This talk presents new tools that I have been developping in
the last two years for an application that lets a non-expert user
write a story in words and translate it into a short animated movie. I
will focus on the important step of editing the shots from many
virtual cameras into a "correct" movie, according to the rules of
traditional film grammar. I will explain how this editing process
can be modeled with a semi-Markov conditional random field and how its
parameters can be learned directly from movies. I will show
"proof-of-concept" results that were obtained in a simplified setting,
and conclude with a discussion of research topics that still need to
be addressed in future work.
Kernel-based Methods for Detection
Zaïd Harchaoui
Laboratoire Traitement et Communication de
l'Information, CNRS-TELECOM ParisTech
INRIA Rhone-Alpes, F107
Monday, January 26th 2009, 14h
Abstract:
Kernel-methods have enjoyed considerable success
in machine learning during two decades, especially for tackling
supervised learning tasks. We address here the issue of building
kernel-based methods for solving unsupervised detection problems. First,
we propose a family of kernels for computer vision, based on the
soft-matching of common subtree-patterns. Second, we introduce a
regularized kernel-based test statistic for testing homogeneity of two
samples, for which we established the null distribution and proved the
consistency in power in a large-sample setting. Our regularized
kernel-based test statistic was successfully applied in a speaker
verification task. We also derived a computationally attractive variant
of this approach within a sliding-window framework for the temporal
segmentation of audio tracks from archives of entertainment TV-shows for
indexation purpose. Third, we introduce a regularized kernel-based test
statistic for change-point analysis, which was successfully applied to
the temporal segmentation of Brain-Computer interface acquired signals
into segments corresponding to mental tasks. Finally, we proposed two
retrospective multiple change-point estimation methods, one without
kernels and one with kernels, which we applied successfully for the
temporal segmentation respectively of well-log data and pop songs.
Inferring the relevance of images from eye movements
Teofilo de Campos
Textual & Visual Pattern Analysis group, Xerox XRCE
INRIA Rhone-Alpes, F107
Wednesday, January 21st 2009, 14h
Abstract:
Query formulation and efficient navigation through data to
reach relevant results are undoubtedly major challenges for
image or video retrieval. Queries of good quality are typically
not available and the search process needs to rely on
relevance feedback given by the user, which makes the search
process iterative and laborious.
A key question then is: Is it possible to replace or complement
scarce explicit feedback with implicit feedback (IF)?
IF can be inferred from various sensors not specifically designed
for the retrieval task.
In this talk, I will present preliminary results on inferring the
relevance of images based on IF about users' attention,
measured using an eye tracking device.
We have shown that, in reasonably controlled setups at least,
already fairly simple features and classifiers are capable of
detecting the relevance based on eye movements alone,
without using any explicit feedback.
This work is one of the outcomes of PinView, a EU FP7
collaborative project. It was done in collaboration with A Klami,
C Saunders and S. Kaski.
Probabilistic Models of Textual Collections for Information Access
Eric Gaussier
Université Joseph Fourier
INRIA Rhone-Alpes, F107
Wednesday, January 14th 2009, 10h
Abstract:
Several probabilistic models of text collections have recently been introduced in the text processing community. These models are often defined from a statistical learning perspective. Over the years, however, several empirical findings on how words behave in
documents have been reported (from the work of G. Zipf in 1949 to more recent studies). In this presentation, we study the links between probabilistic models of text collections and empirical observations concerning word frequency distributions. In the first part, we will introduce formal characterizations of several empirical observations. We will then review retrieval heuristics and propose an analytical characterization of them which can be used to design IR (Information Retrieval) models.We will then review standard probabilistic models in light of our characterizations and finally introduce new models (based on the beta negative binomial and log-logistic distributions) compatible with empirical observations. We will finally illustrate the behavior of our models on standard text collections.
High-dimensional estimation of Information-theoretic measures in nonlocal variational methods of computer vision
INRIA Rhone-Alpes, F107
Tuesday, December 16th 2008, 11h
Abstract:
One variational formulation of image and video processing problems
expresses the solution through a minimization of a statistical energy to
account for uncertainty in the observations. In return, the energy is
expressed as a function of the data considered as random variables. This
representation aims at defining models on the image with probability
density functions (PDF). The cost for discriminative power of PDFs built
on images is to deal with PDFs of domains of definition of high
dimension, such as nonlocal patch-based representations. To overcome high
dimensionality, a standard solution is to assume independence between the
different features in order to bring out low-dimension marginal laws
and/or to make some parametric assumptions on the PDFs, thus loosing
generality. At the foundation of statistics, the k-th nearest neighbor
can solve these difficulties by locally adapting to the repartition of
the data and treating the channels jointly. We propose a general
framework based on statistics to efficiently estimate
information-theoretic measures in high dimension. This new framework is
dedicated to variational problems as it estimates efficiently, high
dimensional statistical energies, gradients of these energies, local
probabilities, and as it is also fast since the implementation is
performed on GPU. This framework is applied to three variational problems
where high dimensionality is important: tracking, optical flow, and
segmentation. For the first one, the problem is to determine in
successive frames the region which best matches, in terms of a similarity
measure, a ROI defined in a reference frame. We define a tracking
algorithm based on the Kullback-Leibler divergence combining efficiently
several visual features. We show tracking results high-dimensional
feature vectors containing color information (including pixel-based,
gradient-based and patch-based) and spatial layout. The proposed
procedure performs tracking on sequences with various difficulties such
as occlusions, variations of illumination or noise. I will also detail
the optical flow and segmentation algorithms derived from this framework.
Finally, I will give some perspectives and future directions.
Kernel-based systems for online image retrieval
INRIA Rhone-Alpes, F107
Friday, November 28th 2008, 15h
Abstract:
In this presentation, I will talk about image retrieval systems. The
key components of Content-based image retrieval (CBIR) techniques are
image representation including features and similarity, and the search
engine aiming at retrieving data from large databases.
For the indexing part, visual dictionaries are traditionally used to
encode the image features. I also present how similarity between image
features may be embedded into kernel function framework. For the
retrieval part, I discuss about online learning strategies motivated
by Machine-Learning developments such as Active Learning.
I will also talk about recent applications like iTOWNS (image-based
Time On-line Web Navigation and Searchengine) project or K-videoScan
project to illustrate my presentation.
Improving People Search Using Query Expansions: How Friends Help To
Find People
INRIA Rhone-Alpes, F107
Friday, October 10th 2008, 12h
Abstract:
We are interested in finding images of people on the web, and more
specifically within large databases of captioned news images. It has
recently been shown that visual analysis of the faces in images returned
on a text-based query over captions can significantly improve search
results. The underlying idea to improve the text-based results is that
although this initial result is imperfect, it will render the queried
person to be relatively frequent as compared to other people, so we can
search for a large group of highly similar faces. The performance of
such methods depends strongly on this assumption: for people whose face
appears in less than about 40% of the initial text-based result, the
performance may be very poor. The contribution of this paper is to
improve search results by exploiting faces of other people that co-occur
frequently with the queried person. We refer to this process as `query
expansion'. In the face analysis we use the query expansion to provide a
query-specific relevant set of `negative' examples which should be
separated from the potentially positive examples in the text-based
result set. We apply this idea to a recently-proposed method which
filters the initial result set using a Gaussian mixture model, and apply
the same idea using a logistic discriminant model. We experimentally
evaluate the methods using a set of 23 queries on a database of 15.000
captioned news stories from \yahoonews. The results show that (i) query
expansion improves both methods, (ii) that our discriminative models
outperform the generative ones, and (iii) our best results surpass the
state-of-the-art results by 10% precision on average.
Content-based image retrieval in the large scale: from content to user
INRIA Rhone-Alpes, F107
Wednesday, October 8th 2008, 14h
Abstract:
Scalability issues are nowaday essential for any multi-media search
engine, even for relatively small datasets when using recent computer
vision techniques. In this seminar, we will illustrate through different
works, how scalability considerations can be included at several stages
of a complete visual indexing and retrieval chain (from content
description to search results clustering, via indexing and retrieval
problematics).
In the first part, we will present two works on local visual features
extraction which aim at reducing space and/or time complexity. The first
one concerns new local photometric descriptors based on dissociated
dipoles for transformed images or rigid objects retrieval. Dissociated
dipoles are non local differential operators which are more stable than
purely local standard differential operators. We define specific
oriented dissociated dipoles around multi-resolution color Harris points
and we form 20-dimensional normalized features, invariant to rotation,
affine luminance transformations, negative or flip. The second work
describes a new symmetry oriented interest point detector based on
gradient orientations convergence. The aim is to reach better visual
saliency than current detectors and, as a consequence, to reduce the
amount of features required for content-based retrieval tasks.
In the second part, we will present a new high dimensional similarity
search structure, which improves upon recent theoretical work on
multi-probe and query adaptive LSH. Whereas these methods are based on
likelihood criteria that a given bucket contains query results, we
define a more reliable a posteriori model taking account some prior
about the queries and the searched objects. This prior knowledge allows
a better quality control of the search and a more accurate selection of
the most probable buckets. We show that our a posteriori scheme
outperforms other multi-probe LSH while offering a better quality
control. Comparisons to the basic LSH technique show that our method
allows consistent improvements both in space and time efficiency.
The last part of the seminar will present a work on multi-source image
search results clustering. The aim is to synthetize the search results
obtained from a possibly large set of different search engines, working
with heterogeneous data and similarity measures. The developed technique
is based on the Relevant-Set Correlation (RSC) model, that requires no
direct knowledge of the nature or representation of the data. Instead,
the RSC model relies solely on the existence of an oracle that accepts a
query in the form of a reference to a data item, and returns a ranked
set of references to items that are most relevant to the query. In the
presented work, we describe and compare 3 different fusion strategies
extending the original RSC-based clustering algorithm to the case of
several oracles.
Research at the Image Understanding and
Pattern Recognition
(IUPR) research group
INRIA Rhone-Alpes, F107
Tuesday, June 10th 2008, 11h
Abstract:
Prof. Thomas Breuel is director of the Image Understanding and Pattern
Recognition (IUPR) research group at the Computer Science Department of
the University of Kaiserslautern and the German Research Center for
Artificial Intelligence (DFKI).
The group conducts basic and applied research in pattern recognition,
machine learning, image understanding, and artificial intelligence, with
practical applications to digital libraries, network security,
bioinformatics, historical document analysis, and scientific data analysis.
Crossing textual and visual content
in different application scenarios
INRIA Rhone-Alpes, F107
Thursday, June 5th 2008, 14h
Slides
Abstract:
I will present two approaches for hybrid text-image
information processing that can be straightforwardly generalized to
more
general multimodal scenarios. Both approaches fall in the trans-media
pseudo-relevance feedback category. The first method proposes to use a
mixture model of the aggregate components, considering them as a
single
relevance concept. The second approach, to determine trans-media
similarities between a new multimedia document and the objects of some
repository, define these similarities as an aggregation of mono-modal
similarities between the elements of the aggregate and the new
multimodal object.
I further show how one can frame a large variety of problems in
order to address them with the proposed techniques: image annotation or
captioning, text illustration and multimedia retrieval and clustering.
As an example scenario, the travel blog assistant system is used to
illustrate some of the experimental results.
Towards a Theory of Cascaded Detectors
INRIA Rhone-Alpes, F107
Wednesday, May 7th 2008, 11h30
Slides
Abstract:
Cascades of boosted ensembles have become popular in the object detection
community following their introduction in the face detector of Viola and
Jones. Since then, researchers have sought to improve upon the original
approach by exploring alternative boosting methods, feature sets, etc.
Nevertheless, key decisions about the most basic aspects of the original
cascade classifier, such as how many hypotheses to include in an ensemble
and the appropriate balance of detection and false positive rates in the
individual stages, have not been studied systematically. Choices which
have a significant effect on the cascade's performance are usually made
with heuristics or through trial and error.
We propose a novel method for training cascade classifiers, which exploits
the shape of the ROC curve for a cascade in ways that have been previously
overlooked. We present a new mathematical characterization of the space of
possible cascade operating points. The results of our approach are cascade
detectors with significantly-improved testing speeds in comparison to
other automatic training methods. We automatically produce cascades whose
detection speeds match those of the best hand-tuned detectors.
Improving fast nearest neighbour search in large
database
for visual recognition.
INRIA Rhone-Alpes, F107
Tuesday April 8rd 2008, 16h00
Abstract:
Local feature detectors and descriptors of local image structures are
used in many state of-the-art vision methods that require local
image-to-image correspondences.
In this talk I will discuss an approach for linear discriminant
projection of high dimensional image descriptors to reduce the number of
dimensions and to improve their matching performance. The method is
based on Fischer Analysis and global statistics which can be estimated
from a real or simulated training data.
The projected descriptors are more discriminative than the original
ones, 3-4 times more memory efficient, and require only a small
computational overhead. I will show experimental results in the context
of fast search for visual correspondence using different tree data
structures and approximate nearest neighbour search.
Finally, a recognition system based on a vocabulary forest of local
features will be presented. The system is capable of simultaneous
categorization and localization of scenes, objects and actions.The talk will consist of two parts. It will start with a broad
overview of text mining, its main goals, tasks, and problems. Several
common tasks will be described in some detail, including building and
preprocessing of text collections, text categorization, extraction of
terms, entities and relations, and document summarization. Known
well-performing techniques for solving these problems will be briefly
discussed. In the second part, several complete information extraction
and text mining systems will be presented in more detail, their
strengths and shortcomings demonstrated and contrasted.
Techniques of information extraction and text mining
Benjamin Rozenfeld
INRIA Rhone-Alpes, C208
Thusday April 3rd 2008, 11h00
Slides
Abstract:
The talk will consist of two parts. It will start with a broad
overview of text mining, its main goals, tasks, and problems. Several
common tasks will be described in some detail, including building and
preprocessing of text collections, text categorization, extraction of
terms, entities and relations, and document summarization. Known
well-performing techniques for solving these problems will be briefly
discussed. In the second part, several complete information extraction
and text mining systems will be presented in more detail, their
strengths and shortcomings demonstrated and contrasted.
Improving People Search Using Query Expansions:
How Friends Help To Find People
Thomas Mensink
INRIA Rhone-Alpes, C207
Friday March 28th 2008, 15h00
Abstract:
Faces are important to people, so detecting and recognizing faces are
important applications for visual pattern recognition methods.
Recently these have found their way into consumer products such as
digital cameras.
In this paper we are interested in finding images of people on the web,
and more specifically in large databases of captioned news images.
It has recently been shown that analysis of the faces in images returned
on a text-based query over captions can significantly improve search
results.
The idea underlying this clean-up of text-based results is that the
queried person will appear relatively often compared to other people, so
we can search for a large group of highly similar faces. The performance
of such methods depends strongly on this assumption: for people whose
face appears in less than about 40\% of the initial result, set
performance may be very poor.
The contribution of this paper is to improve search results by
exploiting faces of other people that co-occur frequently with the
queried person.
We refer to this process as `query expansion'. In the face analysis we
use the query expansion to provide a query-specific relevant set of
`negative' examples which should be separated from the potentially
positive examples in the initial result set.We apply this idea to a
recently-proposed method which filters the initial result set using a
Gaussian mixture model. We also consider replacing the Gaussian mixture
with a linear discriminant as basic tool to refine the text-based query
results.
We experimentally evaluate the methods using a set of 23 queries on a
database of 15.000 captioned news stories from Yahoo! News.
The results show that query expansion improves both methods, that our
new discriminative method outperforms generative approaches; and
state-of-the-art results by 10\% precision on average.
Hierarchical Spectral Latent Variable Models (HSLVM)
for Perceptual Inference
INRIA Rhone-Alpes, F107
Friday March 7th 2008, 15h00
Abstract:
I will discuss a recently introduced class of non-linear generative
models
referred to as Spectral Latent Variable Models (SLVM), that combine
the advantages of spectral embeddings with the ones of latent variable
models: (1) provide latent spaces that preserve geometric properties
-- either global or local -- of the data distribution; (2) offer
low-dimensional spaces with probabilistic, bi-directional mappings to
and from the data space, (3) are probabilistically consistent, i.e.
reflect the data distribution, both jointly and marginally, and can be
learned with reasonable efficiency. Time allowing, I will discuss the
extension of this model to hierarchies (HSLVM) that represent multiple
levels of correlation in the data. In this case, training boils down
to learning a partially observed directed graphical model with tree
dependency and local distributions modeled as SLVMs. In practice,
HSLVM provide competitive priors compared to PCA, GPLVM (Gaussian
Process Latent Variable Model) or GTM (Generative Topographic Mapping)
when tracking facial expressions or human body motions like walking,
running, pantomime or dancing not only in benchmarks datasets like
HumanEva, but also in movies like Run Lola Run.
Viewpoint-Independent Object Class Detection
using 3D Feature Maps
INRIA Rhone-Alpes, C207
Friday February 22nd 2008, 16h00
Abstract:
We present a 3D approach to multi-view object class detection. Most
existing approaches recognize object classes for a particular viewpoint or
combine classifiers for a few discrete views. We propose instead to build
3D representations of object classes which allow to handle viewpoint
changes and intra-class variability. Our approach extracts a set of pose
and class discriminant features from synthetic 3D object models using a
filtering procedure, evaluates their suitability for matching to real
image data and
represents them by their appearance and 3D position. We term these
representations 3D Feature Maps. For recognizing an object class in an
image we match the synthetic descriptors to the real ones in a 3D voting
scheme. Geometric coherence is reinforced by means of a robust pose
estimation which yields a 3D bounding box in addition to the 2D
localization. The precision of the 3D pose estimation is evaluated on a
set of images of a calibrated scene. The 2D localization is evaluated on
the PASCAL 2006 dataset for motorbikes and cars, showing that its
performance can compete with state-of-the-art 2D object detectors.
Automatic Face Naming with Caption-based Supervision
INRIA Rhone-Alpes, C207
Monday February 11th 2008, 16h00
Abstract:
We consider two scenarios of naming people in databases of news photos
with captions: (i) finding faces of a single person, and (ii) assigning
names to all faces. We combine an initial text-based step, that
restricts the name assigned to a face to the set of names appearing in
the caption, with a second step that analyzes visual features of faces.
By searching for groups of highly similar faces that can be associated
with a name, the results of purely text-based search can be greatly
improved. We improve a recent graph-based approach, in which nodes
correspond to faces and edges connect highly similar faces. We introduce
constraints when optimizing the objective function, and propose
improvements in the low-level methods used to construct the graphs.
Furthermore, we generalize the graph-based approach to face naming in
the full data set. In this multi-person naming case the optimization
quickly becomes computationally demanding, and we present an important
speed-up using graph-flows to compute the optimal name assignments in
documents. Generative models have previously been proposed to solve the
multi-person naming task. We compare the generative and graph-based
methods in both scenarios, and find significantly better performance
using the graph-based methods in both cases.
Category level object segmentation
using
appearance models and Markov Random Fields
INRIA Rhone-Alpes, C207
Thursday January 31st 2008, 15h00
Abstract:
Object models based on bag-of-words representations achieve
state-of-the-art performance for image classification and object
localization tasks. However, as they consider objects as loose
collections of local patches they fail to accurately locate object
boundaries and are not able to produce accurate object
segmentation. On the other hand, Markov Random Field models used for
image segmentation focus on object boundaries but can hardly use the
global constraints necessary to deal with object categories whose
appearance may vary significantly. Here we propose to
to combine advantages of these two approaches. First, a
mechanism based on blobs of local regions allows to detect objects
using visual word occurrences and produces rough image
segmentation. Second, a MRF component gives clean cuts and enforces
label consistency, guided by local image cues (color, texture and edge
cues) and by long-distance dependencies. Gibbs sampling is used to
infer the model. The proposed method is used to segment object
categories with highly varying appearance in presence of cluttered
backgrounds and large view point changes.
Learning realistic human actions from movies
Yvan Laptev et Marcin Marszalek
IRISA Rennes and INRIA Rhone-Alpes
INRIA Rhone-Alpes, C207
Tuesday January 19th 2008, 15h00
Abstract:
The aim of this paper is to address recognition of natural human actions
in diverse and realistic video settings. This challenging but important
subject has mostly been ignored in the past due to several problems one
of which is the lack of realistic and annotated video datasets. Our
first contribution is to address this limitation and to investigate the
use of movie scripts for automatic annotation of human actions in
videos. We evaluate alternative methods for action retrieval from
scripts and show benefits of a text-based classifier. Using the
retrieved action samples for visual learning, we next turn to the
problem of action classification in video. We present a new method for
video classification that builds upon and extends several recent ideas
including local space-time features, space-time pyramids and
multi-channel non-linear SVMs. The method is shown to improve
state-of-the-art results on the standard KTH action dataset by achieving
91.8\% accuracy. Given the inherent problem of noisy labels in automatic
annotation, we particularly investigate and show high tolerance of our
method to annotation errors in the training set. We finally apply the
method to the learning and classification of challenging action classes
in movies and show promising results.
Scene Segmentation with CRFs Learned
from Partially Labeled Images
INRIA Rhone-Alpes, C207
Friday November 30th 2007, 16h00
Abstract:
Conditional Random Fields (CRFs) are an effective tool for a variety
of different data segmentation and labelling tasks including visual
scene interpretation, which seeks to partition images into their
constituent semantic-level regions and assign appropriate class labels
to each region. For accurate labelling it is important to capture the
global context of the image as well as local information. We introduce
a CRF based scene labelling model that incorporates both local
features and features aggregated over the whole image or large
sections of it. Secondly, traditional CRF learning requires fully
labelled datasets. Complete labellings are typically costly and
troublesome to produce. We introduce an algorithm that allows CRF
models to be learned from datasets where a substantial fraction of the
nodes are unlabeled. It works by marginalizing out the unknown labels
so that the log-likelihood of the known ones can be maximized by
gradient ascent. Loopy Belief Propagation is used to approximate the
marginals needed for the gradient and log-likelihood calculations and
the Bethe free-energy approximation to the log-likelihood is monitored
to control the step size. Our experimental results show that
incorporating top-down aggregate features significantly improves the
segmentations and that effective models can be learned from
fragmentary labellings. The resulting methods give scene segmentation
results comparable to the state-of-the-art on three different image
databases.
Open Source, Distributed and Peer-to-Peer
Information Retrieval
INRIA Rhone-Alpes, Grand Amphi
Monday November 67th 2007, 15h00
Abstract:
I will review arguments for open source and distributed
search/IR, introduce the basic concepts used in distributed IR, and
discuss some aspects of P2P IR in this context. The talk will be based
mainly on my tutorial given at the 6th European Summer School in
Information Retrieval (ESSIR 2007) in Glasgow in August.
Vision Biologique et Vision Artificielle :
Vers une convergence ?
INRIA Rhone-Alpes, F107
Wednesday November 7th 2007, 14h00
Abstract:
Il y a plus de 25 ans, David Marr a proposé que la vision biologique et
la vision par machine pourraient faire partie d'une même discipline.
Force est de constater que cette fusion ne s'est pas réalisée. Or, il y
a certains signes montrant que cette convergence pourrait avoir lieu.
Depuis une bonne dizaine d'années, les recherches sur les systèmes
biologiques ont suggéré que certaines tâches/ a priori/ complexe (comme
décider si une image contient un animal) peuvent être réalisées de façon
tellement rapide que seul un traitement essentiellement feed-forward
semble pouvoir être impliqué. Il est d'ailleurs probable que ce type de
jugement se fasse avant même que la scène soit segmentée. Il est
intéressant de constater qu'en vision par machine c'est justement ce
type d'architecture qui représente l'état de l'art. Est-il possible que
la sélection naturelle et les chercheurs en vision par machine
convergent vers les mêmes solutions ?
Fisher Kernels on Visual Vocabularies
for Image Categorization
Florent Perronin
INRIA Rhone-Alpes, F107
Wednesday October 31st 2007, 14h30
Abstract:
Within the field of pattern classification, the Fisher kernel is a powerful
framework which combines the strengths of generative and discriminative
approaches. The idea is to characterize a signal with a gradient vector
derived from a generative probability model and to subsequently feed this
representation to a discriminative classifier. We propose to apply this
framework to image categorization where the input signals are images and
where the underlying generative model is a visual vocabulary: a Gaussian
mixture model which approximates the distribution of low-level features in
images. We show that Fisher kernels can actually be understood as an
extension of the popular bag-of-visterms. Our approach demonstrates excellent
performance on the VOC 2006 and VOC 2007 databases. It is also very
practical: it has low computational needs both at training and test time and
vocabularies trained on one set of categories can be applied to another set
without any significant loss in performance.
Sprite learning and object category recognition
using invariant features
INRIA Rhone-Alpes, F107
Thursday October 25th 2007, 11h00
Abstract:
This talk will discuss the use of invariant features to learn the
appearance of specific objects and to learn to detect and locate
instances of object categories.
A popular framework for the interpretation of image sequences is the
layers or sprite model. Jojic and Frey (2001) provide a generative
probabilistic model framework for this task, but their algorithm is slow
as it needs to search over discretised transformations for each layer.
We show that by using invariant features and clustering their motions we
can reduce or eliminate search and thus learn the sprites much faster.
The Generative Template of Features (GTF) is a parts-based model for
visual object category detection. The GTF consists of a number of parts,
and for each part there is a corresponding spatial location distribution
and a distribution over 'visual words' (clusters of invariant features).
We examine the performance of the GTF model, and discuss the connection
of the GTF to Hough-transform-like methods for object localisation.
Enhanced Local Texture Feature Sets
for Face
Recognition under Difficult Lighting Conditions
Xiaoyang Tan
INRIA Rhone-Alpes, C208
Thursday October 11th 2007, 17h00
Abstract:
Abstract:Recognition in uncontrolled situations is one of the most important
bottlenecks for practical face recognition systems. We address this by
combining the strengths of robust illumination normalization, local
texture based face representations and distance transform based
matching metrics. Specifically, we make three main contributions:
1) we present a simple and efficient preprocessing chain that
eliminates most of the effects of changing illumination while still
preserving the essential appearance details that are needed for
recognition; 2) we introduce Local Ternary Patterns (LTP), a
generalization of the Local Binary Pattern (LBP) local texture
descriptor that is more discriminant and less sensitive to noise in
uniform regions; and 3) we show that replacing local histogramming
with a local distance transform based similarity metric further
improves the performance of LBP/LTP based face recognition. The
resulting method gives state-of-the-art performance on several popular
datasets chosen to test recognition under difficult illumination
conditions: Face Recognition Grand Challenge experiment 4(version 1 and 2) ,
Extended Yale-B, and CMU PIE.
Using non-expert collaborative work sources
to create ontologies for visual recognition
Pierre Bernard
ENSIMAG, Grenoble
INRIA Rhone-Alpes, C207
Friday October 5rd 2007, 16h30
Abstract:
In the framework of multi-class recognition, we propose to automatically
extract inter-class knowledge from non-expert work
sources to build visual-centered hierarchies. We demonstrate the quality of
these hierarchies expressing visual similarity or
contextual links between classes. We describe how to build and train
classifiers taking advantages of them to perform object detection.
We evaluate our approach on the Pascal VOC'07 dataset, a set of challenging
real-world images, showing a significant average gain
compared to the standard one-against-rest method.
How to Dispatch
Observers to Track an Evolving Boundary
INRIA Rhone-Alpes, F107
Wednesday October 3rd 2007, 11h30
Abstract:
Some distributed-sensing applications make it necessary to dispatch
a limited number of observers (ships, vehicles, or airplanes with
cameras; field workers with chemical kits; high-flying balloons with
atmospheric sensors) to track the evolving boundary of a large
phenomenon such as an oil spill, a fire, a hurricane, air or water
pollution, or EL Nino. This paper develops a new
framework for controlling the movements of the observers to maximize
the information gained about the boundary's shape and position. To
this end, we represent boundary uncertainty by a particle filter
where each particle is a binary indicator function. This makes our
dispatch algorithms applicable to arbitrary boundary representations
from which indicator functions can be computed, including level sets
and polygonal approximations. We demonstrate the benefits of optimal
dispatch on both synthetic and real data. These benefits are most
apparent when the observers are sparse relative to the boundary
size.
Randomized forests for learning the distance
between visual
object classes
INRIA Rhone-Alpes, C207
Tuesday September 25th 2007, 16h
Abstract:
I will present the work on the use of combination of
randomized forests and SVMs to learn the distance between object classes
from image pairs labeled as "same" or "different". The work is extension
of previous work that used randomized forests to learn the distancef
between object instances of the same class. It was shown that this
learned distance generalizes well to the instances of the same class
which were never seen before.
In order to handle increased within-class variability in the case of
visual object classes the representative images of the class (focal
images) are used. Distance to a class is obtained as combination of
distances to each of representative images of the class.
A Large Scale Tracking Problem: Tracking Migrating and Proliferating Cells in Phase-Contrast Microscopy Imagery
INRIA Rhone-Alpes, Grand Amphi
Monday September 13rd 2007, 11h
Abstract:
In Tissue Engineering, the development of tissue substitutes to restore, maintain, or improve the human tissues involves implanting scaffolds (biodegradable exracellular matrices) and seeding and culturing cells with hormones to induce growth of tissue. Computer vision can provide the capability to "engineer individual cells" - precisely and individually tracking a large number of cells in vivo in real time to study and direct the migration and proliferation of tissue cells. The varying density of the cell culture and the complexity of the cell behavior (shape deformation, division/mitosis, close contact and partial occlusion) pose many challenges to tracking techniques. Using our work in collaboration with biomedical engineers, I will present the challenge and excitement of the new application area of motion image analysis.
Character Recognition using Bag of Features: Baseline Results for
Latin and Kannada Characters
INRIA Rhone-Alpes, C207
Wednesday August 22nd 2007, 15h
Abstract:
In this talk, I will present the ongoing work that I've started at
Microsoft Research India, with Manik Varma. We targeted characters
recognition from natural images. An intended application is
recognition of text from portable cameras to aid tourists who do not
know the local language. We acquired an image data set of Latin and
Kannada characters composed of synthesized characters using computer
fonts, handwritten characters and natural images obtained from
photographs. Our main test sets are from the latter group. The
problem was approached using bag of features and five feature
extraction methods were evaluated.
Accurate object detection
with deformable shape models learnt from images
INRIA Rhone-Alpes, Amphi F107
Wednesday July 18th 2007, 16h
Abstract:
In this talk we present an object class detection approach which fully
integrates the complementary strengths offered by shape matchers.
Like an object detector, it can learn class models directly from images,
and localize novel instances in the presence of intra-class variations,
clutter, and scale changes.
Like a shape matcher, it finds the accurate boundaries of the objects,
rather than just their bounding-boxes.
This is made possible with a novel technique for learning both the
prototypical shape of an object class and a statistical model of how it
can deform, given just images of example instances.
Once the model is learnt, we localize novel instances in cluttered
images by combining a Hough-style voting process with a non-rigid point
matcher.
Through experimental evaluation, we show how the method can
detect objects and localize their boundaries accurately, while needing
no segmented training examples (only bounding-boxes).
Local Subspace Classifiers
INRIA Rhone-Alpes, C207
Friday July 13th 2007, 16h
Abstract:
The K-local hyperplane distance nearest neighbor (HKNN) algorithm is a
local classification method that builds nonlinear decision surfaces by
using locally linear manifolds directly in the original sample space.
Although it has been successfully applied in several classification
tasks, it is limited to using the Euclidean distance metric, which is a
significant limitation in the practice. In this paper we reformulate
HKNN in terms of subspaces, and propose a variant, the Local
Discriminative Common Vector (LDCV) method, that is more suitable for
classification tasks where the classes have similar intra-class
variations. We then extend both methods to the nonlinear case by using
the kernel trick to map the data into a higher-dimensional space, in
which the linear manifolds are constructed. This construction allows us
to use a wide variety of distance functions for the local classifiers,
while computing distances between the query sample and the nonlinear
manifolds remains straightforward owing to linear nature of the
manifolds in the mapped space. We tested the proposed methods on several
classification tasks, obtaining better results than both the Support
Vector Machines (SVMs) and their local counterpart SVM-KNN on the USPS
and Image segmentation databases, and outperforming the local SVM-KNN on
the Caltech and Xerox10 visual recognition databases.
Using shape information for recognition
INRIA Rhone-Alpes, Amphi F107
Wednesday June 27th 2007, 10h
Abstract:
Shape information is an important cue for recognizing object and object
categories in images. In fact, many categories are
characterized primarily by the consistency of their shape while
intra-class texture statistics may not be as informative. This is true
even for categories that include a large degree of geometric deformation.
Recent work in the community has shown progress in using shape cues for
recognition, including learned boundary detectors, matching and
classification using local configuration of contour fragments. In this
talk, I will review three recent developments in this area. The first one
is an algorithm for category recognition which relies on very simple shape
features (oriented points sampled on contour fragments). The algorithm
uses an efficient spectral matching technique for both matching and
learning. The category models can be learned from semi-supervised data
(i.e., images labeled as containing/not containing the object without
manual delination of the object). An added benefit of this approach is
that it uses an explicit matching approach between image features and
model parts. As a result, it is possible to extend the classification
algorithm to an efficient detection algorithm, which includes object
localization.
Two other developments will be very briefly described. The first one has
to do with using motion information to detect boundaries; the second one
addresses the problem of extracting boundaries from a single image by
using estimates of the local geometry of the scene (using the results from
our earlier work on estimating geometric layout from an image).
Both approaches provide information about object boundaries that are
useful for recognition.
Accurate Object Localization with Shape Masks
INRIA Rhone-Alpes, Amphi C207
Tuesday June 12th 2007, 16h
Abstract:
We will discuss an object class localization approach which goes beyond
bounding boxes, as it also determines the outline of the object. Unlike
most current localization methods, our approach does not require any
hypothesis parameter space to be defined. Instead, it directly
generates, evaluates and clusters shape masks. Thus, the presented
framework produces much richer answers to the object class localization
problem. For example, it easily learns and detects possible object
viewpoints and articulations, which are often well characterized by the
object outline. We evaluate the proposed approach on the challenging
natural-scene Graz-02 object classes dataset. The results demonstrate
the extended localization capabilities of our method.
A contextual dissimilarity measure
for accurate and efficient image search
INRIA Rhone-Alpes, Amphi C207
Wednesday June 6th 2007, 16h
Abstract:
In this paper we present two contributions to improve accuracy and speed
of an image search system based on bag-of-features: a contextual
dissimilarity measure (CDM) and an efficient search structure for visual
word vectors.
Our measure (CDM) takes into account the local distribution of the
vectors and iteratively estimates distance correcting terms. These terms
are subsequently used to update an existing distance, thereby modifying
the neighborhood structure. Experimental results on the
Nist\'er-Stew\'enius dataset show that our approach significantly
outperforms the state-of-the-art in terms of accuracy.
Our efficient search structure for visual word vectors is a two-level
scheme using inverted files. The first level partitions the image set
into clusters of images. At query time, only a subset of clusters of the
second level has to be searched. This method allows fast querying in
large sets of images. We evaluate the gain in speed and the loss in
accuracy on large datasets (up to 500k images).
Learning and Recognizing Visual Object Categories
Without Detecting Features
INRIA Rhone-Alpes, Grand Amphi
Tuesday June 5th 2007, 11h
Abstract:
Over the past few years there has been substantial progress in the
development of systems that can recognize generic categories of objects in
images, such as automobiles, bicycles, airplanes, and human faces. Much of
this progress can be traced to two underlying technical advances: (i)
detectors for locally invariant features of an image, and (ii) the
application of techniques from machine learning. Despite recent successes,
however, there are some fundamental concerns about methods that rely heavily
on feature detection, as local image evidence is often highly ambiguous due
to the absence of contextual information.
We are taking a different approach to learning and recognizing visual object
categories, in which there is no separate feature detection stage. In our
approach, objects are modeled as local image patches with spring-like
connections that constrain the spatial relations between patches. Such
models are intuitively natural, and their use dates back over 30 years.
Until recently such models were largely abandoned due to computational
challenges that are addressed by our work. Our approach can be used to learn
models from weakly labeled training data, without any specification of the
location of objects or their parts. The recognition accuracy for such models
is better than when using feature-based techniques with similar forms of
spatial constraint.
From objects to actions:
Detection using boosted histogram classifier
INRIA Rhone-Alpes, Amphi F107
Thursday May 31st 2007, 16h
Abstract:
This talk will address the detection of object and action classes in
unconstrained scenes. We first consider object class recognition and
localisation in still images. Building upon recent advances in the field
we show how histogram-based descriptors combined with the boosting
classifier provide a state of the art object detector. Among improvements
we introduce Fisher weak learner for multi-valued histogram features and
address the training from limited sets of examples. We also address
computational aspects and analyse the tradeoff between the speed and the
accuracy of the detector. Validation of the method on VOC05 and VOC06
benchmarks for object recognition shows its superior performance. In
particular, the approach outperforms all the methods reported in VOC05
Challenge for 7 out of 8 detection tasks while using a single set of
parameters and providing close to real-time performance.
We next consider recognition and localisation of "atomic" actions in
video. We treat such actions similarly to the objects in images and extend
the boosted histogram detector to action detection in space-time. Using
this approach, we address recognition and localisation of human actions in
realistic scenarios with substantial variation in subject appearance,
motion, surrounding scenes, viewing angles and spatio-temporal extents. In
contrast to the previous works that study action recognition in controlled
settings, here we train and test the algorithms on real movies. We in
particular investigate the combination of shape and motion information for
action understanding. To this end we introduce ``keyframe priming'' that
combines discriminative models of human appearance and motion in action.
Keyframe priming is shown to significantly improve the performance of
action detection. We present detection results for the action class
``drinking'' evaluated on two episodes of the movie ``Coffee and
Cigarettes'' with 36,000 frames in total.
Penalized least squares with nonquadratic penalties
INRIA Rhone-Alpes, Amphi F107
Monday May 28th 2007, 15h
Slides
Abstract:
A popular method for fitting a linear regression model from data
measurements is regularization: minimize an objective function which
enforces a roughness penalty in addition to coherence with the data.
This is the case when formulating penalized least squares for
linear regression models. We focus on penalized
regression methods involving a variety of nonquadratic penalties,
pointing out some basic principles they have in common. We end this
talk with an application of such penalties for feature selection in
model-based clustering problems.
Learning Visual Similarity Measures for
Comparing Never Seen Objects
INRIA Rhone-Alpes, Amphi C207
Friday May 25th 2007, 16h
Abstract:
In this paper we propose and evaluate an algorithm that learns a similarity
measure for comparing never seen objects. The measure is learned from
pairs of
training images labeled ``same'' or ``different''. This is far less
informative
than the commonly used individual image labels (e.g. ``car model X''),
but it
is cheaper to obtain. The proposed algorithm learns the characteristic
differences between local descriptors sampled from pairs of ``same'' and
``different'' images. These differences are vector quantized by an
ensemble of
extremely randomized binary trees, and the similarity measure is
computed from
the quantized differences. The extremely randomized trees are fast to
learn, robust due
to the redundant information they carry and they have been proved to be
very good
clusterers. Furthermore, the trees efficiently combine different
feature types
(SIFT and geometry). We evaluate our innovative similarity measure on
four very
different datasets and consistantly outperform the state-of-the-art
competitive
approaches.
Applying Generic Object Recognition Methods
to Environmental Monitoring and Ecological Science
INRIA Rhone-Alpes, Amphi F107
Wednesday April 25th 2007, 16h
Abstract:
This talk will describe our work at Oregon State University to develop
object recognition methods that can achieve high precision on the task
of classifying small arthropods according to Family, Genus, and
Species. Arthropods are challenging for computer vision because they
have many internal degrees of freedom and because there is high
within-class variation due to molting. Our interdisciplinary team
combines expertise in computer vision, machine learning, mechanical
engineering, and entomology to develop a high-throughput system for
classifying stonefly larvae collected from freshwater streams.
We are pursuing the bag-of-SIFT approach based on many ideas from
INRIA. Our system begins by applying three region detectors to each
image. Two of these detectors (Harris Affine and Kadir) are
well-known in computer vision, but the third is a new detector (PCBR)
that we developed specifically for natural (non-man-made) objects
based on principal curvature computations. Each detected region is
re-represented as a SIFT descriptor vector. Next, we construct
detector-specific/class-specific visual dictionaries by fitting
Gaussian mixture models to the SIFT descriptor vectors. Finally, we
re-represent the image as a concatenated histogram where each element
counts the number of SIFT vectors mapped to corresponding dictionary
entry. This feature vector is then classified using a bag of logistic
model trees.
Our initial system is capable of identifying three taxa of stoneflies
with 95% accuracy and four taxa with 82% accuracy. We are currently
performing an 8-taxa experiment with 10 additional "distractor"
classes. This talk will also describe our current research directions
and discuss a new application problem: classification and sorting of
soil mesofauna.
Expressive rendering
INRIA Rhone-Alpes, Amphi F107
Wednesday March 14th 2007, 16h
Abstract:
A part of computer graphics can be viewed as a visual communication
tool. Such a point of view implies several goals that we target in ARTIS
with expressive rendering. In particular the user of an expressive
rendering tool should be able to produce the images that corresponds to
his own goals.
This involves, in particular, significant work on the notion of
/relevance/, which is necessarily application-dependent. The relevance
should guide the level of abstraction of the rendered scene to let the
user emphasize the most important elements of the input 3d scene. It can
also be defined from a levels-of-detail point of view: not only can we
adapt the geometry to decrease the computation time, but we can also
adapt the rendering style to meet the user's goals.
Another research direction for expressive rendering concerns /rendering
styles/: in many cases it should be possible to define the constitutive
elements of styles, allowing the application of a given rendering style
to different scenes, or in the long term the capture of style elements
from collections of images.
Finally, since the application of expressive rendering techniques
generally amounts to a visual simplification, or abstraction, of the
scene, particular care must be taken to make the resulting images
consistent over time, for interactive or animated imagery. This leads to
various projects targeting the temporal coherence of animated scenes.
ROBIN project
INRIA Rhone-Alpes, C207
Tuesday February 27th 2007, 16h15
Abstract:
This short talk is about the ROBIN project, funded by the french
ministry of defense and the french ministry of research. Its main goal
is to produce datasets, ground truths data, competition rules and
evaluation metrics for visual object recognition algorithms that
correspond to real operational matters.
As the competitions have begun, I will present the various ROBIN
competitions, the databases and the ways of submission. More
informations on
http://robin.inrialpes.fr.
Fun with Nearest-Neighbor Quantizers
INRIA Rhone-Alpes, Amphi F107
Tuesday February 6th 2007, 16h
Abstract:
I will present recent research on using nearest-neighbor vector
quantization for estimating intrinsic dimensionality of high-dimensional
datasets and for learning informative partitions of labeled data.
In the first part of the talk, I will discuss a technique for intrinsic
dimensionality estimation based on the theoretical notion of quantization
dimension. This technique works by quantizing the dataset at increasing
rates (in practice, we use k-means to learn the quantizer) and by fitting
a parametric form to the plot of the empirical quantizer distortion
as a function of rate. By using tree-structured quantization, we can
simultaneously estimate dimensionality and partition the dataset into
subsets having different intrinsic dimensions.
In the second part of the talk, I will discuss an information-theoretic
method for learning a nearest-neighbor quantizer from labeled continuous
data such that the index of the nearest prototype of a given data point
approximates a sufficient statistic for its class label. I will
demonstrate applications of this method to learning discriminative
visual vocabularies for bag-of-features image classification and to
image segmentation.
Inverse chronological order.
Details of 2006 seminars
Learning a similarity measure to compare never seen objects
Presenter: Eric nowak
|
15 December, at 16h30 |
C207, INRIA Rhône-Alpes |
|
Affiliation:
Lear Project, INRIA Rhone-Alpes
Abstract:
We propose a similarity measure between two images that predicts how
similar two images of never seen objects are, given a training set of
similar and different object pairs. This similarity measure is used
for visual identification from *one image*. It does not
model any a priori deformation nor does it expect a linear or
quadratic transformation of the input space to be relevant, instead it
clusters local image representations and weights these clusters for the
same/different prediction. An ensemble of extremely randomized
decision trees is used as clusterer. These trees are particularly
adapted to the clustering since they are very fast to learn and they
produce redundant information, which brings robustness. We evaluate
our similarity measure on three datasets and outperform
state-of-the-art competitive methods.
Human character recognition in TV-style movies
Affiliation:
Lear Project, INRIA Rhone-Alpes
Abstract:
This master thesis describes a supervised approach to the detection and
the identification
of humans in TV-style video sequences. In still images and video
sequences, humans
appear in different poses and views, fully visible and partly occluded,
with varying
distances to the camera, at different places, under different
illumination conditions, etc.
This diversity in appearance makes the task of human detection and
identification to a
particularly challenging problem. A possible solution of this problem is
interesting for a
wide range of applications such as video surveillance and content-based
image and video
processing.
In order to detect humans in views ranging from full to close-up view
and in the
presence of clutter and occlusion, they are modeled by an assembly of
several upper body
parts. For each body part, a detector is trained based on a Support
Vector Machine
and on densely sampled, SIFT-like feature points in a detection window.
For a more
robust human detection, localized body parts are assembled using a
learned model for
geometric relations based on Gaussians.
For a flexible human identification, the outward appearance of humans is
captured and
learned using the Bag-of-Features approach and non-linear Support Vector
Machines.
Probabilistic votes for each body part are combined to improve
classification results.
The combined votes yield an identification accuracy of about 80% in our
experiments
on episodes of the TV series ?Buffy the Vampire Slayer?.
The Bag-of-Features approach has been used in previous work mainly for
object classification
tasks. Our results show that this approach can also be applied to the
identification
of humans in video sequences. Despite the difficulty of the given
problem, the
overall results are good and encourage future work in this direction.
Sensor Synchronization and
Localization for Meeting Scene Analysis
Affiliation:
MIT Artificial Intelligence Laboratory
Abstract:
In this talk we tackle the problems of automatically i) synchronizing
audio-visual streams and ii) localizing a set of cameras in a meeting
analysis setting. More exactly, we consider a conference meeting setup where
each participant wears a close-talking microphone and is recorded by a
personal video camera. The multiple audio and video streams are recorded in
an unsynchronized manner and the location and orientation of the cameras are
unknown. We propose here some techniques for automatically estimating the
time discrepancy between all audio and video streams and recovering the
location and orientation of the cameras.
First we show how the mutual information between the estimated motion energy
of the lips and the audio energy can be used to recover the time discrepancy
between the video and audio streams corresponding to the same participant.
Then we show how the same technique can be used to synchronize the
audio-visual streams corresponding to different participants.
Finally we describe a probabilistic Bayesian framework for estimating the
location and orientation of a set of cameras. We show how the head direction
of the users can be used as a constraint by exploiting gaze patterns in
multiparty conversational settings. In order to evaluate the performance of
our algorithms, we show some synchronization and calibration results on real
meetings.
Presentation of an appearance model for small targets tracking
Presenter: Julien Bohn¿
|
11 October, at 17h00 |
C207, INRIA Rhône-Alpes |
|
Affiliation:
Lear Project, INRIA Rhone-Alpes
Abstract:
Our method combines a statistical appearance model of the target
and an accurate modeling of the background in the neighborhood. The 2
models are updated during the image sequence to adapt appearance
changes. We especially take care of the ability of the algorithm to
provide a good estimation of the confidence in the position estimations
Contribution au mosa¿quage d'images a¿riennes
Affiliation:
Universit¿ de Haute-Alsace,
composante Label
Abstract:
Cet expos¿ intitul¿ ¿ Contribution au mosa¿quage d'images a¿riennes ¿,
pr¿sente les travaux d'une th¿se. Nous d¿crivons notre dispositif
exp¿rimental, ainsi que les caract¿ristiques des s¿quences d'images
qui en ¿manent. Nous faisons ensuite un ¿tat de l'art des techniques
de mosa¿quage, ainsi qu'une ¿tude approfondie des algorithmes. Dans la
derni¿re partie nous parlons de nos contributions, qui sont
l'¿laboration d'un vecteur descripteur invariant aux rotations selon
l'axe optique pour la mise en correspondance de points sp¿cifiques,
l'impl¿mentation d'une technique de recalage subpixellique des
correspondances et l'¿laboration d'une m¿thode de compensation de
l'accumulation d'erreurs d'une mosa¿que.
Efficient MAP approximation for dense energy functions
Affiliation:
The Robotics Institute, Carnegie Mellon University
Abstract:
We present an efficient method for maximizing energy functions with
first and second order potentials, suitable for MAP labeling
estimation problems that arise in undirected graphical models. Our
approach is to relax the integer constraints on the solution in two
steps. First we efficiently obtain the relaxed global optimum
following a procedure similar to the iterative power method for
finding the largest eigenvector of a matrix. Next, we map the relaxed
optimum on a simplex and show that the new energy obtained has a
certain optimal bound. Starting from this energy we follow an
efficient coordinate ascent procedure that is guaranteed to increase
the energy at every step and converge to a solution that obeys the
initial integral constraints. We also present a sufficient condition
for ascent procedures that guarantees the increase in energy at every
step.
Blind Vision
Presenter: Shai Avidan
|
17 July, at 17h00 |
F107, INRIA Rhône-Alpes |
|
Affiliation:
Mitsubishi Electric Research Laboratories
Abstract:
We have developed a general framework for secure image and video
analysis that allows a client to have his data analyzed by a server,
privately. For example, the client might submit his images to the
server for face detection, without letting the server learn anything
about the content of the images. Or, more generally, the client might
use a query image to query an image database stored on the server,
without revealing the content of the query image to the server. In the
last year, we have implemented a secure face detector as a
proof-of-concept, presented our work at a scientific conference and
extended the method to work with different types of machine learning
technologies.
Latent Mixture Vocabularies for Object Categorization
Presenter: Diane Larlus
|
12 July, at 14h00 |
C207, INRIA Rhône-Alpes |
|
Affiliation:
LEAR Group
Abstract:
The visual vocabulary is an intermediate level representation
which has been proven to be very powerful for addressing object
categorization problems. It is generally built by vector quantizing a
set of local image descriptors, independently of the object model used
for categorizing images. We propose here to embed the visual vocabulary creation
within the object model construction, allowing to make it more
suited for object class discrimination. We experimentally show that
the proposed model outperforms approaches not learning such an
adapted visual vocabulary.
statistical models to address the problem of object
recognition
Affiliation:
Computer Science Department, Aachen University
Abstract:
Object Recognition in images, that is deciding whether an object is
contained in an image or not and to tell where it is located is an
active field of research. A promising approach to this problem is to
model objects as a collection of parts where relationships can be
modeled flexibly.
We present a set of methods following this approach where image
patches extracted from certain points in the images are used as
features.
Starting from approaches inspired by nearest neighbor classification
we develop various statistical models to address the problem of object
recognition. Though most of the models developed are strongly
connected, the training method and the representation of the data have
a strong impact on the performance of a system. Some of the methods
offer interesting insights in the way computers might be able to learn
the visual appearance of certain object categories. For example, an
object recognition system trained to recognize faces learns that the
most discriminative, i.e. the most relevant part, are the eyes.
Using the methods presented, very interesting and promising results for
different tasks can be achieved.
Conservative Learning and On-line Boosting for Vision
Presenter: Horst Bischof
|
5 June, at 14h00 |
Grand Amphi, INRIA Rhône-Alpes |
|
Affiliation:
Institute for Computer Graphics and Vision, TU Graz
Abstract:
I will present two recently developed visual learning methods:
1. The conservative learning framework allows to learn object
detectors with minimal or no supervision by exploiting the redundancy
of the video stream of cameras. Conservative learning exploits
generative and discriminative learning in a co-training fashion to
obtain powerful object detectors. We demonstrate the framework on a
surveillance task where we learn person and car detectors in an
on-line fashion.
2. One method in the on-line conservative learning framework is a
novel on-line Adaboost feature selection algorithm. Together with
efficiently computable features (Haar Wavelets, Integral Orientation
Histograms, etc.) training the classifier on-line and incrementally
as new data arrives has several advantages and opens new application
areas for boosting in computer vision. We will demonstrate on-line
learning of detection, background modeling and tracking tasks based on
on-line boosting, all algorithms are real-time capable. All approaches
benefit significantly from the on-line training.
Multiple Object Class Detection with a Generative Model
Presenter: Bernt Schiele
|
9 June, at 14h30 |
F 107, INRIA Rhône-Alpes |
|
Affiliation:
Department of Computer Science
Darmstadt University of Technology
Abstract:
In this talk we propose an approach capable of simultaneous
recognition and localization of multiple object
classes using a generative model. A novel hierarchical representation
allows to represent individual images as well as
various objects classes in a single, scale and rotation invariant
model. The recognition method is based on a codebook
representation where appearance clusters built from edge
based features are shared among several object classes. A
probabilistic model allows for reliable detection of various
objects in the same image. The approach is highly effi-
cient due to fast clustering and matching methods capable
of dealing with millions of high dimensional features. The
system shows excellent performance on several object categories
over a wide range of scales, in-plane rotations, background
clutter, and partial occlusions. The performance of
the proposed multi-object class detection approach is comparable
with state of the art approaches dedicated to a single
object class recognition problem.
Extremely randomized trees applied to image
quantification combined to a visual attention process for
object categorization
Affiliation:
Lear Project
Abstract:
Lately, the bag-of-features approach became very popular for Image
Categorization. However, there are several areas where it can be
improved: The selection of features is so far done either densely or
with detector functions. While the dense approach achieves better
results than detector-based approaches, it also has a higher complexity.
The second area of possible improvement is the creation of visual
codebooks. The standard clustering method - k-means - is not only slow,
it also does not create codebooks suited to discriminate between
classes. The associated nearest-neighbor routine to assign clusters is
also slow.
We proposed to improve in both areas: Extremely-Randomized Trees are
used to create a codebook efficiently and in a discriminative manner.
Beside, a combined bottom-up/top-down process is introduced to bias the
random selection of features, which leads to a smaller amount of
features needed to obtain the same and even better results.
Brain Computer Interfaces
Affiliation:
Lab. LITIS - INSA de Rouen
Abstract:
A lot of research have been carried out to design Brain Computer
Interfaces (BCI), especially in the field of supervised
classification of non stationary signals.
EEG signals require particular processing and we propose to tackle
those problems according to three approaches: building a denoised
compact representation for raw signals, introducing translation
invariance in the procedure and dealing with the variability of EEG
signals.
In all our approaches we keep two threads: non-parametric tools with
kernel machines and a tripolar strategy including the representation
of raw signals, the building of similarities between representations
and the classification machine.
First, we face the problem of describing the raw signals.
We aim at constructing a denoised and compact representation of the
raw signals.
We designed the Kernel Basis Pursuit (KBP) algorithm which combines
multiple kernels, sparse regularization and very efficient solving of
regression problems.
We add some heuristics to make this method parameter-free thus
enabling us to deal with large amounts of data.
Then we make the assumption that one difficulty resides in the
variable time position of the discriminant patterns.
We develop a translation invariant approach to classify
non-stationary signals.
Such a method relies on a graph model of shift-covariant
representation (wavelet transform or time-frequency) where all the
time information becomes comparative.
Finally, the variability of EEG signals turned out to be the main
difficulty in BCI problems.
We show that combining multiple classifiers and variable selection is
an efficient strategy to identify evoked potential in EEG.
* Key words: Regularization L1, Kernel methods, Multiple kernel,
Graph kernel, Translation invariance, Multiple classifiers, Brain
Computer Interface.
Methodes de filtrage pour du suivi dans des sequences d'images -
application au suivi de points caracteristiques
Presenter: Elise Arnaud
|
4 april, at 16h30 |
F 107, INRIA Rhône-Alpes |
|
Affiliation:
Universit¿ de Genes,
Italy et
IRISA Rennes
Abstract:
Cette ¿tude traite de l'utilisation de m¿thodes de filtrage (filtrage de
Kalman, methodes sequentielles de Monte Carlo) pour du suivi dans des
s¿quences d'images. Ces algorithmes reposent sur une repr¿sentation du
syst¿me dynamique par une cha¿ne de Markov cach¿e, d¿crite par une loi
dynamique et une vraisemblance des donn¿es. Pour construire une m¿thode
g¿n¿rale, une loi dynamique estim¿e sur les images est consid¿r¿e. Ce
choix met en ¿vidence les limitations du mod¿le simple de cha¿ne de
Markov cach¿e, qui ne d¿crit pas la d¿pendance des ¿l¿ments du syst¿me
aux images. Nous proposons d'abord une mod¿lisation originale du
probl¿me. Celle-ci rend les images explicites et permet de construire
des algorithmes sans information a priori. Les filtres associ¿s ¿ cette
nouvelle repr¿sentation sont d¿riv¿s sur la base des filtres classiques,
en consid¿rant un conditionnement par rapport ¿ la s¿quence. Il est
¿galement pr¿sent¿ comment ce nouveau sch¿ma permet de consid¿rer des
mod¿les simples, pour lesquels la fonction d'importance optimale est
disponible.
Ensuite, nous nous int¿ressons ¿ la validation pratique de la
mod¿lisation propos¿e sur une application de suivi de points
caract¿ristiques. Les syst¿mes mis en oeuvre sont enti¿rement estim¿s
sur la s¿quence. Ils associent des mesures de similarit¿ ¿ une dynamique
d¿finie ¿ partir d'un mouvement instantan¿ estim¿ par une m¿thode
diff¿rentielle robuste. Les algorithmes construits sont valid¿s sur de
nombreuses s¿quences r¿elles, et utilises pour differentes applications
(imagerie medicale, reconnaissance d'objet).
Error-resilient source codes and joint source/channel codes
Presenter: Herve Jegou
|
3 april, at 16h |
A 104, INRIA Rhône-Alpes |
|
Affiliation:
IRISA, Rennes
Abstract:
L'expos¿ se d¿roulera en deux parties distinctes.
En premier lieu, deux contributions sur le codage conjoint source-canal seront
pr¿sent¿es. La premi¿re concerne le d¿codage de codes ¿ longueur variable. Une
technique d'agr¿gation du treillis de d¿codage optimal sera expos¿e. Elle
permet
de diminuer la complexit¿ du d¿codage bay¿sien d'un ordre de grandeur. Son
optimalit¿ pour les r¿alisations typiquement conjointe source/canal est
motiv¿e
par le calcul de la quantit¿ d'information contenue dans la contrainte de
terminaison. La seconde contribution consiste en l'introduction de codes
fond¿s
sur des r¿gles de r¿-¿criture et implant¿s par des transducteurs s¿quentiels.
Quelques propri¿t¿s illustreront l'int¿r¿t de cette classe de codes.
La seconde partie de cet expos¿ traitera de la recherche par similarit¿ et
plus
particuli¿rement de la recherche approximative de plus proche voisins dans des
espaces de grande dimension. Apr¿s une introduction de la probl¿matique, nous
soulignerons les limitations d'un algorithme de l'¿tat de l'art, Omedrank,
avant
de poursuivre sur des am¿liorations cet algorithme. Nous montrerons en
particulier qu'il est possible d'obtenir d'importants gains en modifiant la
strat¿gie de vote utilis¿e. Nous donnerons enfin quelques perspectives de
recherche sur ce th¿me.
Object Detection in Crowded Scenes
Presenter: Bastian Leibe
|
20 march, at 11h00 |
F 107, INRIA Rhône-Alpes |
|
Affiliation:
Multimodal
Interactive Systems group, Darmstadt
Abstract:
The detection of object classes in real-world images is a challenging
problem which is further complicated by the effects of overlaps and
partial occlusions. We present a novel algorithm which addresses this
problem by considering object categorization and top-down segmentation
as two interleaved processes that closely collaborate towards a common
goal. As we will show, the close coupling between those two processes
allows our method to accumulate additional evidence about object
hypotheses and resolve ambiguities caused by overlaps and partial
visibility.
The core part of our approach is a flexible formulation for object shape
that can combine the information observed on different training examples
in a probabilistic extension of the Generalized Hough Transform. The
resulting approach can detect categorical objects in novel images and
automatically infer a top-down segmentation from the recognition result.
The segmentation is then used to again improve recognition by allowing
the system to focus on object pixels and discard misleading influences
from the background. Moreover, the information from where in the image a
hypothesis draws its support is used in an MDL based verification stage
to resolve ambiguities between overlapping hypotheses and factor out the
effects of partial occlusion.
As an application, we address the problem of detecting objects such as
cars, motorbikes, and pedestrians in real-world street scenes.
Qualitative and quantitative results on several challenging data set
confirm that our method is able to reliably detect objects in crowded
scenes, even when they overlap and partially occlude each other. In
addition, the flexible nature of our approach allows it to operate on
very small training sets.
Beyond bag-of-words: recent research developments on visual
categorization at XRCE
Affiliation:
Xerox Research Centre Europe, Image Processing Group
Abstract:
Generic Visual Categorization (GVC) is the pattern classification
problem which consists in assigning one or multiple labels to an image
based on its semantic content. Several state-of-the-art GVC systems were
inspired by the bag-of-words (BOW) approach to text-categorization. In
the BOW representation, a text document is encoded as a histogram of the
number of occurrences of each word. Similarly, one can characterize an
image by a histogram of "visual words" count. This is sometimes referred
to as the bag-of-keypatches or bag-of-visterms. During this talk, we
will discuss recent developments at the Xerox Research Centre Europe
(XRCE) to improve on such representations.
We first present a novel and practical approach to GVC based on a
universal vocabulary, which describes the content of all the considered
classes of images, and class vocabularies obtained through the
adaptation of the universal vocabulary using class-specific data. An
image is characterized by a set of histograms - one per class - where
each histogram describes whether the image content is best modeled by
the universal vocabulary or the corresponding class vocabulary. It is
shown experimentally on three very different databases that this novel
representation outperforms those approaches which characterize an image
with a single histogram.
In the second part we improve the categorizer by incorporating
geometrical information. Based on scale, orientation or closeness of the
keypatches we can consider a large number of simple geometrical
relationships, each of which can be considered as a simplistic
classifier. We select from this multitude of classifiers (several
millions in our case) and combine them effectively with the original
classifier. An improvement is demonstrated on a challenging 10 class
dataset.
Modelling Scenes with Local Descriptors and Latent Aspects
Affiliation:
K.U.Leuven, VISICS Group
Abstract:
A new approach to model visual scenes in image collections is presented,
based on local invariant features and probabilistic latent space models.
We provide answers to the following three open questions: 1) whether
the invariant local features are suited for scene (rather than object)
classification; 2)whether unsupervised latent space models can be used
for feature extraction in the classification task; and 3) whether the
latent space formulation can discover visual co-occurrence patterns,
motivating novel approaches to image organization and segmentation.
Using a 9500 images-dataset, our approach is validated on each of these
issues. First, we show with extensive experiments on binary and
multiclass scene classification tasks, that the bag-of-words
representation derived from local invariant descriptors, consistently
outperforms state-of-the-art approaches. Second, we show that
Probabilistic Latent Semantic Analysis (PLSA) generates a compact scene
representation, discriminative for accurate classification, and
significantly more robust when less training data are available. Third,
we have exploited the ability of PLSA to automatically extract visually
meaningful aspects, to propose new algorithms for aspect-based image
ranking and context-sensitive image segmentation.
Additionally, I'll discuss some planned future work, exploiting a
similar scheme based on latent aspects and local invariant features for
the integration of visual and textual data.
Le programme TRECVID : Exp¿rimentations en recherche par
le contenu dans des bases de documents vid¿os
Presenter: Georges Qu¿not
|
9 february, at 14h00 |
F 107, INRIA Rhône-Alpes |
|
Affiliation:
CLIPS-IMAG
Abstract:
Le National Institute of Standard and Technology am¿ricain (NIST) et
DARPA ont lanc¿ une campagne d'¿valuation annuelle des syst¿mes de
recherche par le contenu dans des bases de documents vid¿os (TRECVID).
Les syt¿mes sont ¿valu¿s globalement dans le cadre d'une t¿che de
recherche aussi r¿aliste que possible. Des composants ou techniques
n¿cessaires pour ces syst¿mes sont ¿valu¿s ind¿pendamment comem la
segmentation en plans, la segmentation en histoires, la d¿tection de
concepts et la d¿tection du mouvement de la cam¿ra. Nous d¿crirons
les principes g¿n¿raux de la campagne, les diff¿rentes t¿ches et les
r¿sultats obtenus, repr¿sentatifs de l'¿tat de l'art dans le domaine.
Nous pr¿senterons ¿galement les diff¿rents travaux conduits dans
l'¿quipe MRIM et ¿valu¿s dans le cadre de TRECVID.
Geometric Context from a Single Image
Presenter: Derek Hoiem
|
6 february, at 15h00 |
F 107, INRIA Rhône-Alpes |
|
Affiliation:
Robotics Institute of Carnegie Mellon University
Abstract:
Humans have an amazing ability to instantly grasp the overall 3D
structure of a scene -- ground orientation, relative positions of
major landmarks, etc -- even from a single image. This ability is
completely missing in most popular recognition algorithms, which
pretend that the world is flat and/or view it through a patch-sized
peephole. Yet it seems very likely that having a grasp of this
"geometric context" of a scene should be of great
assistance for many
tasks, including recognition, navigation, and novel view synthesis.
In this talk, I will describe our first steps toward the goal of
estimating a 3D scene context from a single image. We propose to
estimate the coarse geometric properties of a scene by learning
appearance-based models of /geometric/ classes. Geometric classes
describe the 3D orientation of an image region with respect to the
camera. We provide a multiple-hypothesis segmentation framework for
robustly estimating scene structure from a single image
and obtaining
confidences for each geometric label. These confidences can then
(hopefully) be used to improve the performance of many other
applications. We provide a quantitative evaluation of our
algorithm on
a dataset of challenging outdoor images.
We also demonstrate its usefulness in two applications:
1) improving object detection, and
2) automatic single-view reconstruction
("Automatic Photo Pop-up", SIGGRAPH'05).
Joint work with Alexei Efros and Martial Hebert at CMU.
Computer vision using local binary patterns
Affiliation:
Information Processing Laboratory, University of Oulu, Finland
Abstract:
The local binary pattern (LBP) operator is defined as a gray-scale
invariant texture measure, derived from a general definition of
texture in a local neighborhood. Through its recent extensions, the LBP
operator has been made into a really powerful measure
of image texture, showing excellent results in many empirical studies.
The LBP operator can be seen as a unifying approach to
the traditionally divergent statistical and structural models of texture
analysis. Perhaps the most important property of
the LBP operator in real-world applications is its invariance against
monotonic gray level changes. Another equally important
is its computational simplicity, which makes it possible to analyze
images in challenging real-time settings.
The LBP method has already been used in a large number of applications
all over the world. This talk presents an
overview of the LBP approach, emphasizing our recent research results.
Theoretical foundations of the LBP and
examples of applying it to various computer vision problems are
presented, including classification of 3D textured surfaces,
face recognition, face detection, facial expression recognition,
content-based retrieval, modeling the background and
detecting moving objects, and recognition of dynamic textures.
Evaluation de d¿tecteurs et de descripteurs de points d'int¿ret sur des
images infrarouges
Presenter: Julien Bohn¿
|
11 january, at 16h00 |
C 207, INRIA Rhône-Alpes |
|
Affiliation:
Lear, INRIA Rhône-Alpes
Abstract:
Une ¿valuation de diff¿rents d¿tecteurs et descripteurs de points
d'int¿r¿t appliqu¿s ¿ des images infra-rouges basse r¿solution sera
pr¿sent¿e. Apr¿s une rapide pr¿sentation de la m¿thode de test, les
r¿sultats des diff¿rents algorithmes seront comment¿s afin de
souligner les avantages et inconv¿nients de chaque technique.
Inverse chronological order.
Details
Discriminative Regions for Semi-Supervised Object Class Localization
Affiliation:
Vision and
Mobile Robotics Lab , Carnegie Mellon University
Abstract:
I will present a method for object class localization using image regions.
Image regions are extracted using unsupervised image segmentation, and provide a
natural spatial support for detection results. Each region can be classified
using both its texture content, as well as local interest points in and around
it. Our framework allows selection of the most discriminative features for a
given object class in a semi-supervised manner, where image labels are given but
not the pixelwise delineation of training objects. Despite the semi-supervised
training, this method allows pixelwise localization where the actual object mask
is determined, not simply a bounding box or object centre.
Discovering objects and their location in images
Presenter: Andrew Zisserman
|
5 December, 2005 at 16h00 |
Grand Amphi, INRIA Rhône-Alpes |
|
Affiliation:
Department of Engineering Science, University of Oxford
Abstract:
This is joint work with Josef Sivic, Bryan Russell, Alexei Efros, and
William Freeman.
There has been much recent research activity in recognizing object
categories (such as cars, faces, motorbikes) in images. Most
approaches start by learning a category model from a set of labelled
training images for each category. The level of supervision of these
training images can vary from segmenting in detail each object
instance, through to simply labelling the image as containing that
object category.
In this work we explore unsupervised training - we seek to discover
the object categories depicted in a set of unlabelled images. We
achieve this using a model developed in the statistical text
literature: probabilistic Latent Semantic Analysis (pLSA). In text
analysis this is used to discover topics in a corpus using the
bag-of-words document representation. Here we treat object categories
as topics, so that an image containing instances of several categories
is modeled as a mixture of topics.
The model is applied to images by using a visual analogue of a word,
formed by vector quantizing SIFT-like region descriptors. The topic
discovery approach successfully translates to the visual domain: for a
small set of objects, we show that both the object categories and
their approximate spatial layout are found without supervision.
Performance of this unsupervised method is compared to previous
supervised approaches, and we show applications to category based
retrieval in image databases and films.
Hyperfeatures - Multilevel Local Coding for Visual Recognition
Presenter: Ankur Agarwal
|
23 November, 2005 at 16h00 |
C207, INRIA Rhône-Alpes |
|
Affiliation:
Lear Project, INRIA Rhone-Alpes
Abstract:
Histograms of local appearance descriptors are a popular representation for
visual recognition. They are highly discriminant and they have good
resistance to local occlusions and to geometric and photometric variations,
but they are not able to exploit spatial co-occurrence statistics of features
at scales larger than their local input patches. We present a new multilevel
visual representation, `hyperfeatures', that is designed to remedy this. The
basis of the work is the familiar notion that to detect object parts, in
practice it often suffices to detect co-occurrences of more local object
fragments ??? a process that can be formalized as comparison (vector
quantization) of image patches against a codebook of known fragments,
followed by local aggregation of the resulting codebook membership vectors to
detect co-occurrences. This process converts collections of local image
descriptor vectors into slightly less local histogram vectors ??? higher-level
but spatially coarser descriptors. Our central observation is that it can
therefore be iterated, and that doing so captures and codes ever larger
assemblies of object parts and increasingly abstract or `semantic' image
properties. This repeated nonlinear `folding' is essentially different from
that of hierarchical models such as Convolutional Neural Networks and HMAX,
being based on repeated comparison to local prototypes and accumulation of
co-occurrence statistics rather than on repeated convolution and
rectification. We formulate the hyperfeatures model and study its performance
under several different image coding methods including clustering based
Vector Quantization, Gaussian Mixtures, and combinations of these with Latent
Discriminant Analysis. We find that the resulting high-level features provide
improved performance in several object image and texture image classification
tasks.
Reference: Technical Report RR-5655, INRIA - Aug. 2005
Manifold Learning and Image Segmentation
Presenter: Jakob Verbeek
|
24 August, 2005 at 16h00 |
C 107, INRIA Rhône-Alpes |
|
Affiliation:
Intelligent Autonomous Systems, University of Amsterdam
Dynamic Scene Analysis using Non-Parametric Statistics
Presenter: Yoni Wexler
|
30 June, 2005 at 16h00 |
F 107, INRIA Rhône-Alpes |
|
Affiliation:
Weizmann Institute, Israel
Abstract:
Complex dynamic scenes are very difficult to model. They do not have a
well defined geometric or parametric representations. Parametric and
geometric methods have therefor been limited in their ability to solve
real-world problems in Vision. Yet, texture and dynamic changes over
time provide rich statistical information about the scene. This
information is usually non-parametric. In this talk I will demonstrate
how by taking a non-parametric statistical approach, we are able to
solve difficult problems in the field of Computer Vision.
In particular, I will demonstrate the power of this approach through
several example problems. These include analysis, synthesis and
manipulation of complex dynamic video sequences, recovery of Epipolar
Geometry, and recovery of general unknown optical distortions without
modeling them parametrically.
Infra-red image classification
Presenter: Eric Nowak
|
14 June, 2005 at 14h00 |
C 207, INRIA Rhône-Alpes |
|
Affiliation:
Lear Project, INRIA Rhône-Alpe
Abstract:
I will present my work on classification of infra red images and visible images too.
This work is still in progress, so I will present you toughts and experimental results on different topics, including :
- dense representation of objects (SIFT based and raw pixel based)
- feature selection
- multiclass feature selection : how to share efficiently features between classes
Object Detection with Line Segment Networks
Affiliation:
BIWI - ETHZ, Switzerland
Abstract:
We propose a system for object detection in cluttered real images, given only a hand-drawn outline as model. The edges are approximated by polygons, and the resulting line segments are organized into a novel image representation which encodes their interconnections: the Line Segment Network. The object detection problem is formulated as finding paths through the network resembling the model outline, and a computationally efficient detection algorithm is presented. As we demonstrate on several cluttered real images containing two object classes (bottles and swans), our method is capable of robust object detection and allows for considerable shape variation.
Creating Efficient Codebooks for Visual Recognition
Presenter: Fr¿d¿ric Jurie
|
27 April, 2005 at 11h00 |
C 207, INRIA Rhône-Alpes |
|
Affiliation:
INRIA Rhône-Alpes, Project LEAR
Abstract:
Visual codebooks built by vector quantizing appearance descriptors of local image patches are an effective means of capturing image statistics for texture analysis and visual classification. The input patches can either densely cover the image (`texton' representation) or be restricted to a sparse set of keypoints (`local features' representation). Methods such as k-means are common choices for codebook construction. Although k-means works well for the relatively homogeneous images typical of texture analysis, we show that it gives suboptimal codebooks when faced with the highly non-uniform statistics of the natural images found in object recognition problems. We describe a ball-deletion based mean shift clusterer that scales well to large datasets, and show that its codebooks significantly outperform k-means ones on several image classification tasks. We also show that dense representations greatly outperform keypoint based ones, and that mutual information based feature selection starting from a dense codebook gives a further improvement in performance.
Feature Detection in Color Images
Affiliation:
Lear, INRIA Rhône-Alpes
Abstract:
"Colors are only symbols. Reality is to be found in luminance alone.", Picasso exclaimed in one of his blue years. His message seems to be taken to heart by the computer vision community. In general the first thing to do, when trying to interpret the content of images, when looking for objects, persons, textures, or at a smaller scale for edges, ridges, and corners, is to discard color.
In this talk I will focus on two advantages of using color for computer vision tasks. First, color provides extra photometric information which allows the distinction between various physical causes for color variations in the world, such as changes due to shadows, light source reflections, and object reflectance variations. Secondly, color is an important discriminative property of objects and plays an important role in the attribution of saliency. These two advantages are applied to image features, which results in among others photometric invariant edge and corner detectors, and color-saliency focussed local features.
Semi-Local Parts and Adjacency Relations for Object Recognition
Affiliation:
Beckman Institute (University of Illinois at Urbana-Champaign)
Abstract:
This talk will describe a framework for object recognition based on
local scale- and affine-covariant image regions (keypoints) and their
spatial relations. In many existing object recognition approaches,
individual keypoints play the role of generic object parts. We have
developed a more expressive object representation based on composite
semi-local parts, defined as geometrically stable configurations of
multiple regions that are robust against (limited) viewpoint changes
and intra-class variations. Our framework includes a procedure for
learning a vocabulary of semi-local parts for representing an object
class that is weakly supervised (i.e., it works on unsegmented,
cluttered training images) and can be combined with existing feature
selection methods based on likelihood ratio or mutual information.
The talk will conclude with a discussion of work in progress, namely,
probabilistic models for combining semi-local parts and inter-part
adjacency relations.
High Dimensional Discriminant Analysis
Affiliation:
INRIA Rhône-Alpes - Project LEAR
Abstract:
We propose a new method for discriminant analysis, called High Dimensional Discriminant Analysis (HHDA). Our approach is based on the assumption that high dimensional data live in different subspaces with low dimensionality. Thus, HDDA reduces the dimension for each class independently and regularizes class conditional covariance matrices in order to adapt the Gaussian framework to high dimensional data. This regularization is achieved by assuming that classes are spherical in their eigenspace. HDDA is applied to recognize objects in natural images and its performances are compared to classical classification methods.
Strike a Pose: Tracking People by Finding Stylized Poses
Presenter: Deva Ramanan
|
04 February, 2005 at 1400hrs |
C 207, INRIA Rhône-Alpes |
|
Affiliation:
University of Berkeley, Computer Vision Group
Abstract:
An important, open vision problem is to automatically describe what
people are doing in a sequence of video. This problem is difficult for
several reasons. First, one needs to determine how many people (if any)
are in each frame and estimate their configurations (where they are and
what their arms and legs are doing). But finding people and localizing
their limbs is hard because people (a) wear a variety of different
clothes, (b) appear in a variety of poses and (c) tend to partially
occlude themselves and each other. Secondly, one must sew together
estimated configuration reports from across frames into a motion path;
this is tricky because people can move fast and unpredictably. Finally,
one must describe what each person is doing; this problem is poorly
understood, not least because there is no natural or canonical set of
categories into which to classify activities.
In this talk I will discuss our progress on this problem. We develop a
tracker that works in two stages; it first (a) builds a model of
appearance of each person in a video and then (b) tracks by detecting
those models in each frame ("tracking by model-building and detection").
We then marry our tracker with a motion synthesis engine that works by
re-assembling pre-recorded motion clips. The synthesis engine generates
new motions that are human-like and close to the image measurements
reported by the tracker. By using labeled motion clips, our synthesizer
also generates activity labels for each image frame ("analysis by
synthesis"). We have extensively tested our system, running it on
hundreds of thousands of frames of unscripted indoor and outdoor
activity, a feature-length film, and legacy sports footage.
Fast Image Retrieval using SIFT descriptors
Presenter: Micha¿l Sdika
|
21 January, 2005 at 1400hrs |
C 207, INRIA Rhône-Alpes |
|
Affiliation:
INRIA Rhone-Alpes, LeaR
Abstract:
I will present the basis and the techniques used for fast image retrieval using SIFT descriptors in the team's demo and my contribution to the lava library. More precisely, I will talk about:
1) the new implementation of the SIFT descriptor,
2) the new angle estimator,
3) an indexing method using dimensionality reduction and kd-tree,
4) and, the D. Lowe Hough transform to add geometric constraints on matches.
I will conclude by giving some ideas on what can be done to improve the retrieval process.
Monocular Human Motion Capture with a Mixture of Regressors
Presenter: Ankur Agarwal
|
05 January, 2005 at 1600hrs |
C 207, INRIA Rhône-Alpes |
|
Affiliation:
INRIA Rhone-Alpes, LeaR
Abstract:
We address 3D human motion capture from monocular images, taking a learning
based approach to construct a probabilistic pose estimation model from a set of
labelled human silhouettes. To compensate for ambiguities in the pose
reconstruction problem, our model explicitly calculates several possible pose
hypotheses. It uses locality on a manifold in the input space and connectivity
in the output space to identify regions of multi-valuedness in the mapping from
silhouette to 3D pose. This information is used to fit a mixture of regressors
on the input manifold, giving us a global model capable of predicting the
possible poses with corresponding probabilities. These are then used in a
dynamical-model based tracker that automatically detects tracking failures and
re-initializes in a probabilistically correct manner. The system is trained on
optical sensor based motion capture data, using the corresponding real human
silhouettes supplemented with silhouettes synthesized artificially from several
different models for improved robustness to inter-person variations.
Static pose estimation is illustrated on a variety of silhouettes. The
robustness of the method is demonstrated by tracking on a real image sequence
requiring multiple automatic re-initializations.
Titles
Abstracts
Color Constancy from local invariant regions
Presenter: Tijmen Moerland
|
25 November, 2004 at 1600hrs |
C 207, INRIA Rhône-Alpes |
|
Affiliation:
INRIA Rhone-Alpes, LeaR
Abstract:
This master's thesis investigates methods for combining the research fields of
color constancy and invariant region matching. Color constancy aims at
removing the influence of illumination from images so that the 'true' surface
color of objects can be seen. The color constancy algorithm used in this
thesis operates on two images and aims at approximating the joint color
change, the 'color flow'. This makes object colors invariant to illumination
changes. Other invariancies such as rotation and scaling of images and
appearance, disappearence and moving of objects are achieved using DoG
keypoint detection and SIFT matching. Robust color flow estimation based on
normalized support regions makes color constancy viewpoint independent, which
is the main contribution of this work. Furthermore the color flow algorithm
is improved by operation in Hue, Saturation space and thus obtaining
robustness to shadows and highlights.
Summary of Summer school in Machine Learning
Presenter: Ankur Agarwal
|
04 November, 2004 at 1600hrs |
C 207, INRIA Rhône-Alpes |
|
Affiliation:
INRIA Rhône-Alpes, Project LEAR
Abstract:
For details, click here
Summary of International Workshop on Object Recognition
Presenter: Fr¿d¿ric Jurie
|
28 October, 2004 at 1600hrs |
C 207, INRIA Rhône-Alpes |
|
Affiliation:
INRIA Rhône-Alpes, Project LEAR
Abstract:
For details, click here
Detecting Keypoints with Stable Position, Orientation and Scale under Illumination Changes
Presenter:
Bill Triggs
|
17 June, 2004 at 1700hrs |
C 207, INRIA Rhône-Alpes |
|
Affiliation:
INRIA Rhône-Alpes, Project LEAR
Abstract:
Local feature approaches to vision geometry and object recognition
are based on selecting and matching sparse sets of visually salient
image points, known as `keypoints' or `points of interest'. Their
performance depends critically on the accuracy and reliability with
which corresponding keypoints can be found in subsequent images.
Among the many existing keypoint selection criteria, the popular
Förstner-Harris approach explicitly targets geometric stability,
defining keypoints to be points that have locally maximal
self-matching precision under translational least squares template
matching. However, many applications require stability in orientation
and scale as well as in position. Detecting translational keypoints
and verifying orientation/scale behaviour post hoc is suboptimal, and
can be misleading when different motion variables interact. We give a
more principled formulation, based on extending the
Förstner-Harris approach to general motion models and robust
template matching. We also incorporate a simple local appearance model
to ensure good resistance to the most common illumination variations.
We illustrate the resulting methods and quantify their performance on
test images.
Title Unknown
Presenter:
Michel Dhome
|
28 April, 2004 at 1430hrs |
F 107, INRIA Rhône-Alpes |
|
Affiliation:
LASMEA, Universit¿ Blaise Pascal
Abstract:
Michel Dhome (LASMEA, Clermont-Ferrand) will present his recent work on real-time scene reconstruction using a
moving camera - a car manually driven in a city-like environment.
The scene is then automatically reconstructed, allowing later a car to run
autonomously along the learned trajectory.
Learning 3D Human Pose from Silhouettes
Presenter:
Ankur Agarwal
|
24 March, 2004 at 1530hrs |
C 207, INRIA Rhône-Alpes |
|
Affiliation:
INRIA Rhône-Alpes, Project LEAR
Abstract:
I will describe a sparse Bayesian regression method for recovering 3D
human body motion from single images and monocular video sequences. The method
requires neither an explicit body model nor prior labelling of body
parts in the image. Instead, it recovers pose by direct nonlinear regression
against shape descriptor vectors extracted automatically from image
silhouettes. For robustness against local silhouette segmentation errors,
silhouette shape is encoded by histogram-of-shape-contexts descriptors.
Different regressors are evaluated for the main regression, and a Relevance
Vector Machine (RVM) regressor is used to provide a sparse regressor without
compromising performance. The regression scheme is also extended into a
tracking framework by combining a learned autoregressive dynamical model with
the robust shape descriptors. The methods are demonstrated on a 54-parameter
full body pose model, both quantitatively using motion capture based test
sequences, and qualitatively on a test video sequence.
Bandelettes et repr¿sentation g¿om¿trique des images
Presenter:
Erwan Le Pennec
|
03 March, 2004 at 1100hrs |
F 107, INRIA Rhône-Alpes |
|
Affiliation:
CMAP, Ecole Polytechnique
Abstract:
La recherche de repr¿sentations efficaces des signaux est au coeur du
traitement du signal pour des applications telles que la compression,
l'estimation ou les probl¿mes inverses. Pour les images, la repr¿sentation
dans une base d'ondelettes est sous optimale car elle n'exploite pas la
r¿gularit¿ de nature g¿om¿trique de celles-ci. Les bandelettes sont elles
construites dans ce but. Apr¿s les avoir pr¿sent¿es, nous montrerons
qu'elles permettent des r¿sultats optimaux d'approximation non lin¿aire.
Ces propri¿t¿s seront illustr¿es pas des applications ¿ la compression et
au d¿bruitage.
Reading of: New Algorithms for Efficient High-Dimensional Nonparameteric Classification
Affiliation:
INRIA Rhône-Alpes, Project LEAR
Abstract:
The reading group is about non-approximate acceleration of high-dimensional operations,
such as classification, using basic properties of ball trees (similar to kd-trees).
Salil and Peter will present a short introduction to ball-tree algorithms and
summarize the paper, and then discussion will follow.
local copy can be obtained from: /home/edgar/carbonet/public/liu-moore.ps.gz
Kernel fisher discriminant for texture segmentation
Presenter: Jianguo Zhang
|
05 February, 2004 at 1700hrs |
C 207, INRIA Rhône-Alpes |
|
Affiliation:
INRIA Rhône-Alpes, Project LEAR
Abstract:
Kernel Fisher discrimiant (KFD) is a state-of-the-art nonlinear machine learning method,
and it has great potential to outperform linear Fisher discrimiant. In this talk, I will present a nonlinear
discriminative texture feature extraction method based on KFD for texture classification.
It is also mathematically shown that finding the optimal discriminative
texture features is equivalent to finding the optimal discriminative
projection directions of the input data by KFD. The KFD-based method
integrates texture feature extraction, nonlinear dimensionality
reduction, and discrimination in a unified framework. Optimized and closed-form solutions are
derived for both two-class and multi-class texture classification problems, individually. Extensive
experimental results clearly show that the proposed method yields excellent performance in
texture classification and outperforms other kernel based texture classification method.
In this talk, if the time is allowed, I will also present part of my previous work
on MRI tumor segmentation by one-class SVM learning
The abstract is as follows:
In image segmentation, one challenge is how to deal with the nonlinearity
of real data distribution, which often makes segmentation
methods need more human interactions and make unsatisfied segmentation
results. In this talk, we formulate this research issue as a one-class
learning problem from both theoretical and practical
viewpoints with application on medical image segmentation.
For that, a novel and user-friendly tumor segmentation method is
proposed by exploring one-class support vector machine (SVM),
which has the ability of learning the nonlinear distribution of the
tumor data without using any prior knowledge about the data distribution.
Extensive experimental results obtained from real patients' medical images clearly
show that the proposed unsupervised one-class SVM segmentation
method outperforms supervised two-class SVM segmentation
method in terms of segmentation accuracy and with less human intervention.
Improving KD Trees. L-infinity distance for Triangulation
Presenter: Richard Hartley
|
21 January, 2004 at 1600hrs |
Grand Amphi, INRIA Rhône-Alpes |
|
Affiliation:
The Australian National University
Human detection based on a probabilistic assembly of robust part detectors
Affiliation:
Robotics Research Group, University of Oxford
Abstract:
I will present a novel method for human detection which can detect
pedestrians as well as close-up views of humans in the presence of
clutter and occlusion. Humans are modeled as flexible assemblies of
parts. The key point of the approach is a robust part detection. The
part detectors are based on gradient and Laplacian based local
features which efficiently capture the shape information.
Using the probabilistic co-occurrence of these features increases
their distinctiveness while the robustness remains the same. Learning
with AdaBoost combines features with the highest co-occurrence
probabilities.
Furthermore, the parts include a larger local context than in previous
part-based work [Forsyth'97,Ronfard02] and they are therefore more
distinctive. They are also not global (cf. previous work on pedestrian
detectors [Papageorgiou'00]) and they therefore allow for
occlusion and the detection of close-up views. The detection results
are further improved by computing a probabilistic score for the
assembly of parts which takes into account their relative
position. The approach is also very efficient as (i) all part
detectors use the same initial features, (ii) a coarse-to-fine cascade
approach is used for part detection, (iii) an assembly strategy
reduces the number of spurious detections and the search space. The
results are very promising and outperform existing human detectors.
Titles
Abstracts
Transductive Learning for Scene Classification
Presenter: Bill Triggs
|
18 December, 2003 at 1700hrs |
C 208, INRIA Rhône-Alpes |
|
Affiliation:
INRIA Rhone-Alpes - Project LEAR
Indices de forme invariants ¿ l'¿chelle pour la reconnaissance de cat¿gories d'objets
Presenter: Fr¿d¿ric Jurie
|
04 December, 2003 at 1600hrs |
C 208, INRIA Rhône-Alpes |
|
Affiliation:
INRIA Rhone-Alpes - Project LEAR
Abstract:
In this talk we introduce a new method for extracting shape interest
regions which capture the local structure of the contour image.
They are in spirit similar to local interest points extracted from
grey-level images, but describe the shape instead of the texture.
Our approach detects local shape convexities in scale-space.
The detection is based on a robust measure, the entropy of the
gradient orientations in the neighborhood of a circle defined by the
scale. The detected regions allow for clutter, occlusions
as well as spurious detections and are invariant to scale changes and
rotations. Experimental results show a very good performance for shape
matching and recognition of object categories.
R¿sum¿:
Nous pr¿sentons une nouvelle m¿thode pour la d¿tection de zones d'int¿r¿t
bas¿e sur la forme, qui capture la structure locale des contours des images.
Elle est con¿ue dans le m¿me esprit que les d¿tectueurs de points d'int¿r¿t locaux
qui travaillent ¿ partir d'images en niveaux de gris, mais d¿crit la forme plut¿t que la texture.
Notre approche d¿crit des convexit¿s locales des formes, dans l'espace des
¿chelles.
Les r¿gions sont d¿tect¿es de mani¿re robuste, malgr¿ des occultations, le
bruit dans les images ou les changements d'¿chelles. Des r¿sultats
exp¿rimentaux montrent de tr¿s bonnes performances lors de mise en
correspondance de formes et de reconnaissance de cat¿gories d'objets.
Unsupervised Statistical Models for General Object Recognition
Presenter: Peter Carbonetto
|
27 November, 2003 at 1530hrs |
C 207, INRIA Rhône-Alpes |
|
Affiliation:
INRIA Rhône-Alpes - Project LEAR
Abstract:
I will present an overview of the work I did for my Master's thesis at the
University of British Columbia. I will also touch upon some major issues I
uncovered in my work and discuss some future directions for research.
We approach the object recognition problem as the process of attaching
meaningful labels to specific regions of an image. Given a set of images
and their captions, we segment the images, then learn the proper associations
between words and regions. Previous models are limited by the scope of the
representation, and performance is constrained by noise from poor initial
clusterings of the image features. We propose three improvements that address these issues.
Releated papers:
1. Bayesian feature weighting for unsupervised learning, with
application to object recognition. P. Carbonetto, N. de Freitas,
P. Gustafson and N. Thompson. AI-Stats, 2003.
PDF
2. Why can't Jose read? The problem of learning semantic associations in
a
robot environment. P. Carbonetto and N. de Freitas. HLT Conference
Workshop on Learning Word Meaning from Non-Linguistic Data, 2003.
PDF
3. A Statistical Model for General Contextual Object Recognition.
P. Carbonetto, N. de Freitas and K. Barnard. Submitted to ECCV 2004.
(local intranet access -- /home/albireo/carbonet/eccv2004.pdf)
Apprentissage Direct de la Matrice Jacobienne Inverse d'une Fonction
Presenter: Frédéric Jurie
|
6 November, 2003 at 1600hrs |
F 107, INRIA Rhône-Alpes |
|
Affiliation:
INRIA Rhône-Alpes - Project LEAR
Also
Université Blaise Pascal, Project LASMEA
Abstract:
A method to estimate the inverse Jacobian matrix of of a
function, without computing the direct Jacobian matrix is presented. This kind of inverse
Jacobian matrix proves to perform much better in modeling a relation $\theta
= f^{-1}(x)$ (where parameters $\theta$ are to be computed from observations
$x$) than the traditional computation of the Moore-Penrose inverse.
Theoretical insight as well as comparisons in the domain like visual
servoing or tracking will be provided to prove the correctness of the
assertion.
Résumé:
Une méthode sera présentée qui permettant l'estimation de la matrice Jacobienne
inverse d'une fonction, qui n'utilise pas le calcul de la matrice
Jacobienne. Ce type de matrice Jacobienne inverse possède des propriétés
meilleures, dans des probl¿mes d'inversion (calcul de paramètres d'un modèle
à partir de mesures), que la méthode de Moore-Penrose.
Aussi, quelques idées sur les aspects théoriques ainsi que des
comparaisons dans diff¿rents domaines d'applications de la vision tels que
l'asservissement visuel ou le suivi d'objets seront présentés.
Texture Recognition Using Affine-Invariant Regions
Affiliation:
Beckman Institute (University of Illinois at Urbana-Champaign)
Abstract:
This talk will discuss texture representations using affine-invariant interest points.
A model of a texture is constructed from a sparse set of image locations characterized by local
appearance and affine shape. For more descriptive power, it is possible to incorporate
neighborhood constraints based on co-occurrence statistics. Applications include retrieval,
classification, and segmentation of images of textured surfaces under a wide range of
transformations, including viewpoint changes and non-rigid deformations.
Other links:
Releated papers
Svetlana Lazebnik, Cordelia Schmid, and Jean Ponce,
``Affine-Invariant Local Descriptors and Neighborhood Statistics for
Texture Recognition,'' ICCV 2003.
Svetlana Lazebnik, Cordelia Schmid, and Jean Ponce,
``A Sparse Texture Representation Using Affine-Invariant Regions,''
CVPR 2003, vol. II, pp. 319-324.
Méthodes de réduction de dimensionnalité pour le dépliage du ruban cortical
Affiliation:
INRIA Rhône-Alpes - Project LEAR
Other links:
Presentation slides (pdf)
Related article (pdf)
Learning Dyanamical Models for Tracking Complex Motion
Presenter: Ankur Agarwal
|
18 September, 2003 at 1600hrs |
C 207, INRIA Rhône-Alpes |
|
Affiliation:
INRIA Rhône-Alpes - Project LEAR
Abstract:
I will address the problem of tracking complex human motions in monocular
video sequences. Mainly, I will describe a new approach to modelling the
non-linear and time-varying dynamics of generic human motions, using
statistical methods to exploit structured motion patterns that exist in
typical human activities. The method receives, as input, a set of
hand-labelled motion sequences and it learns a piecewise dynamical model
based on Gaussian autoregressive processes by automatically constructing
connected regions in parameter space that exhibit similar dynamical
characteristics. It also automatically partitions the state space into a
number of classes corresponding to different motion patterns, making it
useful for activity recognition.
The Trade-off Between Generative and Discriminative Classifiers
Affiliation:
INRIA Rhône-Alpes - Project LEAR
Abstract:
Given any generative classifier based on an inexact density model,
we can define a discriminative counterpart that reduces its asymptotic error rate.
We introduce a family of parameter estimation problems that interpolates
the two approaches, thus providing a new way to compare them and giving an
estimation procedure whose classification performance is well balanced between the bias
of generative classifiers and the variance of discriminative ones. We show that
an intermediate trade-off between the two strategies is often preferable,
both theoretically and in experiments on real data.