This paper presents an approach for mining visual actions from real-world
videos. Given a large number of movies, we want to automatically extract short
video sequences corresponding to visual human actions. Firstly, we retrieve
actions by mining verbs extracted from the transcripts aligned with the videos.
Not all of these samples visually characterize the action and, therefore, we
rank these videos by visual consistency. We investigate two unsupervised
outlier detection methods: one-class Support Vector Machine (SVM) and densest
component estimation of a similarity graph. Alternatively, we show how to use
automatic weak supervision provided by a random background class, either by
directly applying a binary SVM, or by using an iterative re-training scheme for
Support Vector Regression machines (SVR). Experimental results explore actions
in 144 episodes of the TV series ``Buffy the Vampire Slayer'' and show: (a) the
applicability of our approach to a large scale set of real-world videos, (b)
the importance of visual consistency for ranking videos retrieved from text,
(c) the added value of random non-action samples and (d) the ability of our
iterative SVR re-training algorithm to handle weak supervision. The quality of
the rankings obtained is assessed on manually annotated data for six different
action classes.
Related publication
The paper Mining visual actions from movies,
published at the BMVC 2009 conference, presents our approach and gives
some results on the TV-show Buffy the vampire slayer.
A poster is available here and a one-page abstract can be viewed
here.