We construct a fully automatic large-scale system for visual retrieval of realistic action samples of different human action classes from TV-series and movies.
We first propose a text-driven approach using the synchronization of transcripts and subtitles.
In practice this yields a substantial part of irrelevant samples. We handle them by ranking the whole retrieved set by visual consistency using partially incorrect training data.
Our main contribution is to provide a new generic and fully automatic iterative training scheme for support vector regression that can handle such erroneous supervision to improve the ranking quality.
We validate our approach by conducting experiments on realistic video data and showing that it performs better than existing state-of-the-art unsupervised and supervised methods.
You can donwload my report right here (.pdf, 14MB).
© 2007-2009 Adrien GAIDON