Recognizing complex activities
Problem
Supervised activity classification in videos
Activity: complex actions characterized by spatio-temporal relations between a variable number of parts
Goal: automatically identify motion components and exploit both their contents and their relations to improve recognition
Proposed approach
- Describe motion content using dense tracklets: fixed short duration point trajectories [Wang CVPR'11]
- Hierarchically decompose the set of tracklets of a video using divisive spectral clustering
- SVM on tree-structured activity models using a hierarchical kernel on nested histograms
Extracting motion information
Video data
Dense Tracklets
Camera Motion Compensation
Tracklets on stabilized video
Structuring motion information
Hierarchical motion decomposition
Greedy top-down bi-partitioning of the set of tracklets
Example leaf labels: they are not enough (oversegmentation)
Hierarchical spectral divisive clustering
Hierarchically cluster tracklets using:
- The Nystrom approximation for spectral embedding [Fowlkes et al. PAMI'04]
- Recursive thresholding along embedding dimensions (relaxation of NCut [Shi and Malik PAMI'00])
- Greedy minimization of a spatio-temporal connectedness cost (handles instabilities and ambiguities)
Tree-structured activity models
BOF-Tree activity model
BOF-Tree
nested histograms of motion features
Kernel on BOF-Trees
approximation of all pairwise sub-tree comparisons (uses tree structure and node content)
Experiments
Datasets
High Five [Patron-Perez BMVC'10] (4 human interaction categories, 300 TV-show videos)
Olympic Sports [Niebles ECCV'10] (16 sport activity categories, 783 YouTube videos)
Results
On the High Five dataset
On the Olympic Sports dataset