Recognizing Activities with Cluster-Trees of Tracklets

Adrien Gaidon, Zaid Harchaoui, Cordelia Schmid, BMVC 2012

Recognizing complex activities

Supervised activity classification in videos

Activity: complex actions characterized by spatio-temporal relations between a variable number of parts

Goal: automatically identify motion components and exploit both their contents and their relations to improve recognition

Describe motion content using dense tracklets: fixed short duration point trajectories [Wang CVPR'11]
Hierarchically decompose the set of tracklets of a video using divisive spectral clustering
SVM on tree-structured activity models using a hierarchical kernel on nested histograms

Greedy top-down bi-partitioning of the set of tracklets

Example leaf labels: they are not enough (oversegmentation)

Hierarchically cluster tracklets using:

The Nystrom approximation for spectral embedding [Fowlkes et al. PAMI'04]
Recursive thresholding along embedding dimensions (relaxation of NCut [Shi and Malik PAMI'00])
Greedy minimization of a spatio-temporal connectedness cost (handles instabilities and ambiguities)

BOF-Tree
nested histograms of motion features

Kernel on BOF-Trees
approximation of all pairwise sub-tree comparisons (uses tree structure and node content)

High Five [Patron-Perez BMVC'10] (4 human interaction categories, 300 TV-show videos)

Olympic Sports [Niebles ECCV'10] (16 sport activity categories, 783 YouTube videos)

On the High Five dataset

On the Olympic Sports dataset