Alexander Kläser

Evaluation of local features for action recognition

Together with Heng Wang and Muhammad Muneeb Ullah, we investigated and compared a set of local spatio-temporal feature detectors and descriptors applied to the problem of action recognition/classification. The motivation behind that work was to enable a comparison of popular methods in a common setup. Furthermore, we also wanted to compare different combinations of feature detectors and descriptors.

The detectors that were used in this work are: Harris3D [Laptev'03], Gabor filters [Dollár'05], Hessian3D [Willems'08], and dense sampling. As for the descriptors, we employed: HOF/HOF [Laptev'08], Gradient [Dollár'05], Extended SURF [Willems'08], and our HOG3D. Our evaluation setup used a common bag-of-features framework with a non-linear SVM and the chi-square kernel. In total, we carried out experiments on three different datasets: KTH [Schüldt'04], UCF sports [Rodriguez'08], and Hollywood2 [Marszałek'09].

The work was presented at the BMVC at BMVC 2009. In my PhD thesis (chapter 4) and on this page here, I show the updated results that we obtained for updated settings of our HOG3D descriptor.

First, some example detections of the different detectors applied to three consecutive frames of the movie "Forrest Gump":

original image frame
dollar detector
harris detector
hessian detector

Here the results for the different detector/descriptor combinations on the three datasets. The first table are the results on the KTH dataset (performance in average accuracy over all action classes):

Harris3DGaborHessianDense
HOG3D92.4%91.4%88.1%88.5%
HOG/HOF91.8%88.7%88.7%86.1%
HOG80.9%82.3%77.7%79.0%
HOF92.1%88.2%88.6%88.0%
Gradient-89.1%--
ESURF--81.4%-

The following table summarizes results on the UCF sports dataset (performance in average accuracy over all action classe):

Harris3DGaborHessianDense
HOG3D77.6%85.0%78.9%84.8%
HOG/HOF78.1%77.7%79.3%81.6%
HOG71.4%72.7%66.0%77.4%
HOF75.4%76.7%75.3%82.6%
Gradient-76.6%--
ESURF--77.3%-

And last, but not least, results for the Hollywood2 dataset (performance in mean average precision over all action classes):

Harris3DGaborHessianDense
HOG3D44.3%46.1%43.5%44.8%
HOG/HOF45.2%46.2%46.0%47.4%
HOG32.8%39.4%36.2%39.4%
HOF43.3%42.9%43.0%45.5%
Gradient-45.0%--
ESURF--38.2%-

Among the main conclusions, we note that dense sampling overall outperforms interest point detectors in realistic video settings, but performs worse on the simple KTH dataset. This indicates both (a) the importance of using realistic experimental video data as well as (b) the limitations of current interest point detectors. Note, however, that dense sampling also produces a very large number of features (usually 15-20 times more than feature detectors). This is more difficult to handle than the relatively sparse number of interest points. We also note a rather similar performance of interest point detectors for each dataset. Across datasets, Harris 3D performs better on KTH dataset, while the Gabor detector gives better results for UCF and Hollywood2 datasets.

Among the tested descriptors, the combination of gradient based and optical flow based descriptors seems to be a good choice. The combination of dense sampling with the HOG/HOF descriptor provides best results for the most challenging Hollywood2 dataset. On the UCF dataset, the HOG3D descriptor performs best in combination with dense sampling as well as with the Gabor detector. On KTH, both descriptors, HOG3D and HOG/HOF, show comparable results, with HOG3D having a slight edge. This also motivates further investigations of optical flow based descriptors.

For more information, see also:


by Alexander Kläser 2010