Daily Action Localization in Youtube videos

DALY is intended for research in action recognition. It is suitable to train and test temporal/spatial localization algorithms.

It contains high quality temporal and spatial annotations for 10 actions in 31 hours of videos (3.3M frames), which is an order of magnitude larger than standard action localization datasets.

The dataset was developed by Philippe Weinzaepfel, Xavier Martin and Cordelia Schmid.
It is published by THOTH, a research team part of Inria.
If you use this dataset, please cite the following paper:

@article{weinzaepfel2016towards,
  title={Towards Weakly-Supervised Action Localization},
  author={Weinzaepfel, Philippe and Martin, Xavier and Schmid, Cordelia},
  journal={arXiv preprint arXiv:1605.05197},
  year={2016}
}
Below are some example frames taken from the dataset.

Phoning example Cleaning windows example
Phoning example Taking photo example

As you can see, DALY contains the following spatial annotations:

The action classes are 'applying make-up on lips', 'brushing teeth', 'cleaning floor', 'cleaning windows', 'drinking', 'folding textile', 'ironing', 'phoning', 'playing harmonica', and 'taking photos or videos'.

There are special flags provided for instances that are outside the field of view, that are mirror reflections, and more.