AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual Actions
The AVA dataset densely annotates 80 atomic visual actions localized in space and time resulting in 210k action labels with multiple labels per human appearing frequently.
The main differences with existing
video datasets are:
- the definition of atomic visual actions, which avoids collecting data for each and every complex action;
- precise spatio-temporal annotations with possibly multiple annotations for each human;
- the use of diverse, realistic video material (movies).
If you use this dataset, please cite the following paper (https://arxiv.org/abs/1705.08421):
@article{gu2017, title={{AVA}: A Video Dataset of Spatio-temporally Localized Atomic Visual Actions}, author={Chunhui Gu, Chen Sun, Sudheendra Vijayanarasimhan, Caroline Pantofaru, David A. Ross, George Toderici, Yeqing Li, Susanna Ricco, Rahul Sukthankar, Cordelia Schmid, Jitendra Malik}, journal={arXiv preprint arXiv:1705.08421}, year={2017} }Below are some example frames taken from the dataset.