In Actor and Observer we introduced a dataset linking the first and third-person
video understanding domains, the Charades-Ego Dataset. In this paper we
describe the egocentric aspect of the dataset and present annotations for
Charades-Ego with 68,536 activity instances in 68.8 hours of first and
third-person video, making it one of the largest and most diverse egocentric
datasets available. Charades-Ego furthermore shares activity classes, scripts,
and methodology with the Charades dataset, that consist of additional 82.3
hours of third-person video with 66,500 activity instances. Charades-Ego has
temporal annotations and textual descriptions, making it suitable for
egocentric video classification, localization, captioning, and new tasks
utilizing the cross-modal nature of the data.