DALY - howto

How to use DALY

This small tutorial explains the layout of the dataset .pkl file and how to access every element.
Code excerpts presented here run in Python unless specified otherwise.

Jump to section:

Opening the pickle file
Top-level content
Examining a single video
Examining a single annotation instance
Instance flags
Examining a single keyframe
Action bounding boxes
Pose
Objects

Opening the pickle file

DALY annotations are saved as a pickle file.
Pickle is a handy format to serialize Python objects.

import pickle
with open("path/to/daly1.1.0.pkl") as f:
    daly = pickle.load(f)

Python3 users, if you encounter errors, try this:

import pickle
with open("/path/to/daly1.1.0.pkl", "rb") as f:
    daly = pickle.load(f, encoding='latin1')

Top-level content

DALY is now loaded as a Python dictionary, thus you can explore its contents using the ".keys()" function.

daly.keys()
## ['splits', 'joints', 'labels', 'annot', 'version', 'objectList']

Let us peer over every element here:

splits: test videos splits, 1 split currently
joints: all names of joints annotated in poses (head, shoulderLeft)
labels: all names of action class names, such as BrushingTeeth
annot: dictionary that maps a video ID to a set of annotations
version: 'daly1.1.0'
objectList: all names of objects encountered in DALY (bottle, cloth, phone)
metadata: contains number of frames, fps and duration for every video

To get the list of all videos in the dataset, run the following:

daly['annot'].keys()
## ['K6xXngYnVK8.mp4', 'legp5cXwuHc.mp4', 'OqmmGZS061o.mp4',  ...  , 'ncv3b55czfQ.mp4']

The videos in the dataset are referred to with their Youtube video ID.
You can download the videos using, for example, the "youtube-dl" utility.
We encourage you to get the highest quality as possible.

Examining a single video

Let's pick a video in the dataset and explore its annotations.

vid = 'PFEJ0EQN-bY.mp4'

daly['annot'][vid].keys()
## ['suggestedClass', 'annot']

daly['annot'][vid]['suggestedClass']
## 'TakingPhotosOrVideos'

daly['annot'][vid]['annot'].keys()
## ['Phoning', 'TakingPhotosOrVideos']

Every video in the dataset has a suggestedClass, this video is suggested as TakingPhotosOrVideos.
It means the video was initially found while looking for "taking photo" instances.
Videos are guaranteed to have at least one instance of their suggested class.

len( daly['annot'][vid]['annot']['Phoning'] )
## 3

len( daly['annot'][vid]['annot']['TakingPhotosOrVideos'] )
## 3

This video has instances of two action classes, Phoning and TakingPhotosOrVideos.
More precisely, it has 3 separate instances of each. Let's take a look at the first "taking photo" instance.

Examining a single annotation instance

daly['annot'][vid]['annot']['TakingPhotosOrVideos'][0].keys()
## ['endTime', 'flags', 'beginTime', 'keyframes']

The 'keyframes' list contains between 1 and 5 keyframes.
Keyframes are sampled uniformly inside the duration of the instance, keyframes are spatially annotated.
Please note that keyframeTime != beginTime and keyframeTime != endTime.

beginTime and endTime are self-explanatory, they are float values that represent seconds.

flags contains indicators about the instance at hand, we explain them in further detail.

Instance flags

daly['annot'][vid]['annot']['TakingPhotosOrVideos'][0]['flags']
# {u'isSmall': False, u'isAmbiguous': False, ... }

Here is the breakdown of what you will find within flags:

isShotcut: this temporal instance is the continuation of a previous one, separated by a shot cut
isSmall: the action covers a very small portion of the image at some point
isAmbiguous: it is unclear whether the action genuinely happens (e.g. someone pretending)
isZoom: the action covers the whole image at some point
isReflection: this action instance is a mirror image (flag is applied to the least visible of the two), these instances are spatially annotated with an action bounding box only
isOccluded: at some point the actor is occluded by the environment, or the action is occluded by the actor itself (e.g. we only see the back of the person)
isOutsideFOV: a portion of the action or a portion of the actor is outside the field of view of the camera

The flags isReflection and isAmbiguous should be ignored for evaluation.

Examining a single keyframe

daly['annot'][vid]['annot']['TakingPhotosOrVideos'][0]['keyframes'][0].keys()
## ['boundingBox', 'objects', 'frameNumber', 'pose', 'time']

'time' speaks for itself, it is a float value that represents seconds.
When extracting all frames with ffmpeg, 'frameNumber' is the exact frame that corresponds to the annotation.
ffmpeg starts numbering at 1, and so does 'frameNumber'.

As for the rest, 'boundingBox', 'objects' and 'pose' contain each spatial annotation.
Let's have a look at each separately.

Action bounding boxes

daly['annot'][vid]['annot']['TakingPhotosOrVideos'][0]['keyframes'][0]['boundingBox']
## array([[ 0.54000002,  0.26287743,  0.99900001,  0.98223799]], dtype=float32)

The format of bounding boxes is as follows:

array (xmin, ymin, xmax, ymax)

In DALY, all image coordinates are floats between 0 and 1.
[0, 0] is the top left corner of the image, whereas [1, 1] is the bottom right corner.

Likewise, [xmin, ymin] represents the BBox's top left corner, and [xmax, ymax] the BBox's bottom right corner.

Pose

daly['annot'][vid]['annot']['TakingPhotosOrVideos'][0]['keyframes'][0]['pose']

## array([[ 0.727,  0.26287743, 0.92699999, 0.64476019,  1., 0. ,  0. ],
       ...
       ...
       ], dtype=float32)

There is one line per pose element.
The lines are ordered in the same fashion as the top-level joints list, and are present even when the joint isn't.

daly['joints']
## ['head', 'shoulderLeft', 'elbowLeft', 'wristLeft', 'shoulderRight', 'elbowRight', 'wristRight']

Therefore the first line represents the head bounding box, the second line the left shoulder, etc.

The following format is used:

[xmin, xmax, ymin, ymax, isVisible, isOccluded, isHallucinate]

isVisible is set when the joint is visible and therefore annotated.

Warning: most of the pose elements are joints such as 'shoulderRight'.
'shoulderRight' is a 2D point, only the pose element 'head' is a Bounding Box.
shoulderRight is saved with xmin == xmax and ymin == ymax.
head is saved as a bounding box.

isOccluded is only present on the 'head' element.
It indicates the head is visible yet obstructed by another actor.

isHallucinate can be present on all joints.
This flag is set when it is possible to infer the joint's position, despite it not being visible in the image.

Objects

daly['annot'][vid]['annot']['TakingPhotosOrVideos'][0]['keyframes'][0]['objects']
## array([[  0.57200003,  0.36944938, 0.62400001, 0.56483126, 30., 0. , 0. ]], dtype=float32)

Each line is read separately and represents an object bounding box. The format is as follows:

[xmin, ymin, xmax, ymax, objectID, isOccluded, isHallucinate]

To get the object's name, use the top-level object name dictionary with the object ID:

daly['objectList'][30]
## u'smartphone'

isOccluded is indicates the object is visible yet obstructed.
isHallucinate is set when it is possible to infer the object's location, despite not being visible in the image.

Go to top