How to use DALY
This small tutorial explains the layout of the dataset .pkl file and how to access every element.Code excerpts presented here run in Python unless specified otherwise.
Jump to section:
- Opening the pickle file
- Top-level content
- Examining a single video
- Examining a single annotation instance
- Instance flags
- Examining a single keyframe
- Action bounding boxes
- Pose
- Objects
Opening the pickle file
DALY annotations are saved as a pickle file.Pickle is a handy format to serialize Python objects.
import pickle with open("path/to/daly1.1.0.pkl") as f: daly = pickle.load(f)Python3 users, if you encounter errors, try this:
import pickle with open("/path/to/daly1.1.0.pkl", "rb") as f: daly = pickle.load(f, encoding='latin1')
Top-level content
DALY is now loaded as a Python dictionary, thus you can explore its contents using the ".keys()" function.daly.keys() ## ['splits', 'joints', 'labels', 'annot', 'version', 'objectList']Let us peer over every element here:
- splits: test videos splits, 1 split currently
- joints: all names of joints annotated in poses (head, shoulderLeft)
- labels: all names of action class names, such as BrushingTeeth
- annot: dictionary that maps a video ID to a set of annotations
- version: 'daly1.1.0'
- objectList: all names of objects encountered in DALY (bottle, cloth, phone)
- metadata: contains number of frames, fps and duration for every video
daly['annot'].keys() ## ['K6xXngYnVK8.mp4', 'legp5cXwuHc.mp4', 'OqmmGZS061o.mp4', ... , 'ncv3b55czfQ.mp4']The videos in the dataset are referred to with their Youtube video ID.
You can download the videos using, for example, the "youtube-dl" utility.
We encourage you to get the highest quality as possible.
Examining a single video
Let's pick a video in the dataset and explore its annotations.vid = 'PFEJ0EQN-bY.mp4' daly['annot'][vid].keys() ## ['suggestedClass', 'annot'] daly['annot'][vid]['suggestedClass'] ## 'TakingPhotosOrVideos' daly['annot'][vid]['annot'].keys() ## ['Phoning', 'TakingPhotosOrVideos']Every video in the dataset has a suggestedClass, this video is suggested as TakingPhotosOrVideos.
It means the video was initially found while looking for "taking photo" instances.
Videos are guaranteed to have at least one instance of their suggested class.
len( daly['annot'][vid]['annot']['Phoning'] ) ## 3 len( daly['annot'][vid]['annot']['TakingPhotosOrVideos'] ) ## 3This video has instances of two action classes, Phoning and TakingPhotosOrVideos.
More precisely, it has 3 separate instances of each. Let's take a look at the first "taking photo" instance.
Examining a single annotation instance
daly['annot'][vid]['annot']['TakingPhotosOrVideos'][0].keys() ## ['endTime', 'flags', 'beginTime', 'keyframes']The 'keyframes' list contains between 1 and 5 keyframes.
Keyframes are sampled uniformly inside the duration of the instance, keyframes are spatially annotated.
Please note that keyframeTime != beginTime and keyframeTime != endTime.
beginTime and endTime are self-explanatory, they are float values that represent seconds.
flags contains indicators about the instance at hand, we explain them in further detail.
Instance flags
daly['annot'][vid]['annot']['TakingPhotosOrVideos'][0]['flags'] # {u'isSmall': False, u'isAmbiguous': False, ... }Here is the breakdown of what you will find within flags:
- isShotcut: this temporal instance is the continuation of a previous one, separated by a shot cut
- isSmall: the action covers a very small portion of the image at some point
- isAmbiguous: it is unclear whether the action genuinely happens (e.g. someone pretending)
- isZoom: the action covers the whole image at some point
- isReflection: this action instance is a mirror image (flag is applied to the least visible of the two), these instances are spatially annotated with an action bounding box only
- isOccluded: at some point the actor is occluded by the environment, or the action is occluded by the actor itself (e.g. we only see the back of the person)
- isOutsideFOV: a portion of the action or a portion of the actor is outside the field of view of the camera
Examining a single keyframe
daly['annot'][vid]['annot']['TakingPhotosOrVideos'][0]['keyframes'][0].keys() ## ['boundingBox', 'objects', 'frameNumber', 'pose', 'time']'time' speaks for itself, it is a float value that represents seconds.
When extracting all frames with ffmpeg, 'frameNumber' is the exact frame that corresponds to the annotation.
ffmpeg starts numbering at 1, and so does 'frameNumber'.
As for the rest, 'boundingBox', 'objects' and 'pose' contain each spatial annotation.
Let's have a look at each separately.
Action bounding boxes
daly['annot'][vid]['annot']['TakingPhotosOrVideos'][0]['keyframes'][0]['boundingBox'] ## array([[ 0.54000002, 0.26287743, 0.99900001, 0.98223799]], dtype=float32)The format of bounding boxes is as follows:
array (xmin, ymin, xmax, ymax)In DALY, all image coordinates are floats between 0 and 1.
[0, 0] is the top left corner of the image, whereas [1, 1] is the bottom right corner.
Likewise, [xmin, ymin] represents the BBox's top left corner, and [xmax, ymax] the BBox's bottom right corner.
Pose
daly['annot'][vid]['annot']['TakingPhotosOrVideos'][0]['keyframes'][0]['pose'] ## array([[ 0.727, 0.26287743, 0.92699999, 0.64476019, 1., 0. , 0. ], ... ... ], dtype=float32)There is one line per pose element.
The lines are ordered in the same fashion as the top-level joints list, and are present even when the joint isn't.
daly['joints'] ## ['head', 'shoulderLeft', 'elbowLeft', 'wristLeft', 'shoulderRight', 'elbowRight', 'wristRight']Therefore the first line represents the head bounding box, the second line the left shoulder, etc.
The following format is used:
[xmin, xmax, ymin, ymax, isVisible, isOccluded, isHallucinate]isVisible is set when the joint is visible and therefore annotated.
Warning: most of the pose elements are joints such as 'shoulderRight'.
'shoulderRight' is a 2D point, only the pose element 'head' is a Bounding Box.
shoulderRight is saved with xmin == xmax and ymin == ymax.
head is saved as a bounding box.
isOccluded is only present on the 'head' element.
It indicates the head is visible yet obstructed by another actor.
isHallucinate can be present on all joints.
This flag is set when it is possible to infer the joint's position, despite it not being visible in the image.
Objects
daly['annot'][vid]['annot']['TakingPhotosOrVideos'][0]['keyframes'][0]['objects'] ## array([[ 0.57200003, 0.36944938, 0.62400001, 0.56483126, 30., 0. , 0. ]], dtype=float32)Each line is read separately and represents an object bounding box. The format is as follows:
[xmin, ymin, xmax, ymax, objectID, isOccluded, isHallucinate]To get the object's name, use the top-level object name dictionary with the object ID:
daly['objectList'][30] ## u'smartphone'isOccluded is indicates the object is visible yet obstructed.
isHallucinate is set when it is possible to infer the object's location, despite not being visible in the image.