@inproceedings{potapov2014category, url = {http://hal.inria.fr/hal-01022967}, title = {{Category-specific video summarization}}, author = {Potapov, Danila and Douze, Matthijs and Harchaoui, Zaid and Schmid, Cordelia}, booktitle = {{ECCV 2014 - European Conference on Computer Vision}}, year = {2014}, }
Annotation of semantic segments and their importanceAnnotation of semantic segmentsThis task consists in annotating temporal segments in video. For a given video, annotating temporal segments corresponds to finding time-stamps called "change-points" such that the video chunk between two consecutive time-stamps is "semantically consistent". A video chunk between two consecutive annotated time-stamps is called a "segment". We consider that a segment is "semantically consistent" (or semantic segment for short) if a human can describe it with a short sentence. Yet, the segment should be delimited so that watching the segment is sufficient for a user to be able to grasp what is going on. For example it can be "a group of people marching in the street" for a "Parade" video or "putting one slice of bread onto another" for a "Making a sandwich" video (like in the examples).
The whole video have to be covered by non-intersecting semantic segments without gaps. Annotating semantic segments corresponds to specifying segments' change-points. We require that all shot boundaries to be annotated as change-points (a "shot" is a part of video continuous in time and space). Note that change-points do not necessarily correspond to shot boundaries, but all shot boundaries should be change-points. Gradual transition (non-abrupt shot boundary) has a change-point in the middle if it lasts less than 1 second. Otherwise the gradual transition must be treated as a separate segment. A video below shows an example of a short gradual transition. If a shot is long and contains several actions, you must annotate starting and ending frames of these actions. Often a shot contains a single action, but the main part is shorter than the whole segment. In this case you also should annotate the main part. See example video below.
Some actions are repetitive or homogenous, e.g. running, sewing, etc. In that case you should specify the "minimum duration" of a subsegment that fully represents the whole segment. For example, watching 2-3 seconds of a running person is sufficient to understand what is going on and describe the segment as "a person is running". We require the "minimum duration" of the segment to be at least 2 seconds. On the other hand, 10 seconds of a running person is too long of a segment to concisely represent the sentence "a person is running". Therefore, we give the following durations as a recommendation:
The interface allows to navigate the video with a step of 5 frames. You should specify change-points as accurate as possible. A change-point is inserted just before the frame that you see. Annotation of importanceFor each semantic segment we ask you to annotate importance. You should answer the question: "Does the segment contain evidence of the given event category?" Please choose one of the answers:
If something is only mentioned in text or speech, then do not report it as important. |
Subset | Training | Validation | NULL | Test |
---|---|---|---|---|
MED dataset (our split) | ||||
Total videos | 1338 (520) | 1311 (877) | 9600 | 31820 |
Total duration, hours | 60 (25) | 57 (39) | 408 | 980 |
MED-Summaries (subset) | ||||
Annotated videos | — | 60 (40) | — | 100 |
Total duration, hours | — | 3 (2) | — | 4 |
Annotators per video | — | 1 | — | 2-4 | Total annotated segments | — | 1680 (1122) | — | 8904 |