Traditional approaches for classifying event videos rely on a manually curated
training dataset. While this paradigm has achieved excellent results on
benchmarks such as TrecVid multimedia event detection (MED) challenge datasets,
it is restricted by the effort involved in careful annotation. Recent
approaches have attempted to address the need for annotation by automatically
extracting images from the web, or generating queries to retrieve videos. In
the former case, they fail to exploit additional cues provided by video data,
while in the latter, they still require some manual annotation to generate
relevant queries. We take an alternate approach in this paper, leveraging the
synergy between visual video data and the associated textual metadata, to learn
event classifiers without manually annotating any videos. Specifically, we
first collect a video dataset with queries constructed automatically from
textual description of events, prune irrelevant videos with text and video
data, and then learn the corresponding event classifiers. We evaluate this
approach in the challenging setting where no manually annotated training set is
available, i.e., EK0 in the TrecVid challenge, and show state-of-the-art
results on MED 2011 and 2013 datasets.