Scene text recognition has gained significant attention from the computer
vision community in recent years. Recognizing such text is a challenging
problem, even more so than the recognition of scanned documents. In this work,
we focus on the problem of recognizing text extracted from street images. We
present a framework that exploits both bottom-up and top-down cues. The
bottom-up cues are derived from individual character detections from the image.
We build a Conditional Random Field model on these detections to jointly model
the strength of the detections and the interactions between them. We impose
top-down cues obtained from a lexicon-based prior, i.e. language statistics, on
the model. The optimal word represented by the text image is obtained by
minimizing the energy function corresponding to the random field model.

We show significant improvements in accuracies on two challenging public
datasets, namely Street View Text (over 15%) and ICDAR 2003 (nearly 10%).