Measuring concept generalization, i.e., the extent to which models trained on a
set of (seen) visual concepts can be leveraged to recognize a new set of
(unseen) concepts, is a popular way of evaluating visual representations,
especially in a self-supervised learning framework. Nonetheless, the choice of
unseen concepts for such an evaluation is usually made arbitrarily, and
independently from the seen concepts used to train representations, thus
ignoring any semantic relationships between the two. In this paper, we argue
that the semantic relationships between seen and unseen concepts affect
generalization performance and propose ImageNet-CoG a novel benchmark on the
ImageNet-21K (IN-21K) dataset that enables measuring concept generalization in
a principled way. Our benchmark leverages expert knowledge that comes from
WordNet in order to define a sequence of unseen IN-21K concept sets that are
semantically more and more distant from the ImageNet-1K (IN-1K) subset, a
ubiquitous training set. This allows us to benchmark visual representations
learned on IN-1K out-of-the box. We conduct a large-scale study encompassing 31
convolution and transformer-based models and show how different architectures,
levels of supervision, regularization techniques and use of web data impact the
concept generalization performance.