Why Geographically Homogeneous Datasets Won't Cut It: Evaluating CIFAR-10 Image Recognition in a Global Context




Das, Sourav

Journal Title

Journal ISSN

Volume Title



In this thesis, we explore the following hypothesis: Do neural networks trained on contemporary CIFAR-10 datasets generalize to culturally different images with a foreign locale? We employ a combination of Large Language Models (LLMs) and Stable Diffusions to create a synthetic dataset with a foreign context and show that models trained on CIFAR-10 suffer a significant loss in accuracy on many classes of this synthetic dataset. Remarkably, our study underscores that these models perform exceptionally well when assessed on synthetic datasets devoid of foreign contextual elements. This observation firmly establishes the causal link between the loss of accuracy and the international context in the synthetic data, further accentuating the dearth of geographic diversity intrinsic to the CIFAR-10 dataset. As a forward-looking proposition, we conjecture that datasets may need to be updated to reflect global geographical diversity to prepare the AI models for global deployment, advocating for the inclusion of a more comprehensive representation of global geographic diversity. Such an evolution in dataset design holds the potential to better equip AI models for widespread deployment, thereby catering to the specific needs of low and middle-income group countries as well.



CIFAR10, DLA, Image Classification, International Context, Social Good, Synthetic Image



Computer Science