COCO Image Retrieval with MindEye2: Challenges and Insights with OpenCLIP bigG Embeddings

cover
16 Apr 2025

Abstract and 1 Introduction

2 MindEye2 and 2.1 Shared-Subject Functional Alignment

2.2 Backbone, Diffusion Prior, & Submodules

2.3 Image Captioning and 2.4 Fine-tuning Stable Diffusion XL for unCLIP

2.5 Model Inference

3 Results and 3.1 fMRI-to-Image Reconstruction

3.2 Image Captioning

3.3 Image/Brain Retrieval and 3.4 Brain Correlation

3.5 Ablations

4 Related Work

5 Conclusion

6 Acknowledgements and References

A Appendix

A.1 Author Contributions

A.2 Additional Dataset Information

A.3 MindEye2 (not pretrained) vs. MindEye1

A.4 Reconstruction Evaluations Across Varying Amounts of Training Data

A.5 Single-Subject Evaluations

A.6 UnCLIP Evaluation

A.7 OpenCLIP BigG to CLIP L Conversion

A.8 COCO Retrieval

A.9 Reconstruction Evaluations: Additional Information

A.10 Pretraining with Less Subjects

A.11 UMAP Dimensionality Reduction

A.12 ROI-Optimized Stimuli

A.13 Human Preference Experiments

A.8 COCO Retrieval

MindEye1 scaled up image retrieval using a pool of billions of image candidates contained in the LAION-5B dataset (Schuhmann et al., 2022). This was possible because all LAION images were already converted to CLIP L embeddings and made available for nearest neighbor lookup via the CLIP Retrieval client (Beaumont, 2022). We were not able to use this approach for MindEye2 because it would require converting all images to the 256 × 1664 dimensionality bigG latent space which was not feasible. That said, cursory investigation with the comparatively smaller MS-COCO dataset suggests that retrieval from a pool of images not containing the original image may not work as well with OpenCLIP bigG embeddings compared to the CLIP L embeddings used in MindEye1. To test retrieval, we used FAISS (Douze et al., 2024) for k-nearest neighbor search through an index of flattened OpenCLIP bigG embeddings of 73,000 MS-COCO images. We found that for incorrect retrievals, the 3 nearest neighbors usually were dissimilar to the original image both semantically and in low-level appearance. This could be due to the latents corresponding to the 256 image patch tokens of OpenCLIP bigG representing a more complex combination of different levels of information. This could cause the OpenCLIP bigG embeddings to not be as effective for nearest neighbor retrieval in terms of subjective intepretation, as the last layer of CLIP ViT-L/14 is highly semantic but lacks in low-level image content. Although we demonstrated improved retrieval performance for MindEye2 compared to MindEye1 using random subsets of 300 images for MindEye2 compared to MindEye1 (Table 4), we suggest that mapping to the last layer of CLIP ViT-L/14 image space would work better if the intended application is to find semantically related nearest neighbors in a large image pool.

This paper is available on arxiv under CC BY 4.0 DEED license.

Authors:

(1) Paul S. Scotti, Stability AI and Medical AI Research Center (MedARC);

(2) Mihir Tripathy, Medical AI Research Center (MedARC) and a Core contribution;

(3) Cesar Kadir Torrico Villanueva, Medical AI Research Center (MedARC) and a Core contribution;

(4) Reese Kneeland, University of Minnesota and a Core contribution;

(5) Tong Chen, The University of Sydney and Medical AI Research Center (MedARC);

(6) Ashutosh Narang, Medical AI Research Center (MedARC);

(7) Charan Santhirasegaran, Medical AI Research Center (MedARC);

(8) Jonathan Xu, University of Waterloo and Medical AI Research Center (MedARC);

(9) Thomas Naselaris, University of Minnesota;

(10) Kenneth A. Norman, Princeton Neuroscience Institute;

(11) Tanishq Mathew Abraham, Stability AI and Medical AI Research Center (MedARC).