Reconstruction Evaluations Across Varying Amounts of Training Data: Mindeye2

cover
15 Apr 2025

Abstract and 1 Introduction

2 MindEye2 and 2.1 Shared-Subject Functional Alignment

2.2 Backbone, Diffusion Prior, & Submodules

2.3 Image Captioning and 2.4 Fine-tuning Stable Diffusion XL for unCLIP

2.5 Model Inference

3 Results and 3.1 fMRI-to-Image Reconstruction

3.2 Image Captioning

3.3 Image/Brain Retrieval and 3.4 Brain Correlation

3.5 Ablations

4 Related Work

5 Conclusion

6 Acknowledgements and References

A Appendix

A.1 Author Contributions

A.2 Additional Dataset Information

A.3 MindEye2 (not pretrained) vs. MindEye1

A.4 Reconstruction Evaluations Across Varying Amounts of Training Data

A.5 Single-Subject Evaluations

A.6 UnCLIP Evaluation

A.7 OpenCLIP BigG to CLIP L Conversion

A.8 COCO Retrieval

A.9 Reconstruction Evaluations: Additional Information

A.10 Pretraining with Less Subjects

A.11 UMAP Dimensionality Reduction

A.12 ROI-Optimized Stimuli

A.13 Human Preference Experiments

A.4 Reconstruction Evaluations Across Varying Amounts of Training Data

Here, we present a further analysis of how model performance scales with training data. All of the results presented in Figures 6, 7, and 8 are calculated on only subject 1.

Figure 6: Low-level metric performance (y-axis) plotted against the number of fMRI scanning sessions used in the training data (x-axis) for subject 1. All values are normalized to the same y-axis. The bolded line represents the average performance across all metrics.

Figure 7: High-level metric performance (y-axis) plotted against the number of fMRI scanning sessions used in the training data (x-axis) for subject 1. All values are normalized to the same y-axis. The bolded line represents the average performance across all metrics. SwAV and EffNetB scores are inverted in this plot so that higher is better for all metrics.

Figure 8: Brain correlation scores (y-axis) in different brain regions including visual cortex (defined by the nsdgeneral mask, bolded), V1, V2, V3, V4 (collectively called early visual cortex) and higher visual areas (the set complement of nsdgeneral and early visual cortex) plotted against the number of fMRI scanning sessions used in the training data (x-axis) for subject 1. All values are normalized to the same y-axis.

This paper is available on arxiv under CC BY 4.0 DEED license.

Authors:

(1) Paul S. Scotti, Stability AI and Medical AI Research Center (MedARC);

(2) Mihir Tripathy, Medical AI Research Center (MedARC) and a Core contribution;

(3) Cesar Kadir Torrico Villanueva, Medical AI Research Center (MedARC) and a Core contribution;

(4) Reese Kneeland, University of Minnesota and a Core contribution;

(5) Tong Chen, The University of Sydney and Medical AI Research Center (MedARC);

(6) Ashutosh Narang, Medical AI Research Center (MedARC);

(7) Charan Santhirasegaran, Medical AI Research Center (MedARC);

(8) Jonathan Xu, University of Waterloo and Medical AI Research Center (MedARC);

(9) Thomas Naselaris, University of Minnesota;

(10) Kenneth A. Norman, Princeton Neuroscience Institute;

(11) Tanishq Mathew Abraham, Stability AI and Medical AI Research Center (MedARC).