OpenCLIP BigG to CLIP L Conversion: What You Need to Know

Figure 9: Generating images from their CLIP image embeddings. SDXL unCLIP (middle) outperforms Versatile Diffusion (right) in capturing perceptual details.

Table 9: SDXL unCLIP reconstructions from ground truth OpenCLIP image latents consistently outperform Versatile Diffusion reconstructions from ground truth CLIP image latents.

pendently trained a linear model using ground truth images from the COCO 2017 train and validation dataset. This conversion was necessary to use the pretrained GIT image captioning model. The PyTorch code used to train this model is depicted in Algorithm 1.

This paper is available on arxiv under CC BY 4.0 DEED license.

Authors:

(1) Paul S. Scotti, Stability AI and Medical AI Research Center (MedARC);

(2) Mihir Tripathy, Medical AI Research Center (MedARC) and a Core contribution;

(3) Cesar Kadir Torrico Villanueva, Medical AI Research Center (MedARC) and a Core contribution;

(4) Reese Kneeland, University of Minnesota and a Core contribution;

(5) Tong Chen, The University of Sydney and Medical AI Research Center (MedARC);

(6) Ashutosh Narang, Medical AI Research Center (MedARC);

(7) Charan Santhirasegaran, Medical AI Research Center (MedARC);

(8) Jonathan Xu, University of Waterloo and Medical AI Research Center (MedARC);

(9) Thomas Naselaris, University of Minnesota;

(10) Kenneth A. Norman, Princeton Neuroscience Institute;

(11) Tanishq Mathew Abraham, Stability AI and Medical AI Research Center (MedARC).

← Previous

MindEye2 unCLIP vs. Versatile Diffusion: Evaluating Image Generation from CLIP Latents

Up Next →

COCO Image Retrieval with MindEye2: Challenges and Insights with OpenCLIP bigG Embeddings