The paper provides a compelling analysis of why its architecture is superior to traditional dual-encoder (CLIP-style) models.
-
Modality Gap: Figure 2 shows the distribution of cosine similarities for matching pairs. In CLIP models (top, middle), there is a clear gap: text-text pairs have high similarity, while image-text pairs have much lower similarity. In jina-embeddings-v4 (bottom), the distributions for both pair types significantly overlap in the high-similarity region, indicating the modality gap has been dramatically reduced.
Figure 2: Distribution of cosine similarities for matched pairs. The plots compare OpenAI CLIP (top), jina-clip-v2 (middle), and jina-embeddings-v4 (bottom). The reduced gap between the blue (Image-Text) and pink (Text-Text) distributions in the bottom plot shows the improved cross-modal alignment of jina-embeddings-v4.
-
Cross-Modal Alignment: Table 4 quantifies this observation, showing that jina-embeddings-v4 has a much higher cross-modal alignment score (average similarity of matching image-text pairs) than CLIP-style models.
-
Cone Effect: Figure 3 visualizes the separation between positive (correct) and negative (incorrect) image-text pairs. In CLIP (top), the distributions are very close, a symptom of the "cone effect" where all embeddings are clustered in a narrow cone, making discrimination difficult. In contrast, jina-embeddings-v4 (bottom) shows a clear and wide separation between the positive (blue) and negative (orange) distributions, indicating it uses the embedding space more effectively.
Figure 3: Distribution of cosine similarities for positive vs. negative pairs. The plots show OpenAI CLIP (top), jina-clip-v2 (middle), and jina-embeddings-v4 (bottom). The clear separation of peaks for positive (blue) and negative (orange) samples in the bottom plot demonstrates the superior discriminative power of jina-embeddings-v4.