Baidu offers ERNIE-VIL 2.0, a multi-view contrastive learning framework that aims to gain a more robust cross-modal representation by simultaneously establishing intramodal and cross-modal correlations between distinct views

Vision-language pre-training (VLP) models have made significant progress on several cross-modal tasks, such as visual question answering (VQA) and cross-modal retrieval, over the past two years. The majority of previous efforts based on intermodal transformer encoders focus on constructing several proxy pre-training tasks (e.g., masked language modeling (MLM) and masked region modeling (MRM)) to learn joint intermodal representation. On the other hand, the cross-modal attention layers in the encoder attempt to merge different visual/textual features at the token level to understand the joint representation with massive interactions, which leads to high computational costs for world systems. real such as cross-modal online. recovery system.

Current research based on the dual-coder architecture uses an efficient computational framework with slight cross-modal interaction, yielding equivalent performance on visual language tasks by training on large-scale image-text pairings to overcome this constraint. . However, since the established cross-modal correlation depends on only one view for each modality, they attempt to develop cross-modal alignment via single-view contrastive learning. Indeed, the intramodal correlation they overlook has the potential to improve single-modal representation and contribute to the development of higher intermodal alignment. Moreover, there are often weak correlations in noisy image-text pairings explored on the Web with intrinsic visual/textual viewpoints, widening the cross-modal semantic gap.

They propose ERNIE-ViL 2.0, a multi-view contrastive learning framework for cross-modal retrieval, aimed at learning a robust cross-modal representation by modeling cross-modal and intramodal correlations between distinct views. Unlike traditional single-view contrastive learning approaches, multi-view contrastive learning learns about both intramodal and crossmodal correlations. Similarly, CMC uses multi-view contrastive knowledge for visual representation learning, resulting in a more robust representation. Their approach creates many visual/textual viewpoints to enhance representations within and across modalities.


Multi-Perspective Contrastive Learning vs. Single-View Contrastive Learning Single-view contrastive learning relies solely on a single cross-modal association between a visual and textual perspective. Through the construction of many possible perspectives, multi-view contrastive learning could learn about many types of intra-modal and cross-modal correlations.

They specifically generate image-image pairings and text-text pairs for contrasting intramodal view pairs to improve representation with each modality. In addition to the intrinsic visual/textual views, they generate sequences of object labels as a single textual view to reduce the impacts of noisy multimodal data and facilitate the learning of vision-language alignment. They train an English model on 29 million publicly available datasets using the dual-encoder architecture and achieve competitive performance on cross-modal retrieval tasks. They increased the size of the training datasets to 1.5 billion Chinese image-text pairs, yielding significant gains over previous SOTA results on Chinese cross-modal retrieval.

Broadly, they divide their contributions into three categories:

1. We propose the first multi-view learning framework for cross-modal retrieval that uses multiple perspectives to produce invariant and resilient cross-modal representations.

2. They offer object tags as outstanding text views, bridging the semantic gap between image and text and making it easier to learn cross-modal alignment on large-scale noisy data.

3. Using only publicly available noisy datasets, create a credible and comparable benchmark for cross-modal recovery in English. Moreover, their model achieves SOTA performance on Chinese intermodal recovery after being trained on 1.5 billion Chinese image-text pairs.

Official implementations of many of the ERNIE family pre-training models covering topics such as language comprehension and generation and multimodal comprehension and generation are available on GitHub.

This Article is written as a research summary article by Marktechpost Staff based on the research paper 'ERNIE-VIL 2.0: MULTI-VIEW CONTRASTIVE LEARNING FOR IMAGE-TEXT PRE-TRAINING'. All Credit For This Research Goes To Researchers on This Project. Check out the paper and github link.

Please Don't Forget To Join Our ML Subreddit

Aneesh Tickoo is an intern consultant at MarktechPost. He is currently pursuing his undergraduate studies in Data Science and Artificial Intelligence at Indian Institute of Technology (IIT), Bhilai. He spends most of his time working on projects aimed at harnessing the power of machine learning. His research interest is image processing and is passionate about building solutions around it. He enjoys connecting with people and collaborating on interesting projects.