Papers
arxiv:2212.03533

Text Embeddings by Weakly-Supervised Contrastive Pre-training

Published on Dec 7, 2022
Authors:
,
,
,
,
,

Abstract

E5, a contrastively trained text embedding model, achieves superior performance in zero-shot and fine-tuned settings across various tasks compared to existing models.

AI-generated summary

This paper presents E5, a family of state-of-the-art text embeddings that transfer well to a wide range of tasks. The model is trained in a contrastive manner with weak supervision signals from our curated large-scale text pair dataset (called CCPairs). E5 can be readily used as a general-purpose embedding model for any tasks requiring a single-vector representation of texts such as retrieval, clustering, and classification, achieving strong performance in both zero-shot and fine-tuned settings. We conduct extensive evaluations on 56 datasets from the BEIR and MTEB benchmarks. For zero-shot settings, E5 is the first model that outperforms the strong BM25 baseline on the BEIR retrieval benchmark without using any labeled data. When fine-tuned, E5 obtains the best results on the MTEB benchmark, beating existing embedding models with 40x more parameters.

Community

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2212.03533
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 74

Browse 74 models citing this paper

Datasets citing this paper 1

Spaces citing this paper 381

Collections including this paper 1