Papers
arxiv:2309.07597

C-Pack: Packaged Resources To Advance General Chinese Embedding

Published on Sep 14, 2023

Abstract

C-Pack advances Chinese text embeddings with a benchmark, a dataset, and optimized models, achieving state-of-the-art performance on English embeddings as well.

AI-generated summary

We introduce C-Pack, a package of resources that significantly advance the field of general Chinese embeddings. C-Pack includes three critical resources. 1) C-MTEB is a comprehensive benchmark for Chinese text embeddings covering 6 tasks and 35 datasets. 2) C-MTP is a massive text embedding dataset curated from labeled and unlabeled Chinese corpora for training embedding models. 3) C-TEM is a family of embedding models covering multiple sizes. Our models outperform all prior Chinese text embeddings on C-MTEB by up to +10% upon the time of the release. We also integrate and optimize the entire suite of training methods for C-TEM. Along with our resources on general Chinese embedding, we release our data and models for English text embeddings. The English models achieve state-of-the-art performance on MTEB benchmark; meanwhile, our released English data is 2 times larger than the Chinese data. All these resources are made publicly available at https://github.com/FlagOpen/FlagEmbedding.

Community

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2309.07597
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 188

Browse 188 models citing this paper

Datasets citing this paper 11

Browse 11 datasets citing this paper

Spaces citing this paper 2,255

Collections including this paper 2