Papers
arxiv:2103.06874

CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation

Published on Mar 11, 2021
Authors:
,
,

Abstract

CANINE, a neural encoder that operates on characters without explicit tokenization, achieves superior performance compared to mBERT on multilingual question answering with fewer parameters.

Pipelined NLP systems have largely been superseded by end-to-end neural modeling, yet nearly all commonly-used models still require an explicit tokenization step. While recent tokenization approaches based on data-derived subword lexicons are less brittle than manually engineered tokenizers, these techniques are not equally suited to all languages, and the use of any fixed vocabulary may limit a model's ability to adapt. In this paper, we present CANINE, a neural encoder that operates directly on character sequences, without explicit tokenization or vocabulary, and a pre-training strategy that operates either directly on characters or optionally uses subwords as a soft inductive bias. To use its finer-grained input effectively and efficiently, CANINE combines downsampling, which reduces the input sequence length, with a deep transformer stack, which encodes context. CANINE outperforms a comparable mBERT model by 2.8 F1 on TyDi QA, a challenging multilingual benchmark, despite having 28% fewer model parameters.

Community

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2103.06874
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 2

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2103.06874 in a dataset README.md to link it from this page.

Spaces citing this paper 8

Collections including this paper 9