Papers
arxiv:2410.22906

From Babble to Words: Pre-Training Language Models on Continuous Streams of Phonemes

Published on Oct 30, 2024
Authors:
,
,
,

Abstract

Phoneme-based training of language models shows reduced performance on some language understanding tasks but provides analytical and practical benefits.

AI-generated summary

Language models are typically trained on large corpora of text in their default orthographic form. However, this is not the only option; representing data as streams of phonemes can offer unique advantages, from deeper insights into phonological language acquisition to improved performance on sound-based tasks. The challenge lies in evaluating the impact of phoneme-based training, as most benchmarks are also orthographic. To address this, we develop a pipeline to convert text datasets into a continuous stream of phonemes. We apply this pipeline to the 100-million-word pre-training dataset from the BabyLM challenge, as well as to standard language and grammatical benchmarks, enabling us to pre-train and evaluate a model using phonemic input representations. Our results show that while phoneme-based training slightly reduces performance on traditional language understanding tasks, it offers valuable analytical and practical benefits.

Community

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2410.22906
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 9

Browse 9 models citing this paper

Datasets citing this paper 2

Spaces citing this paper 1

Collections including this paper 2