LEFSA Models

This repository contains LEFSA model weights for multimodal affective state recognition. LEFSA stands for Label Encoder Fusion Strategy with Averaging and is designed for joint emotion recognition and sentiment recognition from audio, video, and text modalities.

Files

lefsa.pt — LEFSA checkpoint for joint audio-video-text emotion and sentiment recognition.
emoaffectnet.pt — EmoAffectNet model for visual feature extraction. The original EmoAffectNet repository is available here: https://github.com/ElenaRyumina/EMO-AffectNetModel.

What the Model Predicts

The model has two classification heads.

Task	Number of classes	Class order
Emotion recognition	7	`neutral`, `happy`, `sad`, `anger`, `surprise`, `disgust`, `fear`
Sentiment recognition	3	`negative`, `neutral`, `positive`

Use this exact class order when converting logits or probabilities to labels.

Model Overview

Acoustic, visual, and linguistic features are downsampled to a common temporal representation.
The model applies cross-modal transformer blocks to model interactions between modalities.
A label encoder produces unimodal emotion and sentiment predictions and injects this label-level context back into the fusion module.
In LEFSA, unimodal predictions are additionally averaged with multimodal predictions to improve robustness.

Research Corpora

The model family was evaluated in a multilingual and multicorpus setting.

Corpus	Language / domain	Modalities	Tasks
RAMAS	Russian, dyadic semi-spontaneous interactions	Audio, video, text	Emotion, sentiment
MELD	English, scripted TV-series dialogues	Audio, video, text	Emotion, sentiment
CMU-MOSEI	English, in-the-wild YouTube monologues	Audio, video, text	Emotion, sentiment

Emotion labels are mapped to seven classes: neutral, happiness, sadness, anger, surprise, disgust, and fear. Sentiment labels are mapped to three classes: negative, neutral, and positive.

Feature and Fusion Strategy Comparison

MF is macro F1-score. For CMU-MOSEI emotion recognition, mMF is the mean macro F1-score over binary positive/negative emotion classes. The average value is the integral score reported in the paper.

ID	Audio	Video	Text	Fusion	RAMAS Emotion MF7	RAMAS Sentiment MF3	MELD Emotion MF7	MELD Sentiment MF3	CMU-MOSEI Emotion mMF6	CMU-MOSEI Sentiment MF3	Average
1	Wav2Vec2	EmoAffectNet	JINA	BFS	60.57	65.02	38.56	65.94	62.46	62.56	59.18
2	ExHuBERT	EmoAffectNet	JINA	BFS	62.35	64.02	38.40	63.93	62.10	62.32	58.85
3	Wav2Vec2	EmoAffectNet	RoBERTa	BFS	57.64	67.14	35.48	65.03	61.58	60.19	57.84
4	ExHuBERT	EmoAffectNet	RoBERTa	BFS	60.54	64.70	35.05	63.78	60.78	59.25	57.35
5	Wav2Vec2	ResEmoteNet	JINA	BFS	55.88	59.93	37.75	63.81	62.22	63.69	57.21
6	ExHuBERT	ResEmoteNet	JINA	BFS	57.21	61.40	38.25	64.06	61.11	60.81	57.14
7	ExHuBERT	ResEmoteNet	RoBERTa	BFS	49.62	54.44	32.56	62.15	59.88	60.47	53.19
8	Wav2Vec2	ResEmoteNet	RoBERTa	BFS	52.29	54.87	34.08	61.65	58.79	57.42	53.18
9	Wav2Vec2	EmoAffectNet	JINA	LEFS	61.38	66.57	39.79	65.70	62.79	61.59	59.64
10	Wav2Vec2	EmoAffectNet	JINA	LEFSA	62.52	64.96	40.09	67.02	62.30	62.00	59.81

The best LEFSA configuration uses Wav2Vec2 + EmoAffectNet + JINA features with Label Encoder Fusion Strategy with Averaging.

Comparison with Prior Multimodal Approaches

ST means single-task recognition and MT means multitask recognition.

Approach	Corpus	Setup	Emotion A7	Emotion WF7	Sentiment A3	Sentiment WF3
Ours	RAMAS	MT	68.99	67.79	84.11	84.02
Zhang et al.	MELD	MT	41.17	41.22	67.33	67.21
Van et al.	MELD	ST	66.28	65.69	—	—
Hwang et al.	MELD	ST	66.70	65.93	—	—
Tu et al.	MELD	ST	67.85	67.02	—	—
Ours	MELD	MT	62.30	59.79	69.20	69.02

Approach	Corpus	Setup	Emotion mWA6	Emotion mWF6	Sentiment A2	Sentiment WF2
Chauhan et al.	CMU-MOSEI	MT	62.97	79.02	80.37	78.23
Sangwan et al.	CMU-MOSEI	MT	63.16	79.06	80.15	78.30
Hwang et al.	CMU-MOSEI	ST	—	—	87.40	87.30
Zheng et al.	CMU-MOSEI	ST	—	—	85.90	86.00
Ours	CMU-MOSEI	MT	64.78	79.06	84.83	84.90

Related Publications

Markitantov M., Ryumina E., Kaya H., Karpov A. Multi-Modal Multi-Task Affective States Recognition Based on Label Encoder Fusion // In Proc. Interspeech 2025, pp. 3010–3014. https://doi.org/10.21437/Interspeech.2025-2060

@inproceedings{markitantov25_interspeech,
  title     = {{Multi-Modal Multi-Task Affective States Recognition Based on Label Encoder Fusion}},
  author    = {Maxim Markitantov and Elena Ryumina and Heysem Kaya and Alexey Karpov},
  year      = {2025},
  booktitle = {{Interspeech 2025}},
  pages     = {3010--3014},
  doi       = {10.21437/Interspeech.2025-2060},
  issn      = {2958-1796}
}

Markitantov M., Ryumina E., Dvoynikova A., Karpov A. Multi-lingual approach for multi-modal emotion and sentiment recognition based on triple fusion // Information Fusion, 2026, vol. 132, article 104207. https://doi.org/10.1016/j.inffus.2026.104207

@article{markitantov2026triplefusion,
  title   = {Multi-lingual approach for multi-modal emotion and sentiment recognition based on triple fusion},
  author  = {Markitantov, Maxim and Ryumina, Elena and Dvoynikova, Anastasia and Karpov, Alexey},
  journal = {Information Fusion},
  volume  = {132},
  pages   = {104207},
  year    = {2026},
  doi     = {10.1016/j.inffus.2026.104207},
  url     = {https://doi.org/10.1016/j.inffus.2026.104207}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support