LEFSA Models
This repository contains LEFSA model weights for multimodal affective state recognition. LEFSA stands for Label Encoder Fusion Strategy with Averaging and is designed for joint emotion recognition and sentiment recognition from audio, video, and text modalities.
Files
lefsa.ptβ LEFSA checkpoint for joint audio-video-text emotion and sentiment recognition.emoaffectnet.ptβ EmoAffectNet model for visual feature extraction. The original EmoAffectNet repository is available here: https://github.com/ElenaRyumina/EMO-AffectNetModel.
What the Model Predicts
The model has two classification heads.
| Task | Number of classes | Class order |
|---|---|---|
| Emotion recognition | 7 | neutral, happy, sad, anger, surprise, disgust, fear |
| Sentiment recognition | 3 | negative, neutral, positive |
Use this exact class order when converting logits or probabilities to labels.
Model Overview
- Acoustic, visual, and linguistic features are downsampled to a common temporal representation.
- The model applies cross-modal transformer blocks to model interactions between modalities.
- A label encoder produces unimodal emotion and sentiment predictions and injects this label-level context back into the fusion module.
- In LEFSA, unimodal predictions are additionally averaged with multimodal predictions to improve robustness.
Research Corpora
The model family was evaluated in a multilingual and multicorpus setting.
| Corpus | Language / domain | Modalities | Tasks |
|---|---|---|---|
| RAMAS | Russian, dyadic semi-spontaneous interactions | Audio, video, text | Emotion, sentiment |
| MELD | English, scripted TV-series dialogues | Audio, video, text | Emotion, sentiment |
| CMU-MOSEI | English, in-the-wild YouTube monologues | Audio, video, text | Emotion, sentiment |
Emotion labels are mapped to seven classes: neutral, happiness, sadness, anger, surprise, disgust, and fear. Sentiment labels are mapped to three classes: negative, neutral, and positive.
Feature and Fusion Strategy Comparison
MF is macro F1-score. For CMU-MOSEI emotion recognition, mMF is the mean macro F1-score over binary positive/negative emotion classes. The average value is the integral score reported in the paper.
| ID | Audio | Video | Text | Fusion | RAMAS Emotion MF7 | RAMAS Sentiment MF3 | MELD Emotion MF7 | MELD Sentiment MF3 | CMU-MOSEI Emotion mMF6 | CMU-MOSEI Sentiment MF3 | Average |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Wav2Vec2 | EmoAffectNet | JINA | BFS | 60.57 | 65.02 | 38.56 | 65.94 | 62.46 | 62.56 | 59.18 |
| 2 | ExHuBERT | EmoAffectNet | JINA | BFS | 62.35 | 64.02 | 38.40 | 63.93 | 62.10 | 62.32 | 58.85 |
| 3 | Wav2Vec2 | EmoAffectNet | RoBERTa | BFS | 57.64 | 67.14 | 35.48 | 65.03 | 61.58 | 60.19 | 57.84 |
| 4 | ExHuBERT | EmoAffectNet | RoBERTa | BFS | 60.54 | 64.70 | 35.05 | 63.78 | 60.78 | 59.25 | 57.35 |
| 5 | Wav2Vec2 | ResEmoteNet | JINA | BFS | 55.88 | 59.93 | 37.75 | 63.81 | 62.22 | 63.69 | 57.21 |
| 6 | ExHuBERT | ResEmoteNet | JINA | BFS | 57.21 | 61.40 | 38.25 | 64.06 | 61.11 | 60.81 | 57.14 |
| 7 | ExHuBERT | ResEmoteNet | RoBERTa | BFS | 49.62 | 54.44 | 32.56 | 62.15 | 59.88 | 60.47 | 53.19 |
| 8 | Wav2Vec2 | ResEmoteNet | RoBERTa | BFS | 52.29 | 54.87 | 34.08 | 61.65 | 58.79 | 57.42 | 53.18 |
| 9 | Wav2Vec2 | EmoAffectNet | JINA | LEFS | 61.38 | 66.57 | 39.79 | 65.70 | 62.79 | 61.59 | 59.64 |
| 10 | Wav2Vec2 | EmoAffectNet | JINA | LEFSA | 62.52 | 64.96 | 40.09 | 67.02 | 62.30 | 62.00 | 59.81 |
The best LEFSA configuration uses Wav2Vec2 + EmoAffectNet + JINA features with Label Encoder Fusion Strategy with Averaging.
Comparison with Prior Multimodal Approaches
ST means single-task recognition and MT means multitask recognition.
| Approach | Corpus | Setup | Emotion A7 | Emotion WF7 | Sentiment A3 | Sentiment WF3 |
|---|---|---|---|---|---|---|
| Ours | RAMAS | MT | 68.99 | 67.79 | 84.11 | 84.02 |
| Zhang et al. | MELD | MT | 41.17 | 41.22 | 67.33 | 67.21 |
| Van et al. | MELD | ST | 66.28 | 65.69 | β | β |
| Hwang et al. | MELD | ST | 66.70 | 65.93 | β | β |
| Tu et al. | MELD | ST | 67.85 | 67.02 | β | β |
| Ours | MELD | MT | 62.30 | 59.79 | 69.20 | 69.02 |
| Approach | Corpus | Setup | Emotion mWA6 | Emotion mWF6 | Sentiment A2 | Sentiment WF2 |
|---|---|---|---|---|---|---|
| Chauhan et al. | CMU-MOSEI | MT | 62.97 | 79.02 | 80.37 | 78.23 |
| Sangwan et al. | CMU-MOSEI | MT | 63.16 | 79.06 | 80.15 | 78.30 |
| Hwang et al. | CMU-MOSEI | ST | β | β | 87.40 | 87.30 |
| Zheng et al. | CMU-MOSEI | ST | β | β | 85.90 | 86.00 |
| Ours | CMU-MOSEI | MT | 64.78 | 79.06 | 84.83 | 84.90 |
Related Publications
Markitantov M., Ryumina E., Kaya H., Karpov A. Multi-Modal Multi-Task Affective States Recognition Based on Label Encoder Fusion // In Proc. Interspeech 2025, pp. 3010β3014. https://doi.org/10.21437/Interspeech.2025-2060
@inproceedings{markitantov25_interspeech,
title = {{Multi-Modal Multi-Task Affective States Recognition Based on Label Encoder Fusion}},
author = {Maxim Markitantov and Elena Ryumina and Heysem Kaya and Alexey Karpov},
year = {2025},
booktitle = {{Interspeech 2025}},
pages = {3010--3014},
doi = {10.21437/Interspeech.2025-2060},
issn = {2958-1796}
}
Markitantov M., Ryumina E., Dvoynikova A., Karpov A. Multi-lingual approach for multi-modal emotion and sentiment recognition based on triple fusion // Information Fusion, 2026, vol. 132, article 104207. https://doi.org/10.1016/j.inffus.2026.104207
@article{markitantov2026triplefusion,
title = {Multi-lingual approach for multi-modal emotion and sentiment recognition based on triple fusion},
author = {Markitantov, Maxim and Ryumina, Elena and Dvoynikova, Anastasia and Karpov, Alexey},
journal = {Information Fusion},
volume = {132},
pages = {104207},
year = {2026},
doi = {10.1016/j.inffus.2026.104207},
url = {https://doi.org/10.1016/j.inffus.2026.104207}
}