LEFSA Models

This repository contains LEFSA model weights for multimodal affective state recognition. LEFSA stands for Label Encoder Fusion Strategy with Averaging and is designed for joint emotion recognition and sentiment recognition from audio, video, and text modalities.

Files

  • lefsa.pt β€” LEFSA checkpoint for joint audio-video-text emotion and sentiment recognition.
  • emoaffectnet.pt β€” EmoAffectNet model for visual feature extraction. The original EmoAffectNet repository is available here: https://github.com/ElenaRyumina/EMO-AffectNetModel.

What the Model Predicts

The model has two classification heads.

Task Number of classes Class order
Emotion recognition 7 neutral, happy, sad, anger, surprise, disgust, fear
Sentiment recognition 3 negative, neutral, positive

Use this exact class order when converting logits or probabilities to labels.

Model Overview

  • Acoustic, visual, and linguistic features are downsampled to a common temporal representation.
  • The model applies cross-modal transformer blocks to model interactions between modalities.
  • A label encoder produces unimodal emotion and sentiment predictions and injects this label-level context back into the fusion module.
  • In LEFSA, unimodal predictions are additionally averaged with multimodal predictions to improve robustness.

Research Corpora

The model family was evaluated in a multilingual and multicorpus setting.

Corpus Language / domain Modalities Tasks
RAMAS Russian, dyadic semi-spontaneous interactions Audio, video, text Emotion, sentiment
MELD English, scripted TV-series dialogues Audio, video, text Emotion, sentiment
CMU-MOSEI English, in-the-wild YouTube monologues Audio, video, text Emotion, sentiment

Emotion labels are mapped to seven classes: neutral, happiness, sadness, anger, surprise, disgust, and fear. Sentiment labels are mapped to three classes: negative, neutral, and positive.

Feature and Fusion Strategy Comparison

MF is macro F1-score. For CMU-MOSEI emotion recognition, mMF is the mean macro F1-score over binary positive/negative emotion classes. The average value is the integral score reported in the paper.

ID Audio Video Text Fusion RAMAS Emotion MF7 RAMAS Sentiment MF3 MELD Emotion MF7 MELD Sentiment MF3 CMU-MOSEI Emotion mMF6 CMU-MOSEI Sentiment MF3 Average
1 Wav2Vec2 EmoAffectNet JINA BFS 60.57 65.02 38.56 65.94 62.46 62.56 59.18
2 ExHuBERT EmoAffectNet JINA BFS 62.35 64.02 38.40 63.93 62.10 62.32 58.85
3 Wav2Vec2 EmoAffectNet RoBERTa BFS 57.64 67.14 35.48 65.03 61.58 60.19 57.84
4 ExHuBERT EmoAffectNet RoBERTa BFS 60.54 64.70 35.05 63.78 60.78 59.25 57.35
5 Wav2Vec2 ResEmoteNet JINA BFS 55.88 59.93 37.75 63.81 62.22 63.69 57.21
6 ExHuBERT ResEmoteNet JINA BFS 57.21 61.40 38.25 64.06 61.11 60.81 57.14
7 ExHuBERT ResEmoteNet RoBERTa BFS 49.62 54.44 32.56 62.15 59.88 60.47 53.19
8 Wav2Vec2 ResEmoteNet RoBERTa BFS 52.29 54.87 34.08 61.65 58.79 57.42 53.18
9 Wav2Vec2 EmoAffectNet JINA LEFS 61.38 66.57 39.79 65.70 62.79 61.59 59.64
10 Wav2Vec2 EmoAffectNet JINA LEFSA 62.52 64.96 40.09 67.02 62.30 62.00 59.81

The best LEFSA configuration uses Wav2Vec2 + EmoAffectNet + JINA features with Label Encoder Fusion Strategy with Averaging.

Comparison with Prior Multimodal Approaches

ST means single-task recognition and MT means multitask recognition.

Approach Corpus Setup Emotion A7 Emotion WF7 Sentiment A3 Sentiment WF3
Ours RAMAS MT 68.99 67.79 84.11 84.02
Zhang et al. MELD MT 41.17 41.22 67.33 67.21
Van et al. MELD ST 66.28 65.69 β€” β€”
Hwang et al. MELD ST 66.70 65.93 β€” β€”
Tu et al. MELD ST 67.85 67.02 β€” β€”
Ours MELD MT 62.30 59.79 69.20 69.02
Approach Corpus Setup Emotion mWA6 Emotion mWF6 Sentiment A2 Sentiment WF2
Chauhan et al. CMU-MOSEI MT 62.97 79.02 80.37 78.23
Sangwan et al. CMU-MOSEI MT 63.16 79.06 80.15 78.30
Hwang et al. CMU-MOSEI ST β€” β€” 87.40 87.30
Zheng et al. CMU-MOSEI ST β€” β€” 85.90 86.00
Ours CMU-MOSEI MT 64.78 79.06 84.83 84.90

Related Publications

Markitantov M., Ryumina E., Kaya H., Karpov A. Multi-Modal Multi-Task Affective States Recognition Based on Label Encoder Fusion // In Proc. Interspeech 2025, pp. 3010–3014. https://doi.org/10.21437/Interspeech.2025-2060

@inproceedings{markitantov25_interspeech,
  title     = {{Multi-Modal Multi-Task Affective States Recognition Based on Label Encoder Fusion}},
  author    = {Maxim Markitantov and Elena Ryumina and Heysem Kaya and Alexey Karpov},
  year      = {2025},
  booktitle = {{Interspeech 2025}},
  pages     = {3010--3014},
  doi       = {10.21437/Interspeech.2025-2060},
  issn      = {2958-1796}
}

Markitantov M., Ryumina E., Dvoynikova A., Karpov A. Multi-lingual approach for multi-modal emotion and sentiment recognition based on triple fusion // Information Fusion, 2026, vol. 132, article 104207. https://doi.org/10.1016/j.inffus.2026.104207

@article{markitantov2026triplefusion,
  title   = {Multi-lingual approach for multi-modal emotion and sentiment recognition based on triple fusion},
  author  = {Markitantov, Maxim and Ryumina, Elena and Dvoynikova, Anastasia and Karpov, Alexey},
  journal = {Information Fusion},
  volume  = {132},
  pages   = {104207},
  year    = {2026},
  doi     = {10.1016/j.inffus.2026.104207},
  url     = {https://doi.org/10.1016/j.inffus.2026.104207}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support