Multimodal Emotion Recognition using Vision, Text, and Audio with Transformer Models for Real-Time Video Call Applications

Prince Gupta

Volume 12 Issue 7
July-2025
eISSN: 2349-5162

7.95 impact factor calculated by Google scholar

Published Paper ID:
JETIR2507513

Registration ID:
566725

Multimodal Emotion Recognition using Vision, Text, and Audio with Transformer Models for Real-Time Video Call Applications

The proliferation of remote communication has created an urgent need for intelligent systems capable of understanding human emotions through multimodal inputs in video conferencing environments. Current emotion recognition systems predominantly rely on single modalities, limiting their effectiveness in capturing the complex nature of human emotional expression. This research proposes a novel Adaptive Cross-Modal Transformer Fusion Network (ACMTFN) that integrates vision, text, and audio modalities using advanced transformer architectures for real-time emotion recognition in video calls. Our approach employs a hierarchical transformer-based architecture with dedicated encoders for each modality: a fine-tuned Vision Transformer (ViT) for facial expressions, BERT for textual content analysis, and Wav2Vec2 for audio processing. The core innovation lies in our adaptive cross-modal attention mechanism that dynamically weights inter-modal relationships based on contextual relevance, addressing the critical challenge of modality imbalance in multimodal learning. Comprehensive evaluation on the MELD and CMU-MOSEI benchmark datasets demonstrates superior performance, achieving 76.8% accuracy on MELD and 79.4% F1-score on CMU-MOSEI, representing improvements of 4.2% and 3.8% respectively over existing state-of-the-art methods. Crucially, the system maintains computational efficiency suitable for real-time applications with inference times averaging 85ms per sample, meeting the stringent latency requirements for video conferencing platforms. The proposed ACMTFN contributes to human-AI interaction by providing a practical solution for emotion-aware computing systems with immediate applications in virtual meetings, mental health monitoring, and educational technology. This work advances the field of affective computing by demonstrating how transformer-based multimodal fusion can effectively bridge the emotional gap in digital communication.

Multimodal emotion recognition, transformer models, cross-modal attention, human-computer interaction, video conferencing, affective computing

"Multimodal Emotion Recognition using Vision, Text, and Audio with Transformer Models for Real-Time Video Call Applications", International Journal of Emerging Technologies and Innovative Research (www.jetir.org), ISSN:2349-5162, Vol.12, Issue 7, page no.f110-f118, July-2025, Available :http://www.jetir.org/papers/JETIR2507513.pdf

"Multimodal Emotion Recognition using Vision, Text, and Audio with Transformer Models for Real-Time Video Call Applications", International Journal of Emerging Technologies and Innovative Research (www.jetir.org | UGC and issn Approved), ISSN:2349-5162, Vol.12, Issue 7, page no. ppf110-f118, July-2025, Available at : http://www.jetir.org/papers/JETIR2507513.pdf

Published Paper ID: JETIR2507513

Registration ID: 566725

Published In: Volume 12 | Issue 7 | Year July-2025

DOI (Digital Object Identifier):

Page No: f110-f118

Country: panvel,raigad, maharashtra, India .

Area: Engineering

ISSN Number: 2349-5162

Publisher: IJ Publication

Home |
Contact Us

Contact Us
Click Here

WhatsApp Contact
Click Here

Published in:

UGC and ISSN approved 7.95 impact factor UGC Approved Journal no 63975

Unique Identifier

Page Number

Post-Publication

Share This Article

Important Links:

Jetir RMS

Title

Authors

Abstract

Key Words

Cite This Article

ISSN

Cite This Article

Publication Details

Download Paper / Preview Article

Download Paper

Preview This Article

Download PDF

Downloads

Print This Page

Impact Factor:

7.95

Impact Factor Calculation click here

Impact Factor:

7.95

Impact Factor Calculation click here

Current Call For Paper

Call for Paper
Cilck Here For More Info

Important Links:

Jetir RMS

Contact Us Click Here

WhatsApp Contact Click Here

Published in:

UGC and ISSN approved 7.95 impact factor UGC Approved Journal no 63975

Unique Identifier

Page Number

Post-Publication

Share This Article

Important Links:

Jetir RMS

Title

Authors

Abstract

Key Words

Cite This Article

ISSN

Cite This Article

Publication Details

Download Paper / Preview Article

Download Paper

Preview This Article

Download PDF

Downloads

Print This Page

Impact Factor: 7.95 Impact Factor Calculation click here

Impact Factor:

7.95

Impact Factor Calculation click here

Current Call For Paper

Call for Paper Cilck Here For More Info

Important Links:

Jetir RMS

Contact Us
Click Here

WhatsApp Contact
Click Here

Impact Factor:

7.95

Impact Factor Calculation click here

Call for Paper
Cilck Here For More Info