Abstract
Speech Emotion Recognition (SER) has emerged as a crucial domain within human-computer interaction (HCI), enabling machines to identify and respond to users' emotional states. Unlike traditional text-based sentiment analysis, SER relies on auditory cues, making it more complex due to the dynamic and nuanced nature of speech. With the proliferation of deep learning, especially architectures like Long Short-Term Memory (LSTM) and Convolutional Neural Networks (CNNs), significant strides have been made in extracting emotional patterns from raw audio data. This review paper delves into recent advancements in SER, focusing primarily on methodologies that bypass extensive feature engineering by utilizing raw waveform data. We comprehensively analyze ten state-of-the-art studies that have contributed novel techniques, including attention mechanisms, hybrid CNN-LSTM models, and end-to-end learning paradigms. Each method is evaluated based on the dataset used, performance metrics (accuracy, F1-score, etc.), and its limitations. A key insight from the review is the increasing reliance on raw audio inputs, eliminating the dependency on handcrafted features such as MFCC or spectrograms. However, existing approaches still struggle with generalization, dataset imbalance, and speaker variability. The paper also presents a brief overview of a proposed LSTM-based architecture designed to enhance robustness across diverse speech signals. Our findings highlight the gaps in current research and suggest directions for future exploration, particularly emphasizing multilingual datasets and unsupervised learning techniques for SER.
Keywords
Speech Emotion Recognition (SER), Raw Audio, LSTM, Deep Learning, CNN, Attention Mechanism, Human-Computer Interaction, Emotion Detection, End-to-End Learning, Neural Networks