Abstract
Depression and mental health disorders are serious health calamities all over the globe, and thus, the early detection of cases must be made efficient. This study represents a complete comparative analysis of various machine learning approaches to detect depression and associated mental health disorders from text. Five models were implemented and tested, namely, Logistic Regression, Linear SVM, Random Forest, CNN+BiLSTM, and DistilBERT, on a big text-based dataset of 53,043 text samples belonging to seven categories of mental health (Normal, Depression, Suicidal, Anxiety, Bipolar, Stress, and Personality disorder). Initial baseline results showed a moderate performance with Logistic Regression 76.00%, Linear SVM 75.00%, Random Forest 74.00%, CNN+BiLSTM 77.00%, and DistilBERT 80.48%. By applying an exhaustive search for its hyperparameters, we managed to improve the performance of classical models: Logistic Regression (baseline: 75.00% → optimized: 76.44%) with parameters C=4.28 and L2 penalty, Linear SVM (75.00% → 76.93%) with parameters C=0.234 and squared-hinge loss, and Random Forest (74.00% → 75.37%) with parameters 500 estimators. TF-IDF vectorization was applied as a text pre-processing technique with 5,000 features and n-gram range (1,2). DistilBERT was the best of all with 80.48% accuracy, thus demonstrating the prowess of transformer-based architectures in analyzing mental health texts.