Recognising Speech Based Emotion Using CNN in Deep Learning
Main Article Content
Abstract
Building a powerful machine learning model for emotion recognition in speech is the project’s primary objective. To do this, the model must be trained using a diverse dataset that includes a spectrum of emotions, including neutrality, sadness, anger, disgust, fear and happiness. With the use of cutting-edge methods like Convolutional Neural Networks and Deep Learning techniques, the model attempts to accurately anticipate the speaker’s emotional state by analysing complex audio information. The strategy depends on utilising cutting-edge techniques, particularly Convolutional Neural Networks and Deep Learning. This strategic choice enables the model to grasp complex patterns in audio features, offering a nuanced understanding of the emotional content within speech. By adopting these advances techniques, the model aims to surpass traditional methods, enhancing its ability to recognize and classify emotions effectively. In initial stage, the model undergoes comprehensive training on a meticulously curated dataset. The model is trained using the tagged speech samples in this dataset, which span a variety of emotional states. The meticulous planning stage of the project is among its most crucial elements. Here, important features are extracted from the raw audio data by carefully processing it. This stage involves tasks such as audio segmentation, noise reduction, and feature extraction, ensuring that the model receives well-refined inputs. The subsequent stage involves the application of the trained model to real-world scenarios. Once equipped with the ability to recognize emotions in speech, the model can be deployed in practical setting, aiding professionals in psychology and speech therapy. In conclusion, the project presents a cutting-edge solution for emotion recognition in speech, combining advanced Machine Learning techniques with a meticulously curated dataset. The model’s ability to accurately predict emotional state offers significant utility in psychology and speech therapy, providing professionals with a valuable tool for enhancing their understanding of emotional nuances.
Using the TESS dataset, our method creates a CNN model for audio classification. The code’s objective is to categorise audio samples into six distinct emotional states, including fear, anger, disgust, happiness, neutrality, and sadness. As part of the process, MFCCs are retrieved as features from the audio data, and the dataset is then supplemented with several modifications. An output layer with softmax activation, a dense layer, and a convolutional layer make up the CNN model architecture. Using the TESS dataset for training and assessment, the model achieves a test accuracy of 93.33%. The main conclusions of this research shows that emotions may be accurately classified from audio samples using a CNN model. The model performs better when MFCC features and data augmentation methods like noise addition, temporal stretching, and pitch shifting are used. The suggested method for audio emotion classification is effective, as seen by the 93.99% accuracy that was attained. Applications include affective computing, human-computer interaction, and speech analysis that recognise emotions may be impacted by these findings.