Background:
Twitter, with its 330 million monthly active users, provides a platform for businesses to directly engage with customers. However, detecting negative mentions quickly is challenging due to the vast amount of information. Sentiment analysis helps in understanding customer emotions, staying updated on brand mentions, and discovering industry trends.
Objective:
Classify tweets about US airlines into positive, negative, and neutral sentiments, and further categorize reasons for negative sentiments.
Data Description:
Dataset: Twitter data from February 2015, including columns like tweet_id, airline_sentiment, negativereason, airline, retweet_count, text, tweet_created, and user_timezone.
Methodology:
Data Preprocessing:
Removed irrelevant columns.
Cleaned text data by removing HTML tags, numbers, punctuations, and stopwords.
Tokenized and lemmatized the text.
Vectorization:
Applied CountVectorizer and TfidfVectorizer to convert text into numerical features.
Model Building:
Used Random Forest classifier for sentiment classification.
Tuned hyperparameters using GridSearchCV.
Evaluated models using cross-validation and confusion matrices.
Evaluation:
Assessed model performance based on accuracy, precision, recall, and F1-score.
Generated word clouds for top features to visualize important words.
Results:
Model Performance: Achieved ~76% accuracy with both CountVectorizer and TfidfVectorizer.
Key Insights:
Majority of tweets (62.7%) were negative.
Common negative reasons included customer service issues and flight delays.
Word clouds and feature importance analysis provided insights into key factors driving sentiment.
Conclusion:
NLP techniques and machine learning models effectively extract insights from social media data. Airlines can use these insights to address customer concerns, improve services, and enhance customer satisfaction. Future work could explore advanced models like LSTM or BERT for better accuracy.