Real-Time Sentiment Tracker for Streaming Social Media Data
This project is designed to analyze the sentiment of streaming social media data in real-time using Apache Kafka, Apache Spark on AWS EMR, and a Flask-based dashboard. The processed results are visualized dynamically to showcase sentiment trends.
Features
- Kafka streams social media data (simulated using a dataset).
- Apache Spark processes data using sentiment analysis with TextBlob.
- Processed data is stored in Amazon S3.
- Flask-based dashboard with Plotly graphs displays sentiment trends.
Technologies Used
- Apache Kafka: Data streaming.
- Apache Spark: Real-time processing (via PySpark).
- AWS EMR: Cloud-based Spark cluster.
- Flask: Web application for dashboard.
- Plotly: Data visualization.
- Amazon S3: Data storage.
- Python Libraries:
textblob, boto3, kafka-python.
Setup Instructions
1. Prerequisites
- Python 3.8 or above
- AWS account with access to S3 and EMR
- Apache Kafka installed locally or managed service
- Virtual environment setup for Python
- Dataset: https://www.kaggle.com/datasets/kazanova/sentiment140
2. Install Dependencies
- Create a virtual environment:
python3 -m venv venv
- Activate the virtual environment:
source venv/bin/activate
venv\Scripts\activate
- Install required packages:
pip install -r requirements.txt
3. Set Up Kafka
bin/zookeeper-server-start.sh config/zookeeper.properties
bin/kafka-server-start.sh config/server.properties
bin/kafka-topics.sh --create --topic tweets --bootstrap-server localhost:9092
4. Set Up AWS EMR
- Go to the AWS Management Console and navigate to the EMR service.
- Create a cluster with the following configurations:
- Release version:
emr-6.x.x
- Choose
m5.xlarge instance types.
- Ensure SSH access is enabled.
- Launch the cluster.
- Upload and execute the
spark_streaming.py script.
5. Run the Kafka Producer
Stream data to Kafka:
python kafka_producer.py
6. Run the Flask Application
Launch the dashboard:
python app.py
Access the dashboard at http://127.0.0.1:5000.
File Structure
project/
โโโ kafka_producer.py # Streams data to Kafka
โโโ spark_streaming.py # Spark job for real-time processing
โโโ app.py # Flask application for visualization
โโโ templates/
โ โโโ index.html # HTML for the dashboard
โโโ requirements.txt # Python dependencies
โโโ README.md # Project documentation
Example Output
- Dashboard: Displays a bar chart with the counts of positive, negative, and neutral sentiments.
Future Improvements
- Add live integration with social media APIs (e.g., Twitter API).
- Use advanced sentiment analysis models for improved accuracy.
- Implement real-time notifications based on sentiment trends.