📄 README.md

Real-Time Sentiment Tracker for Streaming Social Media Data

This project is designed to analyze the sentiment of streaming social media data in real-time using Apache Kafka, Apache Spark on AWS EMR, and a Flask-based dashboard. The processed results are visualized dynamically to showcase sentiment trends.

Features

Data Streaming:

Kafka streams social media data (simulated using a dataset).

Real-Time Processing:

Apache Spark processes data using sentiment analysis with TextBlob.

Data Storage:

Processed data is stored in Amazon S3.

Visualization:

Flask-based dashboard with Plotly graphs displays sentiment trends.

Technologies Used

Apache Kafka: Data streaming.
Apache Spark: Real-time processing (via PySpark).
AWS EMR: Cloud-based Spark cluster.
Flask: Web application for dashboard.
Plotly: Data visualization.
Amazon S3: Data storage.
Python Libraries: textblob, boto3, kafka-python.

Setup Instructions

1. Prerequisites

Python 3.8 or above
AWS account with access to S3 and EMR
Apache Kafka installed locally or managed service
Virtual environment setup for Python
Dataset: https://www.kaggle.com/datasets/kazanova/sentiment140

2. Install Dependencies

Create a virtual environment:

python3 -m venv venv

Activate the virtual environment:

macOS/Linux:

source venv/bin/activate

Windows:

venv\Scripts\activate

Install required packages:

pip install -r requirements.txt

3. Set Up Kafka

Start Zookeeper:

bin/zookeeper-server-start.sh config/zookeeper.properties

Start Kafka:

bin/kafka-server-start.sh config/server.properties

Create a Kafka topic:

bin/kafka-topics.sh --create --topic tweets --bootstrap-server localhost:9092

4. Set Up AWS EMR

Go to the AWS Management Console and navigate to the EMR service.
Create a cluster with the following configurations:

Release version: emr-6.x.x

Enable Spark and Hadoop.

Choose m5.xlarge instance types.

Ensure SSH access is enabled.
Launch the cluster.
Upload and execute the spark_streaming.py script.

5. Run the Kafka Producer

Stream data to Kafka:

python kafka_producer.py

6. Run the Flask Application

Launch the dashboard:

python app.py

Access the dashboard at http://127.0.0.1:5000.

File Structure

project/
├── kafka_producer.py       # Streams data to Kafka
├── spark_streaming.py      # Spark job for real-time processing
├── app.py                  # Flask application for visualization
├── templates/
│   └── index.html          # HTML for the dashboard
├── requirements.txt        # Python dependencies
└── README.md               # Project documentation

Example Output

Dashboard: Displays a bar chart with the counts of positive, negative, and neutral sentiments.

Future Improvements

Add live integration with social media APIs (e.g., Twitter API).
Use advanced sentiment analysis models for improved accuracy.
Implement real-time notifications based on sentiment trends.