📦 keyurkhant / StreamGuard

📄 README.md · 124 lines
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124# Real-Time Sentiment Tracker for Streaming Social Media Data

This project is designed to analyze the sentiment of streaming social media data in real-time using Apache Kafka, Apache Spark on AWS EMR, and a Flask-based dashboard. The processed results are visualized dynamically to showcase sentiment trends.

---

## Features
1. **Data Streaming**:
   - Kafka streams social media data (simulated using a dataset).

2. **Real-Time Processing**:
   - Apache Spark processes data using sentiment analysis with TextBlob.

3. **Data Storage**:
   - Processed data is stored in Amazon S3.

4. **Visualization**:
   - Flask-based dashboard with Plotly graphs displays sentiment trends.

---

## Technologies Used
- **Apache Kafka**: Data streaming.
- **Apache Spark**: Real-time processing (via PySpark).
- **AWS EMR**: Cloud-based Spark cluster.
- **Flask**: Web application for dashboard.
- **Plotly**: Data visualization.
- **Amazon S3**: Data storage.
- **Python Libraries**: `textblob`, `boto3`, `kafka-python`.

---

## Setup Instructions

### 1. Prerequisites
- Python 3.8 or above
- AWS account with access to S3 and EMR
- Apache Kafka installed locally or managed service
- Virtual environment setup for Python
- Dataset: https://www.kaggle.com/datasets/kazanova/sentiment140

### 2. Install Dependencies
1. Create a virtual environment:
   ```bash
   python3 -m venv venv
   ```
2. Activate the virtual environment:
   - macOS/Linux:
     ```bash
     source venv/bin/activate
     ```
   - Windows:
     ```bash
     venv\Scripts\activate
     ```
3. Install required packages:
   ```bash
   pip install -r requirements.txt
   ```

### 3. Set Up Kafka
1. Start Zookeeper:
   ```bash
   bin/zookeeper-server-start.sh config/zookeeper.properties
   ```
2. Start Kafka:
   ```bash
   bin/kafka-server-start.sh config/server.properties
   ```
3. Create a Kafka topic:
   ```bash
   bin/kafka-topics.sh --create --topic tweets --bootstrap-server localhost:9092
   ```

### 4. Set Up AWS EMR
1. Go to the AWS Management Console and navigate to the EMR service.
2. Create a cluster with the following configurations:
   - Release version: `emr-6.x.x`
   - Enable Spark and Hadoop.
   - Choose `m5.xlarge` instance types.
   - Ensure SSH access is enabled.
3. Launch the cluster.
4. Upload and execute the `spark_streaming.py` script.

### 5. Run the Kafka Producer
Stream data to Kafka:
```bash
python kafka_producer.py
```

### 6. Run the Flask Application
Launch the dashboard:
```bash
python app.py
```

Access the dashboard at `http://127.0.0.1:5000`.

---

## File Structure
```
project/
├── kafka_producer.py       # Streams data to Kafka
├── spark_streaming.py      # Spark job for real-time processing
├── app.py                  # Flask application for visualization
├── templates/
│   └── index.html          # HTML for the dashboard
├── requirements.txt        # Python dependencies
└── README.md               # Project documentation
```

---

## Example Output
- **Dashboard**: Displays a bar chart with the counts of positive, negative, and neutral sentiments.

---

## Future Improvements
- Add live integration with social media APIs (e.g., Twitter API).
- Use advanced sentiment analysis models for improved accuracy.
- Implement real-time notifications based on sentiment trends.