📦 Asabeneh / data-analysis-with-python-spring-2025

📄 readme.md · 126 lines
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126### **Week 1: Introduction to Data Analysis & Practicalities**  

- **Objective**: Set up the environment and understand the data lifecycle.  
- **Tools**: Python, Visual Stuido Code, Jupyter Notebook on Anaconda, Google Colab.  
- **Topics**:  
  - What is data analysis? (Descriptive vs. diagnostic vs. predictive).  
  - Installing Python libraries (`pip`, `conda`).  
  - Data types (structured, semi struncture, unstructured data).  
  - Introduction to datasets (CSV, Excel, SQL).  
  - Common File fomats of datasets(.txt, .csv, .tsv, .json. .xml, .xlsx)
  - Ethics in data handling (GDPR, privacy).  
  - **Hands-on**: Loading a dataset and basic exploration.

---

### **Week 2: NumPy Fundamentals**  

- **Objective**: Master array operations for numerical computing.  
- **Tools**: NumPy.  
- **Topics**:  
  - Creating arrays (1D, 2D, 3D).  
  - Array operations (reshaping, slicing, broadcasting).  
  - Mathematical functions (aggregations, linear algebra).  
  - Random sampling (normal, uniform distributions).  
  - **Hands-on**: Solving numerical problems (e.g., Random data generation, matrix multiplication, ).

---

### **Week 3: Data Visualization with Matplotlib**  

- **Objective**: Create static, interactive, and publication-quality plots.  
- **Tools**: Matplotlib, Seaborn (optional).  
- **Topics**:  
  - Line plots, bar charts, histograms, scatter plots.  
  - Customizing plots (labels, legends, colors).  
  - Subplots and multi-panel figures.  
  - Introduction to Seaborn for statistical visuals.  
  - **Hands-on**: Visualizing different datasets

---

### **Week 4: Pandas for Data Manipulation**  

- **Objective**: Clean, transform, and analyze tabular data.  
- **Tools**: Pandas.  
- **Topics**:  
  - DataFrames vs. Series.  
  - Indexing (`loc`, `iloc`), filtering, and grouping.  
  - Handling missing data (`dropna`, `fillna`).  
  - Merging/joining datasets.  
  - **Hands-on**: Cleaning, Exploring, transforming, visualizing and  analysing different datasets

---

### **Week 5: Descriptive Statistics**  

- **Objective**: Summarize and interpret data distributions.  
- **Tools**: Pandas, SciPy.  
- **Topics**:  
  - Measures of central tendency (mean, median, mode).  
  - Measures of spread (variance, standard deviation, IQR).  
  - Skewness, kurtosis, and distributions (normal, binomial).  
  - Correlation and covariance.  
  - **Hands-on**: Exporing and analyzing a dataset.

---

### Exercise

#### **Task 1: Employee Dataset Analysis**  

**Objective**: Use Python and pandas to analyze the [Employee Dataset](https://www.kaggle.com/datasets/tawfikelmetwally/employee-dataset) and derive actionable insights.  You can download the [Employee Dataset](https://www.kaggle.com/datasets/tawfikelmetwally/employee-dataset) data from Kaggle, you need to create an account on Kaggle since it requires to  downolad datasets

**Requirements**:  

1. **Data Preparation**:  
   - Clean the dataset (handle missing values, duplicates, data types).  
   - Validate columns like `salary`, `age`, and `management` for consistency.  

2. **Exploratory Analysis**:  
   - Generate summary statistics (mean, median, distributions).  
   - Explore relationships between variables (e.g., `salary` vs. `education`, `management` vs. `environment` satisfaction).  

3. **Visualization**:  
   - Create visualizations (e.g., boxplots for salary distribution by education level, heatmaps for correlation analysis).  
   - Highlight trends (e.g., attrition patterns linked to `management` scores).  

4. **Key Questions**:  
   - Does higher education correlate with salary or job retention?  
   - Are there gender disparities in salary or promotion?  
   - What workplace factors (e.g., `environment`, `colleagues`) most impact employee satisfaction?  

---

#### **Task 2: Cat Breed API to CSV Transformation**  

**Objective**: Fetch data from [The Cat API](https://api.thecatapi.com/v1/breeds) and transform it into a structured `cats.csv` file.  

**Requirements**:  

1. **API Data Extraction**:  
   - Fetch breed data programmatically

2. **Data Transformation**:  
   - Map API fields to CSV headers:  

     ```csv
     ID, Name, Origin, Description, Temperament, Life Span (years), Weight (kg), Image URL
     ```  

   - **Special Cases**:  
     - Combine `temperament` as a comma-separated string (e.g., "Active, Curious").  
     - Convert `weight` from imperial to metric if necessary.  
     - Extract the first image URL from the breed’s `image` object.  

3. **Validation**:  
   - Handle missing fields (e.g., default `Description` to "N/A" if empty).  
   - Ensure numeric columns (`Life Span`, `Weight`) are properly formatted.  

    **Sample CSV Row**:  

    ```csv
    ID, Name, Origin, Description, Temperament, Life Span (years), Weight (kg), Image URL
    abys,Abyssinian,Egypt,"The Abyssinian is easy to care for...","Active, Energetic, Independent",14.5,4,https://cdn2.thecatapi.com/images/0XYvRd7oD.jpg
    ```