Data warehouse implementation for the Australian Road Deaths Database (ARRD) using dlt, SQLMesh, Docker, PostgreSQL, and Metabase
https://github.com/sglkc/ardd-data-warehouse.git
Internship final project for data warehouse to visualize the Australian Road Deaths Database (ARDD). The project practices task management, data analysis, ETL (extract, transform, load) process, data architect, and probably some CI/CD & DevOps.
For quick access to the data warehouse, it is currently hosted to a VPS. Plz dont abuse thxx. Should you find not able to log in, it means that I've changed the credentials and you are out of luck.
Demo Metabase
The project references a data warehouse project from The University of Western Australia:
https://csse-uwa.gitbook.io/data-warehouse-project-1-s1-2025
The dataset for this project are gathered from the Australian Road Deaths Database (ARDD):
https://datahub.roadsafety.gov.au/progress-reporting/monthly-road-deaths
The GeoJSON data for map visualization in Metabase are downloaded from:
https://csse-uwa.gitbook.io/data-warehouse-project-1-s1-2025#data-sources
with 0.05% Visvalingam simplification to minify the file size via https://mapshaper.org/
The data warehouse uses Medallion Architecture that consists of three layers:
The project handles crashes and fatalities data extracted from the ARDD. Following is the data architecture diagram:
Fact constellation, also known as Galaxy Schema, is the perfect schema design for this particular warehouse. The data processing results in two fact tables and six dimensions, where all facts share the same dimensions, except for dim_victims exclusively fact_fatalities. The following is the data schema for the final gold layer.
You may downlaod the repository archive directly using the URL:
https://github.com/sglkc/data-warehouse/archive/refs/heads/master.zip
or clone with git:
git clone https://github.com/sglkc/ardd-data-warehouse.git && cd ardd-data-warehouse
Before running, you must set up environment variables for the project:
cp .env.example .env
The variables are ready for development, but it can be customized.
The project uses uv for Python version and package management, make sure you already had it installed.
uv venv
source .venv/bin/activate
uv sync
The project uses Docker for containerization, the services used are:
./compose.sh up
To stop the services, run:
./compose.sh down
To clean up everything, including container data, run:
./compose.sh cleanup
The script should be executable by default, if not, run:
chmod +x ./compose.sh
By default, the credentials for containerized database should be:
Upon running the compose up script, you will have to wait for Metabase to start.
This process should take a minute, though the logs can be seen with:
docker logs ${PROJECT_NAME}_metabase -f
Adapt PROJECT_NAME to your environment variable, by default, it's au_crashes.
On first start up, you will be required to create an admin account to access Metabase. Follow the process and input anything. On the 4th step, you should pick PostgreSQL and enter the database credentials.
If you use the database locally, you must replace localhost with host.docker.internal.
Setup PostgreSQL data warehouse in Metabase
If you have ran Metabase before executing the data pipeline, you would probably need to refresh the database schema and field values.
To do this, head to: http://localhost:3000/admin/databases/
Refresh data warehouse in Metabase
Metabase by default only ship with US and World map.
To add Australian regions to the map visualization:
Add a map
The project utilizes dlt (data load tool) to extract file datasets into the data warehouse bronze layer.
To load the source datasets into the data warehouse, run the following script with additional source name:
./dlt.sh <all|fatalities-jun|fatalities-jul|lga|ra|sa4|dwellings|population>
As for the transform and load (TL) process, the project uses SQLMesh in the silver and gold layer.
To apply model changes to the data warehouse, run the following script:
./sqlmesh.sh
If you encountered Failed to build psycopg2==2.9.10 on uv sync, you must install the libpq library to build it.
sudo apt-get install libpq-dev
There are more publicly available datasets from Australian Bureau of Statistics that may be used to expand this project to a larger data warehouse:
https://www.abs.gov.au/statistics
The population datasets are very helpful to know territory populations and death counts.
You may contribute or fork this repository, please feel free to use this to learn!