๐Ÿ“ฆ linhkid / tabgen-bm

โ˜… 0 stars โ‘‚ 0 forks ๐Ÿ‘ 0 watching
๐Ÿ“ฅ Clone https://github.com/linhkid/tabgen-bm.git
HTTPS git clone https://github.com/linhkid/tabgen-bm.git
SSH git clone git@github.com:linhkid/tabgen-bm.git
CLI gh repo clone linhkid/tabgen-bm
linhkid linhkid run c471b2b 7 months ago ๐Ÿ“ History
๐Ÿ“‚ main View all commits โ†’
๐Ÿ“ ctabgan
๐Ÿ“ ctabganplus
๐Ÿ“ Data
๐Ÿ“ Discrete
๐Ÿ“ distsampl
๐Ÿ“ envs
๐Ÿ“ ganblr
๐Ÿ“ great
๐Ÿ“ Raw
๐Ÿ“ Scripts
๐Ÿ“ src
๐Ÿ“ tabddpm
๐Ÿ“ tabsyn
๐Ÿ“„ .gitignore
๐Ÿ“„ main.py
๐Ÿ“„ README.md
๐Ÿ“„ requirements.txt
๐Ÿ“„ README.md

Tabular Data Generation Benchmark

This repository contains code for benchmarking various tabular data generation models.

Dataset Structure

The repository organizes datasets by size category:
  • Small datasets: datasets_DM_small/
  • Medium datasets: datasets_DM_medium/
  • Large datasets: datasets_DM_big/

Available Models

The following models are implemented:
  • GANBLR - Generative Adversarial Network with Bayesian Label Representation
  • GANBLR++ - Enhanced GANBLR with Reinforcement Learning capabilities
  • CTABGAN+ - Conditional Tabular GAN Plus
  • TabDDPM - Tabular Denoising Diffusion Probabilistic Model
  • TabSyn - Tabular Data Synthesizer
  • GREAT - Generative Relational Autoencoder for Tabular data
  • RLIG - Representation Learning with Information Gain (KDB-based)

Setup

  • Create the conda environments (this only needs to be done once):
``bash # For TensorFlow-based models conda env create -f envs/env_tf.yml # For PyTorch-based models conda env create -f envs/env_torch.yml ` ## Using the Benchmark Runner This repository includes a main.py script that automates the entire workflow. It will: 1. Convert ARFF files to CSV 2. Preprocess the data 3. Split datasets into train/test 4. Train selected models 5. Run evaluation ### Examples: `bash # Create the conda environments (only needs to be done once) python main.py --create_envs # Run a single model on a single dataset (for testing) python main.py --single_run --dataset car --model ganblr --size small # Run specific models on specific datasets python main.py --datasets adult magic --models ganblr tabddpm # Run all models on all datasets python main.py # Run with GPU selection python main.py --gpu 1 # Run only the new RLIG model python main.py --single_run --dataset adult --model rlig --size medium ` ## Manual Workflow If you prefer to run the steps manually, follow these instructions: ### 1. Convert ARFF to CSV `python # Include this in your script import re import pandas as pd from scipy.io import arff # Load ARFF File data, meta = arff.loadarff('adult.arff') df = pd.DataFrame(data) # Decode Byte Strings df = df.applymap(lambda x: x.decode('utf-8') if isinstance(x, bytes) else x) # Clean: remove \, ", and ' from strings df = df.applymap(lambda x: re.sub(r'[\\\'\"]', '', x) if isinstance(x, str) else x) # Function for encoding pre-discretized interval bins def extract_lower_bound(interval): match = re.match(r"\(?(-?[\d\.inf]+)-", interval) if match: val = match.group(1) return float('-inf') if val == '-inf' else float(val) return float('inf') # fallback if format doesn't match def encode_bins_numerically(series): unique_bins = series.dropna().unique() sorted_bins = sorted(unique_bins, key=extract_lower_bound) bin_to_id = {bin_val: i for i, bin_val in enumerate(sorted_bins)} return series.map(bin_to_id), bin_to_id # Identify and encode binned columns binned_cols = [ ] for col in binned_cols: df[col], mapping = encode_bins_numerically(df[col]) print(f"\nMapping for {col}:") for k, v in mapping.items(): print(f"{k} โ†’ {v}") # Save final CSV df.to_csv("adult.csv", index=False) ` ### 2. Preprocessing `bash # Encode discrete values python Scripts/preprocess_encode_only.py --input Raw/adult.csv --output Discrete/adult_discrete.csv # Split dataset python Scripts/split_dataset.py --input_csv Discrete/adult_discrete.csv --output_dir Data/adult --seed 42 ` ### 3. Training Models #### GANBLR `bash # Activate TF environment conda activate tabgen-tf # Train model python Scripts/ganblr_train.py --dataset adult --size_category medium # Evaluate python Scripts/tstr_evaluation.py --synthetic_dir Synthetic/adult/ganblr --real_test_dir Data/adult ` #### GANBLR++ `bash # Activate TF environment conda activate tabgen-tf # Train model python Scripts/ganblrplus_train.py --dataset adult --size_category medium --k 2 --episodes 5 # Evaluate python Scripts/tstr_evaluation.py --synthetic_dir Synthetic/adult/ganblrplus --real_test_dir Data/adult ` #### CTABGAN+ `bash # Activate PyTorch environment conda activate tabgen-torch # Train model python Scripts/ctabganplus_train.py --dataset_name adult --size_category medium # Evaluate python Scripts/tstr_evaluation.py --synthetic_dir Synthetic/adult/ctabgan_plus --real_test_dir Data/adult ` #### TabDDPM `bash # Activate PyTorch environment conda activate tabgen-torch # Train model python Scripts/tabddpm_train.py --dataset adult # Evaluate python Scripts/tstr_evaluation.py --synthetic_dir Synthetic/adult/tabddpm --real_test_dir Data/adult ` #### TabSyn `bash # Activate PyTorch environment conda activate tabgen-torch # Create NPY files python Scripts/create_npy.py --dataset adult # Train VAE python tabsyn/vae/main.py --dataname adult --gpu 0 # Train diffusion model python Scripts/tabsyn_train.py --dataset adult # Evaluate python Scripts/tstr_evaluation.py --synthetic_dir Synthetic/adult/tabsyn --real_test_dir Data/adult ` #### GREAT `bash # Activate PyTorch environment conda activate tabgen-torch # Train model python Scripts/great_train.py --dataset adult # Evaluate python Scripts/tstr_evaluation.py --synthetic_dir Synthetic/adult/great --real_test_dir Data/adult ` #### RLIG (KDB-based) `bash # Activate TF environment conda activate tabgen-tf # Train model python Scripts/rlig_train.py --dataset adult --size_category medium # Evaluate python Scripts/tstr_evaluation.py --synthetic_dir Synthetic/adult/rlig --real_test_dir Data/adult ``