๐Ÿ“ฆ PedramNavid / ped-vals-oai

โ˜… 0 stars โ‘‚ 0 forks ๐Ÿ‘ 0 watching
๐Ÿ“ฅ Clone https://github.com/PedramNavid/ped-vals-oai.git
HTTPS git clone https://github.com/PedramNavid/ped-vals-oai.git
SSH git clone git@github.com:PedramNavid/ped-vals-oai.git
CLI gh repo clone PedramNavid/ped-vals-oai
Pedram Navid Pedram Navid Ask codex to add readme 0ce4f02 5 months ago ๐Ÿ“ History
๐Ÿ“‚ main View all commits โ†’
๐Ÿ“ app
๐Ÿ“ data
๐Ÿ“ static
๐Ÿ“ templates
๐Ÿ“„ config.py
๐Ÿ“„ README.md
๐Ÿ“„ requirements.txt
๐Ÿ“„ TODO.md
๐Ÿ“„ README.md

LLM Content Generation Evaluation App

A FastAPI web app to systematically evaluate which LLM (OpenAI, Anthropic, Google) and which prompting strategy produces the best marketing content, using blind human evaluations.

Tech Stack

  • Backend: FastAPI
  • Database: SQLite + SQLAlchemy ORM
  • Frontend: HTML/JS + Tailwind CSS (via CDN)
  • LLM SDKs: openai, anthropic, google-generativeai
  • Visualization: JSON summaries (Chart.js can be added later)

Project Structure

llm-content-eval/
โ”œโ”€โ”€ app/
โ”‚   โ”œโ”€โ”€ __init__.py
โ”‚   โ”œโ”€โ”€ main.py                 # FastAPI app entry point
โ”‚   โ”œโ”€โ”€ models.py               # SQLAlchemy models
โ”‚   โ”œโ”€โ”€ database.py             # DB engine, session, init/load tasks
โ”‚   โ”œโ”€โ”€ schemas.py              # Pydantic schemas
โ”‚   โ”œโ”€โ”€ llm_clients.py          # LLM API wrapper with offline stub
โ”‚   โ”œโ”€โ”€ generation_service.py   # Generation logic
โ”‚   โ”œโ”€โ”€ evaluation_service.py   # Blind evaluation logic
โ”‚   โ”œโ”€โ”€ analysis_service.py     # Aggregations & summaries
โ”‚   โ””โ”€โ”€ routers/
โ”‚       โ”œโ”€โ”€ __init__.py
โ”‚       โ”œโ”€โ”€ experiments.py      # Experiment endpoints
โ”‚       โ”œโ”€โ”€ generations.py      # Generation endpoints
โ”‚       โ”œโ”€โ”€ evaluations.py      # Evaluation endpoints
โ”‚       โ””โ”€โ”€ analysis.py         # Analysis endpoints
โ”œโ”€โ”€ static/
โ”‚   โ”œโ”€โ”€ css/style.css
โ”‚   โ””โ”€โ”€ js/main.js
โ”œโ”€โ”€ templates/
โ”‚   โ”œโ”€โ”€ base.html
โ”‚   โ”œโ”€โ”€ index.html              # Dashboard
โ”‚   โ”œโ”€โ”€ setup.html              # Experiment setup
โ”‚   โ”œโ”€โ”€ generate.html           # Run generations
โ”‚   โ”œโ”€โ”€ evaluate.html           # Blind evaluation
โ”‚   โ””โ”€โ”€ results.html            # Results & analysis
โ”œโ”€โ”€ data/
โ”‚   โ”œโ”€โ”€ tasks.json              # Task definitions
โ”‚   โ””โ”€โ”€ database.db             # SQLite DB (created at runtime)
โ”œโ”€โ”€ config.py                   # Models, pricing, DB URL
โ”œโ”€โ”€ requirements.txt
โ””โ”€โ”€ .env                        # API keys (create locally)

Prerequisites

  • Python 3.10+
  • Optional: virtual environment

Install

pip install -r requirements.txt

Configure Environment

Create a .env in the project root (keys are optional; without them, the app uses stubbed generations for local testing):
OPENAI_API_KEY=your_key_here
ANTHROPIC_API_KEY=your_key_here
GOOGLE_API_KEY=your_key_here

Initialize Database and Load Tasks

python -c "from app.database import init_db; init_db()"
python -c "from app.database import load_tasks; load_tasks('data/tasks.json')"
This creates data/database.db and loads task definitions.

Run the App

uvicorn app.main:app --reload
Open http://localhost:8000

Usage Flow

  • Setup: Visit /setup. Paste 2+ writing samples (separated by blank lines) and create an experiment.
  • Generate: Visit /generate. Enter the experiment ID and Start All. If API keys are set, real SDKs are used; otherwise stubbed content is generated for flow testing.
  • Evaluate: Visit /evaluate. Enter the experiment ID, Load Next, and submit blind evaluations. The UI hides model/strategy/task provenance.
  • Results: Visit /results. Enter the experiment ID and load summary/by-model/by-strategy/by-task stats.

API Endpoints (summary)

  • Experiments: POST /api/experiments, GET /api/experiments, GET /api/experiments/{id}, PUT /api/experiments/{id}/status
  • Generations: POST /api/generations/start, GET /api/generations/progress/{experiment_id}, POST /api/generations/single, GET /api/generations/{experiment_id}
  • Evaluations: GET /api/evaluations/next/{experiment_id}, POST /api/evaluations?experiment_id=..., GET /api/evaluations/progress/{experiment_id}, GET /api/evaluations/{experiment_id}
  • Analysis: GET /api/analysis/{experiment_id}/summary, /by-model, /by-strategy, /by-task

Notes

  • Offline-friendly: llm_clients.py falls back to stubbed responses if SDKs/keys are unavailable.
  • Cost tracking: Estimated via config.PRICING and token usage where available (best-effort).
  • Randomization: Generations randomize task/combination order and include small delays.
  • Not yet implemented: CSV export and Chart.js visualizations (JSON summaries provided). Retry/backoff can be added for API errors.

Troubleshooting

  • Ensure data/tasks.json exists before loading tasks.
  • If using SQLite, the app creates data/database.db. Verify the data/ folder is writable.
  • If SDK calls fail or no keys are set, stubbed content is returned to keep the flow working.