https://github.com/PedramNavid/ped-vals-oai.git
A FastAPI web app to systematically evaluate which LLM (OpenAI, Anthropic, Google) and which prompting strategy produces the best marketing content, using blind human evaluations.
openai, anthropic, google-generativeaillm-content-eval/
โโโ app/
โ โโโ __init__.py
โ โโโ main.py # FastAPI app entry point
โ โโโ models.py # SQLAlchemy models
โ โโโ database.py # DB engine, session, init/load tasks
โ โโโ schemas.py # Pydantic schemas
โ โโโ llm_clients.py # LLM API wrapper with offline stub
โ โโโ generation_service.py # Generation logic
โ โโโ evaluation_service.py # Blind evaluation logic
โ โโโ analysis_service.py # Aggregations & summaries
โ โโโ routers/
โ โโโ __init__.py
โ โโโ experiments.py # Experiment endpoints
โ โโโ generations.py # Generation endpoints
โ โโโ evaluations.py # Evaluation endpoints
โ โโโ analysis.py # Analysis endpoints
โโโ static/
โ โโโ css/style.css
โ โโโ js/main.js
โโโ templates/
โ โโโ base.html
โ โโโ index.html # Dashboard
โ โโโ setup.html # Experiment setup
โ โโโ generate.html # Run generations
โ โโโ evaluate.html # Blind evaluation
โ โโโ results.html # Results & analysis
โโโ data/
โ โโโ tasks.json # Task definitions
โ โโโ database.db # SQLite DB (created at runtime)
โโโ config.py # Models, pricing, DB URL
โโโ requirements.txt
โโโ .env # API keys (create locally)
pip install -r requirements.txt
.env in the project root (keys are optional; without them, the app uses stubbed generations for local testing):
OPENAI_API_KEY=your_key_here
ANTHROPIC_API_KEY=your_key_here
GOOGLE_API_KEY=your_key_here
python -c "from app.database import init_db; init_db()"
python -c "from app.database import load_tasks; load_tasks('data/tasks.json')"
This creates data/database.db and loads task definitions.
uvicorn app.main:app --reload
Open http://localhost:8000
/setup. Paste 2+ writing samples (separated by blank lines) and create an experiment./generate. Enter the experiment ID and Start All. If API keys are set, real SDKs are used; otherwise stubbed content is generated for flow testing./evaluate. Enter the experiment ID, Load Next, and submit blind evaluations. The UI hides model/strategy/task provenance./results. Enter the experiment ID and load summary/by-model/by-strategy/by-task stats.POST /api/experiments, GET /api/experiments, GET /api/experiments/{id}, PUT /api/experiments/{id}/statusPOST /api/generations/start, GET /api/generations/progress/{experiment_id}, POST /api/generations/single, GET /api/generations/{experiment_id}GET /api/evaluations/next/{experiment_id}, POST /api/evaluations?experiment_id=..., GET /api/evaluations/progress/{experiment_id}, GET /api/evaluations/{experiment_id}GET /api/analysis/{experiment_id}/summary, /by-model, /by-strategy, /by-taskllm_clients.py falls back to stubbed responses if SDKs/keys are unavailable.config.PRICING and token usage where available (best-effort).data/tasks.json exists before loading tasks.data/database.db. Verify the data/ folder is writable.