The 64.41 Ceiling: What AlphaEval Actually Measures (and Why Every Agent Eval Hits a Wall) by Khaled Ahmed, PhD
Why production environments shatter lab benchmarks, and how to fix your evals.
Read on SubstackCatching AI Drifts: Program Analysis in Agentic Workflows by Khaled Ahmed, PhD
Using program analysis to verify AI-generated code against OpenSpec requirements
Read on SubstackAI-Driven Development with OpenSpec: A Step-by-Step Walkthrough by Khaled Ahmed, PhD
Building a budget tracker from proposal to archive, one artifact at a time
Read on SubstackSpec Driven Development: Fixing the AI Coding Pipeline with OpenSpec and Claude Code by Khaled Ahmed, PhD
Stop letting AI guess your architecture. Start orchestrating it with Spec-Driven Development
Read on SubstackAtomic claims as an evaluation primitive by Khaled Ahmed, PhD
Turning free text into checkable units for LLM evaluation
Read on SubstackWhy Holistic LLM Judging Fails by Khaled Ahmed, PhD
Single-pass “LLM-as-a-judge” tends to sample the claim space, overloads attention in long contexts, and can produce plausible false critique.
Read on SubstackWhere To Trust LLMs in the Program Analysis Pipeline by Khaled Ahmed, PhD
Reflections from my thesis defense on keeping correctness with analysis and using models for interpretation.
Read on Substack