Blog

Visit and subscribe to the blog from here: Semantics & Systems

The 64.41 Ceiling: What AlphaEval Actually Measures (and Why Every Agent Eval Hits a Wall) by Khaled Ahmed, PhD

Why production environments shatter lab benchmarks, and how to fix your evals.

Catching AI Drifts: Program Analysis in Agentic Workflows by Khaled Ahmed, PhD

Using program analysis to verify AI-generated code against OpenSpec requirements

AI-Driven Development with OpenSpec: A Step-by-Step Walkthrough by Khaled Ahmed, PhD

Building a budget tracker from proposal to archive, one artifact at a time

Spec Driven Development: Fixing the AI Coding Pipeline with OpenSpec and Claude Code by Khaled Ahmed, PhD

Stop letting AI guess your architecture. Start orchestrating it with Spec-Driven Development

Atomic claims as an evaluation primitive by Khaled Ahmed, PhD

Turning free text into checkable units for LLM evaluation

Why Holistic LLM Judging Fails by Khaled Ahmed, PhD

Single-pass “LLM-as-a-judge” tends to sample the claim space, overloads attention in long contexts, and can produce plausible false critique.

Where To Trust LLMs in the Program Analysis Pipeline by Khaled Ahmed, PhD

Reflections from my thesis defense on keeping correctness with analysis and using models for interpretation.

Older posts: