Blog

Visit and subscribe to the blog from here: Semantics & Systems

Spec Driven Development: Fixing the AI Coding Pipeline with OpenSpec and Claude Code by Khaled Ahmed, PhD

Stop letting AI guess your architecture. Start orchestrating it with Spec-Driven Development

Atomic claims as an evaluation primitive by Khaled Ahmed, PhD

Turning free text into checkable units for LLM evaluation

Why Holistic LLM Judging Fails by Khaled Ahmed, PhD

Single-pass “LLM-as-a-judge” tends to sample the claim space, overloads attention in long contexts, and can produce plausible false critique.

Where To Trust LLMs in the Program Analysis Pipeline by Khaled Ahmed, PhD

Reflections from my thesis defense on keeping correctness with analysis and using models for interpretation.

adaptive-testing-tools: a small Python library for Adaptive Random Testing by Khaled Ahmed, PhD

From one-off LLM eval scripts to a reusable ART primitive you can drop into any Python test harness.

Testing Tool-Calling LLMs with Adaptive Random Inputs by Khaled Ahmed, PhD

Measuring Tool Call Accuracy to catch brittle agent behavior before it ships

Evaluating LLM prompts using Adaptive Random Testing by Khaled Ahmed

for quickly finding test inputs that reveal "problems" with the prompts

Adaptive Random Testing Introduction by Khaled Ahmed

A step-by-step guide.

Mutation-Based Fault Localization Introduction by Khaled Ahmed

A step-by-step guide.

Why Trust Matters in AI, and Why We Still Don’t Have It by Khaled Ahmed