CodexOpt: Optimize AGENTS.md and SKILL.md for Codex with GEPA-Inspired Feedback

Modern coding agents are getting better fast. But for most teams, one problem remains stubbornly manual: the instructions that shape agent behavior. A repo might have an AGENTS.md. It might have a set of SKILL.md files. Over time, those files grow. They collect repeated rules, contradictory guidance, vague workflows, missing verification steps, and formatting drift.

Teams tweak them constantly, but usually without a reliable way to answer basic questions:

Did this actually improve agent behavior?

Did we reduce prompt bloat or make it worse?

Did we accidentally weaken safety, testing, or output quality?

Which version of our instructions should we keep?

At first, everything feels manageable. The agent follows instructions reasonably well, the repo stays organized, and small tweaks seem easy enough to make by hand. Then the instruction files start to grow. A rule gets repeated. A workflow gets added in one place and contradicted in another. A skill becomes too generic. The agent starts skipping tests, getting too verbose, or formatting responses inconsistently. You fix it by editing the prompt files again, but now you are guessing.

That is the problem CodexOpt is built to solve.

CodexOpt is a CLI for benchmarking and optimizing the instruction assets developers already use with Codex-style workflows. Instead of treating AGENTS.md and SKILL.md files like static documentation, it treats them like measurable parts of the system.

Why This Project Exists

Most teams still edit agent instruction files manually. That works for a while, but it breaks down once those files become part of a real engineering workflow. The challenge is not just writing instructions. The challenge is maintaining them over time.

You need a way to answer practical questions:

Is this version actually better than the last one?

Did we make the instructions clearer or just longer?

Did we improve safety, testing guidance, and output quality?

Which candidate should we keep?

CodexOpt is designed around those questions. It gives developers a repeatable loop for discovering instruction assets, scoring them, generating improved candidates, reviewing diffs, and applying only the changes that are actually worth keeping.

Why Focus on Codex, AGENTS.md, and SKILL.md?

The scope is intentionally narrow. CodexOpt does not try to optimize every prompt format or every agent framework. It focuses on the files that are closest to how developers actually work in a repository:

AGENTS.md

The top-level behavioral contract that defines how the agent operates in your repo

SKILL.md files

Reusable task-specific workflows that handle particular capabilities

That narrow focus is a strength. These files are version-controlled, reviewable, repo-local, and already part of the development workflow. They are the right place to start if you want a practical way to improve agent behavior without building an entire prompt platform from scratch.

Codex itself being open source also matters here. It makes repo-local instruction assets more important, not less. Teams can shape behavior in a way that stays transparent, inspectable, and close to the codebase.

What Inspired CodexOpt

Two ideas influenced the design of CodexOpt. The first is GEPA, which shows how text artifacts can be optimized using reflection and search. The second is the prompt-learning idea described by Arize in Prompt Learning: Using English Feedback to Optimize LLM Systems, which argues that natural-language feedback can be a stronger optimization signal than a single scalar score.

CodexOpt borrows from both ideas, but in a very repo-native way. From GEPA, it takes the idea that prompts and instructions are not fixed text — they are optimizable system components. From prompt learning, it takes the idea that critique matters. Instead of relying only on length checks or numeric scores, CodexOpt tries to capture feedback such as contradiction, redundancy, missing verification guidance, weak trigger clarity, and poor alignment with the actual repo workflow.

It is not just “prompt cleanup.” It is an attempt to turn instruction maintenance into an engineering workflow.

How CodexOpt Works

CodexOpt gives developers a simple command-line flow. You point it at a repository with an AGENTS.md and one or more skills. It can also take optional evidence, such as a tasks.md file or a short list of recurring issues and review themes. Then it runs through a series of steps:

scan

Discovers instruction assets in your repository

benchmark

Scores them and generates structured feedback

optimize

Generates improved candidates

apply

Previews or writes changes safely

report

Summarizes the latest runs

This makes the workflow measurable. Instead of endlessly editing a prompt and hoping for the best, you can benchmark the current state, inspect the findings, compare candidates, and keep a record of what changed.

What It Evaluates

CodexOpt is not just checking whether a file exists or whether a skill has frontmatter. It looks for the kinds of problems developers actually run into when instruction files drift.

AGENTS.md

Top-level behavioral contract

Contradictory guidance

Duplicated rules

Missing workflow structure

Weak verification guidance

Missing output expectations

Weak repo grounding

SKILL.md

Task-specific workflows

Missing frontmatter

Vague trigger conditions

Weak workflow structure

Insufficient verification steps

Prompt bloat

It can also use optional evidence files to make the evaluation more grounded. A task list can help CodexOpt understand what the repo actually expects from the agent. A short issue or review log can tell it which mistakes keep happening. That does not mean CodexOpt is executing full agent simulations yet. Today, those evidence files shape scoring and feedback rather than running end-to-end task execution. But even that shift is valuable, because it makes instruction quality more repo-specific and less generic.

Heuristic Mode and GEPA Mode

CodexOpt currently supports two optimization paths.

Heuristic Engine

The default, fast local engine. Handles whitespace cleanup, duplicate-line removal, and skill frontmatter repair. Deterministic, cheap, and easy to understand. The best place to start.

GEPA-backed Optimization

The optional advanced mode. Uses reflection and search to explore stronger candidates. Promising for deeper instruction optimization over time.

Trust note: CodexOpt now reports when a GEPA-requested run falls back to heuristic behavior. If a team asks for GEPA, they should know whether they actually got a GEPA-backed result or a safe fallback. That visibility is important in production workflows.

Who This Is For

CodexOpt is built for developers who are already maintaining repo-local instruction assets and want a better way to improve them.

Individual developers experimenting with Codex in their own repos

Teams maintaining shared AGENTS.md guidance

Platform teams standardizing internal skills

Open-source maintainers who want more predictable coding-agent behavior

If you already find yourself editing prompts because the same mistakes keep happening, CodexOpt is aimed at you.

The Demo Repo

To make the project easy to understand, there is a companion demo repo at codexopt-demo. The demo contains intentionally messy instruction assets:

A noisy and contradictory AGENTS.md

Skills with missing frontmatter, duplicated lines, and unnecessary verbosity

A tasks.md file that describes repo tasks

An issues.md file that simulates recurring feedback themes

It also includes a tiny Python package with bugs aligned to those tasks, so the repo feels like something a developer might actually work on. This matters because it makes the value of CodexOpt concrete. You are not looking at abstract prompts in a vacuum. You are looking at a repo, its instruction files, its recurring problems, and a tool that tries to improve them in a measured way.

What Is Different About CodexOpt

There are plenty of prompt tools and many agent frameworks. CodexOpt is different because it stays local to the repository and close to how developers already work. It is not trying to become a hosted prompt management platform. It is not trying to become a full agent execution framework. It is not trying to replace all prompt engineering workflows.

It is trying to do one thing well: help teams improve the instruction assets that shape coding-agent behavior in source control.

That focus makes it easier to adopt, easier to reason about, and easier to integrate into real development workflows.

Why This Matters for Open Source

As coding agents become part of day-to-day development, instruction files become part of the real software surface area. Open source teams need better tooling around them.

CodexOpt is useful precisely because it treats those files seriously. It makes them inspectable, benchmarkable, reviewable, and safe to apply. That is a much better foundation than endless manual prompt edits with no measurement.

For open-source maintainers, this is a practical way to keep instruction quality from becoming invisible technical debt.

Where the Project Goes Next

The current release is a solid foundation, but there is a clear path forward. Over time, CodexOpt can grow into richer scenario-based evaluation, deeper repo-specific scoring, stronger evidence handling, and more capable GEPA-backed optimization.

Richer scenario-based evaluation

Deeper repo-specific scoring

Stronger evidence handling

More capable GEPA-backed optimization

But the important part is that the core workflow already exists: benchmark, optimize, review, apply. That alone is a meaningful step forward for teams maintaining AGENTS.md and SKILL.md by hand.

Try It

If you are already maintaining instruction files for Codex in a real repository, CodexOpt gives you a better way to do it — not as prompt guesswork, but as an engineering workflow.

CodexOpt Repository codexopt-demo Repository

CodexOpt: Optimize AGENTS.md and SKILL.md for Codex with GEPA-Inspired Feedback

CodexOpt

Explore CodexOpt

Watch Demo

Why This Project Exists

Why Focus on Codex, AGENTS.md, and SKILL.md?

What Inspired CodexOpt

How CodexOpt Works

scan

benchmark

optimize

apply

report

What It Evaluates

AGENTS.md

SKILL.md

Heuristic Mode and GEPA Mode

Heuristic Engine

GEPA-backed Optimization

Who This Is For

The Demo Repo

What Is Different About CodexOpt

Why This Matters for Open Source

Where the Project Goes Next

Try It

Related Posts

A2A v1 in SuperOptiX: Expose, Connect, and Orchestrate AI Agents

Agent Engineering 101 at GDG London: How to Build Reliable AI Systems