CodexOpt: Optimize AGENTS.md and SKILL.md for Codex with GEPA-Inspired Feedback

Open Source • Coding Agents
CodexOpt
Explore CodexOpt
Source code and demo repository
Watch Demo
Modern coding agents are getting better fast. But for most teams, one problem remains stubbornly manual: the instructions that shape agent behavior. A repo might have an AGENTS.md. It might have a set of SKILL.md files. Over time, those files grow. They collect repeated rules, contradictory guidance, vague workflows, missing verification steps, and formatting drift.
Teams tweak them constantly, but usually without a reliable way to answer basic questions:
At first, everything feels manageable. The agent follows instructions reasonably well, the repo stays organized, and small tweaks seem easy enough to make by hand. Then the instruction files start to grow. A rule gets repeated. A workflow gets added in one place and contradicted in another. A skill becomes too generic. The agent starts skipping tests, getting too verbose, or formatting responses inconsistently. You fix it by editing the prompt files again, but now you are guessing.
That is the problem CodexOpt is built to solve.
CodexOpt is a CLI for benchmarking and optimizing the instruction assets developers already use with Codex-style workflows. Instead of treating AGENTS.md and SKILL.md files like static documentation, it treats them like measurable parts of the system.
Why This Project Exists
Most teams still edit agent instruction files manually. That works for a while, but it breaks down once those files become part of a real engineering workflow. The challenge is not just writing instructions. The challenge is maintaining them over time.
You need a way to answer practical questions:
CodexOpt is designed around those questions. It gives developers a repeatable loop for discovering instruction assets, scoring them, generating improved candidates, reviewing diffs, and applying only the changes that are actually worth keeping.
Why Focus on Codex, AGENTS.md, and SKILL.md?
The scope is intentionally narrow. CodexOpt does not try to optimize every prompt format or every agent framework. It focuses on the files that are closest to how developers actually work in a repository:
AGENTS.md
The top-level behavioral contract that defines how the agent operates in your repo
SKILL.md files
Reusable task-specific workflows that handle particular capabilities
That narrow focus is a strength. These files are version-controlled, reviewable, repo-local, and already part of the development workflow. They are the right place to start if you want a practical way to improve agent behavior without building an entire prompt platform from scratch.
Codex itself being open source also matters here. It makes repo-local instruction assets more important, not less. Teams can shape behavior in a way that stays transparent, inspectable, and close to the codebase.
What Inspired CodexOpt
Two ideas influenced the design of CodexOpt. The first is GEPA, which shows how text artifacts can be optimized using reflection and search. The second is the prompt-learning idea described by Arize in Prompt Learning: Using English Feedback to Optimize LLM Systems, which argues that natural-language feedback can be a stronger optimization signal than a single scalar score.
CodexOpt borrows from both ideas, but in a very repo-native way. From GEPA, it takes the idea that prompts and instructions are not fixed text — they are optimizable system components. From prompt learning, it takes the idea that critique matters. Instead of relying only on length checks or numeric scores, CodexOpt tries to capture feedback such as contradiction, redundancy, missing verification guidance, weak trigger clarity, and poor alignment with the actual repo workflow.
It is not just “prompt cleanup.” It is an attempt to turn instruction maintenance into an engineering workflow.
How CodexOpt Works
CodexOpt gives developers a simple command-line flow. You point it at a repository with an AGENTS.md and one or more skills. It can also take optional evidence, such as a tasks.md file or a short list of recurring issues and review themes. Then it runs through a series of steps:
scan
Discovers instruction assets in your repository
benchmark
Scores them and generates structured feedback
optimize
Generates improved candidates
apply
Previews or writes changes safely
report
Summarizes the latest runs
This makes the workflow measurable. Instead of endlessly editing a prompt and hoping for the best, you can benchmark the current state, inspect the findings, compare candidates, and keep a record of what changed.
What It Evaluates
CodexOpt is not just checking whether a file exists or whether a skill has frontmatter. It looks for the kinds of problems developers actually run into when instruction files drift.
AGENTS.md
Top-level behavioral contract
SKILL.md
Task-specific workflows
It can also use optional evidence files to make the evaluation more grounded. A task list can help CodexOpt understand what the repo actually expects from the agent. A short issue or review log can tell it which mistakes keep happening. That does not mean CodexOpt is executing full agent simulations yet. Today, those evidence files shape scoring and feedback rather than running end-to-end task execution. But even that shift is valuable, because it makes instruction quality more repo-specific and less generic.
Heuristic Mode and GEPA Mode
CodexOpt currently supports two optimization paths.
Heuristic Engine
The default, fast local engine. Handles whitespace cleanup, duplicate-line removal, and skill frontmatter repair. Deterministic, cheap, and easy to understand. The best place to start.
GEPA-backed Optimization
The optional advanced mode. Uses reflection and search to explore stronger candidates. Promising for deeper instruction optimization over time.
Trust note: CodexOpt now reports when a GEPA-requested run falls back to heuristic behavior. If a team asks for GEPA, they should know whether they actually got a GEPA-backed result or a safe fallback. That visibility is important in production workflows.
Who This Is For
CodexOpt is built for developers who are already maintaining repo-local instruction assets and want a better way to improve them.
If you already find yourself editing prompts because the same mistakes keep happening, CodexOpt is aimed at you.
The Demo Repo
To make the project easy to understand, there is a companion demo repo at codexopt-demo. The demo contains intentionally messy instruction assets:
It also includes a tiny Python package with bugs aligned to those tasks, so the repo feels like something a developer might actually work on. This matters because it makes the value of CodexOpt concrete. You are not looking at abstract prompts in a vacuum. You are looking at a repo, its instruction files, its recurring problems, and a tool that tries to improve them in a measured way.
What Is Different About CodexOpt
There are plenty of prompt tools and many agent frameworks. CodexOpt is different because it stays local to the repository and close to how developers already work. It is not trying to become a hosted prompt management platform. It is not trying to become a full agent execution framework. It is not trying to replace all prompt engineering workflows.
It is trying to do one thing well: help teams improve the instruction assets that shape coding-agent behavior in source control.
That focus makes it easier to adopt, easier to reason about, and easier to integrate into real development workflows.
Why This Matters for Open Source
As coding agents become part of day-to-day development, instruction files become part of the real software surface area. Open source teams need better tooling around them.
CodexOpt is useful precisely because it treats those files seriously. It makes them inspectable, benchmarkable, reviewable, and safe to apply. That is a much better foundation than endless manual prompt edits with no measurement.
For open-source maintainers, this is a practical way to keep instruction quality from becoming invisible technical debt.
Where the Project Goes Next
The current release is a solid foundation, but there is a clear path forward. Over time, CodexOpt can grow into richer scenario-based evaluation, deeper repo-specific scoring, stronger evidence handling, and more capable GEPA-backed optimization.
But the important part is that the core workflow already exists: benchmark, optimize, review, apply. That alone is a meaningful step forward for teams maintaining AGENTS.md and SKILL.md by hand.
Try It
If you are already maintaining instruction files for Codex in a real repository, CodexOpt gives you a better way to do it — not as prompt guesswork, but as an engineering workflow.
Related Posts
A2A v1 in SuperOptiX: Expose, Connect, and Orchestrate AI Agents
SuperOptiX now includes first-class A2A v1 support as a native protocol capability.
Read PostAgent Engineering 101 at GDG London: How to Build Reliable AI Systems
Most AI failures in production are engineering failures. A recap of the talk at GDG London Build with AI 2026.
Read Post