AI on Code is cheap, let's talk

Putting Semantic Search into an AI Coding Harness: Notes on Open-Sourcing ace-wrapper

Sat, 09 May 2026 14:38:00 +0800

I am not a native English speaker; this article was translated by AI.

In the previous post about Harness Engineering, I compressed my default AI coding workflow into a few steps:

Read
Search
Change
Verify
Record

Among these steps, Search is the easiest one to underestimate.

Many agents fail because they read the wrong place first. The user describes a behavior, a bug, or a cross-layer workflow, while the code may not contain a function with the same name. Running rg login, rg upload, or rg session is fast, but it only works when the keyword is already known. If the keyword is unknown, speed just helps the agent drift faster.

So I open-sourced a small layer I have been using recently:

ferstar/ace-wrapper

It does one narrow thing: wrap Augment Context Engine’s filesystem context search as an ace command, so coding agents can do semantic retrieval from the shell before editing.

Why This Layer Exists
#

The target is concrete: make the search action part of the harness.

I used to see this path often:

flowchart LR
  A[User describes behavior] --> B[Agent guesses keywords]
  B --> C[Reads nearby files]
  C --> D[Edits plausible code]
  D --> E[Verification fails]
  E --> B

The problem with this loop is that, after failure, the agent often keeps circling around the same wrong files. It can edit code; it needs a better entry point into candidate files.

ace-wrapper is meant to patch this part:

flowchart LR
  A[User describes behavior] --> B[ace semantic retrieval]
  B --> C[Candidate files]
  C --> D[Read returned files]
  D --> E[rg / tests confirm evidence]
  E --> F[Small patch]
  F --> G[Verify]

The important part is the order: ace only finds candidate files. Conclusions still require reading files, exact search, and tests.

Usage Is Short
#

Install it:

uv tool install ace-wrapper

Install a local development checkout:

uv tool install /path/to/ace-wrapper

Search for a workflow when the exact keyword is unknown:

timeout 60s ace "user uploads an unsupported file and should see skipped-file feedback" -w /repo
rg -n "unsupported|skipped|upload|file" /repo

The first command answers “which files may be relevant.” The second command confirms “which identifiers, events, copy, or tests actually exist in the code.”

I usually put this rule into a project’s AGENTS.md:

Use `timeout 60s ace "" -w ` for semantic codebase discovery.
Treat `ace` results as candidate files.
After it returns results, read the relevant files and use exact search before using them as evidence.

These lines work better than “read more context,” because they give the agent a concrete action and a boundary against false conclusions.

How It Works with rg
#

ace and rg work better as consecutive steps.

Scenario	Use First	Why
You know the behavior but not the implementation location	`ace`	Behavior descriptions can find candidate entry points across files and naming styles
You know the function name, event name, or error text	`rg`	It is exact, complete, and enumerable
You need a structural refactor	`ast-grep`	AST-level matching is needed; textual proximity falls short
You need to confirm whether a feature exists	`ace` + read files + `rg`	A semantic hit cannot prove the feature exists

I intentionally wrote this boundary into the README: ACE returns candidate files, while evidence still has to come from code and tests. That boundary matters.

Semantic retrieval returns “nearby” things. If you ask about a feature that does not exist, it may still find files that look related. If an agent treats “there are results” as “the feature exists,” it starts inventing a story. A conclusion is only defensible after reading an implementation, test, route, config, or call site.

Where It Fits in Harness Engineering
#

ace-wrapper is small, and I want it to stay that way. It is closer to a small gear in the harness: it turns open-ended code discovery into a repeatable, constrained command.

I now prefer this project rule:

Read -> Search -> Change -> Verify

Here, Search means choosing the tool by problem type:

Open-ended behavior and cross-layer workflows: use ace first
Exact identifiers, errors, routes, and config keys: use rg
Structural replacements: use ast-grep
External strategy and industry practice: use web research
Old decisions and repeated lessons: use memory

The useful part of this split is reduced agent randomness. The agent first uses semantic retrieval to narrow the reading surface, then uses deterministic tools to confirm facts, and only then changes code.

The Prompt Matters Most
#

A good ace query describes behavior and avoids keyword piles:

timeout 60s ace "frontend sends requestId to backend and starts a processing job" -w /repo
timeout 60s ace "用户拖入不支持的文件后应该显示跳过文件提示" -w /repo
timeout 60s ace "how provider config is persisted and restored after app restart" -w /repo

I try to include four kinds of information:

User action: click, drag, upload, stop generation
Runtime boundary: frontend to backend, CLI handler to core service
Expected effect: persist config, abort loop, show skipped-file feedback
Known fields: sessionId, requestId, files, workspace

This is much more stable than only searching upload or provider. It lets the retrieval system look for behavior and data flow, and it reminds the agent that this step is still semantic retrieval.

Why I Open-Sourced It
#

ace-wrapper has very little code. The core is just FileSystemContext.create(str(workspace)) plus context.search(args.query). I wanted to preserve the workflow constraints around those few lines:

If the keyword is unknown, start with semantic retrieval.
Ask one workflow per query.
Treat results as candidate files.
Read the files, then use rg to confirm exact evidence.
Do not conclude without evidence.

Once these rules live in the tool README, skill, and agent prompt, they become much more likely to stick. Otherwise every session depends on a human reminding the agent again.

The previous post said Harness Engineering means putting an engineering track around AI. ace-wrapper is one small piece of that track: its job is modest, helping the agent read the right place first.

From Vibe Coding to Harness Engineering: How My AI Coding Workflow Changed

Sat, 09 May 2026 14:19:00 +0800

I am not a native English speaker; this article was translated by AI.

This is the written version of an internal team sharing session. The slides are here:

From Vibe Coding to Harness Engineering

In the previous phase, I cared about one question: can AI take over most coding work?

The answer is now fairly clear. If project context, quality gates, and verification workflows are in place, AI-generated code can enter the engineering workflow reliably. Human time gradually moves from “writing” to “verifying”: requirement breakdown, architecture judgment, context organization, boundary checks, and failure handling.

Recent practice moved one step further. The problem is no longer just “how to write prompts.” The real question is whether the whole workflow can support long-running tasks.

What Changed
#

Early Vibe Coding solved the entry problem: explain the requirement clearly, put project rules into AGENTS.md / CLAUDE.md, and let tests, lint, and review catch model output.

That still works, but it is closer to single-task engineering. Once a task runs longer, new problems show up:

Context keeps growing until the model loses the important part.
Repeated retries can push the fix further away from the actual problem.
Without external references, strategy turns into guesswork.
After many rounds, it becomes hard to tell which changes should be kept.
User rejection, permission blocks, and empty output need explicit stop semantics.

So I now prefer calling this layer Harness Engineering. The focus is to put an engineering track around AI so that tasks are executable, results are verifiable, and failures are recoverable.

flowchart LR
  A[Task scope] --> B[Context route]
  B --> C[Agent loop]
  C --> D[Verification gate]
  D --> E[Recovery / memory]
  D -->|failed| F[Patch harness]
  F --> C

The Four Things I Manage First
#

The first thing is task boundaries.

Before a medium-sized task starts, I want at least done when, out of scope, the change surface, and the verification command. This does not need to be a long document. Five lines are often enough. The key is to let the executor know when to stop.

The second thing is context routing.

AGENTS.md should not become an encyclopedia. It works better as an index: what the project rules are, where the entry points are, what command verifies the change, what must not be touched, and where the next layer of docs lives. Long context should be read on demand instead of being dumped into the session.

The third thing is the verification loop.

My default order is now:

Read: read README, AGENTS, older notes, and key implementation files
Search: use ace, rg, ast-grep, nmem, and Exa to find evidence
Change: apply a small patch and avoid drive-by refactors
Verify: run narrow checks first, then expand by risk
Record: write repeated lessons back into rules, tests, or memory

This order looks ordinary, but it prevents many runaway cases. Reading and searching first reduce model guesswork. Narrow verification avoids a large change where nobody knows which step broke.

The fourth thing is failure handling.

After a failure, I classify it first: stop, retry, patch the harness, or record it.

Type	When to Use It	Handling
Stop	User rejection, permission block, side effect risk, repeated spinning	Break the loop and return control
Retry	Network jitter, fixable parameter, read failure without side effects	Retry in small steps and keep logs
Patch	Same class of error appears twice	Add tests, rules, scripts, or logs
Record	The case will likely happen again	Save trigger conditions, verification commands, and evidence entry points

I used to treat many failures as “try again.” Now I am more careful. Retry only the failures that are actually retryable. Stop conditions must stop.

Where External Research Fits
#

In this workflow, Exa or similar web search tools have a clearer role.

I usually do not search for broad trends. I search for concrete engineering questions:

What timeout should be used?
Should this failure be retried?
How should the default strategy be split?
What boundaries do mainstream tools provide?
What failure samples show up in real issues?

I do not copy the external solution directly. External material gives me a reference frame, and the final decision still has to fit the current repo. Useful conclusions should land in specs, project rules, tests, or scripts. Otherwise I will have to search again next time.

Autoresearch and Ralph Loop
#

Autoresearch works best for long loops with a clear metric. Give the agent a goal, a guard, and a verification command first. Each round should allow only one rollback-friendly change.

I currently treat Ralph Loop as persistent single-owner execution. The same owner keeps driving the work. PRD and test spec come first, then the agent runs the long task. The point is to preserve context, judgment, and verification clues during long-running work before bringing in more agents.

Both patterns share the same idea: define the track before letting the agent run. The track needs metrics, boundaries, verification, and keep/discard rules.

Three Steps Worth Copying First
#

If this needs to move into a team workflow, I would not start with platform work. Three steps are enough to copy tomorrow:

Write done when and out of scope for every medium-sized task.
Ask the agent to list files, evidence, and the change surface before allowing edits.
After one failure, patch tests, rules, or scripts before letting the agent continue.

Once these three steps are in place, AI coding moves a bit from “it can produce output” toward “it can be shipped.” Autoresearch, Ralph Loop, team workers, and memory become much easier to reason about after that.

AI on Code is cheap, let's talk

Putting Semantic Search into an AI Coding Harness: Notes on Open-Sourcing ace-wrapper

Why This Layer Exists #

Usage Is Short #

How It Works with rg #

Where It Fits in Harness Engineering #

The Prompt Matters Most #

Why I Open-Sourced It #