Olu

Shipped v0.3.4 of NightCrew. Session token spend down ~21%. Branch name: feat/caveman-prompts.

The premise: the pipeline is machine-to-machine. Plan output gets piped into the implementation prompt; implementation output gets piped into review. No human reads the prose between phases. Every "I'll now analyze the codebase…" is wasted tokens.

So we tell Claude to grunt instead. Three layers.

Layer 1: terse output directives

Three template files, one new section each. All in PR #8.

templates/system-prompt.md (implementation):

+## Output Discipline
+- Code changes only. No explanations before or after.
+- Do not restate the plan or narrate your thinking.
+- Log decisions to DECISIONS-${TASK_ID}.md. Do not inline them.
+- When tests pass: "Tests pass." and move on. No elaboration.
+- Fragments OK. Terse status updates OK. Code blocks unchanged.

templates/plan-prompt.md (Opus, planning):

+# Output Discipline
+- Structured output only. Follow the Output Format exactly.
+- No preamble ("I'll now analyze..."). Start with the first section heading.
+- One sentence per bullet. No filler words. No hedging.
+- ASCII diagrams: keep them, they're information-dense.
+- Code examples: keep them, they're precise.

templates/review-prompt.md (review):

+# Output Discipline
+- One-line findings only: [file:line] Problem -> Fix
+- No preamble. No summary at the end.
+- Scope check and plan completion: structured output only.
+- Suppress findings below confidence 5. No "everything looks good" commentary.

The highest-leverage line in the whole PR: suppress findings below confidence 5. Reviewers love to fill space. Without that line you get three paragraphs of "the import order in line 14 is fine but you might consider…" when what you wanted was nothing at all.

Layer 2: section-aware plan compression

The plan that Opus writes has sections like ## NOT In Scope and ## Decisions Log — for the human reviewing the PR in the morning. The implementation agent doesn't need them. Strip before injecting.

lib/prompt-builder.sh:105:

compress_plan_for_impl() {
  local plan_file="$1"
  [[ ! -f "$plan_file" ]] && return

  local has_headings=0
  grep -qE '^## (Scope Decision|What Already Exists|Architecture Decisions|Implementation Steps|Test Plan|Failure Modes)' "$plan_file" 2>/dev/null && has_headings=1
  if [[ "$has_headings" -eq 0 ]]; then
    log "Warning: plan has no expected section headings, skipping compression"
    cat "$plan_file"
    return
  fi

  awk '
    /^## /{found=0}
    /^## Scope Decision/{found=1}
    /^## What Already Exists/{found=1}
    /^## Architecture Decisions/{found=1}
    /^## Implementation Steps/{found=1}
    /^## Test Plan/{found=1}
    /^## Failure Mode/{found=1}
    found{print}
  ' "$plan_file" | sed '/^[[:space:]]*$/N;/^\n$/d'
}

Allowlist of sections, awk filter, fall back to the full plan if headings don't match expected shape. Full plan stays on disk — only the impl injection is compressed.

One design call that mattered: keep Failure Modes in the impl prompt. First draft stripped it. Outside review caught that the implementation agent uses the failure mode table as its error-handling spec — removing it would silently degrade error coverage to save tokens. Wrong tradeoff. Reverted.

Layer 3: per-phase tracking + benchmark

Six new fields in state/progress.json:

plan_in_tokens, plan_out_tokens
impl_in_tokens, impl_out_tokens
review_in_tokens, review_out_tokens

And a new subcommand: nightcrew benchmark <session-a> <session-b>. ~100 lines of bash + jq.

Output, baseline vs compressed:

TOKEN BENCHMARK: demo-baseline vs demo-compressed
═══════════════════════════════════════════════════════════
Task                 | Phase    |   Baseline | Compressed |      Delta | Savings
─────────────────────────────────────────────────────────────────
add-user-auth        | plan     |       6300 |       5700 |       -600 |   -10%
add-user-auth        | impl     |      21200 |      15500 |      -5700 |   -27%
add-user-auth        | review   |       5100 |       4600 |       -500 |   -10%
add-user-auth        | TOTAL    |      32600 |      25800 |      -6800 |   -21%
─────────────────────────────────────────────────────────────────
refactor-api-client  | TOTAL    |      25700 |      20450 |      -5250 |   -20%
add-rate-limiter     | TOTAL    |      22100 |      17650 |      -4450 |   -20%
─────────────────────────────────────────────────────────────────
SESSION TOTAL        |          |      80400 |      63900 |     -16500 |   -21%
ESTIMATED COST       |          |      $2.17 |      $1.73 |     $-0.44 |   -21%
═══════════════════════════════════════════════════════════

Implementation phase consistently does best (-27%) — that's where compression layers stack: shorter plan injection, less narration on top.

Honest caveat: numbers are from a synthetic fixture. Task complexity dominates real-world variance, so a controlled benchmark fixture (frozen tasks, frozen repo state) is on TODOS.md as P3. Treat ~20% as directionally correct, not a contract.

When you control both ends of a model conversation, prompt engineering is just compression.

If you're running NightCrew, the diff is in PR #8 — about 200 lines of bash and one awk filter.