Back to the blog

AI Strategy

Eating our own dogfood: how the 4D framework built the tool that built this post

We wrote a whitepaper about AI’s polish bias. Then we got caught by it. Here’s the recursive proof: we used the framework to build the blog engine that wrote the post about the framework.

Dorian Cougias May 26, 2026

At a Glance MoxyWolf wrote a whitepaper about AI’s polish bias – the measurable drop in human critical engagement that happens when AI produces a polished artifact. Two months later, we got caught by the exact failure mode the whitepaper named: 500 emails out, zero replies, every email grammatical and on-brand and forgettable. The fix wasn’t a better prompt. The fix was the 4D framework: Delegation, Description, Discernment, Diligence – four phases with engineered gates between them. Then we built the blog engine that operationalizes that framework. We used the framework to build the engine. The engine wrote the post about the framework. This essay is the recursive proof.

We got caught by the failure mode we wrote the whitepaper about

This month I dug into our outbound email campaign with our marketing director.

500 emails went out. We got zero replies.

The AI had done what it was supposed to. It targeted the right ICP. It surfaced relevant context about each prospect – the recent funding round, the tech stack they used, the role they were hiring for. Every email was grammatical. Every email was on-brand. Every email read like it could have been sent to anyone.

That’s the polish bias. The model produced an output that looked finished, and the looking-finished disarmed the check that would have caught the actual problem.

We wrote a whitepaper about this trap. We named it. We diagrammed the four phases of the framework that fixes it. We sent that whitepaper to clients, prospects, partners, our own team. And we didn’t catch the polish bias in our own send folder for two months.

That’s how good the failure mode is at hiding. The output reads competently. There’s no obvious tell. If you open five of the 500 emails and read them carefully, you find nothing wrong. The mistake isn’t visible at the message level. The mistake is visible only at the campaign level. Zero replies on 500 sends.

When I sat with our marketing director and walked through what had happened, the diagnosis fell out in one sentence. The AI was clearly targeting the right ICP. It was surfacing the right context. But the emails weren’t emotionally wed to the target audience. They were polished, yes – but a polished turd is still a turd.

That phrase is the earned secret of this whole project. It’s the thing we know from direct experience that the conventional discourse about AI marketing still treats as a prompting problem. Better prompts don’t fix this. Better models don’t fix this. The reason is structural. AI is good at the surface layer of polish – grammar, ICP-targeting, on-brand register, prospect-specific opening lines, clean CTAs. AI is not good at the substance layer – whether the first three sentences sound like the inside of the reader’s own head, or like a description of the demographic the reader belongs to.

The surface layer is what the model can do. The substance layer is the part the model can’t fake. And when the surface is clean enough, the human checking the output stops looking for the substance.

That’s the polish bias, named in lived experience, not in a paper.

What we learned about building in the 4Ds

The whitepaper we’d written – Beyond the Prompt – proposed a four-phase discipline for AI-augmented knowledge work. Delegation, Description, Discernment, Diligence. The 4Ds. The argument was that the gap between AI that polishes turds and AI that produces shippable work isn’t a prompting gap. It’s a process gap.

Delegation asks one question: is this the right work to hand to AI in the first place, and what part of it? Not every task warrants AI involvement. Not every task warrants automation as the modality. The whitepaper distinguishes three modalities – automation (the model drafts, the human reviews), augmentation (the model and the human iterate sentence-by-sentence), agency (the model runs end-to-end but holds at a gate). Delegation is where you pick which one, and why. Delegation is also where you pick the angle, name the audience, and surface the earned secret. The earned-secret stall is the most important mechanism in the phase. If the author can’t name what they know from direct experience that most people in the audience don’t, the process deliberately stops. No earned secret, no post. We borrowed that mechanism from a Claude-skill repo I’ll come back to later, but the principle came out of the email campaign: the AI couldn’t surface our 500/0 story because the AI didn’t know it. The author has to bring the earned secret. The framework just refuses to proceed without it.

Description asks: have we told the AI the goal and the constraints precisely enough that it can behave usefully? This is where the eight-question voice interview lives. Trigger, Evidence, Contrarian Take, Authority, Specific Reader, Business Connection, Call to Action, Emotional Core. One question per message. The phase pushes back on vague answers. “Marketing professionals” is not a reader. “Sarah, VP Marketing at a Series-B SaaS company who just got told by her CEO to use AI more” – that’s a reader. Description is also where you pick the narrative structure (default: Sorkin’s Desire-Obstacle-Battle arc), draft the outline with 60-70% question-phrased H2s, write the At-a-Glance block, and pre-load the anti-AI-slop pattern catalog so the model knows what tics to avoid before any prose gets generated.

Discernment asks: did the draft survive a real check, including a 30-day reality check from the world outside our heads? This is the discourse-sweep phase. Platform-targeted queries against reddit, X, Hacker News, dev.to, Substack, GitHub, LinkedIn long-form posts, Apple Podcasts via Apify, scholarly sources via OpenAlex and Semantic Scholar and arXiv. Each finding tagged with verification status. Then a Council deliberation across multiple models synthesizes the raw harvest into themes. Then a bibliography gets built with AI-generated abstracts, and every citation gets verified against the actual source. Then the draft gets written. Then it runs through a two-tier anti-AI-slop pass: a deterministic linter that catches em-dashes, banned phrases, and structural metrics (burstiness, type-token ratio, paragraph-shape standard deviation). Then an LLM-based structural scan catches what the linter misses (question-H2 saturation, three-clause-sentence frequency, hedge stacking, symmetric-list bloat). The rewrite gets a second-pass audit because single-pass de-slop misses survivors.

Diligence asks: will a named human put their signature on this before it ships? This is the Release Owner Gate. Five stages: capability (every claim has a source, every URL resolves, no fabricated data in the body), format (frontmatter validates, typographic rules pass), visual (hero image generated and approved), content review (a nonce-bound BLOCKING reviewer scores the post against a 100-point rubric), asset integrity (all referenced files exist on disk, slug ties the post and the hero and the bibliography together). The reviewer must echo a CSPRNG nonce that the gate wrote to disk before the review started. If the nonce doesn’t match, the review is rejected as provenance-failed. If the score is below 90, the gate is BLOCKING and no LinkedIn output gets derived. If the gate passes, a named human signs the changelog. The whitepaper canonical form is Verified — <initials>, <YYYY-MM-DD>. The plugin never auto-signs.

That’s the framework. Four phases. Engineered gates between them. Each phase produces a named artifact. The next phase refuses to run if the prior artifact hasn’t passed. The architecture is the discipline.

What we learned about building in the 4Ds – the part that’s hard to articulate until you’ve worked inside them – is that the gates do most of the work. The framework isn’t four phases of “be more careful.” The framework is four phases of “the next step refuses to proceed until the current step has produced a specific artifact that has passed a specific check.” The gate is the thing. The phases are just where the gates live.

We didn’t want to build another turd polisher

Here’s where the recursion starts.

We had the whitepaper. We had the 4D framework. We had a clear operational claim: the Release Owner Gate is the Monday-installable mechanism that turns AI-augmented writing from a turd polishing pipeline into a discipline. The claim was now load-bearing. We were going to operate by it.

The next question was: how does the team actually do this, day to day, on the blog posts and LinkedIn content we ship?

The lazy answer was to write a prompt template and call it a methodology. We’ve all seen what that produces. A markdown file with “you are an expert content writer” at the top, a checklist of “make it engaging and concise,” and a vague instruction to “use sources.” Three weeks later, the team is back to polishing turds, because the prompt template hasn’t done anything except put nicer clothes on the same broken process.

The right answer was to build the framework into actual software. A plugin. A pipeline. Phase boundaries enforced by code. Gates that won’t pass unless their specific check actually passes. State files that the next phase reads before it’ll do anything. A reviewer subagent that has to echo a nonce or its work gets rejected. A BLOCKING verdict that means BLOCKING.

But here was the trap, the one that made me hesitate for a day before starting: building a blog engine could very easily become the most polished turd we’d ever shipped. A nicely architected pipeline with phase artifacts and gates, producing content that’s grammatically clean and on-brand and forgettable. The architecture would polish the surface harder, and the substance would still be missing, because nothing in the architecture forces the author to bring the substance.

So the question shifted. How do you build a tool that’s intrinsically incapable of polishing turds? How do you build a process that refuses to ship surface-only output?

The answer that emerged is the architecture itself. The earned-secret stall in Phase 1. The voice interview in Phase 2. The verification tags in Phase 3. The named human signature in Phase 4. None of these are anti-slop linters. They’re substance gates. Each one is a structural requirement that the human has to satisfy with lived material, not with the model’s training data. The Phase 1 earned secret can’t come from the model – it has to come from the author. The Phase 2 voice interview answers can’t come from the model – they have to come from the author. The Phase 3 citations can’t be fabricated – they have to resolve to real URLs that the gate fetches and verifies. The Phase 4 signature can’t be auto-applied – it has to be entered by a named human.

The framework defeats turd polishing because it makes turd polishing structurally impossible to complete. You can’t get through Phase 1 without a real story. You can’t get through Phase 2 without a real voice profile. You can’t get through Phase 3 without real sources. You can’t get through Phase 4 without a real human signing. At every phase, the work the AI can do alone is not sufficient to advance.

That’s the design claim. The next question was whether we could build it and prove it.

The Workforce Automation foundation

We didn’t start this project from zero. We started it from a system we’d been building for a year called Workforce Automation.

Workforce Automation is the analytical engine underneath everything MoxyWolf does with AI tools. The one-sentence version: it maps every Claude skill, plugin, MCP, and command to the human work it automates, joined through ONet occupations and a curated capability lexicon. The headline analytical product is a materialized view called mv_occupation_automation_surface. For any ONet SOC code, it returns the count of Detailed Work Activities (DWAs) covered and the tools delivering coverage at each automation level: replaces, augments, supports, informs.

The four-layer chain is tools → capabilities → DWAs → occupations. The occupations layer is O*Net 30.2: 1,016 occupations, 41 Generalized Work Activities, 2,087 DWAs, 18,796 tasks, 74,435 tools and technology entries. The DWAs layer is the level of granularity at which “what people do at work” is described in the federal labor taxonomy – granular enough to map to capabilities, structured enough to query across SOCs. The capabilities layer is a curated set of verb-object pairs (currently 40 active, expanding) that name what an AI tool can do, governed by orthogonality rules: describe what the tool does, not how. The tools layer is the catalog: 118,703 tools across eight sources at last count, 87.7% with descriptions, refreshed daily.

The pilot occupation we anchored everything to is SOC 15-1212.00, Information Security Analysts. Pick that SOC and the view returns: 10 of 10 DWAs covered, 116 tools across the four automation levels (6 replaces, 58 augments, 36 supports, 16 informs). That number – 116 tools producing measurable coverage for a single occupation – is the empirical product of about six months of methodology development.

Here’s the methodology, because it’s where the recursive pattern with the blog engine really starts to surface.

To populate tool_capability_map – the edge that turns the catalog into a capability mapping – we considered three approaches. Approach A was manual editorial: read each capability, scan tools by category, hand-pick fits. Honest but slow. Approach B was alias-driven string matching against capability_aliases. Fast but recall-bound. Approach C was LLM-assisted draft plus human review. Velocity plus a hard gate.

We picked C. But the eval methodology is what mattered.

The naive version of “LLM-assisted draft” is: pick a model, run it across all the edges, ship the output. That version is exactly the kind of thing that produces polished turds at scale. The version we shipped had five stages.

Stage 1 is a lexical pre-filter using pg_trgm similarity to build a candidate slate. 40 capabilities times 20,867 tools is 834,680 edges. About 99% noise. The pre-filter narrows it to roughly 6,000 candidate edges across 40 capabilities – manageable for LLM scoring.

Stage E1 is a Council deliberation on a gold-6 sample. Six capabilities chosen to span all three pilot SOCs and all four automation levels. Five models vote independently: Claude Opus 4.6, Claude Sonnet 4.6, Gemini 2.5 Flash, Grok 4.3, Mistral Large. Where four of five agree on both delivers=true/false and the specific automation level, that’s a consensus row in the gold standard. Where the vote splits, that’s a dissent row.

Stage E2 is an Opus arbiter on the dissent set. Opus alone reads each dissented edge and casts a deciding vote. The arbitrated dissent rows fold back into the gold standard. The combined gold gives us 141 positives and 758 negatives across the 6 capabilities.

Stage E3 is the eval gate. Run the candidate production model – Sonnet 4.6 in the original plan – against the same candidate slates the Council scored. Measure precision and recall against the Council gold. The gate was set at ≥ 80% precision and ≥ 70% recall on the gold-6 before scaling to all 40.

Sonnet failed the gate. On the first run, precision was 0.788, one and a fraction percentage points below the floor. On the second run against the expanded arbitrated gold, precision dropped to 0.709. We tried tightening the prompt to push precision up. Tightening made precision worse: 0.677, with level-agreement collapsing from 72% to 51% as Sonnet shifted supports calls to augments rather than to false.

The insight that came out of that result is one I’ve reached for a dozen times since. When an LLM hits its precision ceiling on a task, switching prompts doesn’t help. Switching models does. Sonnet had a ceiling around 0.79 precision on this task. No prompt was going to push it higher. The model didn’t have the discrimination needed. We switched production to Opus 4.6 – the same model that had arbitrated the dissent – and the precision problem resolved itself. The cost difference was real (about $180 for the full production run versus about $32 for Sonnet) but the cost of bad data is harder to remediate than the cost of good inference.

The five-stage methodology – lexical pre-filter, Council on gold, arbiter on dissent, eval gate, production model picked by measured performance – became one of MoxyWolf’s load-bearing operating norms. We call it AI-for-bulk, human-for-dissent. The pattern: where signal is strong, trust the AI. Where the signal is contested, force a human (or a more capable model acting as arbiter) to adjudicate.

The other pattern that surfaced is what we now call AI-for-bulk-at-every-layer. The naive version of AI-augmented work uses AI for the first pass and humans for everything after. The mature version uses AI at every layer where the bulk-vs-dissent test applies. We use AI to draft the gold standard via Council. We use AI to arbitrate the dissent via Opus. We use AI to verify citations via parallel Haiku subagents. We use AI to lint prose via deterministic scripts. We use humans where the work is genuinely contested – the load-bearing claim, the angle, the named signature.

That’s the methodology Workforce Automation refined. It’s the methodology we brought to the blog engine.

Then we scanned all of the AI tools

The Workforce Automation catalog isn’t only useful for the headline analytical product. It’s also a corpus. Every plugin in the catalog is a sample of what someone, somewhere, is doing with Claude or AI tooling. If you filter the catalog to a specific domain – say, plugins that build blog posts or LinkedIn content – you have a research dataset of how the field is currently approaching that problem.

So before we wrote a line of code for the blog engine, we did the scan.

The plugin discovery pipeline that feeds the catalog uses awesome-claude-plugins as a nightly discovery feed, ingests 17,826 repos, walks each one’s GitHub tree for SKILL.md and .claude-plugin/marketplace.json manifests, parses out the individual plugins, and writes the results to a discovered_repos queue table. Repos at or above a tunable plugins-count threshold get promoted to first-class sources in the catalog. The rest get walked incrementally and either yield plugin manifests or get marked walked_empty and not re-walked until the repo changes. The framework noise self-filters.

For the blog engine project we exported the slice of the catalog that matched on blog, linkedin, content, writing, marketing, and adjacent terms. That came back as 288 catalog rows reducing to 139 unique English-language repositories after deduplication.

Then I crawled every single one of them.

Not “spot-checked a few.” Not “read the README of the top ten.” Crawled all 139, in waves of four to eight parallel sub-agents, with each sub-agent reading the repo’s SKILL.md and plugin.json and any prose documentation, then producing a structured per-repo report. Ten batch reports went into the build log. A 40-feature taxonomy emerged from the crawl. Every plugin got rated Yes / Partial / No / Unknown on each feature. The taxonomy covered: phase structure, voice mechanisms, anti-AI-slop techniques, source verification patterns, scoring rubrics, hero-image conventions, LinkedIn derivative formats, hook libraries, sign-off protocols, nonce mechanisms.

I’ll tell you what the scan surfaced, because the patterns matter.

Pattern 1 – multi-phase pipelines with engineered gates between phases. Whatever the count (3, 4, 5, 7, 8), every credible plugin in the catalog has the same architecture. Each phase produces a named artifact. The next phase refuses to run unless the prior artifact has passed its gate. The vocabulary varies. draft → plan → review → refine in one repo. Interview → Research → Architecture → Writing in another. idea → research → content → post → publish in a third. Foundation → Thesis → Structure → Research → 30% Outline → Introduction → Drafting → Review in a fourth. RESEARCH → EXPERTS → IDEATION → CREATION → AIO in a fifth. The clothing is different. The discipline is identical. The 4D framework is one specific clothing of a deeper truth: serious content work requires phase boundaries with real gates.

Pattern 2 – interview-first beats outline-first. The single best-stated thesis in the crawl came from a repo called interview-to-blog: “The depth of AI output is directly proportional to the depth of human input.” The default-AI workflow generates an outline, expands it, polishes it. That workflow produces generic content because the AI is drawing entirely from training data. Inverting it – extract unique knowledge, examples, and genuine opinion first – produces voice and originality. Every high-quality plugin in the catalog leads with an interview.

Pattern 3 – voice and anti-AI-slop are mechanisms, not vibes. Aspirational rules (“write naturally”) fail. The plugins that actually produced un-AI-sounding work did it with a deterministic linter script plus a named-pattern catalog plus a second-pass audit. The best detector pairs first-order checks (vocabulary blocklist, regex patterns) with second-order checks (burstiness, paragraph-shape SD, question-H2 ratio, three-clause-sentence frequency). Single-pass de-slop misses survivors. The audit catches them.

Pattern 4 – the earned-secret stall. Only a handful of repos enforced this, but the ones that did were the strongest in the catalog. The mechanism: Phase 1 won’t proceed until the author names what they know from direct experience that most people in the audience don’t. “It can’t be something you read.” If the author can’t surface an earned secret, the process deliberately stalls.

Pattern 5 – verification tags on every claim. A few repos tagged each datum as [V] verified, [S] search-summary-only, or [F] fetch-failed. The writer is forbidden to use [F] data, substitutes [CITATION NEEDED] placeholders. The post ships with a source-verification table at the end so the reviewer only has to vet the [S] rows.

Pattern 6 – the nonce-bound reviewer. Exactly one repo in 269 catalog rows (the workbook actually carries 269 because we kept four copy variants per plugin in some cases) implemented the nonce-bound BLOCKING reviewer pattern: a CSPRNG nonce written to disk before the review starts, which the reviewer agent must echo back verbatim in its verdict line. Anything missing or mismatched is rejected as provenance-failed. We adopted this pattern as the load-bearing primitive of the Diligence gate. It’s a 0.4% catalog adoption feature – and it’s the difference between a review that the system trusts and a review that could be a hallucination.

There were other patterns: the FLOW evidence triple (year anchor in prose + inline citation + URL with retrieval date), the At-a-Glance block (60-90 words of AI-citation-bait at the top of every post), the 60-70% question-H2 ratio for AEO, the JSON-LD @graph with BlogPosting plus Person plus Organization plus FAQPage, the char-210 mobile fold rule for LinkedIn hooks, the 6-formula hook library (Stat-Led, Question, Story, Contrarian, Bold Claim, Pattern Interrupt), the 100-point Release Owner rubric with weighted cells across Content, SEO+AEO, E-E-A-T, Voice match, AI-citation readiness.

What the scan made impossible to ignore was that none of these patterns are individually novel. Every single one of them lives somewhere in the open Claude-plugin field already. The interesting question wasn’t which patterns to invent. The interesting question was which patterns to combine, and where to put the gates.

How the six patterns map onto the 4Ds

Two facts surfaced from the crawl that the pattern list above hides. The first is that the six patterns aren’t independent. Each one resolves to a specific 4D phase, and the phase it resolves to is the phase whose load-bearing question that mechanism is built to answer. The second is that no plugin in the catalog carried all six. Most carried one or two. A few carried three. The catalog had the components. The 4D framework’s contribution was knowing which component went where.

The mapping:

Pattern4D phaseWhat the mechanism does for that phaseCatalog adoption
1 – Multi-phase pipelines with engineered gatesThe whole frameConfirms the 4D shape isn’t a MoxyWolf opinion. Every credible plugin in the catalog has phase-boundaries-with-gates. The vocabulary varies. The architecture doesn’t.Universal across credible plugins
2 – Interview-first beats outline-firstPhase 2 (Description)Forces voice and intent extraction in front of the outline, so the AI draws from the author’s specific knowledge rather than from training-data averages. The depth of the output tracks the depth of the input.7.8% (21 of 269)
3 – Voice and anti-slop as mechanisms, not vibesPhase 3 (Discernment)The two-tier check that asks whether the draft survived a real test. A deterministic linter catches what regex can see. An LLM structural scan catches what regex misses. The second-pass audit catches what the first pass missed.15.2% layer-1, 11.5% layer-2
4 – The earned-secret stallPhase 1 (Delegation)Won’t let Phase 1 conclude until the author names what they know from direct experience that most readers don’t. Answers “is this the right work to hand to AI” with the only honest answer: not unless you bring the lived material the AI can’t generate.5.2% (14 of 269)
5 – Verification tags on every claim ([V]/[S]/[F])Phase 3 (Discernment)The other half of the “real check” – provenance for every citation, not just prose for every sentence. The tagging taxonomy turns sourcing from vibes into a flagged contract that the writer is forbidden to violate.2.2% (6 of 269)
6 – Nonce-bound BLOCKING reviewerPhase 4 (Diligence)The Release Owner Gate needs a review the system can trust as authentic. Without CSPRNG-bound provenance, the review’s verdict could be hallucinated and the signature beneath it would be theater.0.4% (1 of 269)

The adoption column is the part of this table that earned the most attention when I built it. The 4D shape itself is universal across credible plugins. Pattern 2 (interview-first) appears in 7.8% of the catalog. Pattern 4 (earned-secret stall) appears in 5.2%. Pattern 6 (nonce-bound reviewer) appears in one repo out of 269. The architectural claim isn’t that the framework invents anything. The architectural claim is that the framework combines a universally adopted shape with rare, load-bearing primitives at the phase boundaries where they do the most work. Most catalog plugins polish turds because they carry one or two patterns and stop there. The 4D framework only works when every phase carries its phase-appropriate mechanism, and most of those mechanisms live in single-digit adoption territory.

Two observations the table makes explicit.

The first is that Pattern 1 isn’t really a pattern. It’s the 4D framework itself, observed in the wild under different names. Every credible plugin in the catalog reaches this shape independently. The convergence isn’t accidental. The discipline names a structural truth about how AI-augmented knowledge work has to be assembled: phase boundaries with named artifacts and real gates between them, because the model alone can’t tell you when the work is good enough to ship. That’s the deeper claim the whitepaper made and the catalog corroborated. Forty-D isn’t an opinion we want the field to adopt. It’s the architecture the field is already adopting, one repo at a time, under whatever vocabulary fits the author’s mood.

The second is that the other five patterns each anchor to one specific phase, and the phase they anchor to is the phase whose load-bearing question that mechanism is built to answer. Pattern 4 anchors Phase 1 because Phase 1’s question is “is this AI-appropriate work in the first place,” and the answer is structurally no unless the author brings lived material. Pattern 2 anchors Phase 2 because Phase 2’s question is “have we told the model the goal precisely enough,” and the answer flows from interview depth, not from outline cleverness. Patterns 3 and 5 both anchor Phase 3 because Phase 3’s question is “did the draft survive a real check,” and the answer requires both prose-level scrutiny (Pattern 3) and citation-level scrutiny (Pattern 5). Pattern 6 anchors Phase 4 because Phase 4’s question is “will a named human sign,” and the signature only means something if the review beneath it carries authentic provenance.

This is why most plugins in the catalog polish turds. They carry one or two of these mechanisms – usually Patterns 2 or 3 – and miss the rest. A plugin with a beautiful voice interview but no earned-secret stall produces voice-rich content about nothing. A plugin with a deterministic anti-slop linter but no nonce-bound reviewer produces grammatically clean output that nobody put their name on. A plugin with all the discernment-phase mechanisms but no diligence-phase gate ships unverified output the moment the slop linter passes. The mechanisms are necessary but not sufficient on their own. The phase-appropriate placement is what makes the combination work.

The architectural insight wasn’t which patterns to use. It was that all six belonged in the same plugin, each at its phase-appropriate boundary, with the gates between phases enforcing that the work of each phase was actually done before the next phase could start. The whitepaper provided the frame. The catalog provided the components. The architecture’s job was to put each component at the right phase boundary so the framework as a whole could carry the discipline the whitepaper named.

That’s the question the architecture document – release-owner-plugin-architecture-2026-05-25.md – answered.

The architecture that emerged

The 4D framework gave us the phase boundaries. The 139-repo crawl gave us the patterns. The Workforce Automation methodology gave us the AI-for-bulk-human-for-dissent operating norm.

What we assembled out of those three was a plugin specification with the following load-bearing decisions.

The plugin would have one orchestrator skill per phase, four phases total. Each phase would produce a named artifact in a per-piece working directory: 01-delegation.md, 02-description.md, 03-discernment/*, 04-diligence/*. State would be persisted in a state.md file at the piece root, with current_phase, gates_passed: [], audience, thesis, earned secret, and a process log. The next phase would refuse to start unless the prior phase’s gate had passed and the gate timestamp was under 24 hours old. Multi-session work would be resumable by reading the state file.

Phase 1 would reuse the eight-question MoxyWolf voice interview, but with a critical optimization: if the earned secret from the Delegation phase already pre-filled five or six of the voice slots, the system would carry those forward and only ask the unfilled questions. The carry mechanism saved 60% of the interview time in dry-run testing and was the v0.1.1 patch that landed first.

Phase 3 would chain four existing MoxyWolf skills (literature-discovery, citation-verifier, bibtex-builder, content-writer) under a new orchestrator that added the 30-day discourse sweep and the Council deliberation synthesis pass. The discourse sweep would use platform-targeted site: operators against reddit, X, Hacker News, dev.to, Substack, GitHub, LinkedIn long-form, with Apify actors for podcasts and the existing research-pipeline integration for scholarly. The Council deliberation would prompt: “Given these N findings, the angle, and the outline, which findings move the post from generic to specific? Which contradict each other? Which represent the consensus versus the minority view across platforms?”

Phase 3 would also implement the two-tier anti-AI-slop check. Layer 1 would be a deterministic Python script – prose_lint.py – with about 60 banned words inherited from the MoxyWolf voice profile, regex patterns for banned phrases, em-dash detection with automatic replacement, passive-voice frequency, sentence-length burstiness, type-token ratio, paragraph-length standard deviation. Layer 2 would be an LLM-based structural scan that catches what the linter misses: question-H2 saturation, three-clause-sentence frequency, “Here” paragraph-starter count, hedge stacking, symmetric-list bloat. Findings would write to slop-findings.md. The draft would be rewritten only against findings, and the rewrite would then go through a second-pass audit to catch survivors.

Phase 4 would implement the Release Owner Gate as a five-stage contract: capability, format, visual, content review, asset integrity. Stage 4 (content review) would dispatch a subagent reviewer restricted to read-only tools. The reviewer would have to echo a CSPRNG nonce that the gate had written to .review-nonce before the review started. The reviewer would score the post against a 100-point rubric (Content 30, SEO+AEO 25, E-E-A-T 15, Voice match 15, AI-citation readiness 15), apply auto-fail triggers (any em-dash in body, any contrast-frame Tier-2-Major pattern, any curly quote inside a JSON-LD <script> block), and end the scorecard with BLOCKING: true|false (reason). The iteration cap would be three rounds. If the reviewer cleared at 90 or above and BLOCKING was false, the human Release Owner would be presented with three sign-off questions on the load-bearing claims of the piece and would enter their initials and the date in changelog.md. The plugin would never auto-sign.

On a clean gate, Phase 4 would derive the LinkedIn pair: a long-form article (800-1200 words, full mirror of the blog, more personal and opinion-led, no external links in the body – the canonical blog URL goes in the first comment), and a short teaser (about 1,300 characters, hook landing before character 210, 0 to 3 hashtags at the end after a line break, one earned-secret-anchored line woven into the middle, closing question naming a specific concrete decision the reader would actually be facing). Both would carry a 3-axis scorecard: Thought Leadership /10, Pain on Reader /10, Audience Fit /10.

The hero image would generate from a brand-aligned abstract style spec (geometric, angular, layered shapes, calm modern editorial feel, no text, no logos, no people, no literal objects, matte finish, palette restricted to MoxyWolf’s brand). The image prompt would be preserved as a sibling artifact so the reader can see what the model was instructed to draw. Higgsfield via the Banana model would be the default image generator, with a documented fallback for sessions without an image-gen MCP loaded.

That’s the architecture in compressed form. The full specification ran about 4,000 words and got signed off in one session.

We dogfooded it on itself

Then we built the plugin.

23 files. Seven commands, four skills, four Python scripts, six reference documents. Version 0.1.0 shipped after about a day of work, on top of the prior week’s research and architecture.

The first piece we wrote with it wasn’t a manufactured test piece. The first piece we wrote with it was a post about the polish bias, derived from the Beyond the Prompt whitepaper, with the angle “What does the polish bias mean for SMB founders in 2026?”

The recursion is the point. We used the framework to build the engine. We used the engine to write a post about the framework. If the engine could clear its own Release Owner Gate on a post about the gate, the architecture was load-bearing. If it couldn’t, we’d find out exactly which load-bearing claim broke.

The dry-run produced findings.

Phase 1 surfaced the earned secret cleanly: the 500/0 campaign story, the “polished turd” phrase, the moment with our marketing director when the diagnosis landed. The earned-secret stall did its job – Phase 1 wouldn’t proceed without it.

Phase 2‘s voice interview carried five of eight slots from Phase 1 (the v0.1.1 patch worked: a 62.5% efficiency win on the interview step). We only had to answer three new questions. The Sorkin DOB structure picked itself once the Phase 1 story was on the table.

Phase 3 ran the discourse sweep – and surfaced a v0.1.2 bug. The script greps ^## headers in 02-description.md to build the query slate, and it was picking up meta-structural H2s like “Voice interview,” “Structure,” and “Outline” instead of just the content H2s. Sixty-three queries got generated, most of them junk. We logged it as task #33 in the backlog and moved on. The sweep still produced enough verified sources (nine, after the Council deliberation synthesis pass) to draft against.

The draft came in at about 1,990 words, 13.09 average sentence length, 0.744 burstiness, 7.2% passive voice. Layer 1 caught 10 em-dashes – auto-fixed via prose_lint.py --fix. Layer 2 caught a different problem: 100% of the H2s in the outline were phrased as questions, well above the 60-70% spec. The audit forced two H2s to be converted to declarative form before the rewrite. The second-pass audit ran on the revision and came back clean.

Phase 4 – the Release Owner Gate – is where the architecture got tested hardest.

Round 1: the nonce-bound reviewer scored the post 68 out of 100, BLOCKING true. Twelve findings, four of them in voice match (em-dash count, contrast-frame frequency, JSON-LD curly-quote contamination, missing dateline in prose). The reviewer’s verdict line echoed the nonce verbatim. The writer absorbed the findings and rewrote.

Round 2: 88 out of 100, still BLOCKING true (the threshold is 90). Four partial findings remained: one Tier-2-Major contrast frame survived the rewrite, the title was 47 characters (the spec is 50-60), the excerpt was 167 characters (the spec is 150-160 with a CTA verb), and the Key Takeaways block had a doubled-space en-dash artifact from the prose_lint replacement. Three more v0.1.2 bugs surfaced in this round: the lint script was curling straight quotes inside the JSON-LD <script> block (breaking JSON parsing), the em-dash replacement was double-spacing the surrounding whitespace, and the preflight stage 2 typography check was raising the same JSON-LD false positive. We patched the scripts inline, applied the four surgical content edits the reviewer named, and ran round 3.

Round 3: 94 out of 100, BLOCKING false. Five remaining findings, all Medium or Minor copy edits, none gating publish. Three were metadata-drift in the JSON-LD: the headline still read the round-2 title, the description still carried the round-2 excerpt, the YAML excerpt had overcorrected past the lower bound at 143 characters. Three one-line edits closed all three.

I, as the named Release Owner, was presented with the three load-bearing claims of the post: the SMB structural exposure thesis, the mini-Release-Owner-Gate prescription, and the Frontier Founder series forward commitment. I approved all three. The plugin wrote Verified — DC, 2026-05-26 to changelog.md. Stage 4 cleared.

The LinkedIn pair derived cleanly. Article: 881 words, hook landing at character 131, scoring 8/9/9 on the three axes, recommendation SHIP. Teaser: 199 words and 1,135 characters of body, hook landing at character 50, scoring 7/8/9, recommendation SHIP. The linkedin_score.py script flagged the teaser as failed on the first run because it was counting the entire file (including scaffold sections like “Selected hook” and “Alternates considered”) instead of just the ## Body section. Another v0.1.2 patch landed. Re-run clean.

The hero image generated via Higgsfield Banana, twice – once in the original prompt’s charcoal-and-blue palette during the dry-run, then again in the Frontier Founder brand palette (navy, teal, orange, cream, gold) when we staged the post for the FF blog repo. The prompt artifact was preserved both times.

End state: 35 of 40 architecture features validated end-to-end in the dry-run. Two implemented but not exercised in dry-run (cross-source paraphrase clustering, tier 1-5 source quality ladder – both because the 9-source dry-run bibliography was too small to surface them). One partial (the tool-restricted reviewer is enforced at prompt level only – runtime enforcement via a custom subagent type is on the backlog). One degraded path (multi-model Council synthesis ran in single-model fallback during the dry-run session, documented). One known bug (the discourse_sweep outline-parser).

87.5% of the architecture survived the design-to-code-to-dry-run translation with full validation. Zero features deferred to v0.2. The plugin did what we said it would do.

Where v0.1.2 stands against the 269-entry catalog

After the dry-run cleared, I went back to the feature-comparison workbook and added a fifth tab to it. The original four tabs compared the architecture design to the catalog: a 269-row per-entry comparison across the 40-feature taxonomy, a compact relevant-only feature matrix, a 29-cluster recommended-additions list (the v0.2 backlog drawn from features competitors have that we still don’t), and a methodology log. The fifth tab compares the shipped v0.1.2 to the same 269 entries, using the same 40-feature taxonomy, so the recursion closes inside one workbook.

The top-line scorecard:

MetricValue
Features validated end-to-end in the dry-run35 of 40 (87.5%)
Features implemented but not exercised in the dry-run2 of 40
Features with a partial or degraded path2 of 40
Features with a known bug logged for v0.1.31 of 40
Features deferred to v0.20 of 40
Features where v0.1.2 leads the field (catalog adoption ≤ 20%)33 of 40 (82.5%)
Features in rare-differentiator territory (catalog adoption ≤ 5%)21 of 40 (52.5%)
Features where v0.1.2 trails the field0

A slice of the rare-differentiator features the plugin ships, with their catalog adoption percentages and the work they did in this very dry-run:

FeatureCatalog adoptionWhat it did in the dry-run
Nonce-bound BLOCKING reviewer0.4% (1 of 269)Caught a stale round-2 review trying to claim round-3 authority. The mismatch error forced a fresh reviewer dispatch with the new nonce.
AI-transparency hero prompt artifact1.5%The image prompt ships alongside the rendered hero so readers can see what the model was instructed to draw. Both versions of the hero (charcoal-and-blue and then brand-palette) preserved their prompts.
Multi-model Council synthesis pass1.5%Wired but ran in single-model fallback during this session. The graceful degradation path documented itself in state.md.
Cross-source paraphrase clustering1.5%Implemented but the 9-source dry-run bibliography was too small to exercise it. The mechanism is present. The test corpus wasn’t large enough to trigger it.
Iteration cap (max N rounds then escalate)1.9%Three-round cap enforced. Round 3 cleared at 94/100 just under the cap.
[V]/[S]/[F] verification tags2.2%Caught two whitepaper sub-claims that needed primary-source replacement before publication.
3-axis LinkedIn scorecard2.6%Article axes 8/9/9, teaser axes 7/8/9. Both SHIP. The scorecard forced explicit per-axis judgment before the recommendation.
Tool-restricted reviewer (no Bash, no Edit)2.6%Enforced at prompt level only in v0.1.2. Runtime enforcement via a custom subagent type is task #34 on the backlog.
Earned-secret stall5.2%Produced the 500/0 story. Phase 1 wouldn’t proceed without it. The story then pre-filled 5 of 8 voice-interview slots in Phase 2.
30-day platform-targeted discourse sweep5.2%Ran across reddit, X, HN, dev.to, Substack, GitHub, LinkedIn, podcasts, scholarly. Surfaced the outline-parser bug now logged as task #33.
Measurable voice metrics (burstiness, TTR, sentence-length SD)5.2%Burstiness 0.744, passive 7.2%, mean sentence length 13.09, all recorded in the process log so the next post’s metrics can be compared against this one’s.

The aggregate distribution tells the story the architecture document was making. 33 of 40 features lead the field. 21 of 40 are in rare-differentiator territory at five percent adoption or below. Zero features trail. The framework wasn’t built to compete on table stakes. The framework was built to ship at each phase boundary the rarest mechanism in the catalog that did real load-bearing work for that phase’s question. Pattern 6 wasn’t adopted because we liked novelty. Pattern 6 was adopted because Phase 4 needs a review the system can trust as authentic, and a CSPRNG nonce is the cheapest available primitive that provides authenticity binding for a subagent’s verdict against the rest of the pipeline.

The workbook is available as a sanitized public download: 4d-blog-engine-feature-comparison-2026-05-26.xlsx. Five tabs. The first four were built from the architecture document before any code shipped: the 269-row per-entry comparison, a compact relevant-only feature matrix, the 29-cluster recommended-additions list (which IS the v0.2 backlog, published deliberately so anyone reading along can see exactly what we’re building next), and the methodology log. The fifth tab was built from the dry-run after the architecture cleared its own Release Owner Gate. The two halves of the workbook are the before-and-after of the design-to-code translation, measured against the same 269-entry catalog with the same 40-feature taxonomy. The recursion closes there too.

Reading note for the download: the per-entry comparison’s “Notes” column carries our actual judgment calls on each of the 269 catalog rows. Some of them are charitable. Some of them are not. The whole point of publishing it is that the empirical claim has to be auditable, including the parts where reasonable people would have called something differently.

What this method actually proves

We did not write a blog post that’s polish over substance. We caught two whitepaper sub-claims that needed direct source replacement before publication (“5.2pp context drop” and “78%/1% adoption” – both flagged [S] in the verification pass and replaced with the directly-verified 3.7/3.1 dimension drops from Anthropic’s 2026 AI Fluency Index and the 24%/12% adoption figures from the SBE Council’s small-business survey). The substance check did real work.

We caught metadata drift in the JSON-LD that would otherwise have shipped quietly – the headline mismatched the YAML title, the description mismatched the excerpt. The reviewer pulled it out at round 3, the writer fixed it before push. The format check did real work.

We caught the discourse_sweep outline-parser bug because we ran the system on a real piece, not on a synthetic test case. The synthetic test case wouldn’t have had the meta-structural H2s that tripped the parser. The dry-run did. The integration test did real work.

We caught the linkedin_score body-extraction bug for the same reason – the teaser scaffold file is shaped exactly the way the hook-library output contract specifies, with all the H2 sections the human reviewer needs. The script wasn’t aware of that contract. The dry-run forced the contract back into the script.

We caught the prose_lint quote-curling bug because the post had a JSON-LD <script> block in the body, and the linter was processing the whole document including the script content. The fix is to strip protected regions (code fences, inline code spans, script and style HTML blocks) before applying the curl-and-em-dash transforms. The dry-run forced the protection logic into the script.

Each of those bugs is the kind of thing that would have shipped quietly on a less-instrumented pipeline. Each of them got caught by a specific gate doing specific work. The gates aren’t theater. The gates are the thing.

The deeper proof is this: polish without substance is not “almost good.” Polish without substance is the actual failure mode. The polish bias works because the polish reads as completeness. The human reviewer drops the check that would have caught the missing substance, because the surface signal of completeness is so strong. The 4D framework doesn’t fix this by making the AI generate substance – the AI can’t. The 4D framework fixes it by making the substance check structurally unavoidable. You can’t get to Phase 4 without the substance produced in Phase 1, Phase 2, and Phase 3. The phases compound. The gates compound. The reviewer can’t sign off on a piece whose Phase 1 artifact has no earned secret. The reviewer can’t sign off on a piece whose Phase 3 bibliography has no verified citations. The structure makes the failure mode visible.

That’s what the architecture is for. Not to polish harder. To make turd polishing impossible to complete.

The pattern generalizes

The 4D framework works for blog posts because it works for capability mapping in Workforce Automation. It works for capability mapping because it works for code review (we ship a similar gated discipline in our gstack-execution plugin for code work). It works for code review because the underlying claim is true across the board: AI fluency is structural, not lexical. Anyone can paste a prompt. What matters is the gate, the rubric, the eval, the named signature.

We’ve talked about this internally as “the bottleneck has moved from making the work to validating it.” The crawl confirmed that – every credible plugin in the field is grappling with the same problem in different vocabularies. The Workforce Automation methodology confirmed it – our $180 production scoring run on Opus came after a $39 eval gate that proved Sonnet couldn’t clear the precision floor. The blog engine dry-run confirmed it – the 94-out-of-100 round-3 score came after two BLOCKING rounds that forced specific content edits.

What we’ve been building, across SAMS and STIGViewer and RegGenome and Workforce Automation and now this blog engine, is the validation infrastructure for AI-augmented knowledge work. The frameworks differ by domain. The Release Owner Gate handles content. The Council-arbiter-eval-production methodology handles capability inference. The gstack pre-landing review handles code. But they share the same shape: structured human-in-the-loop with engineered phase boundaries and named human accountability at the gate.

If you’re a founder reading this and you’ve adopted AI for any meaningful slice of your team’s output, the question to bring back to your operation isn’t “what prompt template should we use.” The question is: where are your gates, what artifacts do they check, and whose name goes on the signature line.

The whitepaper named this. The 4D framework operationalized it. The Workforce Automation engine empirically proved the AI-for-bulk-human-for-dissent methodology at scale. The 139-repo crawl established that the field is converging on the same architecture under different names. The blog engine made the framework Monday-installable for content work. The dry-run on this very framework’s own genesis whitepaper proved the engine could clear its own gate.

We ate our own dogfood. We caught what the polish bias would have hidden. We named the failure mode in our own send folder. We built the discipline that would have caught it earlier. We wrote the post that explains all of this with the discipline itself.

A polished turd is still a turd. But you can build a kitchen that refuses to serve them.

What’s next

Two specific items, then a closing thought.

The first item is the v0.1.2 patch push. The three deterministic-script fixes that surfaced in the dry-run – quote-curling protection, em-dash whitespace collapse, body-section extraction for LinkedIn scoring – are in the working tree. They go up the next time GitHub Desktop opens cleanly. Version bumps are applied. The 4d-blog-engine moves to 0.1.2 in the marketplace.

The second item is the v0.2 backlog. Five planned increments: hook-library expansion plus multi-schema JSON-LD plus a Context/ single-source-of-truth directory (v0.2.1), past-post scan plus returning-user check plus SERP gap analysis (v0.2.2), a mode dial plus per-H2 citation capsules (v0.2.3), LinkedIn carousel as a third derivative output (v0.2.4), and a hypothesis ledger plus learning loop (v0.3.0). All five trace back to specific features in the catalog scan that competing tools had which ours didn’t. None of them require re-architecting. Each is a contained addition.

The closing thought is for any founder who’s run an AI marketing experiment and gotten back the kind of numbers we got – a reply rate well below the 2026 B2B benchmark of 3.43%, opens stuck below 20%, conversion that won’t budge no matter how good the personalization. Don’t reach for a better prompt. Don’t reach for a better model. Reach for the gate. Pick one channel. Route every AI-drafted message through one named person before it goes out, for one week. The person checks three things: does this land on a person or on a demographic? Does it sound like us? Would I send this with my own name on it? After seven days, count signed-and-sent versus stopped-and-revised. That ratio is your polish-bias baseline.

The discipline matters more than the headcount. The framework works at any scale. The substance has to come from you.

The polish never will.


This post is part of the Frontier Founder series, MoxyWolf’s running argument that the company that wins the AI era is the one built so human judgment scales. The companion post – “What the polish bias means for SMB founders in 2026” – is the first piece this engine wrote, and the practical Monday-installable version of the discipline this essay describes. The plugin source is in the moxywolf-plugins marketplace under plugins/4d-blog-engine. Next in the series: what to do when more got stopped than sent, and how to fix the upstream prompts so the gate stops being the bottleneck.

Sources retrieved 2026-05-26.

Media gallery

Feature-comparison workbook — 269 catalog entries × 40 features, plus the v0.1.2-vs-catalog tab built after the dry-run cleared. Sanitized for public release.4d-blog-engine-feature-comparison-2026-05-26.xlsx