The Dark Factory - Blaz Kos

No human writes the code. No human reviews the code. Specifications go in. Working, production-grade software comes out. The only human work is writing specs and evaluating outcomes.

What Is a Dark Factory?

The term comes from manufacturing — specifically "lights-out factories," industrial plants that operate in complete darkness because no humans are present. Robots and automated systems run 24/7 with no one inside. The lights stay off because there is nobody to see by.

A Dark Factory in software is the same idea applied to code production. The key distinction from "AI-assisted coding" is total:

Humans are not faster coders. They have exited the implementation loop entirely.
Humans are not reviewers. They are not reading pull requests or approving diffs.
Humans are architects and judges. They define what needs to exist, and they evaluate whether what was built actually works.

The constraint moves from "Can we build it?" to "Should we build it?" — and "Should we build it?" has always been the harder and more interesting question.

Why This Changes Everything

Every organizational structure in software development was designed to manage a human limitation. Stand-up meetings exist because developers need to synchronize daily. Sprint planning exists because humans can only hold so many tasks in working memory. Code review exists because humans make mistakes other humans can catch.

When machines do the implementation, these structures are not optional ceremonies — they are friction. The bottleneck moves from implementation speed to spec quality. And spec quality is entirely a function of how deeply you understand the system, the customer, and the problem.

The Five Levels of AI-Assisted Development

This framework maps where the industry stands — from "AI as a faster keyboard" all the way to the fully autonomous Dark Factory. Think of it as a spectrum of how much the human has stepped back from implementation.

Level	Name	Your Job
0	Spicy Autocomplete	Write code. Accept/reject line suggestions.
1	Coding Intern	Assign scoped tasks. Review all output.
2	Junior Dev	Direct multi-file work. Read all code produced.
3	The Manager	Review at PR/feature level. Stop reading code.
4	The Product Manager	Write specs. Evaluate outcomes. Never read code.
5	The Dark Factory	Architect the system. Judge output. Factory runs itself.

Level 0 — Spicy Autocomplete

A faster tab key. The human is writing all the software — the AI just reduces keystrokes. There is no architectural shift in who is responsible for what. Most people who have "tried AI coding tools" and found them underwhelming are thinking of this level.

Level 1 — Coding Intern

"Write this function." "Build this component." The AI handles execution of a bounded task. The human handles architecture, judgment, and integration. You review all output before it goes anywhere. You are essentially a senior engineer with a tireless, fast junior who never complains about repetitive work.

Level 2 — Junior Developer

Now the AI can understand dependencies, build features that span modules, and hold context across a codebase. The output is more complex — but you are still reading every line. You remain the final quality gate. An estimated 90% of developers who call themselves "AI-native" are operating at this level.

Level 3 — The Manager

This is where the relationship starts to flip. You are not writing code and having the AI help. You are directing the AI and reviewing what it produces — at the feature level, at the PR level. The bottleneck here is psychological. Most developers hit a ceiling at Level 3 because they struggle with letting go of the code. This is also the first level where non-coders can operate naturally.

Level 4 — The Product Manager

The code is a black box. You care whether it works, not how it is written. You have defined what "working" means precisely enough that you do not need to inspect implementation details. This requires two hard-won capabilities:

The ability to write specifications precise enough that an agent can implement them correctly without human clarification.
The ability to define evaluation criteria specific enough that outcomes can be judged without reading code.

Level 4 is where non-coders who master spec-writing can compete directly with — and often outperform — experienced developers who never learned to think at this level.

Level 5 — The Dark Factory

This is the endpoint. Specifications go in. Working software comes out. The factory runs autonomously. Humans define what needs to exist and judge whether what was produced is correct. Everything else is automated.

The path to Level 5 is not about finding better tools. The tools today are sufficient. It is about redesigning every part of how you work — your organizational structure, your documentation approach, your evaluation frameworks, your psychological relationship with code.

The Design: Four Pillars

Pillar 1: The Specification System

Your spec system is the input interface to your factory. Everything the factory produces is determined by the quality of what goes in. A Dark Factory with a weak spec system produces garbage autonomously — which is worse than producing garbage slowly.

A standard spec template that every feature or product uses. Consistent structure means agents can parse it reliably.
A versioning approach — specs are living documents and changes need to be tracked.
A categorization system — distinguish between architectural specs, feature specs, and constraint specs.
A validation step before any agent runs — a checklist that confirms a spec is complete before it enters the factory.

Pillar 2: The Scenario System (Holdout Set)

Scenarios are external behavioral tests that the agent never sees during development. They live outside the codebase and serve as your quality gate.

Tests live inside the codebase. The AI can see them. The AI can game them. Scenarios live outside the codebase as a holdout set. The AI cannot see them. The agent builds the software. Then scenarios are run against it externally. This prevents your factory from shipping software that passes all tests but fails real users.

Write scenarios in plain language first, then convert to automated checks.
Scenarios must be behavioral: "When a user submits the form with a valid email, the confirmation page appears within 2 seconds."
Scenarios live in a separate repository from the code. The agent has no access to them during build.
After every agent run, the scenario suite executes against the output. Pass = ships. Fail = spec needs refinement.

Pillar 3: The Agent Orchestration Layer

The automation infrastructure that routes specs to agents, manages agent runs, collects outputs, and triggers evaluation.

Agent selection — Which agent or model runs which type of spec? Different settings per task type.
Run management — How do you launch agent runs without babysitting them? Background mode, CI/CD pipelines, or orchestration tools.
Output collection — Where does the agent's output go? A holding branch, a staging environment, an output directory.
Evaluation trigger — After every agent run, evaluation runs automatically. You see a pass/fail summary, not a code diff.

Pillar 4: The Digital Environment

Agents need a safe environment to build and test without touching real systems.

A local or sandbox environment that mirrors production behavior closely enough for integration testing.
Mocked versions of external services you depend on — payment APIs, email services, third-party data sources.
Test data sets that represent real usage patterns without containing real user data.
An automated deployment pipeline that can push agent-generated code to staging without human intervention.

The Process: Getting to Level 4

Level 4 is where a non-coder can operate at full capability. You are not writing code. You are not reviewing code. You are writing specifications and evaluating outcomes.

Level 0–3 Thinking	Level 4 Thinking
"Is this code correct?"	"Does this behavior match the spec?"
"What does this function do?"	"What outcome should this feature produce?"
"How do I implement this?"	"How do I specify this precisely enough for an agent?"
"Let me review this diff."	"Let me define what passing looks like."
"The AI made an error in line 47."	"The spec was ambiguous about this edge case."

Job 1: Writing Specifications That Machines Can Execute

This is the hardest skill at Level 4. Most people fail not because the AI is insufficient, but because their specs are insufficient. A specification must answer seven questions:

WHAT — What does this feature do from the user's perspective? Describe behavior, not implementation.
WHO — Who is the user? What are they trying to accomplish?
INPUTS — What data, files, or user actions trigger this feature? What are the valid and invalid inputs?
OUTPUTS — What exactly does the system produce? What format? What are the edge cases?
RULES — What business logic governs the behavior? What should never happen?
EVALUATION — How will you know if this worked? What tests or scenarios will you run?
CONSTRAINTS — What are the performance, cost, security, or compatibility requirements?

Read your spec aloud to someone who has never used the product. Can they describe back to you exactly what the feature should do — including at the edges? If not, the spec is not ready for an agent.

Job 2: Defining Evaluation Criteria Before Building

Evaluation criteria must be defined before building begins. Two layers:

Functional evaluation — Does the software do what the spec said? Behavioral tests defined in plain language before the agent starts building.
Quality evaluation — Does the output meet your standards beyond raw functionality? Speed, readability, error handling, edge case coverage.

Job 3: Context Engineering

A Level 4 operator builds and maintains the context infrastructure that makes every spec more effective:

Product context documents — Who are your users? What problems are they solving? What are the core principles of the product?
Technical standards docs — What tech stack? What naming conventions? What does good code look like in your system?
Past decisions log — Why did you build things the way you did? What tradeoffs were made?
Domain glossary — Define every key term precisely. Ambiguous domain language produces ambiguous software.

Job 4: Architectural Direction

Level 4 operators make architectural decisions. They do not implement them — but they define the shape of the system:

Deciding how data flows through the system — what goes where and why.
Choosing which components are separate vs. combined — and specifying the seams between them.
Defining what the system should never do — the guardrails that protect users and the business.
Setting performance and cost targets that agents must hit.

Job 5: Iteration and Quality Judgment

You specify → agent builds → you evaluate → you refine spec → agent rebuilds. When output fails evaluation, the first question is never "What did the AI do wrong?" The first question is "What was ambiguous in my spec?"

Building the Factory: A Practical Roadmap

Phase 1: Foundations (Weeks 1–4)

Develop the discipline of writing good specs — even if you are still at Level 3 and reviewing all the code.

Adopt a spec template and use it for every feature, no matter how small.
Write evaluation criteria before every agent run. No exceptions.
Build your context library: product context, technical standards, domain glossary.
Create a decisions log and document every architectural choice with its rationale.

Phase 2: Automation (Weeks 4–10)

Remove yourself from the loop between spec and evaluation.

Set up automated agent runs — the agent starts when the spec is marked Ready.
Build your first scenario suite — 10–20 behavioral checks for core product flows.
Create the separation: scenarios live outside the codebase.
Establish your staging environment and deploy agent outputs there automatically.

Phase 3: The Feedback Loop (Weeks 10–20)

Systematize the improvement cycle. Every failed evaluation is a spec gap. Every spec gap is a learning.

Create a failure taxonomy: categorize every evaluation failure by type.
Update spec templates based on recurring failure patterns.
Build agent-specific context: custom system prompts that encode your standards.
Expand your scenario suite to cover every significant user-facing behavior.

Phase 4: Scale and Specialize (Months 5–12)

Dark Factory operation at scale. Multiple specs in flight simultaneously.

Run multiple agent workstreams in parallel — features, documentation, test coverage.
Implement cost controls — set per-run budgets.
Build a spec pipeline — a queue of specs that agents pull from autonomously.
Reach the dark factory state: the factory runs and ships while you think about the next quarter.

What Changes in the Organization

What Disappears

Sprint ceremonies — agents do not need weekly reprioritization.
Code review as a gate — agent output is evaluated against scenarios, not reviewed by humans.
Most of engineering management as currently practiced.
Traditional QA workflows — behavioral scenarios replace manual test passes.

What Transforms

The PM role transforms from feature requestor to spec architect.
The engineering manager role transforms from coordinator to specification designer.
The QA role transforms from running manual tests to building and maintaining the scenario suite.
The technical writer role transforms from extracting knowledge to maintaining the context library.

What Stays — And Becomes More Valuable

Deep understanding of what users actually need. Judgment about what is worth building and why. The ability to write specifications precise enough for autonomous execution. Pattern recognition in evaluation results. Architectural thinking. Customer relationships.

The Dark Factory does not need more engineers. It needs better thinkers — people who understand customers deeply, think in systems, and can articulate what needs to exist before it exists at all.

Failure Mode Reference

Symptom	Root Cause & Fix
"The agent keeps misunderstanding what I want."	Spec is ambiguous. Add Given/When/Then scenarios. Define terms in your glossary.
"The output looks right but behaves wrong."	No external scenarios. Agent is passing tests it wrote for itself. Build holdout scenarios.
"The factory is too slow."	You are watching agent runs instead of batching. Review results, not process.
"I keep needing to revise specs 3–4 times."	Missing edge cases in template. Add a required edge-case section.
"I do not trust the output enough to stop reviewing code."	Scenario coverage is too thin. Expand scenarios until the evaluation gate earns your trust.
"Costs are escalating unpredictably."	No per-run budgets. Set agent cost caps. Use cheaper models for lower-stakes tasks.