When AI Becomes
a Team
Last week I built a pipeline where Claude orchestrated a team of local LLMs — assigning tasks, reviewing outputs, deciding what was good enough to keep. Qwen generated. Claude judged. The thing that struck me wasn't the output quality. It was how human it felt. Here's the agent. Drop it in your .claude/agents/ directory.
✓ Profiles all local models by capability✓ Decomposes tasks into routable subtasks✓ Quality gates between every phase✓ Parallel execution where dependencies allow✓ Final integration + polish pass
claude-lmstudio-pipeline.md
PHASE 1 — MODEL CAPABILITY PROFILING
Audit all local LLMs · Score across 10 dimensions
code-gen / reasoning / speed / structured-output / ...
Build a routing map before touching any task
PHASE 2 — TASK DECOMPOSITION
Parse task → identify all output artifacts
Map subtasks to capability dimensions
Flag critical path items · Identify parallel work
PHASE 3 — EXECUTION WITH QUALITY GATES
Route each subtask to its best-fit model
Fail fast: re-prompt → escalate → flag for human
Parallel execution where dependencies allow
PHASE 4 — INTEGRATION & POLISH
Assemble → consistency check → polish pass
Final audit against original requirements
Deliver with routing summaryHow it works
1
Model capability profilingBefore routing anything, the agent audits all local LLMs and scores them across 10 dimensions — code generation, reasoning, speed, structured output, and more.
2
Task decompositionBreaks your task into atomic subtasks, maps each to a capability dimension, identifies dependencies, and builds a routing plan before touching any model.
3
Execution with quality gatesEach subtask runs with a precision prompt tailored to its assigned model. Failures escalate — re-prompt first, then a higher-tier model, then a human review flag.
4
Integration and polishAssembles the final deliverable, resolves inconsistencies, runs a polish pass, and delivers a routing summary showing which model handled what and why.
Built-in safety rails
Fail fastQuality failures caught at subtask level — never at final deliveryEscalation ladderRe-prompt → higher-tier model → human review. No silent failuresOutput contractsEvery subtask defines exact format and structure before executionConsistency checksCross-subtask validation — does the code match the docs?Transparent routingShows all routing decisions and reasoning before executingPersistent memoryCapability map improves with every run — learns what works★ Why roles matter more than raw capability
Everyone's racing to use the best model. The more interesting question is what happens when models work for each other. Claude holds the ORCHESTRATOR and JUDGE roles explicitly — it never auto-approves anything. Local models execute fast and free; Claude decides what passes. The structure of accountability is what made it feel human. Not the output quality.