Tutorial: Agent Orchestration¶
Skill profiles are not a heuristic for matching — they are structured context you feed to an LLM orchestrator alongside cost, latency, and business constraints. The LLM makes the final routing decision; skillinfer gives it the capability data it needs.
What you'll learn
- Building profiles from partial evaluations across a pool of agents
- Combining skill profiles with operational metadata (cost, latency)
- Formatting everything into a prompt for an LLM orchestrator
- Letting the LLM reason about observed vs. inferred skills
The pattern¶
Partial evals → skillinfer profiles → LLM prompt → Routing decision
(structured context) + cost/latency
+ constraints
skillinfer handles step 2: turning sparse, partial evaluations into full skill profiles with calibrated uncertainty. The orchestrating LLM handles the rest — weighing capability against cost, noting which skills are observed vs. inferred, and applying business rules you couldn't encode in a scoring function.
Step 1: Build skill profiles¶
import pandas as pd
import skillinfer
from skillinfer import Task
# Historical data: models you've fully evaluated
history = pd.DataFrame({
"math": [0.88, 0.75, 0.70, 0.82, 0.65, 0.90, 0.72, 0.68],
"code": [0.85, 0.80, 0.65, 0.78, 0.60, 0.88, 0.70, 0.62],
"reasoning": [0.90, 0.82, 0.75, 0.85, 0.70, 0.92, 0.78, 0.72],
"writing": [0.80, 0.85, 0.90, 0.75, 0.88, 0.78, 0.82, 0.92],
"creativity":[0.70, 0.78, 0.85, 0.68, 0.82, 0.65, 0.75, 0.88],
}, index=[f"model_{i}" for i in range(8)])
pop = skillinfer.Population.from_dataframe(history, normalize=False)
Each new agent has been evaluated on a different subset of skills. skillinfer infers the rest from covariance:
agent_observations = {
"gpt-4o": {"reasoning": 0.92, "code": 0.89},
"claude-3.5": {"reasoning": 0.90, "writing": 0.95},
"llama-3-70b": {"code": 0.78},
"gemini-pro": {"reasoning": 0.85, "math": 0.88, "code": 0.82},
"mistral-large": {"writing": 0.80},
}
profiles = {
name: pop.profile().observe_many(obs)
for name, obs in agent_observations.items()
}
Now every agent has a full 5-skill profile — even llama-3-70b, which was only evaluated on code.
Step 2: Add operational metadata¶
Skill profiles are necessary but not sufficient. The orchestrator also needs cost, latency, and capacity data:
agent_metadata = {
"gpt-4o": {"cost_per_1k": 0.005, "avg_latency_ms": 800, "rate_limit_rpm": 500},
"claude-3.5": {"cost_per_1k": 0.003, "avg_latency_ms": 600, "rate_limit_rpm": 1000},
"llama-3-70b": {"cost_per_1k": 0.001, "avg_latency_ms": 1200, "rate_limit_rpm": 200},
"gemini-pro": {"cost_per_1k": 0.004, "avg_latency_ms": 700, "rate_limit_rpm": 600},
"mistral-large": {"cost_per_1k": 0.002, "avg_latency_ms": 500, "rate_limit_rpm": 800},
}
Step 3: Build the orchestrator prompt¶
This is the key step — combining skill profiles, task requirements, metadata, and constraints into a single prompt:
task = Task(
{"math": 1.0, "reasoning": 0.8, "code": 0.3},
"Solve a competition-level math problem that requires writing a Python proof",
)
# Build agent context blocks
agent_blocks = []
for name, profile in profiles.items():
lines = [f"{name}:"]
# Skill predictions — the LLM sees what's observed vs. inferred
lines.append(" Skill profile:")
for skill in sorted(task.weights, key=lambda s: -task.weights[s]):
pred = profile.predict(skill)
source = "observed" if pred["std"] < 0.01 else "inferred"
lines.append(f" {skill}: {pred['mean']:.2f} ± {pred['std']:.2f} ({source})")
# Overall task fit
result = profile.match_score(task, threshold=0.8)
lines.append(f" Task fit: {result.score:.2f} (P>0.8 = {result.p_above_threshold:.0%})")
# Operational metadata
meta = agent_metadata[name]
lines.append(f" Cost: ${meta['cost_per_1k']}/1k tokens, "
f"Latency: {meta['avg_latency_ms']}ms, "
f"Rate limit: {meta['rate_limit_rpm']} req/min")
agent_blocks.append("\n".join(lines))
The prompt gives the LLM the full picture:
prompt = f"""You are an AI orchestrator. A task has arrived and you need to
assign it to the best available agent.
## Task
{task.description}
Required skills (importance weights): {task.weights}
## Available agents
Each agent has a skill profile built from partial evaluations. "Observed" skills
were directly measured; "inferred" skills are predicted from correlated observations
(with higher uncertainty). Task fit is the expected performance score (0-1 scale).
{chr(10).join(agent_blocks)}
## Constraints
- The task is latency-sensitive: prefer agents under 1000ms
- Budget is moderate: avoid the most expensive option unless clearly the best fit
- If an agent's key skills are "inferred" rather than "observed", note the added risk
## Instructions
Pick one agent. Explain your reasoning in 2-3 sentences, referencing the skill
profiles, confidence levels, and operational constraints. Then state your choice
on a final line as: ASSIGNED: <agent name>"""
The resulting prompt looks like:
gpt-4o:
Skill profile:
math: 0.89 ± 0.04 (inferred)
reasoning: 0.92 ± 0.00 (observed)
code: 0.89 ± 0.00 (observed)
Task fit: 0.90 (P>0.8 = 100%)
Cost: $0.005/1k tokens, Latency: 800ms, Rate limit: 500 req/min
gemini-pro:
Skill profile:
math: 0.88 ± 0.00 (observed)
reasoning: 0.85 ± 0.00 (observed)
code: 0.82 ± 0.00 (observed)
Task fit: 0.86 (P>0.8 = 100%)
Cost: $0.004/1k tokens, Latency: 700ms, Rate limit: 600 req/min
claude-3.5:
Skill profile:
math: 0.82 ± 0.04 (inferred)
reasoning: 0.90 ± 0.00 (observed)
code: 0.81 ± 0.04 (inferred)
Task fit: 0.85 (P>0.8 = 99%)
Cost: $0.003/1k tokens, Latency: 600ms, Rate limit: 1000 req/min
...
Notice what the LLM can reason about that a scoring function can't:
- gpt-4o has the best task fit but its math score is inferred — should the LLM trust it?
- gemini-pro has all three skills observed (100% confidence) at a lower cost
- claude-3.5 is cheapest and fastest, but both math and code are inferred
Step 4: Call the orchestrator¶
from openai import OpenAI
client = OpenAI() # uses OPENAI_API_KEY env var
response = client.chat.completions.create(
model="gpt-4o-mini",
max_tokens=300,
messages=[{"role": "user", "content": prompt}],
)
print(response.choices[0].message.content)
Considering the task requires a balance of math, reasoning, and coding skills,
gpt-4o stands out with the highest task fit (0.90) and 100% probability of
exceeding the threshold. However, its math score is inferred rather than observed.
Gemini-pro offers all three skills directly observed with 100% confidence at a
lower cost ($0.004 vs $0.005/1k) and lower latency (700ms vs 800ms). Given the
moderate budget constraint and latency sensitivity, gemini-pro provides the best
balance of confidence and cost.
ASSIGNED: gemini-pro
The LLM chose gemini-pro over gpt-4o — not because its score was higher (it wasn't), but because all its relevant skills were observed rather than inferred, and it's cheaper. This is exactly the kind of nuanced decision that requires an LLM, not a max() call.
Full example¶
See examples/orchestration.py for the complete runnable script. Without an API key, it prints the full prompt so you can inspect the context.
Key takeaway¶
skillinfer is not the orchestrator — it's the context layer. It turns partial, heterogeneous evaluations into structured skill profiles that an LLM can reason about. The LLM then combines capability data with cost, latency, confidence levels, and business constraints to make a routing decision that no scoring function could replicate.