skillinfer¶

Infer a full skill profile from a few observations.

Observe a few skills, predict the rest — with calibrated uncertainty. skillinfer learns how capabilities co-vary across a population and uses that structure to infer a full profile from partial observations.

Get Started API Reference

import skillinfer

pop = skillinfer.datasets.onet()         # 894 occupations x 120 skills
profile = pop.profile()                  # new entity, unknown
profile.observe("Skill:Programming", 0.92)
print(profile.predict())                 # predict all 120 skills

Output — one observation, 120 predicted

                           feature   mean    std  ci_lower  ci_upper
    Skill:Complex Problem Solving   0.81   0.17      0.47      1.00
          Skill:Critical Thinking   0.73   0.15      0.43      1.00
              Skill:Programming     0.92   0.01      0.90      0.93  ← observed
                Skill:Mathematics   0.67   0.12      0.43      0.91
         Ability:Static Strength   0.10   0.23      0.00      0.55  ← anti-correlated
...                                 ...    ...       ...       ...
[120 rows x 5 columns]

AI agents¶

Run one benchmark, predict the rest. Build a full capability profile for any model and hand it to an orchestrator — route tasks to the right model without exhaustive evals.

Humans¶

Observe a few tasks, infer 120+ competencies. Surface skill gaps, match candidates to roles, and prioritize what to assess next — all from partial data.

Human-AI teams¶

People and models profiled in the same skill space. Compare everyone on the same dimensions, then let an orchestrator assign each subtask to whoever is strongest.

Fast and scalable¶

No training loop, no GPU, no neural network. One matrix operation gives you the exact Bayesian posterior. Under 1ms per update, scales to 1000+ skills.

pip install skillinfer

Core types¶

Type	What it is	Example
`Population`	Learned covariance from a population of entities	`Population.from_dataframe(df)`
`Profile`	One entity's skill profile — gets sharper with observations	`pop.profile()` → `.observe()` → `.predict()`
`Skill`	A label-score pair for a single skill measurement	`Skill("Programming", score=0.92)`
`Task`	A weighted mix of skills describing what a job requires	`Task({"math": 1.0, "code": 0.5})`

Use cases¶

Domain	Observe	Predict	Scale
AI model selection	1-2 benchmark scores	All benchmarks + best model for a task	4,500+ models x 6-38 benchmarks
Human skill profiling	A few task observations	Full occupational skill profile	894 occupations x 120 skills
Human-AI orchestration	Partial evals for both	Who handles which subtask	Mixed pools, same skill space
Worker-task matching	Known competencies	Fit for new roles and tasks	Skills, knowledge, abilities

Examples¶

skillinfer works anywhere entities have correlated capabilities. The same API profiles an LLM, a job candidate, or a hybrid team.

AI AgentsHumansHuman-AI Teams

import pandas as pd
import skillinfer

# 4,500+ LLMs from the Open LLM Leaderboard
df = pd.read_parquet("hf://datasets/open-llm-leaderboard/contents/data/train-00000-of-00001.parquet")
benchmarks = ["IFEval", "BBH", "MATH Lvl 5", "GPQA", "MUSR", "MMLU-PRO"]
df = df[["fullname"] + benchmarks].dropna().set_index("fullname")

# Hold out Llama-3.1-70B-Instruct, build population from the rest
model = "meta-llama/Llama-3.1-70B-Instruct"
true_scores = df.loc[model]
pop = skillinfer.Population.from_dataframe(df.drop(model), normalize=False)

# Observe ONE benchmark, predict the other five
profile = pop.profile()
profile.observe("BBH", true_scores["BBH"])

# Compare predictions to ground truth
for b in benchmarks:
    pred = profile.mean(b)
    err = abs(true_scores[b] - pred)
    tag = "  ← observed" if b == "BBH" else ""
    print(f"  {b:<15} true={true_scores[b]:5.1f}  pred={pred:5.1f}  err={err:4.1f}{tag}")

Output — one observation predicts five benchmarks

  IFEval          true= 86.7  pred= 69.3  err=17.4
  BBH             true= 55.9  pred= 55.9  err= 0.0  ← observed
  MATH Lvl 5      true= 38.1  pred= 35.1  err= 3.0
  GPQA            true= 14.2  pred= 14.6  err= 0.4
  MUSR            true= 17.7  pred= 17.8  err= 0.1
  MMLU-PRO        true= 47.9  pred= 50.7  err= 2.9

  MAE on 5 unobserved benchmarks: 4.7

import skillinfer

# O*NET 30.2: 894 occupations x 120 skills, knowledge, abilities
pop = skillinfer.datasets.onet()

profile = pop.profile()
profile.observe("Skill:Programming", 0.92)
profile.observe("Skill:Critical Thinking", 0.85)

print(profile.predict())                   # predict all 120

Output — two observations, 120 predicted

                          feature   mean    std  ci_lower  ci_upper
          Skill:Active Learning   0.94   0.12      0.71      1.17
         Skill:Active Listening   0.74   0.10      0.55      0.93
  Skill:Complex Problem Solving   1.02   0.09      0.83      1.20
        Skill:Critical Thinking   0.85   0.01      0.83      0.87  ← observed
            Skill:Programming     0.92   0.01      0.91      0.93  ← observed
              Skill:Mathematics   0.81   0.11      0.59      1.03
...                               ...    ...       ...       ...
[120 rows]

import pandas as pd, skillinfer
from skillinfer import Task

# Humans and models scored on the same skills
df = pd.DataFrame({
    "math":       [0.9, 0.7, 0.85, 0.95, 0.80],
    "code":       [0.8, 0.5, 0.90, 0.92, 0.70],
    "writing":    [0.6, 0.9, 0.40, 0.55, 0.85],
    "reasoning":  [0.8, 0.6, 0.80, 0.88, 0.75],
}, index=["alice", "bob", "gpt-4o", "claude", "gemini"])

pop = skillinfer.Population.from_dataframe(df, normalize=False)

# Observe one skill each
alice = pop.profile(); alice.observe("math", 0.95)
gpt4o = pop.profile(); gpt4o.observe("math", 0.88)

# Who handles a math-heavy task?
task = Task({"math": 1.0, "reasoning": 0.5})
ranking = skillinfer.rank_agents(task, {"alice": alice, "gpt-4o": gpt4o})
print(ranking)

Output — ranked by expected task performance

    agent  expected_score    std
0   alice            0.91   0.03
1  gpt-4o            0.85   0.03

Skill profiles as context for LLM orchestration¶

skillinfer profiles are structured context you feed to an LLM orchestrator alongside cost, latency, and business constraints. The LLM makes the routing decision; skillinfer gives it calibrated capability data.

import pandas as pd
from openai import OpenAI
import skillinfer
from skillinfer import Task

# --- 1. Build skill profiles from partial evaluations ---

history = pd.DataFrame({
    "math": [0.88, 0.75, 0.82, 0.90, 0.72], "code": [0.85, 0.80, 0.78, 0.88, 0.70],
    "reasoning": [0.90, 0.82, 0.85, 0.92, 0.78], "writing": [0.80, 0.85, 0.75, 0.78, 0.92],
}, index=[f"model_{i}" for i in range(5)])
pop = skillinfer.Population.from_dataframe(history, normalize=False)

# Each agent evaluated on different skills — skillinfer infers the rest
agents = {
    "gpt-4o":      {"reasoning": 0.92, "code": 0.89},
    "claude-3.5":  {"reasoning": 0.90, "writing": 0.95},
    "gemini-pro":  {"math": 0.88, "code": 0.82},
}
profiles = {
    name: pop.profile().observe_many(obs)
    for name, obs in agents.items()
}

# --- 2. Format profiles as LLM context ---

task = Task({"math": 1.0, "reasoning": 0.8, "code": 0.3})

agent_context = ""
for name, profile in profiles.items():
    agent_context += f"\n{name}:\n"
    for skill in ["math", "reasoning", "code"]:
        pred = profile.predict(skill)
        source = "observed" if pred["std"] < 0.01 else "inferred"
        agent_context += f"  {skill}: {pred['mean']:.2f} ± {pred['std']:.2f} ({source})\n"

prompt = f"""Pick an agent for this task.

Task: Solve a competition math problem, write a Python proof.
Skill requirements: math (critical), reasoning (important), code (helpful).

Agent capabilities (from partial evaluations):
{agent_context}
"Observed" = directly measured. "Inferred" = predicted, higher uncertainty.

Constraints:
- Prefer agents whose critical skills (math) are observed, not inferred.
- If two agents are close, prefer the one with lower uncertainty.
- A slightly weaker agent with observed skills can be better than a
  stronger agent whose key skills are only inferred.

Pick one agent. One sentence of reasoning, then: ASSIGNED: <name>"""

# --- 3. Ask the LLM ---

client = OpenAI()
response = client.chat.completions.create(
    model="gpt-4o-mini",
    max_tokens=100,
    messages=[{"role": "user", "content": prompt}],
)
print(response.choices[0].message.content)

LLM response

Gemini-pro has directly observed math (0.88) and code (0.82), while gpt-4o's
math is only inferred. Both are close, but gemini-pro's critical skill is
measured, not predicted — making it the safer choice.

ASSIGNED: gemini-pro

The LLM weighs observed vs. inferred confidence, balances skill scores against natural language constraints ("slightly weaker but observed can be better"), and arrives at a decision no scoring function could replicate. Change the constraints, and the same profiles route differently.

See the Agent Orchestration tutorial for the full walkthrough.