Datasets¶
Built-in datasets that return a Population ready for profiling. No downloads, no preprocessing — just import and go.
import skillinfer
pop = skillinfer.datasets.onet() # or skillinfer.datasets.esco()
profile = pop.profile()
profile.observe("Skill:Programming", 0.92)
print(profile.predict())
All datasets are shipped inside the package: O*NET and ESCO as gzipped CSV (~510 KB combined), PIAAC as a small .npz of summary statistics (~2 KB).
onet()¶
O*NET 30.2 — the U.S. Department of Labor's occupational information network.
O*NET is the most comprehensive public database of occupational skill requirements. It describes what workers in each occupation need to know and be able to do, based on surveys of incumbent workers and occupational analysts.
pop = skillinfer.datasets.onet()
print(pop)
# Population(894 entities x 120 skills, shrinkage=0.0054)
What's in it¶
| Entities | 894 occupations (e.g., "Software Developers", "Registered Nurses", "Chief Executives") |
| Features | 120 total: 35 skills, 33 knowledge areas, 52 abilities |
| Scale | Continuous importance ratings, normalised to [0, 1] |
| Source | O*NET 30.2, U.S. Department of Labor / ETA |
| License | CC BY 4.0 |
Feature categories¶
Features are prefixed by category:
Skill:— learned capabilities (e.g.,Skill:Programming,Skill:Critical Thinking,Skill:Writing)Knowledge:— domain knowledge (e.g.,Knowledge:Mathematics,Knowledge:Computers and Electronics)Ability:— enduring attributes (e.g.,Ability:Deductive Reasoning,Ability:Static Strength,Ability:Manual Dexterity)
How it was preprocessed¶
The raw O*NET 30.2 database contains multiple scales per feature (importance, level, relevance). We extract:
- Skills.txt, Knowledge.txt, Abilities.txt from the O*NET 30.2 database
- Filter to the Importance scale (
Scale ID = "IM"), which rates each feature on a 1–5 scale - Drop rows marked as suppressed (
Recommend Suppress = "Y") - Pivot to an occupation × feature matrix (894 × 120)
- Normalise each column to [0, 1] using min-max scaling
- Replace O*NET SOC codes with human-readable occupation titles
Population statistics¶
| Statistic | Value |
|---|---|
| Mean feature value | 0.385 |
| Feature std | 0.228 |
| Density (non-zero entries) | 92.2% |
| Ledoit-Wolf shrinkage | 0.0054 |
| Condition number | 1,884 |
| Effective dimensions (90% var) | ~15 |
| Mean |correlation| | 0.336 |
| Correlation sparsity (<0.1) | 14.8% |
The high mean correlation (0.336) and low sparsity (14.8%) mean most features are correlated — the population has rich transfer structure. Observing a few skills tells you a lot about the rest.
Example¶
pop = skillinfer.datasets.onet()
profile = pop.profile()
profile.observe("Skill:Programming", 0.92)
profile.observe("Skill:Critical Thinking", 0.85)
print(profile.predict())
# Use a specific occupation as prior
profile = pop.profile(prior_entity="Software Developers")
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
normalize |
bool |
False |
Re-normalise columns to [0, 1]. Data is already normalised, so this is rarely needed. |
esco()¶
ESCO v1.2.1 — the European Commission's taxonomy of Skills, Competences, Qualifications and Occupations.
ESCO is an independently curated taxonomy maintained by EU expert panels. Where O*NET uses continuous importance ratings from U.S. surveys, ESCO uses binary skill assignments curated by European domain experts. This makes it a strong cross-validation target — if skillinfer works on both, the method generalises beyond any single taxonomy.
pop = skillinfer.datasets.esco()
print(pop)
# Population(2999 entities x 134 skills, shrinkage=0.0211)
What's in it¶
| Entities | 2,999 occupations (e.g., "technical director", "registered nurse", "software developer") |
| Features | 134 Level-2 skill groups |
| Scale | Binary (1 = occupation requires at least one essential skill in that group, 0 = does not) |
| Source | ESCO v1.2.1, European Commission |
How it was preprocessed¶
The ESCO classification assigns ~13,000 individual skills to occupations. We aggregate to a manageable matrix:
- Load occupationSkillRelations_en.csv — maps occupations to individual skills with relation types (essential/optional)
- Filter to essential relations only
- Walk each skill up the ESCO skill hierarchy (via
broaderRelationsSkillPillar_en.csvandskillsHierarchy_en.csv) to its Level-2 skill group ancestor - Build a binary occupation × skill-group matrix: 1 if the occupation has at least one essential skill in that group, 0 otherwise
- Replace skill group URIs with human-readable preferred labels from the hierarchy
- Drop occupations with fewer than 5 skill groups (too sparse to be informative)
- Replace occupation URIs with human-readable titles from occupations_en.csv
Population statistics¶
| Statistic | Value |
|---|---|
| Density (fraction of 1s) | 10.1% |
| Ledoit-Wolf shrinkage | 0.0211 |
| Condition number | 468 |
| Effective dimensions (90% var) | ~15 |
| Mean |correlation| | 0.055 |
| Correlation sparsity (<0.1) | 84.6% |
ESCO is much sparser than O*NET — only 10% of entries are 1, and 84.6% of feature correlations are near zero. The covariance structure is concentrated in a few meaningful clusters (e.g., healthcare skills co-occur, IT skills co-occur), with most skill groups being independent. Transfer is still valuable but more targeted.
Key differences from O*NET¶
| O*NET | ESCO | |
|---|---|---|
| Source | U.S. Department of Labor surveys | EU expert panel curation |
| Scale | Continuous [0, 1] | Binary |
| Features | 120 (skills + knowledge + abilities) | 134 (skill groups only) |
| Entities | 894 occupations | 2,999 occupations |
| Correlation structure | Dense (mean |r| = 0.34) | Sparse (mean |r| = 0.06) |
| Best for | Rich skill profiling, continuous predictions | Cross-validation, binary classification tasks |
Example¶
pop = skillinfer.datasets.esco()
profile = pop.profile()
profile.observe("education", 1.0)
profile.observe("teaching and training", 1.0)
print(profile.predict())
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
normalize |
bool |
False |
Normalise columns to [0, 1]. Data is binary, so this is rarely needed. |
piaac_prior()¶
PIAAC Cycle 2 US — within-individual prior derived from the OECD Programme for the International Assessment of Adult Competencies. Unlike onet() and esco(), this dataset ships only the summary statistics (population mean and covariance over 9 dimensions). No individual records are bundled.
pop = skillinfer.datasets.piaac_prior()
print(pop)
# Population(1 entities x 9 skills, shrinkage=None)
The "1 entity" is a placeholder row holding the population mean — piaac_prior is built via Population.from_covariance and behaves correctly for everything that uses the prior structure (profile, observe, predict, match_score, rank_agents).
What's in it¶
| Entities | None bundled — only population summary statistics |
| Features | 9 dimensions (3 IRT-assessed skills + 6 work-use scales) |
| Scale | Min-max scaled to [0, 1] within the PIAAC sample |
| n (sample size) | 2,548 adults with complete records |
| Source | PIAAC Cycle 2 US PUF, OECD / NCES |
Dimensions¶
| Group | Names | Source columns |
|---|---|---|
| IRT-assessed proficiency | literacy, numeracy, problem_solving |
Mean of 10 plausible values (PVLIT*, PVNUM*, PVAPS*) |
| Skill use at work | readwork, writwork, numwork, ictwork, influence, taskdisc |
Weighted likelihood estimates (*_WLE_CA*) |
When to use it (and when not to)¶
piaac_prior() ships a within-individual prior — it tells you how skills covary across people, which is what you want when profiling a specific human from partial observations. Use onet() or esco() instead when you want a cross-occupation prior — how skills covary across job types — for cold-starting an entity by their stated occupation.
How it was preprocessed¶
- Load
prgusap2.csv(PIAAC Cycle 2 US PUF, semicolon-separated) - For each IRT-assessed dimension: take the mean of the 10 plausible values
- For each work-use scale: take the published weighted likelihood estimate
- Drop rows with any missing dimension (n=2,548 complete cases)
- Min-max scale each dimension to [0, 1] (preserves correlations; keeps the package's clipping contract)
- Compute population mean and Ledoit-Wolf shrunk covariance
- Save as a 2 KB
.npz
The full prep script is at scripts/prepare_piaac_prior.py.
Why the prior, not the rows?¶
The PIAAC public-use file is large (~35 MB) and shipping individual-level survey records inside a Python wheel is the wrong threshold. The covariance is the only thing the Kalman update needs from the population. By bundling only the derived prior, the dataset:
- Avoids redistributing OECD individual records
- Sidesteps the "10 plausible values per person" preprocessing question (we make the choice once, document it, and consume the result)
- Adds 2 KB to the wheel instead of tens of MB
Population statistics¶
| Statistic | Value |
|---|---|
| Sample size n | 2,548 |
| Dimensions K | 9 |
| Ledoit-Wolf shrinkage | 0.0030 |
| Top correlation | literacy ↔ numeracy (r = 0.92) |
Example¶
pop = skillinfer.datasets.piaac_prior()
profile = pop.profile()
profile.observe("literacy", 0.85)
profile.observe("numeracy", 0.80)
print(profile.predict())
Observing literacy and numeracy moves problem_solving from 0.59 (population mean) to ≈0.80 — the lift comes entirely through the off-diagonal covariance with the two assessed cognitive scores.