Datasets¶

Built-in datasets that return a Population ready for profiling. No downloads, no preprocessing — just import and go.

import skillinfer

pop = skillinfer.datasets.onet()      # or skillinfer.datasets.esco()
profile = pop.profile()
profile.observe("Skill:Programming", 0.92)
print(profile.predict())

All datasets are shipped inside the package: O*NET and ESCO as gzipped CSV (~510 KB combined), PIAAC as a small .npz of summary statistics (~2 KB).

`onet()`¶

O*NET 30.2 — the U.S. Department of Labor's occupational information network.

O*NET is the most comprehensive public database of occupational skill requirements. It describes what workers in each occupation need to know and be able to do, based on surveys of incumbent workers and occupational analysts.

pop = skillinfer.datasets.onet()
print(pop)
# Population(894 entities x 120 skills, shrinkage=0.0054)

What's in it¶


Entities	894 occupations (e.g., "Software Developers", "Registered Nurses", "Chief Executives")
Features	120 total: 35 skills, 33 knowledge areas, 52 abilities
Scale	Continuous importance ratings, normalised to [0, 1]
Source	O*NET 30.2, U.S. Department of Labor / ETA
License	CC BY 4.0

Feature categories¶

Features are prefixed by category:

Skill: — learned capabilities (e.g., Skill:Programming, Skill:Critical Thinking, Skill:Writing)
Knowledge: — domain knowledge (e.g., Knowledge:Mathematics, Knowledge:Computers and Electronics)
Ability: — enduring attributes (e.g., Ability:Deductive Reasoning, Ability:Static Strength, Ability:Manual Dexterity)

How it was preprocessed¶

The raw O*NET 30.2 database contains multiple scales per feature (importance, level, relevance). We extract:

Skills.txt, Knowledge.txt, Abilities.txt from the O*NET 30.2 database
Filter to the Importance scale (Scale ID = "IM"), which rates each feature on a 1–5 scale
Drop rows marked as suppressed (Recommend Suppress = "Y")
Pivot to an occupation × feature matrix (894 × 120)
Normalise each column to [0, 1] using min-max scaling
Replace O*NET SOC codes with human-readable occupation titles

Population statistics¶

Statistic	Value
Mean feature value	0.385
Feature std	0.228
Density (non-zero entries)	92.2%
Ledoit-Wolf shrinkage	0.0054
Condition number	1,884
Effective dimensions (90% var)	~15
Mean \|correlation\|	0.336
Correlation sparsity (<0.1)	14.8%

The high mean correlation (0.336) and low sparsity (14.8%) mean most features are correlated — the population has rich transfer structure. Observing a few skills tells you a lot about the rest.

Example¶

pop = skillinfer.datasets.onet()

profile = pop.profile()
profile.observe("Skill:Programming", 0.92)
profile.observe("Skill:Critical Thinking", 0.85)
print(profile.predict())

# Use a specific occupation as prior
profile = pop.profile(prior_entity="Software Developers")

Parameters

Parameter	Type	Default	Description
`normalize`	`bool`	`False`	Re-normalise columns to [0, 1]. Data is already normalised, so this is rarely needed.

`esco()`¶

ESCO v1.2.1 — the European Commission's taxonomy of Skills, Competences, Qualifications and Occupations.

ESCO is an independently curated taxonomy maintained by EU expert panels. Where O*NET uses continuous importance ratings from U.S. surveys, ESCO uses binary skill assignments curated by European domain experts. This makes it a strong cross-validation target — if skillinfer works on both, the method generalises beyond any single taxonomy.

pop = skillinfer.datasets.esco()
print(pop)
# Population(2999 entities x 134 skills, shrinkage=0.0211)

What's in it¶


Entities	2,999 occupations (e.g., "technical director", "registered nurse", "software developer")
Features	134 Level-2 skill groups
Scale	Binary (1 = occupation requires at least one essential skill in that group, 0 = does not)
Source	ESCO v1.2.1, European Commission

How it was preprocessed¶

The ESCO classification assigns ~13,000 individual skills to occupations. We aggregate to a manageable matrix:

Load occupationSkillRelations_en.csv — maps occupations to individual skills with relation types (essential/optional)
Filter to essential relations only
Walk each skill up the ESCO skill hierarchy (via broaderRelationsSkillPillar_en.csv and skillsHierarchy_en.csv) to its Level-2 skill group ancestor
Build a binary occupation × skill-group matrix: 1 if the occupation has at least one essential skill in that group, 0 otherwise
Replace skill group URIs with human-readable preferred labels from the hierarchy
Drop occupations with fewer than 5 skill groups (too sparse to be informative)
Replace occupation URIs with human-readable titles from occupations_en.csv

Population statistics¶

Statistic	Value
Density (fraction of 1s)	10.1%
Ledoit-Wolf shrinkage	0.0211
Condition number	468
Effective dimensions (90% var)	~15
Mean \|correlation\|	0.055
Correlation sparsity (<0.1)	84.6%

ESCO is much sparser than O*NET — only 10% of entries are 1, and 84.6% of feature correlations are near zero. The covariance structure is concentrated in a few meaningful clusters (e.g., healthcare skills co-occur, IT skills co-occur), with most skill groups being independent. Transfer is still valuable but more targeted.

Key differences from O*NET¶

	O*NET	ESCO
Source	U.S. Department of Labor surveys	EU expert panel curation
Scale	Continuous [0, 1]	Binary
Features	120 (skills + knowledge + abilities)	134 (skill groups only)
Entities	894 occupations	2,999 occupations
Correlation structure	Dense (mean \|r\| = 0.34)	Sparse (mean \|r\| = 0.06)
Best for	Rich skill profiling, continuous predictions	Cross-validation, binary classification tasks

Example¶

pop = skillinfer.datasets.esco()

profile = pop.profile()
profile.observe("education", 1.0)
profile.observe("teaching and training", 1.0)
print(profile.predict())

Parameters

Parameter	Type	Default	Description
`normalize`	`bool`	`False`	Normalise columns to [0, 1]. Data is binary, so this is rarely needed.

`piaac_prior()`¶

PIAAC Cycle 2 US — within-individual prior derived from the OECD Programme for the International Assessment of Adult Competencies. Unlike onet() and esco(), this dataset ships only the summary statistics (population mean and covariance over 9 dimensions). No individual records are bundled.

pop = skillinfer.datasets.piaac_prior()
print(pop)
# Population(1 entities x 9 skills, shrinkage=None)

The "1 entity" is a placeholder row holding the population mean — piaac_prior is built via Population.from_covariance and behaves correctly for everything that uses the prior structure (profile, observe, predict, match_score, rank_agents).

What's in it¶


Entities	None bundled — only population summary statistics
Features	9 dimensions (3 IRT-assessed skills + 6 work-use scales)
Scale	Min-max scaled to [0, 1] within the PIAAC sample
n (sample size)	2,548 adults with complete records
Source	PIAAC Cycle 2 US PUF, OECD / NCES

Dimensions¶

Group	Names	Source columns
IRT-assessed proficiency	`literacy`, `numeracy`, `problem_solving`	Mean of 10 plausible values (`PVLIT`, `PVNUM`, `PVAPS*`)
Skill use at work	`readwork`, `writwork`, `numwork`, `ictwork`, `influence`, `taskdisc`	Weighted likelihood estimates (`_WLE_CA`)

When to use it (and when not to)¶

piaac_prior() ships a within-individual prior — it tells you how skills covary across people, which is what you want when profiling a specific human from partial observations. Use onet() or esco() instead when you want a cross-occupation prior — how skills covary across job types — for cold-starting an entity by their stated occupation.

How it was preprocessed¶

Load prgusap2.csv (PIAAC Cycle 2 US PUF, semicolon-separated)
For each IRT-assessed dimension: take the mean of the 10 plausible values
For each work-use scale: take the published weighted likelihood estimate
Drop rows with any missing dimension (n=2,548 complete cases)
Min-max scale each dimension to [0, 1] (preserves correlations; keeps the package's clipping contract)
Compute population mean and Ledoit-Wolf shrunk covariance
Save as a 2 KB .npz

The full prep script is at scripts/prepare_piaac_prior.py.

Why the prior, not the rows?¶

The PIAAC public-use file is large (~35 MB) and shipping individual-level survey records inside a Python wheel is the wrong threshold. The covariance is the only thing the Kalman update needs from the population. By bundling only the derived prior, the dataset:

Avoids redistributing OECD individual records
Sidesteps the "10 plausible values per person" preprocessing question (we make the choice once, document it, and consume the result)
Adds 2 KB to the wheel instead of tens of MB

Population statistics¶

Statistic	Value
Sample size n	2,548
Dimensions K	9
Ledoit-Wolf shrinkage	0.0030
Top correlation	`literacy` ↔ `numeracy` (r = 0.92)

Example¶

pop = skillinfer.datasets.piaac_prior()
profile = pop.profile()
profile.observe("literacy", 0.85)
profile.observe("numeracy", 0.80)
print(profile.predict())

Observing literacy and numeracy moves problem_solving from 0.59 (population mean) to ≈0.80 — the lift comes entirely through the off-diagonal covariance with the two assessed cognitive scores.

Datasets¶

onet()¶

What's in it¶

Feature categories¶

How it was preprocessed¶

Population statistics¶

Example¶

esco()¶

What's in it¶

How it was preprocessed¶

Population statistics¶

Key differences from O*NET¶

Example¶

piaac_prior()¶

What's in it¶

Dimensions¶

When to use it (and when not to)¶

How it was preprocessed¶

Why the prior, not the rows?¶

Population statistics¶

Example¶

`onet()`¶

`esco()`¶

`piaac_prior()`¶