Population¶

The population model. Wraps an entity-feature matrix and its learned covariance structure.

import skillinfer

pop = skillinfer.Population.from_dataframe(df)

Constructors¶

`Population.from_dataframe`¶

@classmethod
def from_dataframe(
    cls,
    df: pd.DataFrame,
    normalize: bool = True,
    covariance: str = "ledoit-wolf",
) -> Population

Build a Population from a pandas DataFrame.

Parameters

Parameter	Type	Default	Description
`df`	`pd.DataFrame`	—	Rows = entities, columns = features. All values must be numeric. No NaN values.
`normalize`	`bool`	`True`	Scale each column to [0, 1]. Set to `False` if data is already on a meaningful scale (e.g., binary 0/1 data, or scores where the raw values matter).
`covariance`	`str`	`"ledoit-wolf"`	`"ledoit-wolf"` (recommended) or `"sample"`.

Returns: Population

`Population.from_csv`¶

@classmethod
def from_csv(
    cls,
    path: str,
    index_col: int | str = 0,
    normalize: bool = True,
    covariance: str = "ledoit-wolf",
) -> Population

Build a Population from a CSV file. Convenience wrapper around from_dataframe.

`Population.from_parquet`¶

@classmethod
def from_parquet(
    cls,
    path: str,
    normalize: bool = True,
    covariance: str = "ledoit-wolf",
) -> Population

Build a Population from a Parquet file. Same parameters as from_dataframe.

`Population.from_covariance`¶

@classmethod
def from_covariance(
    cls,
    covariance: np.ndarray,
    feature_names: list[str],
    population_mean: np.ndarray,
) -> Population

Build a Population from a pre-computed covariance matrix. Use this when you have a domain-expert covariance (e.g., from a previous study) rather than estimating from data.

Parameters

Parameter	Type	Description
`covariance`	`np.ndarray`	(K, K) covariance matrix.
`feature_names`	`list[str]`	List of K feature names.
`population_mean`	`np.ndarray`	(K,) mean vector.

Core Methods¶

`profile`¶

def profile(
    self,
    prior_entity: str | None = None,
    prior_mean: np.ndarray | None = None,
    noise: float | None = None,
    method: str = "kalman",
    rank: int | None = None,
    blocks: list[list[str | int]] | dict[str, str | int] | None = None,
    n_components: int | None = None,
    gmm_random_state: int | None = 0,
    k: int = 10,
) -> Profile

Create a Profile for a new entity. Inherits skill descriptions from the Population.

Parameters

Parameter	Type	Default	Description
`prior_entity`	`str \\| None`	`None`	Use this entity's vector as the prior mean.
`prior_mean`	`np.ndarray \\| None`	`None`	Use this array directly as the prior mean.
`noise`	`float \\| None`	`None`	Observation noise (std dev). Default is 5% of average feature spread.
`method`	`str`	`"kalman"`	Inference method. One of `"kalman"`, `"diagonal"`, `"block-diagonal"`, `"pmf"`, `"gmm-kalman"`, `"knn"`.
`rank`	`int \\| None`	`None`	Top-r eigencomponents to retain when `method="pmf"`.
`blocks`	`list[list] \\| dict \\| None`	`None`	Block specification for `method="block-diagonal"`: either a list of feature lists or a `{feature: block_label}` dict.
`n_components`	`int \\| None`	`None`	Mixture size M when `method="gmm-kalman"`.
`gmm_random_state`	`int \\| None`	`0`	Seed for the EM fit (cached per population).
`k`	`int`	`10`	Neighbour count when `method="knn"`.

If neither prior_entity nor prior_mean is given, the population mean is used.

Methods

"kalman" (default) — Full Ledoit–Wolf covariance, propagates evidence to all features.
"diagonal" — Off-diagonal entries zeroed; only the observed feature updates (no-transfer ablation).
"block-diagonal" — Covariance kept inside each block, zeroed across blocks. Useful for restricting transfer to within Skills, within Knowledge, etc.
"pmf" — Rank-rank eigentruncation of the covariance (PMF / probabilistic PCA prior). Cheap and serves as a strong linear baseline; variance on directions outside the top-r subspace collapses to zero.
"gmm-kalman" — Gaussian-mixture prior fit on this population by EM. Each observation triggers per-component Kalman updates plus mixture re-weighting; this is the only non-linear option (a surprising observation can flip the dominant cluster). Returns a GMMProfile.
"knn" — Non-parametric kNN regression in observed-feature space (k neighbours, inverse-distance weighted). Returns a KNNProfile; point predictions only, no posterior covariance — std, CIs, and match_score.p_above_threshold are NaN / None. Often beats the Kalman filter on binary-sparse data (e.g. ESCO). Requires a population with ≥ 2 entity rows; doesn't work with piaac_prior().

Returns: Profile (or GMMProfile / KNNProfile, both subclasses of Profile)

Example

profile = pop.profile()                                      # population mean prior
profile = pop.profile(prior_entity="Software Developers")    # entity-specific prior
profile = pop.profile(noise=0.1)                             # custom noise level

# Alternative inference methods
profile = pop.profile(method="diagonal")                     # no-transfer baseline
profile = pop.profile(method="pmf", rank=20)                 # rank-20 PMF prior
profile = pop.profile(
    method="block-diagonal",
    blocks=[skill_names, knowledge_names, ability_names],
)                                                            # within-category transfer only
profile = pop.profile(method="gmm-kalman", n_components=10)  # mixture-of-Gaussians prior
profile = pop.profile(method="knn", k=10)                    # non-parametric kNN baseline

`entity`¶

def entity(self, name: str) -> np.ndarray

Get the raw feature vector for a named entity. Returns a copy as a numpy array of shape (K,).

`skill_vector`¶

def skill_vector(self, name: str) -> pd.Series

Get a named entity's feature vector as a labeled pd.Series.

`describe_skills`¶

def describe_skills(self, descriptions: dict[str, str] | list[Skill]) -> None

Attach descriptions to skill dimensions. Descriptions flow through to Profiles, appearing in predict() output.

Example

pop.describe_skills({
    "BBH": "Big-Bench Hard: diverse challenging tasks",
    "MMLU-PRO": "Professional-level multitask understanding",
})

Analysis Methods¶

`summary`¶

def summary(self) -> dict

Summary statistics for this population.

Returns: dict with keys:

Key	Type	Description
`n_entities`	`int`	Number of entities (rows)
`n_features`	`int`	Number of features (columns)
`shrinkage`	`float \\| None`	Ledoit-Wolf coefficient
`condition_number`	`float`	Covariance matrix condition number
`effective_dimensions`	`int`	PCA components needed for 90% variance
`mean_correlation`	`float`	Average absolute off-diagonal correlation
`sparsity`	`float`	Fraction of \|correlations\| below 0.1
`top_correlations`	`list[dict]`	Top 5 feature pairs by \|correlation\|

`top_correlations`¶

def top_correlations(self, k: int = 20) -> pd.DataFrame

Top-k strongest feature-feature correlations by absolute value. Returns DataFrame with columns [feature_a, feature_b, correlation].

`pca`¶

def pca(self, n_components: int = 15) -> dict

PCA decomposition. Returns dict with keys: components, explained_variance_ratio, cumulative.

`condition_number`¶

def condition_number(self) -> float

Condition number of the covariance matrix (\(\lambda_{\max} / \lambda_{\min}\)). Lower = more stable.

Export¶

`to_dataframe`¶

def to_dataframe(self) -> pd.DataFrame

The entity-feature matrix as a DataFrame (copy).

`to_csv`¶

def to_csv(self, path: str) -> None

Export the entity-feature matrix to CSV. Re-import with Population.from_csv(path).

`to_parquet`¶

def to_parquet(self, path: str) -> None

Export the entity-feature matrix to Parquet. Re-import with Population.from_parquet(path).

Properties¶

Property	Type	Description
`covariance_df`	`pd.DataFrame`	(K, K) covariance matrix with feature names
`correlation_df`	`pd.DataFrame`	(K, K) correlation matrix with feature names
`skills`	`list[Skill]`	Skill objects with descriptions

Attributes¶

Attribute	Type	Description
`matrix`	`pd.DataFrame`	(N, K) raw data
`feature_names`	`list[str]`	Feature/skill names
`entity_names`	`list[str]`	Entity names
`covariance`	`np.ndarray`	(K, K) covariance matrix
`correlation`	`np.ndarray`	(K, K) correlation matrix
`population_mean`	`np.ndarray`	(K,) mean across all entities
`shrinkage`	`float \\| None`	Ledoit-Wolf shrinkage coefficient

Population¶

Constructors¶

Population.from_dataframe¶

Population.from_csv¶

Population.from_parquet¶

Population.from_covariance¶

Core Methods¶

profile¶

entity¶

skill_vector¶

describe_skills¶

Analysis Methods¶

summary¶

top_correlations¶

pca¶

condition_number¶

Export¶

to_dataframe¶

to_csv¶

to_parquet¶

Properties¶

Attributes¶

`Population.from_dataframe`¶

`Population.from_csv`¶

`Population.from_parquet`¶

`Population.from_covariance`¶

`profile`¶

`entity`¶

`skill_vector`¶

`describe_skills`¶

`summary`¶

`top_correlations`¶

`pca`¶

`condition_number`¶

`to_dataframe`¶

`to_csv`¶

`to_parquet`¶