Preprocessing

scikit-fair implements four families of fairness preprocessing algorithms.


Weighting methods

These methods compute per-sample weights without modifying the feature matrix or labels. The weights are passed to downstream estimators via sample_weight.

Reweighing

Reference: Kamiran & Calders (2012)

Assigns weights so that the weighted dataset exhibits statistical independence between the sensitive attribute A and the label Y:

weight(a, y) = P(A=a) * P(Y=y) / P(A=a, Y=y)
from skfair.preprocessing import Reweighing

rw = Reweighing(sens_attr="sex", priv_group=1)
X_out, weights = rw.fit_transform(X, y)

clf.fit(X_out, y, sample_weight=weights)

FairBalance

Reference: Yu, Chakraborty & Menzies (2024)

Balances class distribution within each demographic group:

weight(a, y) = |A=a| / |A=a, Y=y|
from skfair.preprocessing import FairBalance

fb = FairBalance(sens_attr="sex", priv_group=1)
X_out, weights = fb.fit_transform(X, y)

A variant mode is available that additionally normalises by the overall group ratio.

Wrapper classifiers

ReweighingClassifier and FairBalanceClassifier bundle the weighting step with any sklearn estimator:

from skfair.preprocessing import ReweighingClassifier
from sklearn.ensemble import RandomForestClassifier

clf = ReweighingClassifier(
    estimator=RandomForestClassifier(),
    sens_attr="sex",
    priv_group=1,
)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

Label modification methods

These methods change class labels (not features) to reduce discrimination. The sample count stays the same (clean-sampling).

Massaging

Reference: Kamiran & Calders (2012)

Uses logistic regression to rank candidates and swaps labels:

  • Promotion: unprivileged samples with label 0, ranked by predicted probability of label 1
  • Demotion: privileged samples with label 1, ranked by predicted probability of label 0
from skfair.preprocessing import Massaging

sampler = Massaging(sens_attr="sex", priv_group=1)
X_fair, y_fair = sampler.fit_resample(X, y)

Undersampling methods

FairwayRemover

Removes "ambiguous" samples where group-specific models trained on privileged and unprivileged subsets disagree. Only samples both models agree on are retained.

from skfair.preprocessing import FairwayRemover

remover = FairwayRemover(sens_attr="sex", priv_group=1)
X_clean, y_clean = remover.fit_resample(X, y)

Oversampling methods

These methods generate synthetic samples to bring all (class × group) subgroup sizes into balance.

FairOversampling

Per-group class balancing using SMOTE interpolation. Within each protected group, the minority class is oversampled independently.

from skfair.preprocessing import FairOversampling

fos = FairOversampling(sens_attr="sex", priv_group=1)
X_res, y_res = fos.fit_resample(X, y)

FairSmote

Balances all four (class × group) subgroups using Differential Evolution-style mutation:

x_new = x_parent + F * (x_neighbor1 - x_neighbor2)
from skfair.preprocessing import FairSmote

fs = FairSmote(sens_attr="sex", priv_group=1, cr=0.8, f=0.8)
X_res, y_res = fs.fit_resample(X, y)

FAWOS

Reference: Salazar et al. (2021)

Typology-based weighted oversampling. Samples are classified by their KNN neighbourhood:

Type Same-type neighbours Sampling weight
Safe 4–5 High
Borderline 2–3 Medium
Rare 1 Low
Outlier 0 Excluded
from skfair.preprocessing import FAWOS

fawos = FAWOS(sens_attr="sex", priv_group=1)
X_res, y_res = fawos.fit_resample(X, y)

HeterogeneousFOS

Reference: Sonoda et al. (2023)

Uses heterogeneous clusters for interpolation:

  • H_y: class-heterogeneous neighbours (different class, same group)
  • H_g: group-heterogeneous neighbours (same class, different group)

Bernoulli probability determines which cluster to use for each synthetic sample.

from skfair.preprocessing import HeterogeneousFOS

hfos = HeterogeneousFOS(sens_attr="sex", priv_group=1)
X_res, y_res = hfos.fit_resample(X, y)

Feature transformation methods

These methods modify the feature matrix itself (they are sklearn TransformerMixins).

DisparateImpactRemover

Reference: Feldman et al. (2015)

Repairs feature distributions using quantile buckets. Each non-sensitive feature is mapped towards a shared "median" distribution:

x_repaired = (1 - λ) * x_original + λ * x_repaired_value

lambda_param=0.0 leaves the data unchanged; lambda_param=1.0 applies full repair.

from skfair.preprocessing import DisparateImpactRemover

repair = DisparateImpactRemover(sens_attr="sex", repair_columns=["income", "hours_per_week"], lambda_param=0.8)
X_repaired = repair.fit_transform(X)

OptimizedPreprocessing

Reference: Calmon et al. (2017)

Solves a convex optimisation problem to find a joint transformation of features and labels that minimises discrimination while preserving data utility.

from skfair.preprocessing import OptimizedPreprocessing

op = OptimizedPreprocessing(
    sens_attr="sex",
    epsilon=0.05,
)
X_out, y_out = op.fit_transform(X, y)

Warning

Small datasets with tight epsilon can make the optimisation infeasible. Use epsilon >= 0.05 and ensure you have enough samples in each subgroup.

Note

OptimizedPreprocessing requires all features to be discrete (categorical). Because of this specific data requirement, it may be excluded from automated benchmarks, which typically use datasets with mixed continuous/discrete features.

LearningFairRepresentations

Reference: Zemel et al. (2013)

Learns a fair intermediate representation by optimising three objectives simultaneously: prediction accuracy, statistical parity, and reconstruction fidelity.

from skfair.preprocessing import LearningFairRepresentations

lfr = LearningFairRepresentations(sens_attr="sex", priv_group=1, k=10)
Z = lfr.fit_transform(X, y)

FairMask

Reference: Peng et al. (2021)

A meta-estimator that masks sensitive attribute values at inference time. During training it builds extrapolation models; at prediction time the sensitive values are replaced with synthetic estimates:

from skfair.preprocessing import FairMask
from sklearn.linear_model import LogisticRegression

clf = FairMask(
    estimator=LogisticRegression(max_iter=1000),
    sens_attr="sex",
)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

Utility

IntersectionalBinarizer

Creates a binary protected-group column from complex multi-attribute criteria. Useful when the protected group is defined by the intersection of multiple attributes.

Supports: - Equality: {"race": "White"} - List membership: {"race": ["White", "Asian"]} - Threshold: {"age": {">": 65}}

from skfair.preprocessing import IntersectionalBinarizer

binarizer = IntersectionalBinarizer(
    privileged_definition={"sex": 1, "race": ["White"]},
    group_col_name="privileged",
)
X_out = binarizer.fit_transform(X)

DropColumns

Drop named columns inside an sklearn Pipeline:

from skfair.preprocessing import DropColumns
from sklearn.pipeline import Pipeline

pipe = Pipeline([
    ("drop_sens", DropColumns(columns=["sex"])),
    ("clf", LogisticRegression()),
])

Pipeline integration

Samplers follow the imbalanced-learn API and work with imblearn.pipeline.Pipeline:

from imblearn.pipeline import Pipeline
from skfair.preprocessing import Massaging
from sklearn.linear_model import LogisticRegression

pipe = Pipeline([
    ("fair", Massaging(sens_attr="sex", priv_group=1)),
    ("clf", LogisticRegression(max_iter=1000)),
])
pipe.fit(X_train, y_train)

Transformers (DisparateImpactRemover, IntersectionalBinarizer, DropColumns) are standard sklearn transformers and work inside a regular sklearn.pipeline.Pipeline.

Tip: We recommend always using imblearn.pipeline.Pipeline — it extends sklearn's Pipeline with fit_resample support, so it works with all scikit-fair methods (transformers, samplers, and meta-estimators) without needing to switch imports.