skfair.preprocessing¶
from skfair.preprocessing import <ClassName>
Base class¶
skfair.preprocessing.BaseFairSampler
¶
Bases: BaseSampler
Base class for fairness-aware samplers.
Extends imblearn's BaseSampler with common functionality for fair resampling algorithms. Ensures DataFrame inputs are preserved and provides utilities for sensitive attribute handling, KNN operations, and synthetic sample generation.
| Parameters: |
|
|---|
Notes
Subclasses must:
- Set _sampling_type class attribute ("over-sampling" or "clean-sampling")
- Implement _fit_resample(X, y) method
- Call super().__init__(sens_attr, random_state) in __init__
Weighting methods¶
skfair.preprocessing.Reweighing
¶
Bases: BaseEstimator
Reweighing preprocessing technique (Kamiran & Calders, 2012).
The Reweighing method assigns weights to training samples such that
the weighted dataset exhibits less statistical dependence between
a sensitive attribute and the label. It does not modify the feature
matrix or the labels; instead, it outputs instance weights that can
be passed to downstream learning algorithms via sample_weight.
Conceptual Summary
Let A be a sensitive attribute (e.g., sex, race) and Y the binary label. The method computes the expected probability of each (A, Y) combination under independence:
P(A=a, Y=y)_expected = P(A=a) * P(Y=y)
and compares it to the empirical joint probabilities:
P(A=a, Y=y)_observed
Each sample receives a weight:
weight = P_expected / P_observed
so that, after weighting, the distribution of outcomes is closer to independence between A and Y.
Reference
F. Kamiran and T. Calders, "Data Preprocessing Techniques for Classification without Discrimination," Knowledge and Information Systems, 2012.
| Parameters: |
|
|---|
Attributes (after fit/fit_transform)
group_probs_ : dict Empirical probabilities P(A=a) and P(Y=y). Keys: ("A", a) and ("Y", y).
joint_probs_ : dict Empirical joint probabilities P(A=a, Y=y). Keys: (a, y).
expected_probs_ : dict Expected independence probabilities P(A=a)*P(Y=y). Keys: (a, y).
weight_table_ : dict Mapping (a, y) -> W(a, y) according to Algorithm 3.
weights_ : pandas Series of shape (n_samples,)
Per-sample weights for the data passed to fit_transform.
fit(X, y)
¶
Fit the reweighing model.
This method validates the inputs and records basic label
information (e.g., classes_). It does not compute weights;
weight computation happens in fit_transform.
| Parameters: |
|
|---|
| Returns: |
|
|---|
fit_transform(X, y)
¶
Run Algorithm 3 (Reweighing) and return per-sample weights.
Algorithm 3 (simplified)
For each group value s and class label c:
W(s, c) :=
|{X | S = s}| * |{X | Y = c}|
|{X | S = s}| * |{X | Y = c}|
|D| * |{X | S = s, Y = c}|
Then each instance X_i with sensitive attribute S_i and label Y_i receives weight W(S_i, Y_i).
| Parameters: |
|
|---|
| Returns: |
|
|---|
skfair.preprocessing.FairBalance
¶
Bases: BaseEstimator
FairBalance preprocessing technique (Yu, Chakraborty & Menzies, 2024).
FairBalance assigns weights to training samples to achieve equalized odds
by balancing the class distribution within each demographic group. It does
not modify the feature matrix or labels; instead, it outputs instance
weights that can be passed to downstream learning algorithms via
sample_weight.
Conceptual Summary
The key insight is that violation of equalized odds is caused by different class distributions across demographic groups. FairBalance fixes this by weighting samples so each (sensitive group, class) combination has balanced influence.
For each sample with sensitive attribute A=a and label Y=y:
weight = |A=a| / |A=a, Y=y|
This ensures the weighted class distribution becomes 1:1 (balanced) within each demographic group, which is a sufficient condition for achieving smAOD=0 (smoothed maximum Average Odds Difference).
Variant Mode
When variant=True, uses FairBalanceVariant formula:
weight = 1 / |A=a, Y=y|
Then rescales all weights so they sum to n_samples. This treats all demographic groups equally regardless of their size.
Reference
Z. Yu, J. Chakraborty, and T. Menzies, "FairBalance: How to Achieve Equalized Odds With Data Pre-processing," IEEE Transactions on Software Engineering, 2024. https://github.com/hil-se/FairBalance
| Parameters: |
|
|---|
Attributes (after fit/fit_transform)
classes_ : tuple The two class labels (neg_label, pos_label).
weight_table_ : dict Mapping (a, y) -> weight for each (sensitive attribute, label) combination.
weights_ : pandas Series of shape (n_samples,)
Per-sample weights for the data passed to fit_transform.
group_counts_ : dict Count of samples in each sensitive attribute group. Keys: attribute values.
joint_counts_ : dict Count of samples in each (attribute, label) combination. Keys: (a, y).
fit(X, y)
¶
Fit the FairBalance model.
This method validates the inputs and records basic label information.
It does not compute weights; weight computation happens in fit_transform.
| Parameters: |
|
|---|
| Returns: |
|
|---|
fit_transform(X, y)
¶
Compute FairBalance weights and return them with the original data.
Weight Calculation
For FairBalance (variant=False): w(A=a, Y=y) = |A=a| / |A=a, Y=y|
For FairBalanceVariant (variant=True): w(A=a, Y=y) = 1 / |A=a, Y=y| (then rescaled so sum of weights = n_samples)
| Parameters: |
|
|---|
| Returns: |
|
|---|
skfair.preprocessing.ReweighingClassifier
¶
Bases: BaseEstimator, ClassifierMixin
Meta-estimator that combines Reweighing with any classifier.
This wrapper makes Reweighing compatible with sklearn Pipelines by
encapsulating the weight computation and passing weights to the
underlying classifier's fit method via sample_weight.
The workflow is: 1. fit(X, y): Compute fairness weights using Reweighing, then fit the classifier with those weights. 2. predict(X): Delegate to the fitted classifier.
| Parameters: |
|
|---|
| Attributes: |
|
|---|
Example
from skfair.preprocessing import ReweighingClassifier from sklearn.ensemble import RandomForestClassifier
clf = ReweighingClassifier( ... estimator=RandomForestClassifier(), ... sens_attr='sex', ... priv_group=1, ... pos_label=1 ... ) clf.fit(X_train, y_train) predictions = clf.predict(X_test)
fit(X, y)
¶
Fit the reweighing classifier.
- Compute fairness weights using Reweighing
- Fit the underlying classifier with sample_weight
| Parameters: |
|
|---|
| Returns: |
|
|---|
predict(X)
¶
Predict class labels.
| Parameters: |
|
|---|
| Returns: |
|
|---|
predict_proba(X)
¶
Predict class probabilities.
Only available if the underlying estimator supports predict_proba.
| Parameters: |
|
|---|
| Returns: |
|
|---|
score(X, y, sample_weight=None)
¶
Return accuracy score on the given test data.
| Parameters: |
|
|---|
| Returns: |
|
|---|
skfair.preprocessing.FairBalanceClassifier
¶
Bases: BaseEstimator, ClassifierMixin
Meta-estimator that combines FairBalance with any classifier.
This wrapper makes FairBalance compatible with sklearn Pipelines by
encapsulating the weight computation and passing weights to the
underlying classifier's fit method via sample_weight.
The workflow is: 1. fit(X, y): Compute fairness weights using FairBalance, then fit the classifier with those weights. 2. predict(X): Delegate to the fitted classifier.
| Parameters: |
|
|---|
| Attributes: |
|
|---|
Example
from skfair.preprocessing import FairBalanceClassifier from sklearn.ensemble import RandomForestClassifier
clf = FairBalanceClassifier( ... estimator=RandomForestClassifier(), ... sens_attr='sex', ... pos_label=1 ... ) clf.fit(X_train, y_train) predictions = clf.predict(X_test)
fit(X, y)
¶
Fit the FairBalance classifier.
- Compute fairness weights using FairBalance
- Fit the underlying classifier with sample_weight
| Parameters: |
|
|---|
| Returns: |
|
|---|
predict(X)
¶
Predict class labels.
| Parameters: |
|
|---|
| Returns: |
|
|---|
predict_proba(X)
¶
Predict class probabilities.
Only available if the underlying estimator supports predict_proba.
| Parameters: |
|
|---|
| Returns: |
|
|---|
score(X, y, sample_weight=None)
¶
Return accuracy score on the given test data.
| Parameters: |
|
|---|
| Returns: |
|
|---|
Label modification¶
skfair.preprocessing.Massaging
¶
Bases: BaseFairSampler
Massaging preprocessing technique (Kamiran & Calders, 2012).
Modifies class labels of training samples to reduce the statistical dependence between the sensitive attribute and the label. Identifies candidates for promotion (unprivileged with negative label) and demotion (privileged with positive label), then swaps their labels.
| Parameters: |
|
|---|
| Attributes: |
|
|---|
skfair.preprocessing.FairwayRemover
¶
Bases: BaseFairSampler
Removes 'ambiguous' data points that cause conflicting predictions between privileged and unprivileged group models.
Trains two separate models on privileged and unprivileged groups, then removes any samples where the models disagree on predictions.
| Parameters: |
|
|---|
| Attributes: |
|
|---|
Oversampling¶
skfair.preprocessing.FairOversampling
¶
Bases: BaseFairSampler
Fair Oversampling (FOS) by Dablan et al.
Balances class distributions within each sensitive group independently. For each group (privileged/unprivileged), oversamples the minority class to match the majority class count using SMOTE-style interpolation.
| Parameters: |
|
|---|
| Attributes: |
|
|---|
skfair.preprocessing.FairSmote
¶
Bases: BaseFairSampler
Fair-SMOTE oversampler for fairness-aware data balancing.
Balances the dataset across all (class_label × sensitive_attribute) subgroups. Identifies the largest subgroup and oversamples all others to match that size using a Differential Evolution-style crossover.
| Parameters: |
|
|---|
| Attributes: |
|
|---|
fit_resample(X, y)
¶
Resample the dataset to balance all (class, sensitive_attr) subgroups.
Overrides BaseSampler.fit_resample to preserve DataFrame input.
| Parameters: |
|
|---|
| Returns: |
|
|---|
skfair.preprocessing.FAWOS
¶
Bases: BaseFairSampler
FAWOS: Fairness-Aware Oversampling by Salazar et al.
Balances the dataset by oversampling positive unprivileged instances to match the ratio of positive-to-negative in the privileged group. Uses typology-based weighted selection (Safe, Borderline, Rare, Outlier) to prioritize points that are harder to learn.
| Parameters: |
|
|---|
| Attributes: |
|
|---|
Notes
- k=5 is fixed per Napierala & Stefanowski (2016) typology thresholds.
- SMOTE generation uses same-type neighbors (same Y and S) only, matching the authors' implementation, not global neighbors.
References
Salazar, T., Santos, M. S., Araújo, H., & Abreu, P. H. (2021). FAWOS: Fairness-Aware Oversampling Algorithm Based on Distributions of Sensitive Attributes. IEEE Access, 9, 81370-81379.
skfair.preprocessing.HeterogeneousFOS
¶
Bases: BaseFairSampler
Fair Oversampling using Heterogeneous Clusters by Sonoda et al. (2023).
Balances all (class_label x sensitive_group) subgroups to match the largest subgroup size. Uses heterogeneous clusters for interpolation: - H_y: samples with different class but same group (class-heterogeneous) - H_g: samples with same class but different group (group-heterogeneous)
A Bernoulli probability p_{y,g} determines which cluster to use for each synthetic sample, enabling generation of class-mix and group-mix features that improve classifier generalization and fairness.
| Parameters: |
|
|---|
| Attributes: |
|
|---|
References
Sonoda, R. (2023). Fair Oversampling Technique using Heterogeneous Clusters. arXiv:2305.13875v1
Feature transformation¶
skfair.preprocessing.DisparateImpactRemover
¶
Bases: BaseEstimator, TransformerMixin
Disparate Impact Remover (Feldman et al., 2015).
Transforms feature distributions to reduce correlation with a sensitive attribute while preserving within-group rank ordering. Uses the quantile bucket method to handle uneven group sizes.
The repair formula (Definition 5.2): ȳ = (1 - λ) * y + λ * F_A⁻¹(F_x(y))
Where: - y is the original feature value - λ (lambda_param) controls repair level: 0 = no change, 1 = full repair - F_x(y) maps y to its quantile rank within group x - F_A⁻¹ maps that rank to the median distribution value
| Parameters: |
|
|---|
| Attributes: |
|
|---|
Example
from skfair.preprocessing import DisparateImpactRemover repair = DisparateImpactRemover( ... sens_attr='sex', ... repair_columns=['age', 'income'], ... lambda_param=1.0 ... ) X_repaired = repair.fit_transform(X)
fit(X, y=None)
¶
Learn the quantile bucket structure and median distribution.
For each repair column: 1. Compute N_min (smallest group size) → number of buckets 2. Divide each group into N_min quantile buckets 3. Compute the median of each bucket per group 4. Compute the median distribution A (median across groups)
| Parameters: |
|
|---|
| Returns: |
|
|---|
transform(X)
¶
Apply geometric repair to the specified columns.
For each value: 1. Find which group the sample belongs to 2. Find which bucket the value falls into (F_x mapping) 3. Get the median distribution value for that bucket (F_A⁻¹) 4. Apply: ȳ = (1 - λ) * y + λ * repaired_value
| Parameters: |
|
|---|
| Returns: |
|
|---|
skfair.preprocessing.OptimizedPreprocessing
¶
Bases: BaseFairSampler
Optimized Pre-Processing for Discrimination Prevention.
Learns a randomised mapping P(X', Y' | D, X, Y) that transforms features and labels to reduce statistical parity disparity while controlling individual distortion, solved via convex optimisation (CVXPY).
Important: This algorithm operates on discrete/categorical features only. Continuous features must be discretised before use.
| Parameters: |
|
|---|
| Attributes: |
|
|---|
Example
from skfair.preprocessing import OptimizedPreprocessing def distortion(old, new): ... cost = 0.0 ... for k in old: ... if k != 'label' and old[k] != new[k]: ... cost += 1.0 ... if old['label'] != new['label']: ... cost += 2.0 ... return cost op = OptimizedPreprocessing( ... sens_attr='group', ... features_to_transform=['age_cat', 'education'], ... distortion_fun=distortion, ... epsilon=0.05, ... )
skfair.preprocessing.LearningFairRepresentations
¶
Bases: BaseEstimator, TransformerMixin
Learning Fair Representations (Zemel et al., 2013).
Learns a set of intermediate prototypes that simultaneously encode the data faithfully (low reconstruction error) while removing information about a sensitive attribute (statistical parity of prototype membership across groups).
The objective minimised during fit is::
L = Ax * L_x + Ay * L_y + Az * L_z
where L_x is reconstruction error, L_y is label-prediction cross-entropy, and L_z penalises differences in average prototype membership between the privileged and unprivileged groups.
transform replaces every numeric feature column with its
prototype-based reconstruction, preserving the sensitive-attribute
column unchanged.
| Parameters: |
|
|---|
| Attributes: |
|
|---|
References
.. [1] R. Zemel, Y. Wu, K. Swersky, T. Pitassi, and C. Dwork, "Learning Fair Representations", ICML 2013.
fit(X, y)
¶
Learn fair prototypes from the data.
| Parameters: |
|
|---|
| Returns: |
|
|---|
transform(X)
¶
Replace numeric features with fair representations.
| Parameters: |
|
|---|
| Returns: |
|
|---|
skfair.preprocessing.FairMask
¶
Bases: BaseEstimator, ClassifierMixin
Meta-estimator that masks sensitive attributes at inference time.
FairMask trains an ensemble of extrapolation models to predict sensitive attributes from non-sensitive features. At inference time, it replaces actual sensitive attribute values with synthetic values predicted by the ensemble via weighted voting. This achieves "procedural fairness" - the classifier's decision is based on what non-sensitive features imply, not the actual sensitive attribute.
The workflow is: 1. fit(X, y): Train extrapolation models (sensitive attr predictors) and fit the underlying classifier on original data. 2. predict(X): Replace sensitive attributes with synthetic values from extrapolation models, then delegate to the fitted classifier.
| Parameters: |
|
|---|
| Attributes: |
|
|---|
Example
from skfair.preprocessing import FairMask from sklearn.ensemble import RandomForestClassifier
clf = FairMask( ... estimator=RandomForestClassifier(), ... sens_attr='sex', ... budget=10 ... ) clf.fit(X_train, y_train) predictions = clf.predict(X_test)
References
Peng, K., et al. "FairMask: Better Fairness via Model-based Rebalancing of Protected Attributes." arXiv:2110.01109 (2021).
fit(X, y)
¶
Fit the FairMask classifier.
- Train extrapolation models to predict sensitive attribute
- Fit the underlying classifier on original data
| Parameters: |
|
|---|
| Returns: |
|
|---|
predict(X)
¶
Predict class labels with masked sensitive attributes.
| Parameters: |
|
|---|
| Returns: |
|
|---|
predict_proba(X)
¶
Predict class probabilities with masked sensitive attributes.
Only available if the underlying estimator supports predict_proba.
| Parameters: |
|
|---|
| Returns: |
|
|---|
score(X, y, sample_weight=None)
¶
Return accuracy score on the given test data.
| Parameters: |
|
|---|
| Returns: |
|
|---|
Utilities¶
skfair.preprocessing.IntersectionalBinarizer
¶
Bases: BaseEstimator, TransformerMixin
Creates a single binary Protected Group feature (1=Privileged, 0=Unprivileged) from complex, user-defined intersectional criteria.
Privilege is defined by an OR condition over a list of AND conditions (rules). Supports equality, list inclusion, and threshold operators (>, <, >=, <=, !=).
| Parameters: |
|
|---|
fit(X, y=None)
¶
No fitting necessary for this transformation (stateless).
transform(X, y=None)
¶
Applies the intersectional rules to create the binary feature.
| Parameters: |
|
|---|
| Returns: |
|
|---|
skfair.preprocessing.DropColumns
¶
Bases: BaseEstimator, TransformerMixin
Drop specified columns from a DataFrame in a sklearn pipeline.
Useful for removing sensitive attributes before an estimator while keeping them available earlier in the pipeline for fairness preprocessing or evaluation.
| Parameters: |
|
|---|
Examples:
>>> from skfair.preprocessing import DropColumns
>>> from skfair.datasets import load_adult
>>> from sklearn.pipeline import Pipeline
>>> from sklearn.linear_model import LogisticRegression
>>> X, y = load_adult()
>>> pipe = Pipeline([
... ("drop_sensitive", DropColumns("sex")),
... ("clf", LogisticRegression()),
... ])
>>> pipe.fit(X, y)
fit(X, y=None)
¶
Record which columns exist and should be dropped.
| Parameters: |
|
|---|
| Returns: |
|
|---|
transform(X, y=None)
¶
Drop the specified columns from X.
| Parameters: |
|
|---|
| Returns: |
|
|---|