skfair.preprocessing

from skfair.preprocessing import <ClassName>

Base class

skfair.preprocessing.BaseFairSampler

Bases: BaseSampler

Base class for fairness-aware samplers.

Extends imblearn's BaseSampler with common functionality for fair resampling algorithms. Ensures DataFrame inputs are preserved and provides utilities for sensitive attribute handling, KNN operations, and synthetic sample generation.

Parameters:
  • sens_attr (str) –

    Name of the sensitive attribute column in X (DataFrame).

  • random_state (int, RandomState instance or None, default: None ) –

    Controls randomization for reproducibility.

Notes

Subclasses must: - Set _sampling_type class attribute ("over-sampling" or "clean-sampling") - Implement _fit_resample(X, y) method - Call super().__init__(sens_attr, random_state) in __init__


Weighting methods

skfair.preprocessing.Reweighing

Bases: BaseEstimator

Reweighing preprocessing technique (Kamiran & Calders, 2012).

The Reweighing method assigns weights to training samples such that the weighted dataset exhibits less statistical dependence between a sensitive attribute and the label. It does not modify the feature matrix or the labels; instead, it outputs instance weights that can be passed to downstream learning algorithms via sample_weight.

Conceptual Summary

Let A be a sensitive attribute (e.g., sex, race) and Y the binary label. The method computes the expected probability of each (A, Y) combination under independence:

P(A=a, Y=y)_expected = P(A=a) * P(Y=y)

and compares it to the empirical joint probabilities:

P(A=a, Y=y)_observed

Each sample receives a weight:

weight = P_expected / P_observed

so that, after weighting, the distribution of outcomes is closer to independence between A and Y.

Reference

F. Kamiran and T. Calders, "Data Preprocessing Techniques for Classification without Discrimination," Knowledge and Information Systems, 2012.

Parameters:
  • sens_attr (str, default: None ) –

    Name of the sensitive attribute column in X (pandas DataFrame).

  • priv_group (int or str, default: 1 ) –

    Value in sens_attr treated as the privileged group. (Not used directly in the weighting formula, but stored for consistency with other preprocessors.)

  • pos_label (int or str, default: 1 ) –

    Label considered the favorable outcome.

Attributes (after fit/fit_transform)

group_probs_ : dict Empirical probabilities P(A=a) and P(Y=y). Keys: ("A", a) and ("Y", y).

joint_probs_ : dict Empirical joint probabilities P(A=a, Y=y). Keys: (a, y).

expected_probs_ : dict Expected independence probabilities P(A=a)*P(Y=y). Keys: (a, y).

weight_table_ : dict Mapping (a, y) -> W(a, y) according to Algorithm 3.

weights_ : pandas Series of shape (n_samples,) Per-sample weights for the data passed to fit_transform.

fit(X, y)

Fit the reweighing model.

This method validates the inputs and records basic label information (e.g., classes_). It does not compute weights; weight computation happens in fit_transform.

Parameters:
  • X (pandas DataFrame) –

    Must contain self.sens_attr.

  • y (array-like or pandas Series) –

    Binary labels.

Returns:
  • self

fit_transform(X, y)

Run Algorithm 3 (Reweighing) and return per-sample weights.

Algorithm 3 (simplified)

For each group value s and class label c:

W(s, c) :=
    |{X | S = s}| * |{X | Y = c}|
      |D| * |{X | S = s, Y = c}|

Then each instance X_i with sensitive attribute S_i and label Y_i receives weight W(S_i, Y_i).

Parameters:
  • X (pandas DataFrame) –

    Feature matrix. Not modified by reweighing.

  • y (array-like or pandas Series) –

    Binary labels. Not modified by reweighing.

Returns:
  • X_out( pandas DataFrame ) –

    Same as X, passed through unchanged.

  • weights( pandas Series of shape (n_samples,) ) –

    The reweighing weights corresponding to each instance.


skfair.preprocessing.FairBalance

Bases: BaseEstimator

FairBalance preprocessing technique (Yu, Chakraborty & Menzies, 2024).

FairBalance assigns weights to training samples to achieve equalized odds by balancing the class distribution within each demographic group. It does not modify the feature matrix or labels; instead, it outputs instance weights that can be passed to downstream learning algorithms via sample_weight.

Conceptual Summary

The key insight is that violation of equalized odds is caused by different class distributions across demographic groups. FairBalance fixes this by weighting samples so each (sensitive group, class) combination has balanced influence.

For each sample with sensitive attribute A=a and label Y=y:

weight = |A=a| / |A=a, Y=y|

This ensures the weighted class distribution becomes 1:1 (balanced) within each demographic group, which is a sufficient condition for achieving smAOD=0 (smoothed maximum Average Odds Difference).

Variant Mode

When variant=True, uses FairBalanceVariant formula:

weight = 1 / |A=a, Y=y|

Then rescales all weights so they sum to n_samples. This treats all demographic groups equally regardless of their size.

Reference

Z. Yu, J. Chakraborty, and T. Menzies, "FairBalance: How to Achieve Equalized Odds With Data Pre-processing," IEEE Transactions on Software Engineering, 2024. https://github.com/hil-se/FairBalance

Parameters:
  • sens_attr (str, default: None ) –

    Name of the sensitive attribute column in X (pandas DataFrame).

  • variant (bool, default: False ) –

    If False, use FairBalance formula (preserves group size differences). If True, use FairBalanceVariant formula (treats all groups equally).

  • pos_label (int or str, default: 1 ) –

    Label considered the favorable outcome.

Attributes (after fit/fit_transform)

classes_ : tuple The two class labels (neg_label, pos_label).

weight_table_ : dict Mapping (a, y) -> weight for each (sensitive attribute, label) combination.

weights_ : pandas Series of shape (n_samples,) Per-sample weights for the data passed to fit_transform.

group_counts_ : dict Count of samples in each sensitive attribute group. Keys: attribute values.

joint_counts_ : dict Count of samples in each (attribute, label) combination. Keys: (a, y).

fit(X, y)

Fit the FairBalance model.

This method validates the inputs and records basic label information. It does not compute weights; weight computation happens in fit_transform.

Parameters:
  • X (pandas DataFrame) –

    Must contain self.sens_attr.

  • y (array-like or pandas Series) –

    Binary labels.

Returns:
  • self

fit_transform(X, y)

Compute FairBalance weights and return them with the original data.

Weight Calculation

For FairBalance (variant=False): w(A=a, Y=y) = |A=a| / |A=a, Y=y|

For FairBalanceVariant (variant=True): w(A=a, Y=y) = 1 / |A=a, Y=y| (then rescaled so sum of weights = n_samples)

Parameters:
  • X (pandas DataFrame) –

    Feature matrix. Not modified by FairBalance.

  • y (array-like or pandas Series) –

    Binary labels. Not modified by FairBalance.

Returns:
  • X_out( pandas DataFrame ) –

    Same as X, passed through unchanged.

  • weights( pandas Series of shape (n_samples,) ) –

    The FairBalance weights corresponding to each instance.


skfair.preprocessing.ReweighingClassifier

Bases: BaseEstimator, ClassifierMixin

Meta-estimator that combines Reweighing with any classifier.

This wrapper makes Reweighing compatible with sklearn Pipelines by encapsulating the weight computation and passing weights to the underlying classifier's fit method via sample_weight.

The workflow is: 1. fit(X, y): Compute fairness weights using Reweighing, then fit the classifier with those weights. 2. predict(X): Delegate to the fitted classifier.

Parameters:
  • estimator (classifier, default: LogisticRegression() ) –

    The classifier to train with reweighted samples. Must support the sample_weight parameter in fit().

  • sens_attr (str, default: None ) –

    Name of the sensitive attribute column in X.

  • priv_group (int or str, default: 1 ) –

    Value in sens_attr treated as the privileged group.

  • pos_label (int or str, default: 1 ) –

    Label considered the favorable outcome.

Attributes:
  • reweigher_ (Reweighing) –

    Fitted Reweighing instance.

  • estimator_ (classifier) –

    Fitted classifier instance.

  • classes_ (ndarray) –

    Class labels from the fitted classifier.

  • weights_ (pandas Series) –

    Sample weights computed during fit.

Example

from skfair.preprocessing import ReweighingClassifier from sklearn.ensemble import RandomForestClassifier

clf = ReweighingClassifier( ... estimator=RandomForestClassifier(), ... sens_attr='sex', ... priv_group=1, ... pos_label=1 ... ) clf.fit(X_train, y_train) predictions = clf.predict(X_test)

fit(X, y)

Fit the reweighing classifier.

  1. Compute fairness weights using Reweighing
  2. Fit the underlying classifier with sample_weight
Parameters:
  • X (pandas DataFrame) –

    Feature matrix. Must contain sens_attr column.

  • y (array - like) –

    Binary target labels.

Returns:
  • self

predict(X)

Predict class labels.

Parameters:
  • X (pandas DataFrame) –

    Feature matrix.

Returns:
  • y_pred( ndarray ) –

    Predicted class labels.

predict_proba(X)

Predict class probabilities.

Only available if the underlying estimator supports predict_proba.

Parameters:
  • X (pandas DataFrame) –

    Feature matrix.

Returns:
  • y_proba( ndarray of shape (n_samples, n_classes) ) –

    Predicted class probabilities.

score(X, y, sample_weight=None)

Return accuracy score on the given test data.

Parameters:
  • X (pandas DataFrame) –

    Feature matrix.

  • y (array - like) –

    True labels.

  • sample_weight (array - like, default: None ) –

    Sample weights for scoring.

Returns:
  • score( float ) –

    Accuracy score.


skfair.preprocessing.FairBalanceClassifier

Bases: BaseEstimator, ClassifierMixin

Meta-estimator that combines FairBalance with any classifier.

This wrapper makes FairBalance compatible with sklearn Pipelines by encapsulating the weight computation and passing weights to the underlying classifier's fit method via sample_weight.

The workflow is: 1. fit(X, y): Compute fairness weights using FairBalance, then fit the classifier with those weights. 2. predict(X): Delegate to the fitted classifier.

Parameters:
  • estimator (classifier, default: LogisticRegression() ) –

    The classifier to train with FairBalance-weighted samples. Must support the sample_weight parameter in fit().

  • sens_attr (str, default: None ) –

    Name of the sensitive attribute column in X.

  • variant (bool, default: False ) –

    If False, use FairBalance formula (preserves group size differences). If True, use FairBalanceVariant formula (treats all groups equally).

  • pos_label (int or str, default: 1 ) –

    Label considered the favorable outcome.

Attributes:
  • fairbalance_ (FairBalance) –

    Fitted FairBalance instance.

  • estimator_ (classifier) –

    Fitted classifier instance.

  • classes_ (ndarray) –

    Class labels from the fitted classifier.

  • weights_ (pandas Series) –

    Sample weights computed during fit.

Example

from skfair.preprocessing import FairBalanceClassifier from sklearn.ensemble import RandomForestClassifier

clf = FairBalanceClassifier( ... estimator=RandomForestClassifier(), ... sens_attr='sex', ... pos_label=1 ... ) clf.fit(X_train, y_train) predictions = clf.predict(X_test)

fit(X, y)

Fit the FairBalance classifier.

  1. Compute fairness weights using FairBalance
  2. Fit the underlying classifier with sample_weight
Parameters:
  • X (pandas DataFrame) –

    Feature matrix. Must contain sens_attr column.

  • y (array - like) –

    Binary target labels.

Returns:
  • self

predict(X)

Predict class labels.

Parameters:
  • X (pandas DataFrame) –

    Feature matrix.

Returns:
  • y_pred( ndarray ) –

    Predicted class labels.

predict_proba(X)

Predict class probabilities.

Only available if the underlying estimator supports predict_proba.

Parameters:
  • X (pandas DataFrame) –

    Feature matrix.

Returns:
  • y_proba( ndarray of shape (n_samples, n_classes) ) –

    Predicted class probabilities.

score(X, y, sample_weight=None)

Return accuracy score on the given test data.

Parameters:
  • X (pandas DataFrame) –

    Feature matrix.

  • y (array - like) –

    True labels.

  • sample_weight (array - like, default: None ) –

    Sample weights for scoring.

Returns:
  • score( float ) –

    Accuracy score.


Label modification

skfair.preprocessing.Massaging

Bases: BaseFairSampler

Massaging preprocessing technique (Kamiran & Calders, 2012).

Modifies class labels of training samples to reduce the statistical dependence between the sensitive attribute and the label. Identifies candidates for promotion (unprivileged with negative label) and demotion (privileged with positive label), then swaps their labels.

Parameters:
  • sens_attr (str, default: None ) –

    Name of the sensitive attribute column in X.

  • priv_group (int or str, default: 1 ) –

    Value in sens_attr that represents the privileged group.

  • pos_label (int or str, default: 1 ) –

    Value representing the positive/favorable outcome.

  • estimator (sklearn estimator or None, default: None ) –

    Estimator used to rank candidates for label swapping. Must support predict_proba. If None, uses LogisticRegression.

Attributes:
  • ranker_ (estimator) –

    Fitted ranker used to prioritize candidates.

  • classes_ (ndarray) –

    Class labels from the ranker.


skfair.preprocessing.FairwayRemover

Bases: BaseFairSampler

Removes 'ambiguous' data points that cause conflicting predictions between privileged and unprivileged group models.

Trains two separate models on privileged and unprivileged groups, then removes any samples where the models disagree on predictions.

Parameters:
  • sens_attr (str) –

    Name of the sensitive attribute column in X.

  • priv_group (int or str) –

    Value in sens_attr that represents the privileged group.

  • estimator (sklearn estimator or None, default: None ) –

    Base estimator to use for training group-specific models. If None, uses LogisticRegression(solver='liblinear').

Attributes:
  • model_p_ (estimator) –

    Model trained on the privileged group.

  • model_u_ (estimator) –

    Model trained on the unprivileged group.


Oversampling

skfair.preprocessing.FairOversampling

Bases: BaseFairSampler

Fair Oversampling (FOS) by Dablan et al.

Balances class distributions within each sensitive group independently. For each group (privileged/unprivileged), oversamples the minority class to match the majority class count using SMOTE-style interpolation.

Parameters:
  • sens_attr (str) –

    Name of the sensitive attribute column in X.

  • priv_group (int or str) –

    Value in sens_attr that represents the privileged group.

  • k_neighbors (int, default: 5 ) –

    Number of nearest neighbors for SMOTE interpolation.

  • random_state (int, RandomState instance or None, default: None ) –

    Controls randomization for reproducibility.

Attributes:
  • subgroup_counts_ (dict) –

    Sample counts per (sensitive_group, class_label) after fit.

  • n_synthetic_ (dict) –

    Number of synthetic samples generated per group.


skfair.preprocessing.FairSmote

Bases: BaseFairSampler

Fair-SMOTE oversampler for fairness-aware data balancing.

Balances the dataset across all (class_label × sensitive_attribute) subgroups. Identifies the largest subgroup and oversamples all others to match that size using a Differential Evolution-style crossover.

Parameters:
  • sens_attr (str) –

    Name of the sensitive attribute column in X.

  • cr (float, default: 0.8 ) –

    Crossover rate [0, 1]. Probability that a synthetic feature value differs from the parent.

  • f (float, default: 0.8 ) –

    Mutation amount [0, 1]. Magnitude of the differential step for numeric features: new = parent + f * (neighbor1 - neighbor2).

  • k_neighbors (int, default: 5 ) –

    Number of neighbors to use for KNN.

  • clip_numeric (bool, default: True ) –

    Whether to clip synthetic numeric values to the min/max range observed in the original data.

  • random_state (int, RandomState instance or None, default: None ) –

    Control the randomization of the algorithm.

Attributes:
  • subgroup_counts_ (dict) –

    The counts of samples per (class, sensitive_attr) subgroup.

  • max_size_ (int) –

    The target size (largest subgroup count).

fit_resample(X, y)

Resample the dataset to balance all (class, sensitive_attr) subgroups.

Overrides BaseSampler.fit_resample to preserve DataFrame input.

Parameters:
  • X (pandas DataFrame) –

    Feature matrix.

  • y (array - like) –

    Target labels.

Returns:
  • X_resampled( pandas DataFrame ) –

    Resampled feature matrix.

  • y_resampled( ndarray ) –

    Resampled target labels.


skfair.preprocessing.FAWOS

Bases: BaseFairSampler

FAWOS: Fairness-Aware Oversampling by Salazar et al.

Balances the dataset by oversampling positive unprivileged instances to match the ratio of positive-to-negative in the privileged group. Uses typology-based weighted selection (Safe, Borderline, Rare, Outlier) to prioritize points that are harder to learn.

Parameters:
  • sens_attr (str) –

    Name of the sensitive attribute column in X.

  • priv_group (scalar) –

    Value in sens_attr that represents the privileged group.

  • alpha (float, default: 1.0 ) –

    Oversampling factor. When alpha=1, the positive-to-negative ratio of unprivileged groups will match the privileged group exactly. alpha < 1 creates fewer synthetic samples, alpha > 1 creates more.

  • safe_weight (float, default: 1.0 ) –

    Selection weight for Safe points (4-5 same-type neighbors).

  • borderline_weight (float, default: 1.0 ) –

    Selection weight for Borderline points (2-3 same-type neighbors).

  • rare_weight (float, default: 1.0 ) –

    Selection weight for Rare points (1 isolated same-type neighbor).

  • random_state (int, RandomState instance or None, default: None ) –

    Controls randomization for reproducibility.

Attributes:
  • n_synthetic_ (int) –

    Number of synthetic samples generated.

  • typology_counts_ (dict) –

    Counts of each typology type in PU (positive unprivileged).

Notes
  • k=5 is fixed per Napierala & Stefanowski (2016) typology thresholds.
  • SMOTE generation uses same-type neighbors (same Y and S) only, matching the authors' implementation, not global neighbors.
References

Salazar, T., Santos, M. S., Araújo, H., & Abreu, P. H. (2021). FAWOS: Fairness-Aware Oversampling Algorithm Based on Distributions of Sensitive Attributes. IEEE Access, 9, 81370-81379.


skfair.preprocessing.HeterogeneousFOS

Bases: BaseFairSampler

Fair Oversampling using Heterogeneous Clusters by Sonoda et al. (2023).

Balances all (class_label x sensitive_group) subgroups to match the largest subgroup size. Uses heterogeneous clusters for interpolation: - H_y: samples with different class but same group (class-heterogeneous) - H_g: samples with same class but different group (group-heterogeneous)

A Bernoulli probability p_{y,g} determines which cluster to use for each synthetic sample, enabling generation of class-mix and group-mix features that improve classifier generalization and fairness.

Parameters:
  • sens_attr (str) –

    Name of the sensitive attribute column in X.

  • k_neighbors (int, default: 5 ) –

    Number of nearest neighbors for computing local density and interpolation normalization.

  • random_state (int, RandomState instance or None, default: None ) –

    Controls randomization for reproducibility.

Attributes:
  • subgroup_counts_ (dict) –

    Sample counts per (class_label, sensitive_group) after fit.

  • n_synthetic_ (dict) –

    Number of synthetic samples generated per subgroup.

  • max_size_ (int) –

    The target size (largest subgroup count).

References

Sonoda, R. (2023). Fair Oversampling Technique using Heterogeneous Clusters. arXiv:2305.13875v1


Feature transformation

skfair.preprocessing.DisparateImpactRemover

Bases: BaseEstimator, TransformerMixin

Disparate Impact Remover (Feldman et al., 2015).

Transforms feature distributions to reduce correlation with a sensitive attribute while preserving within-group rank ordering. Uses the quantile bucket method to handle uneven group sizes.

The repair formula (Definition 5.2): ȳ = (1 - λ) * y + λ * F_A⁻¹(F_x(y))

Where: - y is the original feature value - λ (lambda_param) controls repair level: 0 = no change, 1 = full repair - F_x(y) maps y to its quantile rank within group x - F_A⁻¹ maps that rank to the median distribution value

Parameters:
  • sens_attr (str) –

    Column name of the sensitive attribute in X.

  • repair_columns (list of str) –

    Column names to apply repair to.

  • lambda_param (float, default: 1.0 ) –

    Repair level between 0.0 (no repair) and 1.0 (full repair).

Attributes:
  • n_buckets_ (int) –

    Number of quantile buckets (equals smallest group size).

  • bucket_edges_ (dict) –

    For each (column, group): array of quantile edges.

  • group_medians_ (dict) –

    For each (column, group): array of bucket medians.

  • median_distribution_ (dict) –

    For each column: array of median values across all groups per bucket.

  • groups_ (array) –

    Unique values of the sensitive attribute.

Example

from skfair.preprocessing import DisparateImpactRemover repair = DisparateImpactRemover( ... sens_attr='sex', ... repair_columns=['age', 'income'], ... lambda_param=1.0 ... ) X_repaired = repair.fit_transform(X)

fit(X, y=None)

Learn the quantile bucket structure and median distribution.

For each repair column: 1. Compute N_min (smallest group size) → number of buckets 2. Divide each group into N_min quantile buckets 3. Compute the median of each bucket per group 4. Compute the median distribution A (median across groups)

Parameters:
  • X (pandas DataFrame) –

    Feature matrix containing sens_attr and repair_columns.

  • y (ignored, default: None ) –

    Not used, present for sklearn API compatibility.

Returns:
  • self

transform(X)

Apply geometric repair to the specified columns.

For each value: 1. Find which group the sample belongs to 2. Find which bucket the value falls into (F_x mapping) 3. Get the median distribution value for that bucket (F_A⁻¹) 4. Apply: ȳ = (1 - λ) * y + λ * repaired_value

Parameters:
  • X (pandas DataFrame) –

    Feature matrix to transform.

Returns:
  • X_repaired( pandas DataFrame ) –

    Transformed feature matrix with repaired columns.


skfair.preprocessing.OptimizedPreprocessing

Bases: BaseFairSampler

Optimized Pre-Processing for Discrimination Prevention.

Learns a randomised mapping P(X', Y' | D, X, Y) that transforms features and labels to reduce statistical parity disparity while controlling individual distortion, solved via convex optimisation (CVXPY).

Important: This algorithm operates on discrete/categorical features only. Continuous features must be discretised before use.

Parameters:
  • sens_attr (str, default: None ) –

    Column name of the sensitive attribute in X.

  • features_to_transform (list of str, default: None ) –

    Categorical feature columns to include in the optimisation.

  • distortion_fun (callable, default: None ) –

    Cost function (old_dict, new_dict) -> float where each dict maps feature names to values (plus 'label' for the outcome).

  • epsilon (float, default: 0.05 ) –

    Maximum allowed disparity |P(Y'=y|D=d1) - P(Y'=y|D=d2)|.

  • clist (list of float, default: None ) –

    Distortion thresholds for the excess-distortion constraint (Eq. 5). Defaults to [0.99, 1.99, 2.99].

  • dlist (list of float, default: None ) –

    Maximum probability of exceeding each threshold in clist. Defaults to [0.1, 0.05, 0.01]. Must have same length as clist.

  • random_state (int or None, default: None ) –

    Seed for the random number generator used in the randomised mapping.

Attributes:
  • mapping_ (dict) –

    Learned conditional distribution P(X', Y' | D, X, Y).

  • classes_ (tuple) –

    (neg_label, pos_label) observed during fitting.

Example

from skfair.preprocessing import OptimizedPreprocessing def distortion(old, new): ... cost = 0.0 ... for k in old: ... if k != 'label' and old[k] != new[k]: ... cost += 1.0 ... if old['label'] != new['label']: ... cost += 2.0 ... return cost op = OptimizedPreprocessing( ... sens_attr='group', ... features_to_transform=['age_cat', 'education'], ... distortion_fun=distortion, ... epsilon=0.05, ... )


skfair.preprocessing.LearningFairRepresentations

Bases: BaseEstimator, TransformerMixin

Learning Fair Representations (Zemel et al., 2013).

Learns a set of intermediate prototypes that simultaneously encode the data faithfully (low reconstruction error) while removing information about a sensitive attribute (statistical parity of prototype membership across groups).

The objective minimised during fit is::

L = Ax * L_x  +  Ay * L_y  +  Az * L_z

where L_x is reconstruction error, L_y is label-prediction cross-entropy, and L_z penalises differences in average prototype membership between the privileged and unprivileged groups.

transform replaces every numeric feature column with its prototype-based reconstruction, preserving the sensitive-attribute column unchanged.

Parameters:
  • sens_attr (str) –

    Column name of the sensitive attribute in X.

  • priv_group (int or str) –

    Value identifying the privileged group in X[sens_attr].

  • k (int, default: 5 ) –

    Number of prototypes.

  • Ax (float, default: 0.01 ) –

    Weight of the reconstruction loss term.

  • Ay (float, default: 1.0 ) –

    Weight of the label-prediction loss term.

  • Az (float, default: 50.0 ) –

    Weight of the fairness loss term.

  • maxiter (int, default: 5000 ) –

    Maximum iterations for L-BFGS-B.

  • maxfun (int, default: 5000 ) –

    Maximum function evaluations for L-BFGS-B.

  • random_state (int or None, default: None ) –

    Seed for reproducibility.

  • verbose (int, default: 0 ) –

    Verbosity level. If > 0, prints the optimization result.

Attributes:
  • w_ (ndarray of shape (k,)) –

    Learned prototype-to-label weights.

  • prototypes_ (ndarray of shape (k, features_dim_)) –

    Learned prototype vectors.

  • features_dim_ (int) –

    Number of numeric feature columns seen during fit.

  • feature_columns_ (list of str) –

    Numeric column names used for the representation (excludes sens_attr).

References

.. [1] R. Zemel, Y. Wu, K. Swersky, T. Pitassi, and C. Dwork, "Learning Fair Representations", ICML 2013.

fit(X, y)

Learn fair prototypes from the data.

Parameters:
  • X (pandas DataFrame) –

    Feature matrix. Must contain sens_attr as a column. All other numeric columns are used as features.

  • y (array-like of shape (n_samples,)) –

    Binary labels (0/1).

Returns:
  • self

transform(X)

Replace numeric features with fair representations.

Parameters:
  • X (pandas DataFrame) –

    Feature matrix (same schema as fit).

Returns:
  • X_fair( pandas DataFrame ) –

    Copy of X with numeric feature columns replaced by their prototype-based reconstructions. The sensitive-attribute column is preserved unchanged.


skfair.preprocessing.FairMask

Bases: BaseEstimator, ClassifierMixin

Meta-estimator that masks sensitive attributes at inference time.

FairMask trains an ensemble of extrapolation models to predict sensitive attributes from non-sensitive features. At inference time, it replaces actual sensitive attribute values with synthetic values predicted by the ensemble via weighted voting. This achieves "procedural fairness" - the classifier's decision is based on what non-sensitive features imply, not the actual sensitive attribute.

The workflow is: 1. fit(X, y): Train extrapolation models (sensitive attr predictors) and fit the underlying classifier on original data. 2. predict(X): Replace sensitive attributes with synthetic values from extrapolation models, then delegate to the fitted classifier.

Parameters:
  • estimator (classifier, default: LogisticRegression() ) –

    The classifier to wrap. Will be trained on original data.

  • sens_attr (str, default: None ) –

    Name of the sensitive attribute column in X.

  • budget (int, default: 10 ) –

    Number of extrapolation models in the ensemble.

  • extrapolation_model (estimator, default: LogisticRegression() ) –

    Model type used for extrapolation (predicting sensitive attribute from non-sensitive features). Will be cloned for each model in budget.

  • random_state (int, default: None ) –

    Random seed for reproducibility.

Attributes:
  • estimator_ (classifier) –

    Fitted classifier instance.

  • extrapolation_models_ (list) –

    List of fitted extrapolation models.

  • model_weights_ (ndarray) –

    Weights for each extrapolation model based on validation accuracy.

  • classes_ (ndarray) –

    Class labels from the fitted classifier.

Example

from skfair.preprocessing import FairMask from sklearn.ensemble import RandomForestClassifier

clf = FairMask( ... estimator=RandomForestClassifier(), ... sens_attr='sex', ... budget=10 ... ) clf.fit(X_train, y_train) predictions = clf.predict(X_test)

References

Peng, K., et al. "FairMask: Better Fairness via Model-based Rebalancing of Protected Attributes." arXiv:2110.01109 (2021).

fit(X, y)

Fit the FairMask classifier.

  1. Train extrapolation models to predict sensitive attribute
  2. Fit the underlying classifier on original data
Parameters:
  • X (pandas DataFrame) –

    Feature matrix. Must contain sens_attr column.

  • y (array - like) –

    Target labels.

Returns:
  • self

predict(X)

Predict class labels with masked sensitive attributes.

Parameters:
  • X (pandas DataFrame) –

    Feature matrix.

Returns:
  • y_pred( ndarray ) –

    Predicted class labels.

predict_proba(X)

Predict class probabilities with masked sensitive attributes.

Only available if the underlying estimator supports predict_proba.

Parameters:
  • X (pandas DataFrame) –

    Feature matrix.

Returns:
  • y_proba( ndarray of shape (n_samples, n_classes) ) –

    Predicted class probabilities.

score(X, y, sample_weight=None)

Return accuracy score on the given test data.

Parameters:
  • X (pandas DataFrame) –

    Feature matrix.

  • y (array - like) –

    True labels.

  • sample_weight (array - like, default: None ) –

    Sample weights for scoring.

Returns:
  • score( float ) –

    Accuracy score.


Utilities

skfair.preprocessing.IntersectionalBinarizer

Bases: BaseEstimator, TransformerMixin

Creates a single binary Protected Group feature (1=Privileged, 0=Unprivileged) from complex, user-defined intersectional criteria.

Privilege is defined by an OR condition over a list of AND conditions (rules). Supports equality, list inclusion, and threshold operators (>, <, >=, <=, !=).

Parameters:
  • privileged_definition (dict or list of dicts, default: None ) –

    Defines the criteria for the privileged group. - Simple (AND): {"race": "White", "sex": "Male"} - Complex (OR of ANDs): [{"race": "White", "sex": "Male"}, {"age": {">": 65}}]

  • group_col_name (str, default: "_is_privileged" ) –

    The name of the new binary column to be created.

  • privileged_value (int or float, default: 1 ) –

    The value representing the privileged group.

fit(X, y=None)

No fitting necessary for this transformation (stateless).

transform(X, y=None)

Applies the intersectional rules to create the binary feature.

Parameters:
  • X (DataFrame) –

    Input features containing all sensitive attributes referenced in privileged_definition.

Returns:
  • X_out( DataFrame ) –

    Copy of X with an additional binary column group_col_name.


skfair.preprocessing.DropColumns

Bases: BaseEstimator, TransformerMixin

Drop specified columns from a DataFrame in a sklearn pipeline.

Useful for removing sensitive attributes before an estimator while keeping them available earlier in the pipeline for fairness preprocessing or evaluation.

Parameters:
  • columns (str or list of str) –

    Column name(s) to drop.

Examples:

>>> from skfair.preprocessing import DropColumns
>>> from skfair.datasets import load_adult
>>> from sklearn.pipeline import Pipeline
>>> from sklearn.linear_model import LogisticRegression
>>> X, y = load_adult()
>>> pipe = Pipeline([
...     ("drop_sensitive", DropColumns("sex")),
...     ("clf", LogisticRegression()),
... ])
>>> pipe.fit(X, y)

fit(X, y=None)

Record which columns exist and should be dropped.

Parameters:
  • X (DataFrame) –

    Input data.

  • y (ignored, default: None ) –
Returns:
  • self

transform(X, y=None)

Drop the specified columns from X.

Parameters:
  • X (DataFrame) –

    Input data.

  • y (ignored, default: None ) –
Returns:
  • X_transformed( DataFrame ) –

    DataFrame with the specified columns removed.