autoqild.detectors.sklearn_leakage_detector¶

A versatile leakage detection class built on top of the scikit-learn framework, supporting multiple estimators.

Classes

SklearnLeakageDetector(padding_name, ...[, ...])

SklearnLeakageDetector class for detecting information leakage using a scikit-learn-based model.

class autoqild.detectors.sklearn_leakage_detector.SklearnLeakageDetector(padding_name, learner_params, fit_params, hash_value, cv_iterations, n_hypothesis, base_directory, search_space, hp_iters, n_inner_folds, validation_loss, random_state=None, **kwargs)[source]¶

Bases: InformationLeakageDetector

SklearnLeakageDetector class for detecting information leakage using a scikit-learn-based model.

This class extends the InformationLeakageDetector base class and incorporates hyperparameter optimization via Bayesian search, model fitting, and cross-validation using scikit-learn models. It supports the detection of information leakage in machine learning experiments by analyzing the model’s behavior with various padding techniques. The class is highly configurable and works with different search spaces, loss functions, and validation strategies.

Parameters:

padding_name (str) – The name of the padding method used in the experiments to obscure or detect leakage.
learner_params (dict) – Parameters related to the machine learning models (learners) used in the detection process.
fit_params (dict) – Parameters passed to the fit method during model training.
hash_value (str) – A unique hash value used to identify and manage result files for a specific experiment.
cv_iterations (int) – The number of cross-validation iterations to perform during model evaluation.
n_hypothesis (int) – The number of hypotheses or models to be tested for leakage.
base_directory (str) – The base directory where result files, logs, and backups are stored.
search_space (dict) – The hyperparameter search space for Bayesian optimization.
hp_iters (int) – The number of iterations for hyperparameter optimization.
n_inner_folds (int) – The number of folds for inner cross-validation during hyperparameter optimization.
validation_loss (str) – The loss function used to evaluate the performance of models during cross-validation.
random_state (int or RandomState instance, optional) – Controls the randomness for reproducibility, ensuring consistent results across different runs.
**kwargs (dict, optional) – Additional keyword arguments passed to the parent class and used in model fitting.

search_space¶

The hyperparameter search space used in Bayesian optimization.

Type:: dict

hp_iters¶

The number of iterations for hyperparameter optimization.

Type:: int

n_inner_folds¶

Number of folds for inner cross-validation.

Type:: int

validation_loss¶

The loss function used for validation during hyperparameter tuning.

Type:: str

inner_cv_iterator¶

Cross-validation iterator used for inner folds during hyperparameter optimization.

Type:: StratifiedShuffleSplit

tabpfn_folder¶

Directory where TabPFN optimization results are saved.

Type:: str

n_jobs¶

Number of parallel jobs for hyperparameter search.

Type:: int

logger¶

Logger instance for recording the process of leakage detection.

Type:: logging.Logger

detect(detection_method='log_loss_mi')[source]¶

Executes the detection process to identify potential information leakage using the specified method.

Parameters:

detection_method (str)
include (The method to use for detecting information leakage. Options)
paired-t-test (-)
paired-t-test-random (-)
fishers-exact-mean (-)
fishers-exact-median (-)
mid_point_mi (-)
log_loss_mi (-)
log_loss_mi_isotonic_regression (-)
log_loss_mi_platt_scaling (-)
log_loss_mi_beta_calibration (-)
log_loss_mi_temperature_scaling (-)
log_loss_mi_histogram_binning (-)
p_c_softmax_mi (-)

Returns:

detection_decision (bool) – Indicates whether any models showed significant leakage.
hypothesis_rejected (int) – The number of models flagged for leakage.

Notes

The method implements a Holm-Bonferroni correction to control the family-wise error rate for multiple models.

evaluate_scores(X_test, X_train, y_test, y_train, y_pred, p_pred, model, n_model)[source]¶

Evaluate and store model performance metrics for the detection process.

This method computes various evaluation metrics, such as log-loss, accuracy, and confusion matrix, for the model`s predictions. It also supports probability calibration using techniques like isotonic regression and Platt scaling. The results are stored and logged for further analysis.

Parameters:

X_test (array-like of shape (n_samples, n_features)) – The feature matrix for the test set.
X_train (array-like of shape (n_samples, n_features)) – The feature matrix for the training set.
y_test (array-like of shape (n_samples,)) – The true target labels for the test data.
y_train (array-like of shape (n_samples,)) – The true target labels for the training data.
y_pred (array-like of shape (n_samples,)) – The predicted target labels for the test set.
p_pred (array-like of shape (n_samples, n_classes)) – The predicted class probabilities for the test data.
model (object) – The trained model being evaluated.
n_model (int) – The index of the model in the list of evaluated models.

fit(X, y)[source]¶

Fits the model using cross-validation and performs hyperparameter optimization.

This method first checks if the model has already been fitted. If not, it runs the hyperparameter optimization process followed by cross-validation on the specified number of hypotheses. The model is trained using a stratified split of the dataset, and results are evaluated using predefined metrics.

Parameters:

X (array-like of shape (n_samples, n_features)) – The input data used for training the models.
y (array-like of shape (n_samples,)) – The target values (class labels) corresponding to X.

Notes

During fitting, random classifier and majority voting classifier performance is also calculated for comparison.

hyperparameter_optimization(X, y)[source]¶

Performs Bayesian hyperparameter optimization to identify the best model parameters.

This method uses a Bayesian search strategy to explore a predefined hyperparameter search space and selects the optimal configuration based on the specified validation loss. The method performs cross-validation within the search to ensure that the selected hyperparameters generalize well.

Parameters:

X (array-like of shape (n_samples, n_features)) – The input data to be used for training during hyperparameter optimization.
y (array-like of shape (n_samples,)) – The target values (class labels) corresponding to X.

Returns:

The size of the training dataset after reduction (if applicable).

Return type:

int

Raises:

Exception – If an error occurs during the Bayesian search fitting process.

reduce_dataset(X, y)[source]¶

Reduces the dataset size for optimization purposes if the number of instances is too large.

This method is specifically useful for scenarios where lightweight models like TabPFN are being used, and the dataset is too large to fit into memory or optimize efficiently. It reduces the dataset size to a maximum threshold.

Parameters:

X (array-like of shape (n_samples, n_features)) – The input feature matrix.
y (array-like of shape (n_samples,)) – The target values (class labels) corresponding to X.

Returns:

Reduced versions of X and y, if applicable.

Return type:

tuple