autoqild.detectors.ild_base_class¶
Abstract base class that defines the structure and core methods for leakage detection algorithms.
Classes
|
The InformationLeakageDetector class is designed to identify and diagnose information leakage in machine learning models. |
- class autoqild.detectors.ild_base_class.InformationLeakageDetector(padding_name, learner_params, fit_params, hash_value, cv_iterations, n_hypothesis, base_directory, detection_method, random_state, **kwargs)[source]¶
Bases:
objectThe InformationLeakageDetector class is designed to identify and diagnose information leakage in machine learning models. Information leakage occurs when a model inadvertently gains access to information that should not be available during training, leading to overly optimistic performance estimates.
This class facilitates the detection of such leakage by employing various statistical and machine learning-based methods. It supports multiple detection techniques and is capable of managing the entire process, from cross-validation setup to result storage and evaluation.
The class is built with flexibility in mind, allowing users to easily extend or customize detection techniques. It includes robust mechanisms for handling result files and backups, ensuring that detection results are safely stored and can be restored if necessary.
- Parameters:
padding_name (str) – The name of the padding method used in the experiments to potentially obscure or prevent leakage.
learner_params (dict) – Parameters related to the machine learning models (learners) used in the leakage detection process.
fit_params (dict) – Parameters passed to the fit method of the models during training.
hash_value (str) – A unique hash value used to identify and manage result files for a specific experiment.
cv_iterations (int) – The number of cross-validation iterations to perform during model evaluation.
n_hypothesis (int) – The number of hypotheses or models to be tested for leakage.
base_directory (str) – The base directory where result files, logs, and backups are stored.
detection_method (str) – The method to use for detecting information leakage. Options include: - paired-t-test: Uses paired t-test to compare the accuracy of models against the majority voting baseline. - paired-t-test-random: Uses paired t-test to compare the accuracy of models against a random classifier. - fishers-exact-mean: Applies Fisher’s Exact Test on the confusion matrix and computes the mean p-value. - fishers-exact-median: Applies Fisher’s Exact Test on the confusion matrix and computes the median p-value. - estimated_mutual_information: Estimates mutual information to detect leakage. - mid_point_mi: Detects leakage using the midpoint mutual information estimation. - log_loss_mi: Detects leakage using log loss mutual information estimation. - log_loss_mi_isotonic_regression: Uses log loss mutual information estimation with isotonic regression calibration. - log_loss_mi_platt_scaling: Uses log loss mutual information estimation with Platt scaling calibration. - log_loss_mi_beta_calibration: Uses log loss mutual information estimation with beta calibration. - log_loss_mi_temperature_scaling: Uses log loss mutual information estimation with temperature scaling. - log_loss_mi_histogram_binning: Uses log loss mutual information estimation with histogram binning. - p_c_softmax_mi: Uses PC-Softmax mutual information estimation for detection.
random_state (int or RandomState instance) – Controls the randomness for reproducibility, ensuring consistent results across different runs.
**kwargs (dict, optional) – Additional keyword arguments passed to customize the detector.
- logger¶
Logger instance used for recording the steps and processes of the leakage detection.
- Type:
logging.Logger
- padding_name¶
The name of the padding method, used for creating unique identifiers and managing results.
- Type:
str
- padding_code¶
A hash code derived from the padding name, used to uniquely identify the experiment.
- Type:
str
- fit_params¶
Parameters used for fitting the models during training and evaluation.
- Type:
dict
- learner_params¶
Parameters related to the machine learning models (learners) used in the leakage detection process.
- Type:
dict
- cv_iterations¶
The number of cross-validation iterations to perform during model evaluation.
- Type:
int
- n_hypothesis¶
The number of hypotheses or models being tested for leakage.
- Type:
int
- hash_value¶
A unique identifier (hash) used to manage and store results.
- Type:
str
- random_state¶
Random state instance that ensures reproducibility in cross-validation and other random processes.
- Type:
RandomState instance
- cv_iterator¶
Cross-validation iterator that manages the splitting of data into training and test sets.
- Type:
StratifiedKFold
- estimators¶
A list of models (estimators) that are evaluated for leakage.
- Type:
list
- results¶
Dictionary that stores the results of each model`s evaluation, organized by metrics.
- Type:
dict
- base_detector¶
The underlying model or detector used as the reference for detecting leakage.
- Type:
object
- base_directory¶
The base directory where all results, logs, and backups are stored.
- Type:
str
- detection_method¶
The method used for detecting information leakage, as specified by the user.
- Type:
str
- rf_name¶
The filename where the main results are stored.
- Type:
str
- results_file¶
The full path to the main results file.
- Type:
str
- rf_backup_name¶
The filename where backup results are stored.
- Type:
str
- results_file_backup¶
The full path to the backup results file.
- Type:
str
- Private Methods
- ---------------
- __init_results_files__[source]¶
Initializes the results and backup files and restores results from backup if necessary.
- _is_fitted_¶
Checks if the detector has already been fitted by verifying the existence of results files.
- __create_results_from_backup__[source]¶
Creates results files from backup if the main results file is missing or incomplete.
- __update_backup_file__[source]¶
Updates the backup results file with the latest results from the main results file.
- __format_name__(padding_name)[source]¶
Formats the padding name and generates a corresponding hash code.
- __read_majority_accuracies__[source]¶
Reads and returns the accuracy scores from the majority voting classifier.
- __read_random_accuracies__[source]¶
Reads and returns the accuracy scores from the random classifier.
- __get_training_dataset__(X, y)[source]¶
Splits the data into training and test sets using cross-validation.
- __store_results__[source]¶
Stores the evaluation results into the main results file and updates the backup.
- __read_results_file__(detection_method)[source]¶
Reads and returns the results for the specified detection method.
- __calculate_majority_voting_accuracy__(X_train, y_train, X_test, y_test)[source]¶
Calculates and logs the accuracy of a majority voting classifier.
- __calculate_random_classifier_accuracy__(X_train, y_train, X_test, y_test)[source]¶
Calculates and logs the accuracy of a random classifier.
- detect(detection_method='log_loss_mi')[source]¶
Detect potential information leakage using the configured detection method.
This method applies statistical tests, such as paired t-tests or Fisher’s exact tests, to determine if there is a significant difference in model performance that indicates information leakage. The results of these tests are used to decide whether leakage is present and, if so, how many models exhibit it.
Parameter¶
detection_method : str The method to use for detecting information leakage. Options include: - paired-t-test: Uses paired t-test to compare the accuracy of models against the majority voting baseline. - paired-t-test-random: Uses paired t-test to compare the accuracy of models against a random classifier. - fishers-exact-mean: Applies Fisher’s Exact Test on the confusion matrix and computes the mean p-value. - fishers-exact-median: Applies Fisher’s Exact Test on the confusion matrix and computes the median p-value. - estimated_mutual_information: Estimates mutual information to detect leakage. - mid_point_mi: Detects leakage using the midpoint mutual information estimation. - log_loss_mi: Detects leakage using log loss mutual information estimation. - log_loss_mi_isotonic_regression: Uses log loss mutual information estimation with isotonic regression calibration. - log_loss_mi_platt_scaling: Uses log loss mutual information estimation with Platt scaling calibration. - log_loss_mi_beta_calibration: Uses log loss mutual information estimation with beta calibration. - log_loss_mi_temperature_scaling: Uses log loss mutual information estimation with temperature scaling. - log_loss_mi_histogram_binning: Uses log loss mutual information estimation with histogram binning. - p_c_softmax_mi: Uses PC-Softmax mutual information estimation for detection.
- returns:
detection_decision (bool) – Indicates whether any models showed significant leakage.
hypothesis_rejected (int) – The number of models flagged for leakage.
Notes
The method implements a Holm-Bonferroni correction to control the family-wise error rate for multiple models.
- evaluate_scores(X_test, X_train, y_test, y_train, y_pred, p_pred, model, n_model)[source]¶
Evaluate and store model performance metrics for the detection process.
This method computes various evaluation metrics, such as log-loss, accuracy, and confusion matrix, for the model’s predictions. It also supports probability calibration using techniques like isotonic regression and Platt scaling. The results are stored and logged for further analysis.
- Parameters:
X_test (array-like of shape (n_samples, n_features)) – The feature matrix for the test set.
X_train (array-like of shape (n_samples, n_features)) – The feature matrix for the training set.
y_test (array-like of shape (n_samples,)) – The true target labels for the test set.
y_train (array-like of shape (n_samples,)) – The true target labels for the training set.
y_pred (array-like of shape (n_samples,)) – The predicted labels for the test set.
p_pred (array-like of shape (n_samples, n_classes)) – The predicted class probabilities for the test set.
model (object) – The trained model that is being evaluated.
n_model (int) – The index of the model within the list of models being evaluated.
Notes
The method handles specific metrics like log-loss-based mutual information (MI) estimation and confusion matrices, which are critical for detecting information leakage.
- fit(X, y)[source]¶
Fit the model using cross-validation and the specified detection method.
This function trains the model on the provided dataset, applying cross-validation based on the configured detection strategy. The method also integrates hyperparameter optimization if the model is not already fitted. It serves as the main entry point for model training, allowing subclasses to customize the fitting process for different types of detectors.
- Parameters:
X (array-like of shape (n_samples, n_features)) – The input feature matrix used for model training.
y (array-like of shape (n_samples,)) – The target values (class labels) corresponding to each row in X.
- Raises:
NotImplementedError – If the method is not implemented by the subclass.
- hyperparameter_optimization(X, y)[source]¶
Perform hyperparameter optimization using Bayesian search to identify the best model parameters.
This method is intended to explore a wide range of hyperparameters using an optimization strategy (such as Bayesian search) to determine the most effective configuration for the models used in information leakage detection. The method is designed to be overridden by subclasses to implement specific optimization routines.
- Parameters:
X (array-like of shape (n_samples, n_features)) – The input feature matrix used for training during hyperparameter optimization.
y (array-like of shape (n_samples,)) – The target values (class labels) corresponding to each row in X.
- Returns:
The size of the training dataset after the reduction (if applicable).
- Return type:
int
- Raises:
NotImplementedError – If the method is not implemented by the subclass.