autoqild.mi_estimators.gmm_mi_estimator

Gaussian Mixture Model-based MI estimator for evaluating mutual information using probabilistic clustering.

Classes

GMMMIEstimator(n_classes, n_features[, ...])

GMMMIEstimator class for estimating Mutual Information (MI) using Gaussian Mixture Models (GMMs) and performing classification using Logistic Regression.

class autoqild.mi_estimators.gmm_mi_estimator.GMMMIEstimator(n_classes, n_features, y_cat=False, covariance_type='full', reg_covar=1e-06, val_size=0.3, n_reduced=20, reduction_technique='select_from_model_rf', random_state=42, **kwargs)[source]

Bases: MIEstimatorBase

GMMMIEstimator class for estimating Mutual Information (MI) using Gaussian Mixture Models (GMMs) and performing classification using Logistic Regression.

This class leverages GMMs to estimate mutual information and uses feature reduction techniques to create a robust classification model. It evaluates different GMMs based on goodness-of-fit measures such as AIC, BIC, and log-likelihood.

Parameters:
  • n_classes (int) – Number of classes in the classification data samples.

  • n_features (int) – Number of features or dimensionality of the inputs of the classification data samples.

  • y_cat (bool, optional, default=False) – Indicates if the target variable should be considered categorical or real-valued.

  • covariance_type ({full, tied, diag, spherical}, default=`full`) –

    String describing the type of covariance parameters to use. Must be one of:

    • full: each component has its own general covariance matrix.

    • tied: all components share the same general covariance matrix.

    • diag: each component has its own diagonal covariance matrix.

    • spherical: each component has its own single variance.

  • reg_covar (float, default=1e-6) – Non-negative regularization added to the diagonal of covariance. Ensures that the covariance matrices are all positive.

  • val_size (float, optional, default=0.30) – Validation set size as a proportion of the dataset to estimate GMMs.

  • n_reduced (int, optional, default=20) – Number of features to reduce to in case n_features > 100.

  • reduction_technique (str, optional, default=`select_from_model_rf`) –

    Technique to use for feature reduction, provided by scikit-learn. Must be one of:

    • recursive_feature_elimination_et: Uses ExtraTreesClassifier to recursively remove features and build a model.

    • recursive_feature_elimination_rf: Uses RandomForestClassifier to recursively remove features and build a model.

    • select_from_model_et: Meta-transformer for selecting features based on importance weights using ExtraTreesClassifier.

    • select_from_model_rf: Meta-transformer for selecting features based on importance weights using RandomForestClassifier.

    • pca: Principal Component Analysis for dimensionality reduction.

    • lda: Linear Discriminant Analysis for separating classes.

    • tsne: t-Distributed Stochastic Neighbor Embedding for visualization purposes.

    • nmf: Non-Negative Matrix Factorization for dimensionality reduction.

  • random_state (int or object, optional, default=42) – Random state for reproducibility.

  • **kwargs (dict, optional) – Additional keyword arguments.

y_cat

Indicates if the target variable should be considered categorical or real-valued.

Type:

bool

num_comps

List of component counts for GMM evaluation.

Type:

list

reg_covar

Regularization parameter for the GMM covariance matrices.

Type:

float

n_models

Number of GMM models to fit and evaluate.

Type:

int

covariance_type

The covariance type for the GMM.

Type:

str

val_size

Validation set size as a proportion of the dataset.

Type:

float

n_reduced

Number of reduced features for dimensionality reduction.

Type:

int

reduction_technique

Technique used for feature reduction.

Type:

str

selection_model

The fitted feature selection model, or None if not yet fitted.

Type:

object or None

__is_fitted__

Indicates whether the model is fitted.

Type:

bool

cls_model

The classification model used after feature reduction.

Type:

LogisticRegression

best_model

The best fitted GMM model based on likelihood, or None if no model is selected.

Type:

object or None

best_gmm_model

The best fitted GMM used for mutual information estimation.

Type:

object or None

best_likelihood

The highest log-likelihood score achieved during model evaluation.

Type:

float or None

best_bic

The best Bayesian Information Criterion (BIC) score.

Type:

float or None

best_aic

The best Akaike Information Criterion (AIC) score.

Type:

float or None

best_mi

The best estimated mutual information.

Type:

float or None

best_seed

The random seed used to achieve the best model.

Type:

int or None

round

The optimal round for feature selection.

Type:

int or None

logger

Logger instance for logging information.

Type:

logging.Logger

Private Methods
---------------
__get_goodnessof_fit__(gmm, X, y)[source]

Calculate goodness of fit for the GMM model(s) used for MI estimation using Gaussian Mixture Models (GMMs).

__transform__(X, y=None)[source]

Transform and reduce the feature matrix with ‘n_features’ features, using the specified reduction technique to the feature matrix with ‘n_reduced’ features.

create_classification_model(X, y, **kwd)[source]

Create the logistic regression classification model on reduced feature space with n_reduced features.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – Feature matrix.

  • y (array-like of shape (n_samples,)) – Target vector.

  • **kwd (dict, optional) – Additional keyword arguments.

decision_function(X, verbose=0)[source]

Predict confidence scores for samples, which is proportional to the signed distance of that sample to the hyperplane.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – Feature matrix.

  • verbose (int, optional, default=0) – Verbosity level.

Returns:

decision – Decision function values.

Return type:

array-like of shape (n_samples,)

estimate_mi(X, y, verbose=0, **kwd)[source]

Estimate mutual information using the best fitted GMM model.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – Feature matrix.

  • y (array-like of shape (n_samples,)) – Target vector.

  • verbose (int, optional, default=0) – Verbosity level.

  • **kwd (dict, optional) – Additional keyword arguments.

Returns:

mi_estimated – Estimated mutual information.

Return type:

float

fit(X, y, verbose=0, **kwd)[source]

Fit the GMM model and estimate mutual information.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – Feature matrix.

  • y (array-like of shape (n_samples,)) – Target vector.

  • verbose (int, optional, default=0) – print or not to print!?.

  • **kwd (dict, optional) – Additional keyword arguments.

Returns:

self – Fitted estimator.

Return type:

GMMMIEstimator

predict(X, verbose=0)[source]

Predict class labels for the input samples with reduced features of n_reduced using the fitted logistic regression classification model.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – Feature matrix.

  • verbose (int, optional, default=0) – Verbosity level.

Returns:

y_pred – Predicted class labels.

Return type:

array-like of shape (n_samples,)

predict_proba(X, verbose=0)[source]

Predict class labels for the input samples with reduced features of n_reduced using the fitted logistic regression classification model.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – Feature matrix.

  • verbose (int, optional, default=0) – Verbosity level.

Returns:

y_pred – Predicted class labels.

Return type:

array-like of shape (n_samples,)

score(X, y, sample_weight=None, verbose=0)[source]

Compute the likelihood score of the GMM model.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – Feature matrix.

  • y (array-like of shape (n_samples,)) – Target vector.

  • sample_weight (array-like of shape (n_samples,), optional) – Sample weights.

  • verbose (int, optional, default=0) – Verbosity level.

Returns:

score – The score of the model based on likelihood.

Return type:

float