autoqild.dataset_readers.synthetic_data_generator¶
Generates synthetic datasets with introducing noise by flipping certain percentage of labels for testing and evaluating machine learning models.
Classes
|
Generator for synthetic datasets with a focus on generating data with varying class distances. |
- class autoqild.dataset_readers.synthetic_data_generator.SyntheticDatasetGenerator(n_classes=2, n_features=2, samples_per_class=500, flip_y=0.1, random_state=42, fold_id=0, imbalance=0.0, gen_type='single', **kwargs)[source]¶
Bases:
objectGenerator for synthetic datasets with a focus on generating data with varying class distances.
This class generates synthetic datasets by adjusting the distance between class distributions, allowing for the simulation of scenarios with varying levels of overlap between classes. It is designed to help in testing classifiers on datasets with controlled class separability.
- Parameters:
n_classes (int, default=2) – Number of classes in the generated dataset.
n_features (int, default=2) – Number of features in the generated dataset.
samples_per_class (int or dict, default=500) – Number of samples per class. If an integer is provided, it is assumed that all classes have the same number of samples. If a dictionary is provided, the keys should be class labels and values should be the number of samples for each class.
flip_y (float, default=0.1) – The fraction of samples whose class labels will be randomly flipped to simulate noise.
random_state (int or RandomState instance, default=42) – Random state for reproducibility.
fold_id (int, default=0) – Fold ID used for random seed generation.
imbalance (float, default=0.0) – Proportion of the minority class in the dataset. Must be between 0 and 1.
gen_type (str, default=`single`) – Type of generation process. It can be used to modify the dataset generation method.
**kwargs (dict) – Additional keyword arguments.
- n_classes¶
Number of classes in the generated dataset.
- Type:
int
- n_features¶
Number of features in the generated dataset.
- Type:
int
- random_state¶
Random state instance for reproducibility.
- Type:
RandomState instance
- fold_id¶
Fold ID used for random seed generation.
- Type:
int
- means¶
Dictionary storing the mean vectors for each class.
- Type:
dict
- covariances¶
Dictionary storing the covariance matrices for each class.
- Type:
dict
- seeds¶
Dictionary storing the random seeds used for generating each class.
- Type:
dict
- samples_per_class¶
Dictionary storing the number of samples for each class.
- Type:
dict
- imbalance¶
Proportion of the minority class in the dataset.
- Type:
float
- gen_type¶
Type of generation process.
- Type:
str
- n_instances¶
Total number of instances in the generated dataset.
- Type:
int
- class_labels¶
Array of class labels.
- Type:
numpy.ndarray
- y_prob¶
Dictionary storing the probability of each class.
- Type:
dict
- ent_y¶
Entropy of the class distribution.
- Type:
float or None
- flip_y_prob¶
Dictionary storing the probability of flipped class labels for each class.
- Type:
dict
- flip_y¶
The fraction of samples whose class labels will be randomly flipped to simulate noise.
- Type:
float
- logger¶
Logger instance for logging information.
- Type:
logging.Logger
- Private Methods
- ---------------
- __generate_cov_means__[source]¶
Generate the mean vectors and covariance matrices for each class. This method creates a random orthogonal matrix and generates a positive semi-definite covariance matrix. It then calculates the mean vector for each class.
- bayes_predictor_mi()[source]¶
Calculate the mutual information (MI) using the probability distribution function using the formulae below.
\[I(X;Y) = H(X) - H(X|Y)\]- Returns:
mutual_information – The mutual information of the dataset.
- Return type:
float
- bayes_predictor_pc_softmax_mi()[source]¶
Calculate the mutual information (MI) using class probabilities derived from the PDF of a class label given the input data X, applying both the Softmax and PC-Softmax functions.
\[I(X;Y) = H(Y) - H(Y|X)\]Softmax Function:
\[S(z_k) = \frac{e^{z_k}}{\sum_{j=1}^{K} e^{z_j}}\]where:
( z_k ) is the logit or raw score for class ( k ).
( K ) is the total number of classes.
PC-Softmax Function:
\[S_{pc}(z_k) = \frac{e^{z_k}}{\sum_{j=1}^{K} e^{z_j} \cdot p_j}\]where:
( z_k ) is the logit or raw score for class ( k ).
( p_j = frac{text{counts}_j}{text{total samples}} ) is the prior probability of class ( j )
- Returns:
softmax_emi (float) – Estimated softmax mutual information.
pc_softmax_emi (float) – Estimated PC-softmax mutual information.
- calculate_mi()[source]¶
Calculate the mutual information (MI) using the probability distribution function using the formulae below.
\[I(X;Y) = H(X) - H(X|Y)\]- Returns:
mutual_information – The mutual information of the dataset.
- Return type:
float
- entropy_y(y)[source]¶
Calculate the entropy of the class distribution in the dataset.
- Parameters:
y (array-like of shape (n_samples,)) – The labels of the dataset.
- Returns:
mi_pp – The entropy of the class distribution.
- Return type:
float
- generate_dataset()[source]¶
Generate the full synthetic dataset.
- Returns:
X (array-like of shape (n_samples, n_features)) – Feature matrix after applying sampling to create imbalance.
y (array-like of shape (n_samples,)) – Target vector after applying sampling to create imbalance.
- generate_samples_for_class(k_class)[source]¶
Generate synthetic samples for a specific class.
- Parameters:
k_class (int) – The class label for which to generate samples.
- Returns:
data (array-like) – A tuple containing the generated features.
labels (array-like) – A list of labels corresponding to the features.
- get_bayes_mi(metric_name='MCMCLogLossBayesMI')[source]¶
Get the estimated mutual information based on the specified metric.
- Parameters:
metric_name ({MCMCBayesMI, MCMCLogLossBayesMI, MCMCPCSoftmaxBayesMI, MCMCSoftmaxBayesMI}, default=`MCMCLogLossBayesMI`) –
The name of the metric to use for MI estimation. Must be one of:
MCMCLogLossBayesMI: Estimate mutual information using the log loss of the bayes pedictor.
MCMCBayesMI: Estimate mutual information using the marginal of inputs and conditionals on inputs using class labels
MCMCPCSoftmaxBayesMI: Estimate mutual information using the MCMC PC Softmax Bayes method.
MCMCSoftmaxBayesMI: Estimate mutual information using the MCMC Softmax Bayes method.
- Returns:
mutual_information – The estimated mutual information based on the selected metric.
- Return type:
float
- get_prob_dist_x_given_y(k_class)[source]¶
Get the multivariate normal distribution for a given class.
- Parameters:
k_class (int) – The class label for which to get the distribution.
- Returns:
The multivariate normal distribution for the given class.
- Return type:
scipy.stats._multivariate.multivariate_normal_frozen
- get_prob_flip_y_given_x(X, class_label)[source]¶
Get the probability of a flipped class label given the input data X.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Input data.
class_label (int) – The class label for which to compute the probability.
- Returns:
prob_y_given_x – The probability of a flipped class label given the input data X.
- Return type:
array-like
- get_prob_fn_margx()[source]¶
Get the marginal probability distribution function for the input data.
- Returns:
marg_x – A function that computes the marginal probability for the input data.
- Return type:
function
- get_prob_x_given_flip_y(X, class_label)[source]¶
Get the probability of the input data X given a flipped class label.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Input data.
class_label (int) – The flipped class label for which to compute the probability.
- Returns:
prob_x_given_flip_y – The probability of the input data X given a flipped class label.
- Return type:
array-like
- get_prob_x_given_y(X, class_label)[source]¶
Get the probability of X given a specific class label.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Input data.
class_label (int) – The class label for which to compute the probability.
- Returns:
prob_x_given_y – The probability of X given the class label.
- Return type:
array-like
- get_prob_y_given_x(X, class_label)[source]¶
Get the probability of a class label given the input data X.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Input data.
class_label (int) – The class label for which to compute the probability.
- Returns:
prob_y_given_x – The probability of the class label given the input data X.
- Return type:
array-like