autoqild.dataset_readers.utils¶

Provides utility functions for dataset handling, operations, and preprocessing.

Module Attributes

`GEN_TYPES`	List of supported generation types for class imbalance:
`FACTOR`	A constant factor used for scaling or other operations.
`LABEL_COL`	Default label column name used in datasets.

Functions

`clean_class_label`(string)	Clean and format a class label string.
`generate_samples_per_class`(n_classes[, ...])	Generate the number of samples per class with a specified imbalance.
`pdf`(dist, x)	Compute the probability density function (PDF) for the given distribution and input data.

autoqild.dataset_readers.utils.clean_class_label(string)[source]¶

Clean and format a class label string.

This function processes a string by replacing underscores with spaces, capitalizing each word, and removing any extra spaces to make the label more readable and formatted consistently.

Parameters:: string (str) – The input class label string to be cleaned and formatted.
Returns:: The cleaned and formatted class label string.
Return type:: str

Example

>>> clean_class_label("class_label_example")
`Class Label Example`

Notes

This function is useful for formatting class labels in a readable way, especially when they are generated automatically or retrieved from a source where they are not human-readable.

autoqild.dataset_readers.utils.generate_samples_per_class(n_classes, samples=1000, imbalance=0.05, gen_type='single', logger=None, verbose=1)[source]¶

Generate the number of samples per class with a specified imbalance.

This function calculates the number of samples for each class based on the provided imbalance ratio and the generation type. It supports both binary and multi-class scenarios, allowing the user to specify whether the imbalance should be distributed across a single class or multiple classes.

Parameters:

n_classes (int) – The number of classes in the dataset.
samples (int, default=1000) – The total number of samples across all classes.
imbalance (float, default=0.05) – The proportion of samples in the minority class (or classes if gen_type is “multiple”). The value must be less than or equal to 1/n_classes.
gen_type (str, default="single") – The type of imbalance generation: - “single”: Imbalance is applied to one class. - “multiple”: Imbalance is distributed across multiple classes.
logger (logging.Logger, optional) – Logger object for logging output. If None, a default logger is created.
verbose (int, default=1) – Verbosity level. If 1, logging information is displayed.

Returns:

samples_per_class – A dictionary where the keys are class labels (as strings) and the values are the number of samples for each class.

Return type:

dict

Raises:

ValueError – If the imbalance ratio is greater than 1/n_classes or if the generation type is not recognized.

autoqild.dataset_readers.utils.pdf(dist, x)[source]¶

Compute the probability density function (PDF) for the given distribution and input data.

Parameters:

dist (scipy.stats._multivariate.multivariate_normal_frozen) – The multivariate normal distribution object.
x (array-like of shape (n_samples, n_features)) – Input data for which the PDF is computed.

Returns:

log_dist_samples – Probability density values for the input data.

Return type:

array-like

autoqild.dataset_readers.utils.FACTOR = 1.5¶: A constant factor used for scaling or other operations.

autoqild.dataset_readers.utils.GEN_TYPES = ['single', 'multiple']¶

List of supported generation types for class imbalance:

single: Imbalance is applied to one class.
multiple: Imbalance is distributed across multiple classes.

autoqild.dataset_readers.utils.LABEL_COL = 'label'¶: Default label column name used in datasets.