autoqild.dataset_readers.utils¶
Provides utility functions for dataset handling, operations, and preprocessing.
Module Attributes
List of supported generation types for class imbalance: |
|
A constant factor used for scaling or other operations. |
|
Default label column name used in datasets. |
Functions
|
Clean and format a class label string. |
|
Generate the number of samples per class with a specified imbalance. |
|
Compute the probability density function (PDF) for the given distribution and input data. |
- autoqild.dataset_readers.utils.clean_class_label(string)[source]¶
Clean and format a class label string.
This function processes a string by replacing underscores with spaces, capitalizing each word, and removing any extra spaces to make the label more readable and formatted consistently.
- Parameters:
string (str) – The input class label string to be cleaned and formatted.
- Returns:
The cleaned and formatted class label string.
- Return type:
str
Example
>>> clean_class_label("class_label_example") `Class Label Example`
Notes
This function is useful for formatting class labels in a readable way, especially when they are generated automatically or retrieved from a source where they are not human-readable.
- autoqild.dataset_readers.utils.generate_samples_per_class(n_classes, samples=1000, imbalance=0.05, gen_type='single', logger=None, verbose=1)[source]¶
Generate the number of samples per class with a specified imbalance.
This function calculates the number of samples for each class based on the provided imbalance ratio and the generation type. It supports both binary and multi-class scenarios, allowing the user to specify whether the imbalance should be distributed across a single class or multiple classes.
- Parameters:
n_classes (int) – The number of classes in the dataset.
samples (int, default=1000) – The total number of samples across all classes.
imbalance (float, default=0.05) – The proportion of samples in the minority class (or classes if gen_type is “multiple”). The value must be less than or equal to 1/n_classes.
gen_type (str, default="single") – The type of imbalance generation: - “single”: Imbalance is applied to one class. - “multiple”: Imbalance is distributed across multiple classes.
logger (logging.Logger, optional) – Logger object for logging output. If None, a default logger is created.
verbose (int, default=1) – Verbosity level. If 1, logging information is displayed.
- Returns:
samples_per_class – A dictionary where the keys are class labels (as strings) and the values are the number of samples for each class.
- Return type:
dict
- Raises:
ValueError – If the imbalance ratio is greater than 1/n_classes or if the generation type is not recognized.
- autoqild.dataset_readers.utils.pdf(dist, x)[source]¶
Compute the probability density function (PDF) for the given distribution and input data.
- Parameters:
dist (scipy.stats._multivariate.multivariate_normal_frozen) – The multivariate normal distribution object.
x (array-like of shape (n_samples, n_features)) – Input data for which the PDF is computed.
- Returns:
log_dist_samples – Probability density values for the input data.
- Return type:
array-like
- autoqild.dataset_readers.utils.FACTOR = 1.5¶
A constant factor used for scaling or other operations.
- autoqild.dataset_readers.utils.GEN_TYPES = ['single', 'multiple']¶
List of supported generation types for class imbalance:
single: Imbalance is applied to one class.
multiple: Imbalance is distributed across multiple classes.
- autoqild.dataset_readers.utils.LABEL_COL = 'label'¶
Default label column name used in datasets.