autoqild.dataset_readers.utils

Provides utility functions for dataset handling, operations, and preprocessing.

Module Attributes

GEN_TYPES

List of supported generation types for class imbalance:

FACTOR

A constant factor used for scaling or other operations.

LABEL_COL

Default label column name used in datasets.

Functions

clean_class_label(string)

Clean and format a class label string.

generate_samples_per_class(n_classes[, ...])

Generate the number of samples per class with a specified imbalance.

pdf(dist, x)

Compute the probability density function (PDF) for the given distribution and input data.

autoqild.dataset_readers.utils.clean_class_label(string)[source]

Clean and format a class label string.

This function processes a string by replacing underscores with spaces, capitalizing each word, and removing any extra spaces to make the label more readable and formatted consistently.

Parameters:

string (str) – The input class label string to be cleaned and formatted.

Returns:

The cleaned and formatted class label string.

Return type:

str

Example

>>> clean_class_label("class_label_example")
`Class Label Example`

Notes

This function is useful for formatting class labels in a readable way, especially when they are generated automatically or retrieved from a source where they are not human-readable.

autoqild.dataset_readers.utils.generate_samples_per_class(n_classes, samples=1000, imbalance=0.05, gen_type='single', logger=None, verbose=1)[source]

Generate the number of samples per class with a specified imbalance.

This function calculates the number of samples for each class based on the provided imbalance ratio and the generation type. It supports both binary and multi-class scenarios, allowing the user to specify whether the imbalance should be distributed across a single class or multiple classes.

Parameters:
  • n_classes (int) – The number of classes in the dataset.

  • samples (int, default=1000) – The total number of samples across all classes.

  • imbalance (float, default=0.05) – The proportion of samples in the minority class (or classes if gen_type is “multiple”). The value must be less than or equal to 1/n_classes.

  • gen_type (str, default="single") – The type of imbalance generation: - “single”: Imbalance is applied to one class. - “multiple”: Imbalance is distributed across multiple classes.

  • logger (logging.Logger, optional) – Logger object for logging output. If None, a default logger is created.

  • verbose (int, default=1) – Verbosity level. If 1, logging information is displayed.

Returns:

samples_per_class – A dictionary where the keys are class labels (as strings) and the values are the number of samples for each class.

Return type:

dict

Raises:

ValueError – If the imbalance ratio is greater than 1/n_classes or if the generation type is not recognized.

autoqild.dataset_readers.utils.pdf(dist, x)[source]

Compute the probability density function (PDF) for the given distribution and input data.

Parameters:
  • dist (scipy.stats._multivariate.multivariate_normal_frozen) – The multivariate normal distribution object.

  • x (array-like of shape (n_samples, n_features)) – Input data for which the PDF is computed.

Returns:

log_dist_samples – Probability density values for the input data.

Return type:

array-like

autoqild.dataset_readers.utils.FACTOR = 1.5

A constant factor used for scaling or other operations.

autoqild.dataset_readers.utils.GEN_TYPES = ['single', 'multiple']

List of supported generation types for class imbalance:

  • single: Imbalance is applied to one class.

  • multiple: Imbalance is distributed across multiple classes.

autoqild.dataset_readers.utils.LABEL_COL = 'label'

Default label column name used in datasets.