stouputils.data_science.dataset.dataset module#

This module contains the Dataset class, which provides an easy way to handle ML datasets.

The Dataset class has the following attributes:

  • training_data (XyTuple): Training data containing features, labels and file paths

  • test_data (XyTuple): Test data containing features, labels and file paths

  • num_classes (int): Number of classes in the dataset

  • name (str): Name of the dataset

  • grouping_strategy (GroupingStrategy): Strategy for grouping images when loading

  • labels (list[str]): List of class labels (strings)

  • loading_type (Literal[“image”]): Type of the dataset (currently only “image” is supported)

  • original_dataset (Dataset | None): Original dataset used for data augmentation

  • class_distribution (dict[str, dict]): Class distribution counts for train/test sets

It provides methods for:

  • Loading image datasets from directories using different grouping strategies

  • Splitting data into train/test sets with stratification (and care for data augmentation)

  • Managing class distributions and dataset metadata

DEFAULT_IMAGE_KWARGS: dict[str, Any] = {'batch_size': 1, 'color_mode': 'rgb', 'image_size': (224, 224), 'label_mode': 'categorical'}#

Default image kwargs sent to keras.image_dataset_from_directory

class Dataset(training_data: XyTuple | list[Any], val_data: XyTuple | list[Any] | None = None, test_data: XyTuple | list[Any] | None = None, name: str = '', grouping_strategy: GroupingStrategy = GroupingStrategy.NONE, labels: tuple[str, ...] = (), loading_type: Literal['image'] = 'image')[source]#

Bases: object

Dataset class used for easy data handling.

_training_data: XyTuple#

Training data as XyTuple containing X and y as numpy arrays. This is a protected attribute accessed via the public property self.training_data.

_val_data: XyTuple#

Validation data as XyTuple containing X and y as numpy arrays. This is a protected attribute accessed via the public property self.val_data.

_test_data: XyTuple#

Test data as XyTuple containing X and y as numpy arrays. This is a protected attribute accessed via the public property self.test_data.

num_classes: int#

Number of classes in the dataset (y)

name: str#

Name of the dataset (path given in the constructor are converted, ex: “…/data/pizza_not_pizza” becomes “pizza_not_pizza”)

loading_type: Literal['image']#

Type of the dataset

grouping_strategy: GroupingStrategy#

Grouping strategy for the dataset

labels: tuple[str, ...]#

List of class labels (strings)

class_distribution: dict[str, dict[int, int]]#

Class distribution in the dataset for both training and test sets

original_dataset: Dataset | None#

Original dataset used for data augmentation (can be None)

_get_num_classes(*values: Any) int[source]#

Get the number of classes in the dataset.

Parameters:

values (NDArray[Any]) – Arrays containing class labels

Returns:

Number of unique classes

Return type:

int

_update_class_distribution(update_num_classes: bool = False) None[source]#

Update the class distribution dictionary for both training and test data.

exclude_augmented_images_from_val_test(original_dataset: Dataset) None[source]#

Exclude augmented versions of validation and test images from the training set.

This ensures that augmented versions of images in the validation and test sets are not present in the training set, which would cause data leakage.

Parameters:

original_dataset (Dataset) – The original dataset containing the test images to exclude

get_experiment_name(override_name: str = '') str[source]#

Get the experiment name for mlflow, example: “DatasetName_GroupingStrategyName”

Parameters:

override_name (str) – Override the Dataset name

Returns:

Experiment name

Return type:

str