stouputils.data_science.dataset.xy_tuple module#

This module contains the XyTuple class, which is a specialized tuple subclass for maintaining ML dataset integrity with file tracking.

XyTuple handles grouped data to preserve relationships between files from the same subject. All data is treated as grouped, even single files, for consistency.

File Structure Example:

  • dataset/class1/hello.png

  • dataset/class2/subject1/image1.png

  • dataset/class2/subject1/image2.png

Data Representation:

  1. Grouped Format (as loaded): - X: list[list[Any]] = [[image], [image, image], …] - y: list[Any] = [class1, class2, …] - filepaths = [(“hello.png”,), (“subject1/image1.png”, “subject1/image2.png”), …]

  2. Ungrouped Format (after XyTuple.ungroup()): - X: list[Any] = [image, image, image, …] - y: list[Any] = [class1, class2, class2, …] - filepaths: tuple[str, …] = (“hello.png”, “subject1/image1.png”, “subject1/image2.png”)

Key Features:

  • Preserves subject-level grouping during dataset operations

  • Handles augmented files with automatic original/augmented mapping

  • Supports group-aware dataset splitting

  • Implements stratified k-fold splitting that maintains group integrity

class XyTuple(X: ndarray[Any, dtype[Any]] | list[Any], y: ndarray[Any, dtype[Any]] | list[Any], filepaths: tuple[tuple[str, ...], ...] = ())[source]#

Bases: tuple[list[list[Any]], list[Any], tuple[tuple[str, …], …]]

A tuple containing X (features) and y (labels) data with file tracking.

XyTuple handles grouped data to preserve relationships between files from the same subject. All data is treated as grouped, even single files, for consistency.

Examples

>>> data = XyTuple(X=[1, 2, 3], y=[4, 5, 6], filepaths=(("file1.jpg",), ("file2.jpg",), ("file3.jpg",)))
>>> data.X
[[1], [2], [3]]
>>> data.y
[4, 5, 6]
>>> XyTuple(X=[1, 2], y=["a", "b"]).filepaths
()
>>> isinstance(XyTuple(X=[1, 2], y=[3, 4]), tuple)
True
_X: list[list[Any]]#

Features data, list of groups of different sized numpy arrays. Each list corresponds to a subject that can have, for instance, multiple images

This is a protected attribute accessed via the public property self.X.

_y: list[Any]#

Labels data, either a numpy array or a list of different sized numpy arrays.

This is a protected attribute accessed via the public property self.y.

filepaths: tuple[tuple[str, ...], ...]#

List of filepaths corresponding to the features (one file = list with one element)

augmented_files: dict[str, str]#

“file1.jpg”}

Type:

Dictionary mapping all files to their original filepath, e.g. {“file1_aug_1.jpg”

property n_samples: int#

Number of samples in the dataset (property).

is_empty() bool[source]#

Check if the XyTuple is empty.

update_augmented_files() dict[str, str][source]#

Create mapping of all files to their original version. If no filepaths are provided, return an empty dictionary

Returns:

Dictionary where keys are all files (original and augmented),

and values are the corresponding original file

Return type:

dict[str, str]

Examples

>>> xy = XyTuple(X=[1, 2, 3], y=[4, 5, 6], filepaths=(("file1.jpg",), ("file2.jpg",), ("file1_aug_1.jpg",)))
>>> xy.augmented_files
{'file1.jpg': 'file1.jpg', 'file2.jpg': 'file2.jpg', 'file1_aug_1.jpg': 'file1.jpg'}
>>> xy_empty = XyTuple(X=[1, 2], y=[3, 4])
>>> xy_empty.augmented_files
{}
group_by_original() tuple[dict[str, list[int]], dict[str, Any]][source]#

Group samples by their original files and collect labels.

Returns:

  • dict[str, list[int]]: Mapping from original files to their sample indices

  • dict[str, Any]: Mapping from original files to their labels

Return type:

tuple[dict[str, list[int]], dict[str, Any]]

Examples

>>> xy = XyTuple(X=[1, 2, 3], y=["a", "b", "c"],
...              filepaths=(("file1.jpg",), ("file2.jpg",), ("file1_aug_2.jpg",)))
>>> indices, labels = xy.group_by_original()
>>> sorted(indices.items())
[('file1.jpg', [0, 2]), ('file2.jpg', [1])]
>>> [(x, str(y)) for x, y in sorted(labels.items())]
[('file1.jpg', 'a'), ('file2.jpg', 'b')]
get_indices_from_originals(original_to_indices: dict[str, list[int]], originals: tuple[str, ...] | list[str]) list[int][source]#

Get flattened list of indices for given original files.

Parameters:
  • original_to_indices (dict[str, list[int]]) – Mapping from originals to indices

  • originals (tuple[str, ...]) – List of original files to get indices for

Returns:

Flattened list of all indices associated with the originals

Return type:

list[int]

Examples

>>> xy = XyTuple(X=[1, 2, 3, 4], y=["a", "b", "c", "d"],
...              filepaths=(("file1.jpg",), ("file2.jpg",), ("file1_aug_1.jpg",), ("file3.jpg",)))
>>> orig_to_idx, _ = xy.group_by_original()
>>> sorted(xy.get_indices_from_originals(orig_to_idx, ["file1.jpg", "file3.jpg"]))
[0, 2, 3]
>>> xy.get_indices_from_originals(orig_to_idx, ["file2.jpg"])
[1]
>>> xy.get_indices_from_originals(orig_to_idx, [])
[]
create_subset(indices: Iterable[int]) XyTuple[source]#

Create a new XyTuple containing only the specified indices.

Parameters:

indices (list[int]) – List of indices to include in the subset

Returns:

New instance containing only the specified data points

Return type:

XyTuple

Examples

>>> xy = XyTuple(X=[10, 20, 30, 40], y=["a", "b", "c", "d"],
...              filepaths=(("f1.jpg",), ("f2.jpg",), ("f3.jpg",), ("f4.jpg",)))
>>> subset = xy.create_subset([0, 2])
>>> subset.X
[[10], [30]]
>>> subset.y
['a', 'c']
>>> subset.filepaths
(('f1.jpg',), ('f3.jpg',))
>>> xy.create_subset([]).X
[]
remove_augmented_files() XyTuple[source]#

Remove augmented files from the dataset, keeping only original files.

This method identifies augmented files by checking if the file path contains the augmentation suffix and creates a new dataset without them.

Returns:

A new XyTuple instance containing only non-augmented files

Return type:

XyTuple

Examples

>>> xy = XyTuple(X=[1, 2, 3], y=[0, 1, 0],
...              filepaths=(("file1.jpg",), ("file2.jpg",), ("file1_aug_1.jpg",)))
>>> non_aug = xy.remove_augmented_files()
>>> len(non_aug.X)
2
>>> non_aug.filepaths
(('file1.jpg',), ('file2.jpg',))
split(test_size: float, seed: int | RandomState | None = None, num_classes: int | None = None, remove_augmented: bool = True) tuple[XyTuple, XyTuple][source]#

Stratified split of the dataset ensuring original files and their augmented versions stay together.

This function splits the dataset into train and test sets while keeping augmented versions of the same image together. It works in several steps:

  1. Groups samples by original file and collects corresponding labels

  2. Performs stratified split on the original files to maintain class distribution

  3. Creates new XyTuple instances for train and test sets using the split indices

Parameters:
  • test_size (float) – Proportion of dataset to include in test split

  • seed (int | RandomState) – Controls shuffling for reproducible output

  • num_classes (int | None) – Number of classes in the dataset (If None, auto-calculate)

  • remove_augmented (bool) – Whether to remove augmented files from the test set

Returns:

Train and test splits containing (features, labels, file paths)

Return type:

tuple[XyTuple, XyTuple]

Examples

>>> xy = XyTuple(X=np.arange(10), y=[0, 1, 0, 1, 0, 1, 0, 1, 0, 1],
...              filepaths=(("f1.jpg",), ("f2.jpg",), ("f3.jpg",), ("f4.jpg",), ("f5.jpg",),
...                         ("f6.jpg",), ("f7.jpg",), ("f8.jpg",), ("f9.jpg",), ("f10.jpg",)))
>>> train, test = xy.split(test_size=0.3, seed=42)
>>> len(train.X), len(test.X)
(7, 3)
>>> train, test = xy.split(test_size=0.0)
>>> len(train.X), len(test.X)
(10, 0)
>>> train, test = xy.split(test_size=1.0)
>>> len(train.X), len(test.X)
(0, 10)
kfold_split(n_splits: int, remove_augmented: bool = True, shuffle: bool = True, random_state: int | None = None, verbose: int = 1) Generator[tuple[XyTuple, XyTuple], None, None][source]#

Perform stratified k-fold splits while keeping original and augmented data together.

If filepaths are not provided, performs a regular stratified k-fold split on the data.

Parameters:
  • n_splits (int) – Number of folds, will use LeaveOneOut if -1 or too big, -X will use LeavePOut

  • remove_augmented (bool) – Whether to remove augmented files from the validation sets

  • shuffle (bool) – Whether to shuffle before splitting

  • random_state (int | None) – Seed for reproducible shuffling

  • verbose (int) – Whether to print information about the splits

Returns:

List of train/test splits

Return type:

list[tuple[XyTuple, XyTuple]]

Raises:

ValueError – If there are fewer original files than requested splits

Examples

>>> xy = XyTuple(X=np.arange(8), y=[[1, 0], [0, 1], [1, 0], [0, 1], [1, 0], [0, 1], [1, 0], [0, 1]],
...              filepaths=(("f1.jpg",), ("f2.jpg",), ("f3.jpg",), ("f4.jpg",), ("f5.jpg",),
...                          ("f6.jpg",), ("f7.jpg",), ("f8.jpg",)))
>>> splits = list(xy.kfold_split(n_splits=2, random_state=42, verbose=0))
>>> len(splits)
2
>>> len(splits[0][0].X), len(splits[0][1].X)  # First fold: train size, test size
(4, 4)
>>> xy = XyTuple(X=np.arange(8), y=[[1, 0], [0, 1], [1, 0], [0, 1], [1, 0], [0, 1], [1, 0], [0, 1]])
>>> splits = list(xy.kfold_split(n_splits=2, random_state=42, verbose=0))
>>> len(splits)
2
>>> len(splits[0][0].X), len(splits[0][1].X)  # First fold: train size, test size
(4, 4)
>>> xy = XyTuple(X=np.arange(4), y=[[0], [1], [0], [1]])
>>> splits = list(xy.kfold_split(n_splits=2, random_state=42, verbose=0))
>>> len(splits)
2
>>> len(splits[0][0].X), len(splits[0][1].X)
(2, 2)
>>> xy_few = XyTuple(X=[1, 2], y=[0, 1], filepaths=(("f1.jpg",), ("f2.jpg",)))
>>> splits = list(xy_few.kfold_split(n_splits=1, verbose=0))
>>> splits[0][0].X
[[1], [2]]
>>> splits[0][1].X
[]
>>> # Fallback to LeaveOneOut since n_splits is too big, so n_splits becomes -> 2
>>> xy_few = XyTuple(X=[1, 2], y=[0, 1], filepaths=(("f1.jpg",), ("f2.jpg",)))
>>> splits = list(xy_few.kfold_split(n_splits=516416584, shuffle=False, verbose=0))
>>> len(splits)
2
>>> splits[1][0].X
[[1]]
>>> splits[1][1].X
[[2]]
>>> # Fallback to LeavePOut since n_splits is negative
>>> xy_few = XyTuple(X=[1, 2, 3, 4], y=[0, 1, 0, 1])
>>> splits = list(xy_few.kfold_split(n_splits=-2, shuffle=False, verbose=1))
>>> len(splits)
6
>>> splits[0][0].X
[[3], [4]]
>>> splits[0][1].X
[[1], [2]]
ungrouped_array() tuple[ndarray[Any, dtype[Any]], ndarray[Any, dtype[Any]], tuple[tuple[str, ...], ...]][source]#

Ungroup data to flatten the structure.

Converts from grouped format to ungrouped format:

  • Grouped: X: list[list[Any]], y: list[Any]

  • Ungrouped: X: NDArray[Any], y: NDArray[Any]

Returns:

A tuple containing (X, y, filepaths) in ungrouped format

Return type:

tuple[NDArray[Any], NDArray[Any], tuple[tuple[str, …], …]]

Examples

>>> xy = XyTuple(X=[[np.array([1])], [np.array([2]), np.array([3])], [np.array([4])]],
...              y=[np.array(0), np.array(1), np.array(2)],
...              filepaths=(("file1.png",), ("file2.png", "file3.png"), ("file4.png", "file5.png")))
>>> X, y, filepaths = xy.ungrouped_array()
>>> len(X)
4
>>> len(y)
4
>>> filepaths
(('file1.png',), ('file2.png',), ('file3.png',), ('file4.png', 'file5.png'))
static empty() XyTuple[source]#

Create an empty XyTuple.

Returns:

An empty XyTuple with empty lists for X, y, and filepaths

Return type:

XyTuple

Examples

>>> empty = XyTuple.empty()
>>> empty.X
[]
>>> empty.y
[]
>>> empty.filepaths
()