stouputils.collections module#

This module provides utilities for collection manipulation:

  • unique_list: Remove duplicates from a list while preserving order using object id, hash or str

  • sort_dict_keys: Sort dictionary keys using a given order list (ascending or descending)

  • upsert_in_dataframe: Insert or update a row in a Polars DataFrame based on primary keys

  • array_to_disk: Easily handle large numpy arrays on disk using zarr for efficient storage and access.

stouputils collections examples
unique_list(
list_to_clean: Iterable[T],
method: Literal['id', 'hash', 'str'] = 'str',
) list[T][source]#

Remove duplicates from the list while keeping the order using ids (default) or hash or str

Parameters:
  • list_to_clean (Iterable[T]) – The list to clean

  • method (Literal["id", "hash", "str"]) – The method to use to identify duplicates

Returns:

The cleaned list

Return type:

list[T]

Examples

>>> unique_list([1, 2, 3, 2, 1], method="id")
[1, 2, 3]
>>> s1 = {1, 2, 3}
>>> s2 = {2, 3, 4}
>>> s3 = {1, 2, 3}
>>> unique_list([s1, s2, s1, s1, s3, s2, s3], method="id")
[{1, 2, 3}, {2, 3, 4}, {1, 2, 3}]
>>> s1 = {1, 2, 3}
>>> s2 = {2, 3, 4}
>>> s3 = {1, 2, 3}
>>> unique_list([s1, s2, s1, s1, s3, s2, s3], method="str")
[{1, 2, 3}, {2, 3, 4}]
sort_dict_keys(
dictionary: dict[T, Any],
order: list[T],
reverse: bool = False,
) dict[T, Any][source]#

Sort dictionary keys using a given order list (reverse optional)

Parameters:
  • dictionary (dict[T, Any]) – The dictionary to sort

  • order (list[T]) – The order list

  • reverse (bool) – Whether to sort in reverse order (given to sorted function which behaves differently than order.reverse())

Returns:

The sorted dictionary

Return type:

dict[T, Any]

Examples

>>> sort_dict_keys({'b': 2, 'a': 1, 'c': 3}, order=["a", "b", "c"])
{'a': 1, 'b': 2, 'c': 3}
>>> sort_dict_keys({'b': 2, 'a': 1, 'c': 3}, order=["a", "b", "c"], reverse=True)
{'c': 3, 'b': 2, 'a': 1}
>>> sort_dict_keys({'b': 2, 'a': 1, 'c': 3, 'd': 4}, order=["c", "b"])
{'c': 3, 'b': 2, 'a': 1, 'd': 4}
upsert_in_dataframe(
df: pl.DataFrame,
new_entry: dict[str, Any],
primary_keys: dict[str, Any] | None = None,
) pl.DataFrame[source]#

Insert or update a row in the Polars DataFrame based on primary keys.

Parameters:
  • df (pl.DataFrame) – The Polars DataFrame to update.

  • new_entry (dict[str, Any]) – The new entry to insert or update.

  • primary_keys (dict[str, Any]) – The primary keys to identify the row (default: empty).

Returns:

The updated Polars DataFrame.

Return type:

pl.DataFrame

array_to_disk(
data: NDArray[Any] | zarr.Array,
delete_input: bool = True,
more_data: NDArray[Any] | zarr.Array | None = None,
) tuple[zarr.Array, str, int][source]#

Easily handle large numpy arrays on disk using zarr for efficient storage and access.

Zarr provides a simpler and more efficient alternative to np.memmap with better compression and chunking capabilities.

Parameters:
  • data (NDArray | zarr.Array) – The data to save/load as a zarr array

  • delete_input (bool) – Whether to delete the input data after creating the zarr array

  • more_data (NDArray | zarr.Array | None) – Additional data to append to the zarr array

Returns:

The zarr array, the directory path, and the total size in bytes

Return type:

tuple[zarr.Array, str, int]

Examples

>>> import numpy as np
>>> data = np.random.rand(1000, 1000)
>>> zarr_array = array_to_disk(data)[0]
>>> zarr_array.shape
(1000, 1000)
>>> more_data = np.random.rand(500, 1000)
>>> longer_array, dir_path, total_size = array_to_disk(zarr_array, more_data=more_data)