zea.data.datasets¶
zea.data.datasets¶
This module provides classes and utilities for loading, validating, and managing ultrasound datasets stored in HDF5 format. It supports both local and Hugging Face Hub datasets, and offers efficient file handle caching for large collections of files.
Main Classes¶
H5FileHandleCache: Caches open HDF5 file handles to optimize repeated access.
Folder: Represents a group of HDF5 files in a directory, with optional validation.
- Dataset: Provides an iterable interface over multiple HDF5 files or folders, with
support for directory-based splitting and validation.
Functions¶
find_h5_files: Recursively finds HDF5 files and retrieves their dataset shapes.
split_files_by_directory: Splits files among directories according to specified ratios.
count_samples_per_directory: Counts the number of files per directory.
Features¶
Validation of dataset integrity with flag files and error logging.
Support for Hugging Face Hub datasets with local caching.
Utilities for dataset splitting and sample counting.
Example usage provided in the module’s main block.
Functions
|
Count number of samples per directory. |
|
Find HDF5 files from a directory or list of directories and optionally retrieve their shapes. |
|
Split files according to their parent directories and given split ratios. |
Classes
|
Iterate over File(s) and Folder(s). |
|
Group of HDF5 files in a folder that can be validated. |
|
Cache for HDF5 file handles. |
- class zea.data.datasets.Dataset(file_paths, key, search_file_tree_kwargs=None, validate=True, directory_splits=None, **kwargs)[source]¶
Bases:
H5FileHandleCache
Iterate over File(s) and Folder(s).
- classmethod from_config(dataset_folder, dtype, user=None, **kwargs)[source]¶
Creates a Dataset from a config file.
- property n_files¶
Return number of files in dataset.
- property total_frames¶
Return total number of frames in dataset.
- class zea.data.datasets.Folder(folder_path, key=None, search_file_tree_kwargs=None, validate=True, hf_cache_dir=PosixPath('/home/docs/.cache/zea/huggingface/datasets'), **kwargs)[source]¶
Bases:
object
Group of HDF5 files in a folder that can be validated. Mostly used internally, you might want to use the Dataset class instead.
- copy(to_path, key=None, mode=None)[source]¶
Copy the data for all or a specific key to a new location.
Has the option to copy all keys or only a specific key. By default, it only copies if the destination file does not already contain the key. You can change the mode to ‘w’ to overwrite the destination file. Will always copy metadata such as dataset attributes and scan object.
- Parameters:
to_path (str or Path) – The destination path where files will be copied.
key (str, optional) – The key to copy from the source files. If ‘all’ or ‘*’, all keys will be copied. Defaults to None, which uses the key set in the Folder instance.
mode (str) – The mode in which to open the destination files. Defaults to ‘a’ (append mode), and ‘w’ (write mode) if key is ‘all’ or ‘*’. See: https://docs.h5py.org/en/stable/high/file.html#opening-creating-files
- property n_files¶
Return number of files in dataset.
- class zea.data.datasets.H5FileHandleCache(file_handle_cache_capacity=128)[source]¶
Bases:
object
Cache for HDF5 file handles.
This class manages a cache of HDF5 file handles to avoid reopening files multiple times. It uses an OrderedDict to maintain the order of file access and closes the least recently used file when the cache reaches its capacity.
- zea.data.datasets.count_samples_per_directory(file_names, directories)[source]¶
Count number of samples per directory.
- Parameters:
file_names (list) – List of file paths
directories (str or list) – Directory or list of directories
- Returns:
Dictionary with directory paths as keys and sample counts as values
- Return type:
dict
- zea.data.datasets.find_h5_files(paths, key=None, search_file_tree_kwargs=None)[source]¶
Find HDF5 files from a directory or list of directories and optionally retrieve their shapes.
- Parameters:
paths (str or list) – A single directory path, a list of directory paths, or a single HDF5 file path.
key (str, optional) – The key to get the file shapes for.
search_file_tree_kwargs (dict, optional) – Additional keyword arguments for the search_file_tree function. Defaults to None.
- Returns:
List of file paths to the HDF5 files. - file_shapes (list): List of shapes of the HDF5 datasets.
- Return type:
file_paths (list)
- zea.data.datasets.split_files_by_directory(file_names, file_shapes, directory_list, directory_splits)[source]¶
Split files according to their parent directories and given split ratios.
- Parameters:
file_names (list) – List of file paths.
file_shapes (list) – List of shapes for each file.
directory_list (list) – List of directory paths to split by.
directory_splits (list) – List of split ratios (0-1) for each directory.
- Returns:
(split_file_names, split_file_shapes)
- Return type:
tuple