Registry System =============== Semantiva's registry system manages component discovery, registration, and resolution across the framework. This system consists of two main parts: a dual-registry architecture for component separation and a bootstrap profile system for reproducible component loading. Architecture Overview --------------------- Semantiva uses a dual-registry system to manage different types of components while avoiding circular import dependencies. This architecture separates concerns between general component resolution and execution-specific component management. Registry Components ~~~~~~~~~~~~~~~~~~~ ``ProcessorRegistry`` Primary registry for data processors, context processors, workflow components, data collections, and fitting models. Handles dynamic class discovery from modules imported at runtime. ``NameResolverRegistry`` Stores prefix-based resolvers (``rename:``, ``delete:``, ``stringbuild:``, ``slicer:``) that expand declarative YAML strings into processor classes. ``ParameterResolverRegistry`` Maintains resolvers that transform configuration values recursively before processor instantiation. Resolvers are applied to dict values, list/tuple items, and nested structures. For example, the ``model:`` specification used by fitting pipelines. :py:class:`~semantiva.execution.component_registry.ExecutionComponentRegistry` Specialized registry for execution layer components: orchestrators, executors, and transports. Designed to avoid circular import dependencies with graph building and general class resolution. Dependency Separation ~~~~~~~~~~~~~~~~~~~~~ The dual-registry architecture solves a fundamental circular import problem: * **Graph builder** needs ``ProcessorRegistry`` to resolve processor classes from YAML * **Orchestrators** need graph builder functions for canonical specs and pipeline IDs * **ProcessorRegistry** previously needed to import orchestrators for default registration By introducing ``ExecutionComponentRegistry``, we break this cycle: .. code-block:: text Before (Circular): ProcessorRegistry → LocalSemantivaOrchestrator → graph_builder → ProcessorRegistry After (Clean): ProcessorRegistry → graph_builder ExecutionComponentRegistry → LocalSemantivaOrchestrator orchestrator/factory → ExecutionComponentRegistry Parameter Resolver System ------------------------- Parameter resolvers provide a mechanism to transform configuration values before processor instantiation. This system enables dynamic configuration resolution, environment variable substitution, and complex parameter transformations. Resolution Behavior ~~~~~~~~~~~~~~~~~~~ Parameters are **recursively transformed** before processor instantiation. Resolution applies to: * Dictionary values * List and tuple items * Nested structures (dictionaries within lists, lists within dictionaries, etc.) Resolvers are run in registration order and should be pure and idempotent. Adding Custom Parameter Resolvers ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ To add a custom parameter resolver: .. code-block:: python from semantiva.registry.parameter_resolver_registry import ParameterResolverRegistry def my_param_resolver(value): # Return (resolved_value, handled: bool) if isinstance(value, str) and value.startswith("myenv:"): env_var = value.split(":",1)[1] return os.environ.get(env_var, ""), True return value, False ParameterResolverRegistry.register_resolver("myenv", my_param_resolver) Built-in Resolvers ~~~~~~~~~~~~~~~~~~ The framework provides several built-in parameter resolvers: ``model:`` Resolves model specifications into ``ModelDescriptor`` objects for fitting workflows. Example: ``model:LinearRegression:param1=value1,param2=value2`` Resolver Function Interface ~~~~~~~~~~~~~~~~~~~~~~~~~~~ Parameter resolver functions must follow this interface: .. code-block:: python def parameter_resolver(value: Any) -> tuple[Any, bool]: """Transform a parameter value. Args: value: The input parameter value to potentially transform Returns: tuple: (resolved_value, was_handled) - resolved_value: The transformed value (or original if unchanged) - was_handled: True if this resolver processed the value, False otherwise """ If ``was_handled`` is ``True``, the resolved value is used. If ``False``, the original value is passed to the next resolver in the chain. Recursive Resolution Example ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: python # Input parameters with nested structure payload = { "database_url": "myenv:DATABASE_URL", "processing_config": { "batch_size": 100, "model_spec": "model:LinearRegression:learning_rate=0.01" }, "file_paths": ["myenv:INPUT_DIR/file1.txt", "myenv:INPUT_DIR/file2.txt"] } # After recursive parameter resolution resolved = { "database_url": "postgresql://localhost:5432/mydb", "processing_config": { "batch_size": 100, "model_spec": ModelDescriptor("sklearn.LinearRegression", {"learning_rate": 0.01}) }, "file_paths": ["/data/input/file1.txt", "/data/input/file2.txt"] } Bootstrap Profiles ------------------ The **Registry v1** design introduces ``RegistryProfile`` to make registry state explicit, portable, and reproducible. This system tracks modules and extension entry points that declare Semantiva components. Key Concepts ~~~~~~~~~~~~ ``RegistryProfile`` Frozen dataclass capturing four attributes: ``load_defaults`` Whether to ensure the core Semantiva modules and built-in resolvers are loaded. Defaults to ``True``. ``modules`` Python modules to import. Importing runs the Semantiva metaclass hooks, registering every component exposed by those modules. ``extensions`` Entry-point or module specifications that should be loaded via ``semantiva.registry.plugin_registry.load_extensions``. ``apply_profile(profile)`` Applies ``load_defaults`` (idempotent) and then registers ``modules`` and ``extensions`` in that order. ``current_profile()`` Captures the current process registry and returns a ``RegistryProfile`` instance. The snapshot always enables ``load_defaults`` and returns the module history that has been applied. ``fingerprint()`` Produces a SHA-256 hash of a normalised representation of the profile. The fingerprint is pinned into every SER under ``why_ok.env.registry.fingerprint``. Initialization Flow ~~~~~~~~~~~~~~~~~~~ Component registration follows a carefully orchestrated initialization sequence: 1. **ProcessorRegistry.register_modules(DEFAULT_MODULES)** * Registers core data processors, context processors, and fitting models * Ensures built-in resolvers (rename, delete, stringbuild, slicer, model) are available * Calls ``ExecutionComponentRegistry.initialize_defaults()`` 2. **ExecutionComponentRegistry.initialize_defaults()** * Imports execution components using lazy imports (no circular dependencies) * Registers default orchestrators, executors, and transports * Safe to call multiple times (idempotent) Component Resolution ~~~~~~~~~~~~~~~~~~~~ Different component types use their respective registries: **Data Processors (via resolve_symbol)**: .. code-block:: python from semantiva.registry import ProcessorRegistry, resolve_symbol # Ensure modules are registered (idempotent) ProcessorRegistry.register_modules(["semantiva.examples.test_utils"]) processor_cls = resolve_symbol("FloatValueDataSource") **Execution Components (via ExecutionComponentRegistry)**: .. code-block:: python from semantiva.execution.component_registry import ExecutionComponentRegistry # Resolves orchestrators for factory orch_cls = ExecutionComponentRegistry.get_orchestrator("LocalSemantivaOrchestrator") Factory Integration ~~~~~~~~~~~~~~~~~~~ The :py:func:`~semantiva.execution.orchestrator.factory.build_orchestrator` function uses ``ExecutionComponentRegistry`` for component resolution: .. code-block:: python from semantiva.execution.orchestrator.factory import build_orchestrator from semantiva.configurations.schema import ExecutionConfig config = ExecutionConfig( orchestrator="LocalSemantivaOrchestrator", executor="SequentialSemantivaExecutor", transport="InMemorySemantivaTransport" ) orchestrator = build_orchestrator(config) Distributed Execution --------------------- ``QueueSemantivaOrchestrator.enqueue`` now accepts an optional ``registry_profile`` parameter. If omitted, the orchestrator captures the process state via ``current_profile()``. The profile is attached to job metadata so that workers can replay the same registry configuration before constructing pipelines. YAML pipelines keep their ``extensions:`` support—``apply_profile`` is executed before YAML parsing, and ``load_pipeline_from_yaml`` still loads any inline extensions declared in the file. Programmatic Usage ------------------ Registry Profiles ~~~~~~~~~~~~~~~~~ .. code-block:: python from semantiva.registry.bootstrap import RegistryProfile, apply_profile, current_profile # Capture the current process state (defaults, modules, and paths) profile = current_profile() # Launch a distributed job with an explicit profile orchestrator.enqueue(pipeline_nodes, registry_profile=profile) # Rehydrate a profile in a worker or a separate process apply_profile(profile) Component Registration ~~~~~~~~~~~~~~~~~~~~~~ **Custom Data Processors**: .. code-block:: python # Register custom processors via ProcessorRegistry from semantiva.registry import ProcessorRegistry ProcessorRegistry.register_modules(["my_extension.processors"]) **Custom Execution Components**: .. code-block:: python # Register custom orchestrators ExecutionComponentRegistry.register_orchestrator( "CustomOrchestrator", MyCustomOrchestrator ) Best Practices -------------- 1. **Registry Selection**: Use ``resolve_symbol``/``ProcessorRegistry`` for data/context processors and ``ExecutionComponentRegistry`` for execution components. 2. **Initialization Order**: Use ``apply_profile`` or ``ProcessorRegistry.register_modules`` to ensure required modules are loaded before constructing pipelines. 3. **Lazy Imports**: When adding new execution components, use lazy imports in ``initialize_defaults()`` to avoid circular dependencies. 4. **Testing**: Both registries provide ``clear()`` methods for test isolation. 5. **Profile Management**: Use ``current_profile()`` to capture reproducible registry states for distributed execution. Idempotent Defaults ------------------- ``register_builtin_resolvers()`` installs built-in name and parameter resolvers exactly once. Re-invoking it is safe and preserves any user-provided resolvers registered with ``NameResolverRegistry`` or ``ParameterResolverRegistry``. Migration Notes --------------- The dual-registry architecture was introduced to resolve circular import issues while maintaining backward compatibility. Existing code using the new ``ProcessorRegistry`` and ``resolve_symbol`` APIs continues to work unchanged. Only the internal orchestrator factory implementation was modified to use the execution registry explicitly. The separation provides a foundation for future scalability, allowing independent evolution of data processing and execution layer components without coupling concerns.