Skip to content

Architecture details

This page explains how Lahuta decides what to run, what happens behind the scenes during run(), and where Python task performance can become your bottleneck.

Declaring dependencies and automatic inference

Every task node can depend on earlier nodes. You can declare dependencies explicitly with depends=[...], or let Lahuta infer built-in dependencies for Python callables.

from lahuta.pipeline import Pipeline, PipelineContext
from lahuta.sources import FileSource


def needs_system(ctx: PipelineContext) -> dict:
    system = ctx.get_system()
    return {"atoms": int(system.n_atoms) if system is not None else 0}


p = Pipeline(FileSource(["core/data/1kx2_small.cif"]))
p.add_task(name="summary", task=needs_system, depends=["system"])  # explicit

If you do not pass depends, Lahuta inspects your function and infers whether it needs system or topology from ctx.get_system() / ctx.get_topology() access patterns.

Important limits:

  • inference only covers built-ins (system, topology), not your custom task names,
  • explicit depends overrides inference,
  • depends=[] means “no dependencies” and disables inference for that task.

Use explicit dependencies when task ordering is part of your logic, or when context access is dynamic and not obvious to readers.

Deciding whether system and topology are computed

Lahuta only runs built-ins when required by task dependencies (explicit or inferred).

  • If tasks only use ctx.path, neither system nor topology needs to run.
  • If a task needs parsed structure arrays/object state, depend on system.
  • If a task needs chemistry-aware structure interpretation, depend on topology.
  • ContactTask implies topology automatically.

You can also control built-in behavior through params:

from lahuta.pipeline import Pipeline
from lahuta.sources import FileSource
from lahuta.lib.lahuta import TopologyComputers

p = Pipeline(FileSource(["core/data/fubi.cif"]))
p.params("system").is_model = True
p.params("topology").flags = TopologyComputers.All

If you disable topology (p.params("topology").enabled = False), Lahuta blocks tasks that require topology, including contact tasks.

What happens behind the scenes

At runtime, Lahuta compiles your task graph, executes it item by item, and routes emitted payloads through channels to sinks.

The flow is:

  1. The source enumerates items (files, DB records, frames, etc.).
  2. The graph is compiled with built-ins + user tasks in dependency order.
  3. For each item, Lahuta creates a per-item context (PipelineContext)
  4. Stages run in order for that item, and each stage may emit channel payloads.
  5. Channel sinks (memory/file/sharded) consume payloads via async writer queues.
  6. PipelineResult is assembled from memory sinks after run completion.

Backpressure and writer thread settings apply at the sink layer, so slow output destinations do not silently grow unbounded queues.

Caveats for slow Python tasks (GIL and throughput)

Lahuta parallelizes across items, not inside one Python function call. A single heavy Python task body is still one call that must finish before that stage can move that item forward.

Two practical consequences:

  • CPU-bound Python code is typically limited by the GIL, so you should not expect linear multi-core speedup from pure Python loops.
  • If one task takes about 2 seconds per item, that stage will dominate your end-to-end runtime profile even if C++ stages finish in milliseconds.

Also note:

  • thread_safe=False serializes that Python task explicitly (single concurrent invocation),
  • thread_safe=True allows concurrent scheduling across items, but GIL effects still apply for CPU-bound Python code.

Continue to Advanced usage for tuning, graph patterns, observability, and failure-handling workflows.