Skip to content

Architecture

Floe is a modular platform for Apache Iceberg table maintenance. Every core component can be swapped to integrate with your infrastructure.

System Architecture

Floe Architecture

Extension Points

External integrations are pluggable with configuration:

Component Interface Implementations
Catalog CatalogClient Iceberg REST, Hive, Nessie, Polaris, Lakekeeper, Gravitino
Engine ExecutionEngine Spark, Trino
Store PolicyStore, OperationStore PostgreSQL, Memory

For External Scheduler Integration

Disable the built-in scheduler and trigger maintenance via API (requires a matching policy):

For external schedulers

# Disable built-in scheduler
# The demos enable the scheduler by default
FLOE_SCHEDULER_ENABLED=false

# Trigger from external orchestrator
curl -X POST http://floe:9091/api/v1/maintenance/trigger \
  -d '{"catalog": "demo", "namespace": "db", "table": "events"}'

Data Flow

    1. Policy created (patterns can cover many tables)
           |
           v
    2. Scheduler triggers (auto-mode) or API call
           |
           v
    3. Table health assessed (metrics from Iceberg metadata)
           |
           v
    4. Health report persisted in TableHealthStore (if enabled)
           |
           v
    5. Orchestrator validates matching policy and table
           |
           v
    6. MaintenancePlanner selects operations from health issues
           |
           v
    7. TriggerEvaluator gates operations (min interval / signals / critical deadline)
           |
           v
    8. MaintenanceDebtScore prioritizes tables (auto-mode)
           |
           v
    9. Engine executes maintenance (Spark or Trino)
           |
           +--> Spark: rewriteDataFiles(), expireSnapshots(),
           |           deleteOrphanFiles(), rewriteManifests()
           |
           +--> Trino: ALTER TABLE ... EXECUTE optimize,
           |           expire_snapshots, remove_orphan_files
           |
           v
    10. Operation recorded in Store
           |
           v
    11. Event emitted (if logging is enabled)

Prioritization (Auto-Mode)

  • The scheduler computes a maintenance debt score per table using health issues and recent operation outcomes (failure streaks, zero-change runs).
  • Higher debt scores are prioritized when budgets (tables/operations/bytes) are enforced.
  • Throttling and backoff can defer tables after repeated zero-change runs or failures.

Note: The debt score weights are currently fixed. Configurable weight parameters will be added in a future release. See Scheduler for current scoring details.

Policy Requirement

Manual or scheduled triggers require a matching policy. Use table patterns (e.g., demo.test.*) to avoid per-table policies.

Learn More