Scheduler¶
The Floe scheduler automatically triggers maintenance operations based on policy schedules.
How It Works¶
Every minute, the scheduler:
- Acquires lock - Gets distributed lock (if enabled) to prevent duplicate runs
- Lists policies - Fetches all enabled policies
- Checks schedules - For each policy, checks each operation's schedule
- Finds tables - Lists tables matching the policy pattern
- Triggers maintenance - Runs due operations on matching tables
- Records execution - Stores completion time to calculate next run
The scheduler also applies:
- Budget limits (tables/operations/bytes per hour)
- Backoff after repeated failures
- Throttling after repeated zero-change runs
- Auto-mode prioritization based on maintenance debt score
Maintenance Debt Score¶
The maintenance debt score (0-1000) prioritizes tables in auto-mode when budgets are enforced. It combines:
- Health issues: CRITICAL=100, WARNING=50, INFO=10 points each
- Stale metadata: 0.5 points per day since last snapshot
- Failure rate: Percentage of failed operations (0-100)
- Consecutive failures: 20 points per failure in a row
⚠️ Note: The current weights are initial estimates for general use cases. Configurable weight parameters will be added in a future release to allow tuning based on your environment's priorities.
Example Scores¶
| Scenario | Score | Calculation |
|---|---|---|
| Healthy table, recent data | ~0 | No issues |
| 1 WARNING, 10 days old | 55 | 50 + (10 × 0.5) |
| 1 CRITICAL, 3 consecutive failures | 160 | 100 + (3 × 20) |
| Multiple issues, 50% failure rate | 250+ | Cumulative |
Configuration¶
# Enable/disable the scheduler
floe.scheduler.enabled=true
# Enable/disable condition-based triggering (default: true)
floe.scheduler.condition-based-triggering-enabled=true
# Enable distributed locking for multi-replica deployments for future use
floe.scheduler.distributed-lock-enabled=true
# Budget limits (0 = unlimited)
floe.scheduler.max-tables-per-poll=0
floe.scheduler.max-operations-per-poll=0
floe.scheduler.max-bytes-per-hour=0
# Backoff and throttling
floe.scheduler.failure-backoff-threshold=3
floe.scheduler.failure-backoff-hours=6
floe.scheduler.zero-change-threshold=5
floe.scheduler.zero-change-frequency-reduction-percent=50
floe.scheduler.zero-change-min-interval-hours=6
Schedule Configuration¶
Each operation in a policy can have its own schedule:
{
"rewriteDataFilesSchedule": {
"cronExpression": "0 2 * * *",
"windowStart": "02:00",
"windowEnd": "06:00",
"allowedDays": "MONDAY,WEDNESDAY,FRIDAY",
"timeout": 4,
"enabled": true
}
}
Fields¶
| Field | Type | Description |
|---|---|---|
cronExpression |
string | Cron schedule (e.g., 0 2 * * * = 2 AM daily) |
interval |
int | Alternative: run every N days |
windowStart |
string | Earliest start time (HH:mm) |
windowEnd |
string | Latest start time (HH:mm) |
allowedDays |
string | Comma-separated days (e.g., MONDAY,FRIDAY) |
timeout |
int | Timeout in hours |
enabled |
boolean | Enable/disable this schedule |
Distributed Locking¶
For deployments with multiple Floe replicas, enable distributed locking to ensure only one replica runs the scheduler:
How It Works¶
- Before each poll cycle, scheduler attempts to acquire the
floe-schedulerlock - Uses PostgreSQL advisory locks (fast, no table overhead)
- If lock is held by another replica, poll is skipped
- Lock auto-releases on connection close (crash-safe)
Requirements¶
- PostgreSQL storage backend (
floe.store.type=POSTGRES) - Multiple replicas sharing the same database
Single-Replica Mode¶
When distributed-lock-enabled=false (default):
- Uses local
AtomicBooleanfor concurrency within JVM - Suitable for single-replica deployments
- No external dependencies
Execution Tracking¶
The scheduler tracks when each operation last ran for each table to prevent duplicate executions.
Execution Record¶
Policy: daily-compaction
Operation: REWRITE_DATA_FILES
Table: warehouse.analytics.events
Last Run: 2024-01-15T02:00:00Z
Next Run: 2024-01-16T02:00:00Z
Calculation¶
- Check if operation is due:
now >= nextRun - If cron: parse expression to find next occurrence
- If interval:
lastRun + interval - Apply maintenance window constraints
Manual Trigger¶
The scheduler handles automatic execution. To trigger maintenance manually, use the API:
curl -X POST http://localhost:9091/api/v1/maintenance/trigger \
-H "Content-Type: application/json" \
-d '{
"catalog": "warehouse",
"namespace": "analytics",
"table": "events"
}'
Trigger Status¶
Check whether operations would trigger for a table:
Returns per-operation status including conditions met/unmet and next eligible time.