Critical fixes:
- Enable auto-scaling (was disabled in config)
- Add periodic cache eviction (5m interval) — CacheEvictionManager never ran
- Add periodic rebalance check (30s) — proactive load balancing
- Add 5s timeout to all gRPC calls (batchResubscribe, unsubscribe, evict)
- Use Option instead of null checks (scalafix compliance)
These gaps left the coordinator unable to:
1. Scale up when instances overloaded (scaling was disabled)
2. Clean up idle games from memory (no scheduled eviction)
3. Rebalance load proactively (only on scale-up)
4. Handle hung instances (no RPC timeouts, operations could hang forever)
Combined with prior fixes for instance metadata parsing and heartbeat TTL,
the coordinator now handles overload scenarios correctly.
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
Add quarkus-scheduler dependency and schedule health check every 10 seconds.
Dead instances (marked with state="DEAD") now automatically evicted instead of
accumulating indefinitely.
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
Problem: Dead instances pile up indefinitely. Failed metadata parsing leaves stale data in registry. No cleanup mechanism exists.
Changes:
1. Remove instance from registry on parse failure (corrupted metadata = unrecoverable)
2. Evict instances with state="DEAD" on next health check (was only evicting by heartbeat age)
This prevents:
- Memory leak from accumulating dead/corrupted instances
- Stale data persisting after parse failures
- Dead instances blocking resources indefinitely
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
Use Option instead of null checks in HealthMonitor and InstanceRegistry
per Scalafix DisableSyntax rule.
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
On startup, load all known instances from Redis and wait 15s for them to
reconnect via gRPC. Evict instances that don't reconnect within the timeout
and delete their K8s pods.
Replace one-shot pod status check with real fabric8 Watch. On pod Terminating
event, mark instance dead. On pod Deleted event, trigger failover. Failover
now waits reactively for at least one healthy instance before distributing
orphaned games, up to 30s timeout.
- Add startupValidationTimeout and failoverWaitTimeout config (15s, 30s)
- CoordinatorGrpcServer tracks active gRPC streams
- InstanceRegistry.loadAllFromRedis() scans and loads instances on startup
- HealthMonitor startup observer validates instances and starts K8s watch
- FailoverService.onInstanceStreamDropped returns Uni[Unit] for reactive wait
- All failover service callers updated to subscribe to Uni result
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
fabric8 disallows client.resources(classOf[GenericKubernetesResource]) — throws
KubernetesClientException at runtime. Switch to genericKubernetesResources(apiVersion, kind)
which is the correct API for CRDs.
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
Reviewed-on: #43
Co-authored-by: Janis <janis.e.20@gmx.de>
Co-committed-by: Janis <janis.e.20@gmx.de>