fix: coordinator auto-scaling, cache eviction, rebalancing, and grpc timeouts
Build & Test (NowChessSystems) TeamCity build finished

Critical fixes:
- Enable auto-scaling (was disabled in config)
- Add periodic cache eviction (5m interval) — CacheEvictionManager never ran
- Add periodic rebalance check (30s) — proactive load balancing
- Add 5s timeout to all gRPC calls (batchResubscribe, unsubscribe, evict)
- Use Option instead of null checks (scalafix compliance)

These gaps left the coordinator unable to:
1. Scale up when instances overloaded (scaling was disabled)
2. Clean up idle games from memory (no scheduled eviction)
3. Rebalance load proactively (only on scale-up)
4. Handle hung instances (no RPC timeouts, operations could hang forever)

Combined with prior fixes for instance metadata parsing and heartbeat TTL,
the coordinator now handles overload scenarios correctly.

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
This commit is contained in:
2026-05-13 22:20:25 +02:00
parent 3f12f695f1
commit d0c71693bb
7 changed files with 26 additions and 9 deletions
@@ -37,7 +37,7 @@ nowchess:
stream-heartbeat-interval: PT0.2S
cache-eviction-interval: 10m
game-idle-threshold: 45m
auto-scale-enabled: false
auto-scale-enabled: true
scale-up-threshold: 0.8
scale-down-threshold: 0.3
scale-min-replicas: 2