Commit Graph

232 Commits

Author SHA1 Message Date
TeamCity bcd8257db2 ci: bump version with Build-101 store-0.19.0 2026-05-19 09:06:09 +00:00
Janis d61fe97b4c feat(redis): add @Startup annotation to GameWritebackStreamListener
Build & Test (NowChessSystems) TeamCity build finished
2026-05-19 10:44:51 +02:00
TeamCity 959bb53335 ci: bump version with Build-100 store-0.18.0 2026-05-19 08:13:51 +00:00
Janis b610678005 fix(redis): add log message for starting Writeback listener
Build & Test (NowChessSystems) TeamCity build finished
2026-05-19 09:48:21 +02:00
TeamCity d0552b08b5 ci: bump version with Build-99 core-0.45.0 2026-05-19 07:15:31 +00:00
Janis 87f29a7204 feat(config): add GameWritebackEventDto to reflection targets
Build & Test (NowChessSystems) TeamCity build finished
2026-05-19 08:49:48 +02:00
TeamCity cb44f491bd ci: bump version with Build-98 core-0.44.0 2026-05-18 21:25:03 +00:00
Janis f5614c3582 fix(core): add logs to trace subscribeGame call in createGame
Build & Test (NowChessSystems) TeamCity build finished
Track whether subscribeGame is being called and completing successfully
to diagnose empty game-writeback messages.

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-05-18 22:55:33 +02:00
TeamCity 9d960d3ee5 ci: bump version with Build-97 coordinator-0.32.0 2026-05-18 18:58:53 +00:00
Janis a9f4606b40 feat: force delete pod immediately on heartbeat loss
Build & Test (NowChessSystems) TeamCity build finished
When instance stream drops, immediately force delete the K8s pod (grace period 0). No waiting for health check or pod watch events.

Reduces failover latency and ensures stale pods don't linger.

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-05-18 20:36:26 +02:00
Janis 32a12737e3 refactor: resource-based scaling only, remove health-check triggered scaling
Scale up: only if resource constrained (CPU/memory)
Scale down: only if NOT resource constrained AND game load low
Remove: triggering scale-up on unexpected instance failures
Keep: health monitoring (mark dead, delete pod, failover games) but no scaling

Prevents cascade scaling from transient health check failures.

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-05-18 20:36:23 +02:00
TeamCity b4c75e2a0f ci: bump version with Build-96 coordinator-0.31.0 2026-05-17 20:01:45 +00:00
Janis 9bf995f47d fix: revert pod matching to original logic instanceId.contains(podName)
Build & Test (NowChessSystems) TeamCity build finished
Accidentally flipped pod matching direction in previous commits. Changed from correct instanceId.contains(podName) to incorrect podName.contains(instanceId), causing all health checks to fail.

Reverted all 3 locations to original working logic.

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-05-17 21:35:16 +02:00
TeamCity 8df418627c ci: bump version with Build-95 coordinator-0.30.0 2026-05-17 18:53:07 +00:00
Janis 6dbe1e62ac fix: correct pod matching logic from endsWith to contains
Build & Test (NowChessSystems) TeamCity build finished
Pod matching used endsWith(instanceId) which failed to match any pods because instanceId is randomly generated 8-char string, not pod name suffix. All instances marked dead causing cascading health check failures.

Changed to podName.contains(instanceId) to match instanceId embedded anywhere in pod name. Reverting incomplete fix from previous commit.

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-05-17 20:26:08 +02:00
TeamCity 6311d8fd00 ci: bump version with Build-94 coordinator-0.29.0 core-0.43.0 store-0.17.0 2026-05-17 17:07:55 +00:00
Janis 5205468534 fix: remove redundant line break in LoadBalancer.scala for improved readability
Build & Test (NowChessSystems) TeamCity build finished
2026-05-17 18:41:35 +02:00
Janis 4ec5b931de feat: integrate auto-scaler to handle instance health checks and scaling
Build & Test (NowChessSystems) TeamCity build failed
2026-05-17 18:01:55 +02:00
Janis ebba729af3 fix: ensure full hierarchy registration for reflection in NativeReflectionConfig 2026-05-17 17:59:54 +02:00
Janis 5619c8223a fix: resolve 6 coordinator bugs (cache eviction, rebalance race, pod matching, lookup inefficiency)
- Add lastUpdatedMs timestamp to GameCacheDto to track actual game updates instead of heartbeat time. Fix cache eviction incorrectly marking correspondence games as idle.
- Use atomic SPOP in LoadBalancer.getGamesToMove() to prevent concurrent rebalance calls from selecting same games for migration.
- Add game→instance reverse mapping (nowchess:game:$gameId:instance) to eliminate O(instances) linear scan during cache eviction.
- Fix HealthMonitor pod matching from loose contains() to reliable endsWith() to prevent matching unintended pods with similar names.
- Update FailoverService to maintain game→instance mappings when migrating games during failover.
- Update CacheEvictionManager to use game→instance mapping for O(1) lookup instead of O(n) instance scan.

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-05-17 17:07:29 +02:00
Janis 2d76c001fe fix: refresh Redis TTL on instance heartbeat to prevent false DEAD marking
Instances were being incorrectly marked DEAD because their Redis key TTL was
not being refreshed on heartbeat. HealthMonitor.checkRedisHeartbeat() checks
pttl > 0, which fails when the TTL expires even if the instance is alive and
sending regular heartbeats.

Now pexpire(key, heartbeatTtl) is called on each heartbeat to keep the key
alive. Prevents scaling messages from undercounting healthy instances.

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-05-17 14:50:34 +02:00
TeamCity b58bbbc782 ci: bump version with Build-93 coordinator-0.28.0 2026-05-17 12:13:55 +00:00
Janis 1a02f9e186 fix: remove unused clearDrainingByPodName method and update HealthMonitor to clear draining instances
Build & Test (NowChessSystems) TeamCity build finished
2026-05-17 13:49:15 +02:00
TeamCity 255f43ddda ci: bump version with Build-92 coordinator-0.27.0 2026-05-16 13:59:43 +00:00
Janis f109fe3860 fix: improve pod instance ID matching logic in AutoScaler and HealthMonitor
Build & Test (NowChessSystems) TeamCity build finished
2026-05-16 15:45:41 +02:00
Janis a07bf89fae feat: add configurable CPU and memory scaling thresholds for auto-scaling
Build & Test (NowChessSystems) TeamCity build finished
2026-05-16 15:22:56 +02:00
TeamCity b47ad7ef89 ci: bump version with Build-90 coordinator-0.26.0 2026-05-16 13:15:48 +00:00
Janis 2e4ba43597 feat: implement clock expiry scanning and handling for game records (#54)
Build & Test (NowChessSystems) TeamCity build finished
Reviewed-on: #54
2026-05-16 15:09:04 +02:00
TeamCity b0d27d2de2 ci: bump version with Build-89 coordinator-0.25.0 core-0.42.0 store-0.16.0 2026-05-16 11:31:37 +00:00
Janis 8f9eb12f66 feat: implement clock expiry scanning and handling for game records (#53)
Build & Test (NowChessSystems) TeamCity build finished
Reviewed-on: #53
2026-05-16 13:24:48 +02:00
TeamCity 5d5fffa812 ci: bump version with Build-88 account-0.16.0 coordinator-0.24.0 core-0.41.0 store-0.15.0 2026-05-16 10:07:11 +00:00
Janis 73239088d9 fix: NCS-85 Database Writeback fails without Logs (#52)
Build & Test (NowChessSystems) TeamCity build finished
Reviewed-on: #52
2026-05-16 11:41:56 +02:00
Janis 4ad92ab236 fix: NCS-84 More Verbose Logging (#51)
Build & Test (NowChessSystems) TeamCity build failed
Reviewed-on: #51
2026-05-16 11:20:05 +02:00
TeamCity c65a1393b9 ci: bump version with Build-87 coordinator-0.23.0 2026-05-14 09:04:57 +00:00
Janis 4a36096a55 fix: linter formatting and improve code readability
Build & Test (NowChessSystems) TeamCity build finished
2026-05-14 10:47:40 +02:00
Janis 960a419792 fix: force-delete hanging pods and remove failed instances from registry
Build & Test (NowChessSystems) TeamCity build failed
When pod deletion fails, instances remained in registry with state=DEAD,
preventing scale-down since avgLoad calculation counted them. Now:

- Use gracePeriod(0) for immediate pod deletion instead of 30s wait
  (prevents cascade when nodes are down or pods stuck terminating)
- Remove instance from registry on deletion failure anyway
  (prevents dead instances from blocking scale-down via avgLoad)

This breaks the cycle: failed deletions → scaleUp → max replicas →
more failures → more stuck instances blocking recovery.
2026-05-14 09:57:29 +02:00
TeamCity 68d6c1d36f ci: bump version with Build-86 coordinator-0.22.0 2026-05-13 22:24:05 +00:00
Janis b991878214 fix: scalafix violations in metrics check and health monitor
Build & Test (NowChessSystems) TeamCity build finished
- Wrap asInstanceOf casts with scalafix:off/on in isResourceConstrained
- Replace var with immutable List in HealthMonitor.checkInstanceHealth

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-05-13 23:59:47 +02:00
Janis 43525d41a3 fix: scale up immediately when instance is lost
Build & Test (NowChessSystems) TeamCity build failed
When an instance is evicted or fails health check, immediately trigger scale-up
to replace the lost capacity. Don't wait for the next scheduled scale check.

HealthMonitor now calls autoScaler.scaleUp() when:
1. Stale instances are evicted
2. Instance fails health check and is marked dead

Ensures quick recovery from instance loss.

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-05-13 23:50:33 +02:00
Janis 6bf1013710 style: format code for improved readability and consistency
Build & Test (NowChessSystems) TeamCity build failed
2026-05-13 23:46:01 +02:00
Janis 255e2da33c feat: scale up on high CPU load, not just subscription count
AutoScaler now checks K8s pod metrics (CPU) in addition to subscription count.
Scale-up triggers if:
1. avgLoad > scaleUpThreshold * maxGamesPerCore, OR
2. Any instance has CPU > 800m

Fixes scenario where instance under heavy CPU load wouldn't scale without
high subscription count. Now responds to compute utilization, not just game count.

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-05-13 23:46:01 +02:00
Janis 4b3b5e7c4e fix: don't trigger scale-down if already at min replicas
checkAndScale was deciding to scale-down based on avgLoad without checking
if already at minimum replicas. This caused unnecessary scale-down attempts
that would fail and log noise.

Add check: only scale-down if instances.size > scaleMinReplicas

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-05-13 23:46:01 +02:00
Janis 1d121c727c fix: don't block event loop during scale-down drain
Scale-down was calling failoverService.onInstanceStreamDropped synchronously
and waiting for it to complete. Failover retries for up to 30s waiting for
healthy instances, which blocks the Quarkus event loop thread.

This caused:
- Event loop blocked for 15+ seconds
- Redis health checks timing out (also on event loop)
- Scale-down operations failing

Fix: Trigger drain asynchronously without waiting. Scale-down proceeds
immediately while drain happens in background.

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-05-13 23:46:00 +02:00
TeamCity 56f0030a83 ci: bump version with Build-85 coordinator-0.21.0 core-0.40.0 2026-05-13 20:48:35 +00:00
Janis d0c71693bb fix: coordinator auto-scaling, cache eviction, rebalancing, and grpc timeouts
Build & Test (NowChessSystems) TeamCity build finished
Critical fixes:
- Enable auto-scaling (was disabled in config)
- Add periodic cache eviction (5m interval) — CacheEvictionManager never ran
- Add periodic rebalance check (30s) — proactive load balancing
- Add 5s timeout to all gRPC calls (batchResubscribe, unsubscribe, evict)
- Use Option instead of null checks (scalafix compliance)

These gaps left the coordinator unable to:
1. Scale up when instances overloaded (scaling was disabled)
2. Clean up idle games from memory (no scheduled eviction)
3. Rebalance load proactively (only on scale-up)
4. Handle hung instances (no RPC timeouts, operations could hang forever)

Combined with prior fixes for instance metadata parsing and heartbeat TTL,
the coordinator now handles overload scenarios correctly.

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-05-13 22:20:25 +02:00
Janis 3f12f695f1 feat: implement periodic scaling checks and enhance instance management in AutoScaler
Build & Test (NowChessSystems) TeamCity build failed
2026-05-13 22:08:22 +02:00
TeamCity 0a3c494fa8 ci: bump version with Build-84 core-0.39.0 2026-05-13 19:32:12 +00:00
Janis f7ce4df595 fix: update documentation to reflect new functions in CoordinatorGrpcServer and InstanceRegistry
Build & Test (NowChessSystems) TeamCity build finished
2026-05-13 21:07:44 +02:00
TeamCity d41c03700c ci: bump version with Build-83 coordinator-0.20.0 2026-05-13 17:06:04 +00:00
Janis 10937e756a fix: streamline logging for evicted instances in InstanceRegistry
Build & Test (NowChessSystems) TeamCity build finished
2026-05-13 18:39:32 +02:00