Commit Graph

12 Commits

Author SHA1 Message Date
Janis 2e4ba43597 feat: implement clock expiry scanning and handling for game records (#54)
Build & Test (NowChessSystems) TeamCity build finished
Reviewed-on: #54
2026-05-16 15:09:04 +02:00
Janis 255e2da33c feat: scale up on high CPU load, not just subscription count
AutoScaler now checks K8s pod metrics (CPU) in addition to subscription count.
Scale-up triggers if:
1. avgLoad > scaleUpThreshold * maxGamesPerCore, OR
2. Any instance has CPU > 800m

Fixes scenario where instance under heavy CPU load wouldn't scale without
high subscription count. Now responds to compute utilization, not just game count.

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-05-13 23:46:01 +02:00
Janis d0c71693bb fix: coordinator auto-scaling, cache eviction, rebalancing, and grpc timeouts
Build & Test (NowChessSystems) TeamCity build finished
Critical fixes:
- Enable auto-scaling (was disabled in config)
- Add periodic cache eviction (5m interval) — CacheEvictionManager never ran
- Add periodic rebalance check (30s) — proactive load balancing
- Add 5s timeout to all gRPC calls (batchResubscribe, unsubscribe, evict)
- Use Option instead of null checks (scalafix compliance)

These gaps left the coordinator unable to:
1. Scale up when instances overloaded (scaling was disabled)
2. Clean up idle games from memory (no scheduled eviction)
3. Rebalance load proactively (only on scale-up)
4. Handle hung instances (no RPC timeouts, operations could hang forever)

Combined with prior fixes for instance metadata parsing and heartbeat TTL,
the coordinator now handles overload scenarios correctly.

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-05-13 22:20:25 +02:00
Janis 81b045d01b feat: add coordinator startup validation and K8s pod watch
Build & Test (NowChessSystems) TeamCity build failed
On startup, load all known instances from Redis and wait 15s for them to
reconnect via gRPC. Evict instances that don't reconnect within the timeout
and delete their K8s pods.

Replace one-shot pod status check with real fabric8 Watch. On pod Terminating
event, mark instance dead. On pod Deleted event, trigger failover. Failover
now waits reactively for at least one healthy instance before distributing
orphaned games, up to 30s timeout.

- Add startupValidationTimeout and failoverWaitTimeout config (15s, 30s)
- CoordinatorGrpcServer tracks active gRPC streams
- InstanceRegistry.loadAllFromRedis() scans and loads instances on startup
- HealthMonitor startup observer validates instances and starts K8s watch
- FailoverService.onInstanceStreamDropped returns Uni[Unit] for reactive wait
- All failover service callers updated to subscribe to Uni result

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-05-13 09:55:38 +02:00
Janis 3904d5ad8a feat: add OpenTelemetry trace configuration with parentbased sampler
Build & Test (NowChessSystems) TeamCity build finished
2026-05-12 19:00:08 +02:00
Janis 649566eb3f feat: NCS-78 Add Traceability to the Applications (#46)
Build & Test (NowChessSystems) TeamCity build finished
Reviewed-on: #46
2026-05-09 20:54:18 +02:00
Janis be0b710543 fix: add instance-dead-timeout configuration and update HealthMonitor to use it for stale instance eviction
Build & Test (NowChessSystems) TeamCity build finished
2026-05-08 15:32:44 +02:00
Janis 5baf6a7cdb fix(redis): update Redis configuration with max pool size and waiting parameters
Build & Test (NowChessSystems) TeamCity build finished
2026-05-05 20:01:32 +02:00
Janis 804a4bf179 feat(logging): add DEBUG/INFO/WARN logging across services (NCS-72) (#41)
Build & Test (NowChessSystems) TeamCity build finished
Reviewed-on: #41
Co-authored-by: Janis <janis.e.20@gmx.de>
Co-committed-by: Janis <janis.e.20@gmx.de>
2026-05-02 17:33:27 +02:00
Janis 2404e6164c feat(config): update application.yml for PostgreSQL and remove staging/production configurations 2026-04-30 16:14:10 +02:00
Janis 6113432a14 feat(config): update application.yml for staging and production environments
Build & Test (NowChessSystems) TeamCity build finished
2026-04-30 10:55:20 +02:00
Janis 590924254e feat: true-microservices (#40)
Reviewed-on: #40
2026-04-29 22:06:01 +02:00