Commit Graph

32 Commits

Author SHA1 Message Date
Janis 255e2da33c feat: scale up on high CPU load, not just subscription count
AutoScaler now checks K8s pod metrics (CPU) in addition to subscription count.
Scale-up triggers if:
1. avgLoad > scaleUpThreshold * maxGamesPerCore, OR
2. Any instance has CPU > 800m

Fixes scenario where instance under heavy CPU load wouldn't scale without
high subscription count. Now responds to compute utilization, not just game count.

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-05-13 23:46:01 +02:00
Janis 4b3b5e7c4e fix: don't trigger scale-down if already at min replicas
checkAndScale was deciding to scale-down based on avgLoad without checking
if already at minimum replicas. This caused unnecessary scale-down attempts
that would fail and log noise.

Add check: only scale-down if instances.size > scaleMinReplicas

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-05-13 23:46:01 +02:00
Janis 1d121c727c fix: don't block event loop during scale-down drain
Scale-down was calling failoverService.onInstanceStreamDropped synchronously
and waiting for it to complete. Failover retries for up to 30s waiting for
healthy instances, which blocks the Quarkus event loop thread.

This caused:
- Event loop blocked for 15+ seconds
- Redis health checks timing out (also on event loop)
- Scale-down operations failing

Fix: Trigger drain asynchronously without waiting. Scale-down proceeds
immediately while drain happens in background.

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-05-13 23:46:00 +02:00
Janis d0c71693bb fix: coordinator auto-scaling, cache eviction, rebalancing, and grpc timeouts
Build & Test (NowChessSystems) TeamCity build finished
Critical fixes:
- Enable auto-scaling (was disabled in config)
- Add periodic cache eviction (5m interval) — CacheEvictionManager never ran
- Add periodic rebalance check (30s) — proactive load balancing
- Add 5s timeout to all gRPC calls (batchResubscribe, unsubscribe, evict)
- Use Option instead of null checks (scalafix compliance)

These gaps left the coordinator unable to:
1. Scale up when instances overloaded (scaling was disabled)
2. Clean up idle games from memory (no scheduled eviction)
3. Rebalance load proactively (only on scale-up)
4. Handle hung instances (no RPC timeouts, operations could hang forever)

Combined with prior fixes for instance metadata parsing and heartbeat TTL,
the coordinator now handles overload scenarios correctly.

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-05-13 22:20:25 +02:00
Janis 3f12f695f1 feat: implement periodic scaling checks and enhance instance management in AutoScaler
Build & Test (NowChessSystems) TeamCity build failed
2026-05-13 22:08:22 +02:00
Janis 10937e756a fix: streamline logging for evicted instances in InstanceRegistry
Build & Test (NowChessSystems) TeamCity build finished
2026-05-13 18:39:32 +02:00
Janis 380a2cceeb feat: add periodic health check to evict dead instances
Build & Test (NowChessSystems) TeamCity build failed
Add quarkus-scheduler dependency and schedule health check every 10 seconds.
Dead instances (marked with state="DEAD") now automatically evicted instead of
accumulating indefinitely.

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-05-13 18:25:12 +02:00
Janis 43184d296d fix: remove corrupted instances immediately and evict dead instances
Problem: Dead instances pile up indefinitely. Failed metadata parsing leaves stale data in registry. No cleanup mechanism exists.

Changes:
1. Remove instance from registry on parse failure (corrupted metadata = unrecoverable)
2. Evict instances with state="DEAD" on next health check (was only evicting by heartbeat age)

This prevents:
- Memory leak from accumulating dead/corrupted instances
- Stale data persisting after parse failures
- Dead instances blocking resources indefinitely

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-05-13 18:25:12 +02:00
Janis d5c8da20f8 fix: update grpcServer variable to use Instance wrapper and add optional access method
Build & Test (NowChessSystems) TeamCity build finished
2026-05-13 14:42:12 +02:00
Janis ad9495afa3 fix: clean up code formatting and improve error handling in gRPC server and failover service
Build & Test (NowChessSystems) TeamCity build failed
2026-05-13 13:16:22 +02:00
Janis 2b04d7fa71 fix: replace null checks with Option in coordinator
Build & Test (NowChessSystems) TeamCity build failed
Use Option instead of null checks in HealthMonitor and InstanceRegistry
per Scalafix DisableSyntax rule.

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-05-13 12:44:34 +02:00
Janis 81b045d01b feat: add coordinator startup validation and K8s pod watch
Build & Test (NowChessSystems) TeamCity build failed
On startup, load all known instances from Redis and wait 15s for them to
reconnect via gRPC. Evict instances that don't reconnect within the timeout
and delete their K8s pods.

Replace one-shot pod status check with real fabric8 Watch. On pod Terminating
event, mark instance dead. On pod Deleted event, trigger failover. Failover
now waits reactively for at least one healthy instance before distributing
orphaned games, up to 30s timeout.

- Add startupValidationTimeout and failoverWaitTimeout config (15s, 30s)
- CoordinatorGrpcServer tracks active gRPC streams
- InstanceRegistry.loadAllFromRedis() scans and loads instances on startup
- HealthMonitor startup observer validates instances and starts K8s watch
- FailoverService.onInstanceStreamDropped returns Uni[Unit] for reactive wait
- All failover service callers updated to subscribe to Uni result

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-05-13 09:55:38 +02:00
Janis 3904d5ad8a feat: add OpenTelemetry trace configuration with parentbased sampler
Build & Test (NowChessSystems) TeamCity build finished
2026-05-12 19:00:08 +02:00
Janis d438e97f32 feat: add initialization metrics for various services 2026-05-11 22:37:22 +02:00
Janis 9459203e0d refactor: update timer record calls to use Runnable type
Build & Test (NowChessSystems) TeamCity build failed
2026-05-10 22:24:55 +02:00
Janis d57c488661 feat: configure logging and add OpenTelemetry support (#49)
Build & Test (NowChessSystems) TeamCity build failed
Reviewed-on: #49
2026-05-10 20:31:48 +02:00
Janis 649566eb3f feat: NCS-78 Add Traceability to the Applications (#46)
Build & Test (NowChessSystems) TeamCity build finished
Reviewed-on: #46
2026-05-09 20:54:18 +02:00
Janis be0b710543 fix: add instance-dead-timeout configuration and update HealthMonitor to use it for stale instance eviction
Build & Test (NowChessSystems) TeamCity build finished
2026-05-08 15:32:44 +02:00
Janis 0f41f13ce6 fix: update HealthMonitor to evict instances without associated pods
Build & Test (NowChessSystems) TeamCity build finished
2026-05-08 14:10:53 +02:00
Janis b4920d3817 fix: enhance AutoScaler and InstanceRegistry for replica management and stale instance eviction
Build & Test (NowChessSystems) TeamCity build finished
2026-05-08 12:37:23 +02:00
Janis 5baf6a7cdb fix(redis): update Redis configuration with max pool size and waiting parameters
Build & Test (NowChessSystems) TeamCity build finished
2026-05-05 20:01:32 +02:00
Janis d522f7f6ed fix(coordinator): refine type casting in rolloutSpec method (#45)
Build & Test (NowChessSystems) TeamCity build failed
Reviewed-on: #45
Co-authored-by: Janis <janis.e.20@gmx.de>
Co-committed-by: Janis <janis.e.20@gmx.de>
2026-05-03 12:12:39 +02:00
Janis 82d0b754be fix(coordinator): use genericKubernetesResources API for Argo Rollout scaling (#44)
Build & Test (NowChessSystems) TeamCity build failed
Reviewed-on: #44
Co-authored-by: Janis <janis.e.20@gmx.de>
Co-committed-by: Janis <janis.e.20@gmx.de>
2026-05-02 22:27:18 +02:00
Janis fa3c6b2886 fix(coordinator): use genericKubernetesResources API for Argo Rollout scaling (#43)
Build & Test (NowChessSystems) TeamCity build finished
fabric8 disallows client.resources(classOf[GenericKubernetesResource]) — throws
KubernetesClientException at runtime. Switch to genericKubernetesResources(apiVersion, kind)
which is the correct API for CRDs.

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

Reviewed-on: #43
Co-authored-by: Janis <janis.e.20@gmx.de>
Co-committed-by: Janis <janis.e.20@gmx.de>
2026-05-02 21:22:53 +02:00
Janis 804a4bf179 feat(logging): add DEBUG/INFO/WARN logging across services (NCS-72) (#41)
Build & Test (NowChessSystems) TeamCity build finished
Reviewed-on: #41
Co-authored-by: Janis <janis.e.20@gmx.de>
Co-committed-by: Janis <janis.e.20@gmx.de>
2026-05-02 17:33:27 +02:00
Janis d346c41d98 refactor: improve code formatting and readability
Build & Test (NowChessSystems) TeamCity build finished
2026-05-01 20:06:10 +02:00
Janis 2dd0501687 fix(middleware): update paths for bot generation and stockfish configuration
Build & Test (NowChessSystems) TeamCity build failed
refactor(bru): standardize authentication settings across requests
chore: add coordinator base URL to configuration files
2026-05-01 19:56:34 +02:00
Janis 2404e6164c feat(config): update application.yml for PostgreSQL and remove staging/production configurations 2026-04-30 16:14:10 +02:00
Janis 6113432a14 feat(config): update application.yml for staging and production environments
Build & Test (NowChessSystems) TeamCity build finished
2026-04-30 10:55:20 +02:00
Janis 34b9933046 feat(docker): add Dockerfiles for Quarkus application in JVM and native modes
Build & Test (NowChessSystems) TeamCity build finished
2026-04-30 09:28:02 +02:00
Janis 3f2d2bb4c9 feat(docker): add Dockerfiles for building Quarkus application in native and JVM modes
Build & Test (NowChessSystems) TeamCity build failed
2026-04-30 08:32:04 +02:00
Janis 590924254e feat: true-microservices (#40)
Reviewed-on: #40
2026-04-29 22:06:01 +02:00