Commit Graph

26 Commits

Author SHA1 Message Date
Janis 380a2cceeb feat: add periodic health check to evict dead instances
Build & Test (NowChessSystems) TeamCity build failed
Add quarkus-scheduler dependency and schedule health check every 10 seconds.
Dead instances (marked with state="DEAD") now automatically evicted instead of
accumulating indefinitely.

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-05-13 18:25:12 +02:00
Janis 43184d296d fix: remove corrupted instances immediately and evict dead instances
Problem: Dead instances pile up indefinitely. Failed metadata parsing leaves stale data in registry. No cleanup mechanism exists.

Changes:
1. Remove instance from registry on parse failure (corrupted metadata = unrecoverable)
2. Evict instances with state="DEAD" on next health check (was only evicting by heartbeat age)

This prevents:
- Memory leak from accumulating dead/corrupted instances
- Stale data persisting after parse failures
- Dead instances blocking resources indefinitely

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-05-13 18:25:12 +02:00
Janis d5c8da20f8 fix: update grpcServer variable to use Instance wrapper and add optional access method
Build & Test (NowChessSystems) TeamCity build finished
2026-05-13 14:42:12 +02:00
Janis ad9495afa3 fix: clean up code formatting and improve error handling in gRPC server and failover service
Build & Test (NowChessSystems) TeamCity build failed
2026-05-13 13:16:22 +02:00
Janis 2b04d7fa71 fix: replace null checks with Option in coordinator
Build & Test (NowChessSystems) TeamCity build failed
Use Option instead of null checks in HealthMonitor and InstanceRegistry
per Scalafix DisableSyntax rule.

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-05-13 12:44:34 +02:00
Janis 81b045d01b feat: add coordinator startup validation and K8s pod watch
Build & Test (NowChessSystems) TeamCity build failed
On startup, load all known instances from Redis and wait 15s for them to
reconnect via gRPC. Evict instances that don't reconnect within the timeout
and delete their K8s pods.

Replace one-shot pod status check with real fabric8 Watch. On pod Terminating
event, mark instance dead. On pod Deleted event, trigger failover. Failover
now waits reactively for at least one healthy instance before distributing
orphaned games, up to 30s timeout.

- Add startupValidationTimeout and failoverWaitTimeout config (15s, 30s)
- CoordinatorGrpcServer tracks active gRPC streams
- InstanceRegistry.loadAllFromRedis() scans and loads instances on startup
- HealthMonitor startup observer validates instances and starts K8s watch
- FailoverService.onInstanceStreamDropped returns Uni[Unit] for reactive wait
- All failover service callers updated to subscribe to Uni result

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-05-13 09:55:38 +02:00
Janis 3904d5ad8a feat: add OpenTelemetry trace configuration with parentbased sampler
Build & Test (NowChessSystems) TeamCity build finished
2026-05-12 19:00:08 +02:00
Janis d438e97f32 feat: add initialization metrics for various services 2026-05-11 22:37:22 +02:00
Janis 9459203e0d refactor: update timer record calls to use Runnable type
Build & Test (NowChessSystems) TeamCity build failed
2026-05-10 22:24:55 +02:00
Janis d57c488661 feat: configure logging and add OpenTelemetry support (#49)
Build & Test (NowChessSystems) TeamCity build failed
Reviewed-on: #49
2026-05-10 20:31:48 +02:00
Janis 649566eb3f feat: NCS-78 Add Traceability to the Applications (#46)
Build & Test (NowChessSystems) TeamCity build finished
Reviewed-on: #46
2026-05-09 20:54:18 +02:00
Janis be0b710543 fix: add instance-dead-timeout configuration and update HealthMonitor to use it for stale instance eviction
Build & Test (NowChessSystems) TeamCity build finished
2026-05-08 15:32:44 +02:00
Janis 0f41f13ce6 fix: update HealthMonitor to evict instances without associated pods
Build & Test (NowChessSystems) TeamCity build finished
2026-05-08 14:10:53 +02:00
Janis b4920d3817 fix: enhance AutoScaler and InstanceRegistry for replica management and stale instance eviction
Build & Test (NowChessSystems) TeamCity build finished
2026-05-08 12:37:23 +02:00
Janis 5baf6a7cdb fix(redis): update Redis configuration with max pool size and waiting parameters
Build & Test (NowChessSystems) TeamCity build finished
2026-05-05 20:01:32 +02:00
Janis d522f7f6ed fix(coordinator): refine type casting in rolloutSpec method (#45)
Build & Test (NowChessSystems) TeamCity build failed
Reviewed-on: #45
Co-authored-by: Janis <janis.e.20@gmx.de>
Co-committed-by: Janis <janis.e.20@gmx.de>
2026-05-03 12:12:39 +02:00
Janis 82d0b754be fix(coordinator): use genericKubernetesResources API for Argo Rollout scaling (#44)
Build & Test (NowChessSystems) TeamCity build failed
Reviewed-on: #44
Co-authored-by: Janis <janis.e.20@gmx.de>
Co-committed-by: Janis <janis.e.20@gmx.de>
2026-05-02 22:27:18 +02:00
Janis fa3c6b2886 fix(coordinator): use genericKubernetesResources API for Argo Rollout scaling (#43)
Build & Test (NowChessSystems) TeamCity build finished
fabric8 disallows client.resources(classOf[GenericKubernetesResource]) — throws
KubernetesClientException at runtime. Switch to genericKubernetesResources(apiVersion, kind)
which is the correct API for CRDs.

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

Reviewed-on: #43
Co-authored-by: Janis <janis.e.20@gmx.de>
Co-committed-by: Janis <janis.e.20@gmx.de>
2026-05-02 21:22:53 +02:00
Janis 804a4bf179 feat(logging): add DEBUG/INFO/WARN logging across services (NCS-72) (#41)
Build & Test (NowChessSystems) TeamCity build finished
Reviewed-on: #41
Co-authored-by: Janis <janis.e.20@gmx.de>
Co-committed-by: Janis <janis.e.20@gmx.de>
2026-05-02 17:33:27 +02:00
Janis d346c41d98 refactor: improve code formatting and readability
Build & Test (NowChessSystems) TeamCity build finished
2026-05-01 20:06:10 +02:00
Janis 2dd0501687 fix(middleware): update paths for bot generation and stockfish configuration
Build & Test (NowChessSystems) TeamCity build failed
refactor(bru): standardize authentication settings across requests
chore: add coordinator base URL to configuration files
2026-05-01 19:56:34 +02:00
Janis 2404e6164c feat(config): update application.yml for PostgreSQL and remove staging/production configurations 2026-04-30 16:14:10 +02:00
Janis 6113432a14 feat(config): update application.yml for staging and production environments
Build & Test (NowChessSystems) TeamCity build finished
2026-04-30 10:55:20 +02:00
Janis 34b9933046 feat(docker): add Dockerfiles for Quarkus application in JVM and native modes
Build & Test (NowChessSystems) TeamCity build finished
2026-04-30 09:28:02 +02:00
Janis 3f2d2bb4c9 feat(docker): add Dockerfiles for building Quarkus application in native and JVM modes
Build & Test (NowChessSystems) TeamCity build failed
2026-04-30 08:32:04 +02:00
Janis 590924254e feat: true-microservices (#40)
Reviewed-on: #40
2026-04-29 22:06:01 +02:00