Compare commits

...

4 Commits

Author SHA1 Message Date
TeamCity 0060229ee9 ci: bump version with Build-81 2026-05-13 12:59:28 +00:00
Janis d5c8da20f8 fix: update grpcServer variable to use Instance wrapper and add optional access method
Build & Test (NowChessSystems) TeamCity build finished
2026-05-13 14:42:12 +02:00
Janis ad9495afa3 fix: clean up code formatting and improve error handling in gRPC server and failover service
Build & Test (NowChessSystems) TeamCity build failed
2026-05-13 13:16:22 +02:00
Janis 2b04d7fa71 fix: replace null checks with Option in coordinator
Build & Test (NowChessSystems) TeamCity build failed
Use Option instead of null checks in HealthMonitor and InstanceRegistry
per Scalafix DisableSyntax rule.

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-05-13 12:44:34 +02:00
6 changed files with 69 additions and 27 deletions
+32
View File
@@ -334,3 +334,35 @@
* **middleware:** update paths for bot generation and stockfish configuration ([2dd0501](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/2dd0501687db08dcd242359f6837125baf8a2fdc))
* **redis:** update Redis configuration with max pool size and waiting parameters ([5baf6a7](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/5baf6a7cdbea484fc49c02e2b5a1c3919b7fa2c4))
* update HealthMonitor to evict instances without associated pods ([0f41f13](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/0f41f13ce68b76846684bab67241a122250dfaf9))
## (2026-05-13)
### Features
* add coordinator startup validation and K8s pod watch ([81b045d](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/81b045d01bb054a4bc9dc9e02fc30f814e756205))
* add initialization metrics for various services ([d438e97](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/d438e97f32bdde0bfc63c1b4a8cc810cdd093166))
* add OpenTelemetry trace configuration with parentbased sampler ([3904d5a](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/3904d5ad8ad4930ddee65287a7bfab785a6148f5))
* **config:** update application.yml for PostgreSQL and remove staging/production configurations ([2404e61](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/2404e6164c3b50ffccbea5238d636060d6abe4d6))
* **config:** update application.yml for staging and production environments ([6113432](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/6113432a14c476a3a0dfc0d449e17d023697f2ba))
* configure logging and add OpenTelemetry support ([#49](https://git.janis-eccarius.de/NowChess/NowChessSystems/issues/49)) ([d57c488](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/d57c4886612d1d92da0e1b79209fc83e6ef537a1))
* **docker:** add .dockerignore and .gitignore files for build exclusions ([c987d8e](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/c987d8e258c0e6c4cfbdaa8381c64c410d7a2b83))
* **docker:** add Dockerfiles for building Quarkus application in native and JVM modes ([3f2d2bb](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/3f2d2bb4c97fa8cddba66e1da4427c54236dfeed))
* **docker:** add Dockerfiles for Quarkus application in JVM and native modes ([34b9933](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/34b993304670cf2aa62cd2f6460cee7b9864b08e))
* **logging:** add DEBUG/INFO/WARN logging across services (NCS-72) ([#41](https://git.janis-eccarius.de/NowChess/NowChessSystems/issues/41)) ([804a4bf](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/804a4bf179e3dfb19e2be4390e7e543caf5237c6))
* NCS-78 Add Traceability to the Applications ([#46](https://git.janis-eccarius.de/NowChess/NowChessSystems/issues/46)) ([649566e](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/649566eb3fcf38f91c8896a739f74ea318af312d))
* NCS-78 Add Traceability to the Applications ([#47](https://git.janis-eccarius.de/NowChess/NowChessSystems/issues/47)) ([87dfc6c](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/87dfc6c2bcce7f7d58fc641bd8d468a2e584c108))
* true-microservices ([#40](https://git.janis-eccarius.de/NowChess/NowChessSystems/issues/40)) ([5909242](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/590924254e8a2754de661a57a03e43f89ceb6299))
### Bug Fixes
* add instance-dead-timeout configuration and update HealthMonitor to use it for stale instance eviction ([be0b710](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/be0b710543b542da5c301efef7d2d587d0ba758a))
* clean up code formatting and improve error handling in gRPC server and failover service ([ad9495a](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/ad9495afa3e93593b57154a187346c9b01393911))
* **coordinator:** refine type casting in rolloutSpec method ([#45](https://git.janis-eccarius.de/NowChess/NowChessSystems/issues/45)) ([d522f7f](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/d522f7f6edf9c985f03dd16816439d4184f1a589))
* **coordinator:** use genericKubernetesResources API for Argo Rollout scaling ([#43](https://git.janis-eccarius.de/NowChess/NowChessSystems/issues/43)) ([fa3c6b2](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/fa3c6b2886dc59c14c5dad834acc9b41e42023bb))
* **coordinator:** use genericKubernetesResources API for Argo Rollout scaling ([#44](https://git.janis-eccarius.de/NowChess/NowChessSystems/issues/44)) ([82d0b75](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/82d0b754be1075084944b466858672d944f9f7d8))
* **dependencies:** replace Fabric8 Kubernetes client with Quarkus Kubernetes client ([5f44570](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/5f44570b357277d09f33b7296860c421e2e70ce0))
* enhance AutoScaler and InstanceRegistry for replica management and stale instance eviction ([b4920d3](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/b4920d3817e58bda94d7764e608b856ce9a909f7))
* **middleware:** update paths for bot generation and stockfish configuration ([2dd0501](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/2dd0501687db08dcd242359f6837125baf8a2fdc))
* **redis:** update Redis configuration with max pool size and waiting parameters ([5baf6a7](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/5baf6a7cdbea484fc49c02e2b5a1c3919b7fa2c4))
* replace null checks with Option in coordinator ([2b04d7f](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/2b04d7fa713e06662bff5afe3fb3f9d04541ce51))
* update grpcServer variable to use Instance wrapper and add optional access method ([d5c8da2](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/d5c8da20f8805199e920ea5afbd9cdb39a078e40))
* update HealthMonitor to evict instances without associated pods ([0f41f13](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/0f41f13ce68b76846684bab67241a122250dfaf9))
@@ -127,7 +127,8 @@ class CoordinatorGrpcServer extends CoordinatorServiceGrpc.CoordinatorServiceImp
_ =>
val response = DrainInstanceResponse.newBuilder().setGamesMigrated(gamesBefore).build()
responseObserver.onNext(response)
responseObserver.onCompleted(),
responseObserver.onCompleted()
,
ex =>
log.warnf(ex, "Drain failed for %s", instanceId)
responseObserver.onError(ex),
@@ -122,14 +122,17 @@ class FailoverService:
log.infof("Cleaned up games set for instance %s", instanceId)
private def waitForHealthyInstanceAsync(): Uni[InstanceMetadata] =
Uni.createFrom().deferred(() =>
instanceRegistry.getAllInstances
.filter(_.state == "HEALTHY")
.sortBy(_.subscriptionCount)
.headOption match
Uni
.createFrom()
.deferred(() =>
instanceRegistry.getAllInstances
.filter(_.state == "HEALTHY")
.sortBy(_.subscriptionCount)
.headOption match
case Some(inst) => Uni.createFrom().item(inst)
case None => Uni.createFrom().failure(new RuntimeException("no healthy instance"))
).onFailure()
case None => Uni.createFrom().failure(new RuntimeException("no healthy instance")),
)
.onFailure()
.retry()
.withBackOff(Duration.ofMillis(500))
.expireIn(config.failoverWaitTimeout.toMillis)
@@ -39,7 +39,7 @@ class HealthMonitor:
private var meterRegistry: MeterRegistry = uninitialized
@Inject
private var grpcServer: CoordinatorGrpcServer = uninitialized
private var grpcServerInstance: Instance[CoordinatorGrpcServer] = uninitialized
@Inject
private var failoverService: FailoverService = uninitialized
@@ -52,6 +52,10 @@ class HealthMonitor:
if kubeClientInstance.isUnsatisfied then None
else Some(kubeClientInstance.get())
private def grpcServerOpt: Option[CoordinatorGrpcServer] =
if grpcServerInstance.isUnsatisfied then None
else Some(grpcServerInstance.get())
def setRedisPrefix(prefix: String): Unit =
redisPrefix = prefix
@@ -133,19 +137,18 @@ class HealthMonitor:
action match
case Watcher.Action.DELETED =>
handlePodGone(pod)
case Watcher.Action.MODIFIED
if Option(pod.getMetadata.getDeletionTimestamp).isDefined =>
case Watcher.Action.MODIFIED if Option(pod.getMetadata.getDeletionTimestamp).isDefined =>
handlePodTerminating(pod)
case _ => ()
override def onClose(cause: WatcherException): Unit =
if cause != null then
log.warnf(cause, "Pod watch closed, restarting")
Option(cause).foreach { ex =>
log.warnf(ex, "Pod watch closed, restarting")
startPodWatch()
},
)
log.info("Pod watch started")
catch
case ex: Exception => log.warnf(ex, "Failed to start pod watch")
catch case ex: Exception => log.warnf(ex, "Failed to start pod watch")
private def isPodReady(pod: Pod): Boolean =
Option(pod.getStatus)
@@ -180,15 +183,17 @@ class HealthMonitor:
private def validateStartupInstances(timeoutMs: Long): Unit =
Thread.sleep(timeoutMs)
instanceRegistry.getAllInstances.foreach { inst =>
if !grpcServer.hasActiveStream(inst.instanceId) then
log.warnf(
"Startup: instance %s did not reconnect within %dms — evicting",
inst.instanceId,
timeoutMs,
)
instanceRegistry.removeInstance(inst.instanceId)
deleteK8sPod(inst.instanceId)
grpcServerOpt.foreach { grpcServer =>
instanceRegistry.getAllInstances.foreach { inst =>
if !grpcServer.hasActiveStream(inst.instanceId) then
log.warnf(
"Startup: instance %s did not reconnect within %dms — evicting",
inst.instanceId,
timeoutMs,
)
instanceRegistry.removeInstance(inst.instanceId)
deleteK8sPod(inst.instanceId)
}
}
private def handlePodTerminating(pod: Pod): Unit =
@@ -51,14 +51,15 @@ class InstanceRegistry:
keys.asScala.foreach { key =>
val instanceId = key.stripPrefix(s"$redisPrefix:instances:")
val json = syncRedis.value(classOf[String]).get(key)
if json != null then
Option(json).foreach { jsonStr =>
try
val metadata = mapper.readValue(json, classOf[InstanceMetadata])
val metadata = mapper.readValue(jsonStr, classOf[InstanceMetadata])
instances.put(instanceId, metadata)
log.infof("Startup: loaded instance %s from Redis", instanceId)
catch
case ex: Exception =>
log.warnf(ex, "Startup: failed to parse instance %s", instanceId)
}
}
def getInstance(instanceId: String): Option[InstanceMetadata] =
+1 -1
View File
@@ -1,3 +1,3 @@
MAJOR=0
MINOR=18
MINOR=19
PATCH=0