Compare commits
4 Commits
| Author | SHA1 | Date | |
|---|---|---|---|
| 0060229ee9 | |||
| d5c8da20f8 | |||
| ad9495afa3 | |||
| 2b04d7fa71 |
@@ -334,3 +334,35 @@
|
||||
* **middleware:** update paths for bot generation and stockfish configuration ([2dd0501](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/2dd0501687db08dcd242359f6837125baf8a2fdc))
|
||||
* **redis:** update Redis configuration with max pool size and waiting parameters ([5baf6a7](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/5baf6a7cdbea484fc49c02e2b5a1c3919b7fa2c4))
|
||||
* update HealthMonitor to evict instances without associated pods ([0f41f13](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/0f41f13ce68b76846684bab67241a122250dfaf9))
|
||||
## (2026-05-13)
|
||||
|
||||
### Features
|
||||
|
||||
* add coordinator startup validation and K8s pod watch ([81b045d](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/81b045d01bb054a4bc9dc9e02fc30f814e756205))
|
||||
* add initialization metrics for various services ([d438e97](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/d438e97f32bdde0bfc63c1b4a8cc810cdd093166))
|
||||
* add OpenTelemetry trace configuration with parentbased sampler ([3904d5a](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/3904d5ad8ad4930ddee65287a7bfab785a6148f5))
|
||||
* **config:** update application.yml for PostgreSQL and remove staging/production configurations ([2404e61](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/2404e6164c3b50ffccbea5238d636060d6abe4d6))
|
||||
* **config:** update application.yml for staging and production environments ([6113432](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/6113432a14c476a3a0dfc0d449e17d023697f2ba))
|
||||
* configure logging and add OpenTelemetry support ([#49](https://git.janis-eccarius.de/NowChess/NowChessSystems/issues/49)) ([d57c488](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/d57c4886612d1d92da0e1b79209fc83e6ef537a1))
|
||||
* **docker:** add .dockerignore and .gitignore files for build exclusions ([c987d8e](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/c987d8e258c0e6c4cfbdaa8381c64c410d7a2b83))
|
||||
* **docker:** add Dockerfiles for building Quarkus application in native and JVM modes ([3f2d2bb](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/3f2d2bb4c97fa8cddba66e1da4427c54236dfeed))
|
||||
* **docker:** add Dockerfiles for Quarkus application in JVM and native modes ([34b9933](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/34b993304670cf2aa62cd2f6460cee7b9864b08e))
|
||||
* **logging:** add DEBUG/INFO/WARN logging across services (NCS-72) ([#41](https://git.janis-eccarius.de/NowChess/NowChessSystems/issues/41)) ([804a4bf](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/804a4bf179e3dfb19e2be4390e7e543caf5237c6))
|
||||
* NCS-78 Add Traceability to the Applications ([#46](https://git.janis-eccarius.de/NowChess/NowChessSystems/issues/46)) ([649566e](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/649566eb3fcf38f91c8896a739f74ea318af312d))
|
||||
* NCS-78 Add Traceability to the Applications ([#47](https://git.janis-eccarius.de/NowChess/NowChessSystems/issues/47)) ([87dfc6c](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/87dfc6c2bcce7f7d58fc641bd8d468a2e584c108))
|
||||
* true-microservices ([#40](https://git.janis-eccarius.de/NowChess/NowChessSystems/issues/40)) ([5909242](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/590924254e8a2754de661a57a03e43f89ceb6299))
|
||||
|
||||
### Bug Fixes
|
||||
|
||||
* add instance-dead-timeout configuration and update HealthMonitor to use it for stale instance eviction ([be0b710](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/be0b710543b542da5c301efef7d2d587d0ba758a))
|
||||
* clean up code formatting and improve error handling in gRPC server and failover service ([ad9495a](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/ad9495afa3e93593b57154a187346c9b01393911))
|
||||
* **coordinator:** refine type casting in rolloutSpec method ([#45](https://git.janis-eccarius.de/NowChess/NowChessSystems/issues/45)) ([d522f7f](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/d522f7f6edf9c985f03dd16816439d4184f1a589))
|
||||
* **coordinator:** use genericKubernetesResources API for Argo Rollout scaling ([#43](https://git.janis-eccarius.de/NowChess/NowChessSystems/issues/43)) ([fa3c6b2](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/fa3c6b2886dc59c14c5dad834acc9b41e42023bb))
|
||||
* **coordinator:** use genericKubernetesResources API for Argo Rollout scaling ([#44](https://git.janis-eccarius.de/NowChess/NowChessSystems/issues/44)) ([82d0b75](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/82d0b754be1075084944b466858672d944f9f7d8))
|
||||
* **dependencies:** replace Fabric8 Kubernetes client with Quarkus Kubernetes client ([5f44570](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/5f44570b357277d09f33b7296860c421e2e70ce0))
|
||||
* enhance AutoScaler and InstanceRegistry for replica management and stale instance eviction ([b4920d3](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/b4920d3817e58bda94d7764e608b856ce9a909f7))
|
||||
* **middleware:** update paths for bot generation and stockfish configuration ([2dd0501](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/2dd0501687db08dcd242359f6837125baf8a2fdc))
|
||||
* **redis:** update Redis configuration with max pool size and waiting parameters ([5baf6a7](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/5baf6a7cdbea484fc49c02e2b5a1c3919b7fa2c4))
|
||||
* replace null checks with Option in coordinator ([2b04d7f](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/2b04d7fa713e06662bff5afe3fb3f9d04541ce51))
|
||||
* update grpcServer variable to use Instance wrapper and add optional access method ([d5c8da2](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/d5c8da20f8805199e920ea5afbd9cdb39a078e40))
|
||||
* update HealthMonitor to evict instances without associated pods ([0f41f13](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/0f41f13ce68b76846684bab67241a122250dfaf9))
|
||||
|
||||
+2
-1
@@ -127,7 +127,8 @@ class CoordinatorGrpcServer extends CoordinatorServiceGrpc.CoordinatorServiceImp
|
||||
_ =>
|
||||
val response = DrainInstanceResponse.newBuilder().setGamesMigrated(gamesBefore).build()
|
||||
responseObserver.onNext(response)
|
||||
responseObserver.onCompleted(),
|
||||
responseObserver.onCompleted()
|
||||
,
|
||||
ex =>
|
||||
log.warnf(ex, "Drain failed for %s", instanceId)
|
||||
responseObserver.onError(ex),
|
||||
|
||||
+10
-7
@@ -122,14 +122,17 @@ class FailoverService:
|
||||
log.infof("Cleaned up games set for instance %s", instanceId)
|
||||
|
||||
private def waitForHealthyInstanceAsync(): Uni[InstanceMetadata] =
|
||||
Uni.createFrom().deferred(() =>
|
||||
instanceRegistry.getAllInstances
|
||||
.filter(_.state == "HEALTHY")
|
||||
.sortBy(_.subscriptionCount)
|
||||
.headOption match
|
||||
Uni
|
||||
.createFrom()
|
||||
.deferred(() =>
|
||||
instanceRegistry.getAllInstances
|
||||
.filter(_.state == "HEALTHY")
|
||||
.sortBy(_.subscriptionCount)
|
||||
.headOption match
|
||||
case Some(inst) => Uni.createFrom().item(inst)
|
||||
case None => Uni.createFrom().failure(new RuntimeException("no healthy instance"))
|
||||
).onFailure()
|
||||
case None => Uni.createFrom().failure(new RuntimeException("no healthy instance")),
|
||||
)
|
||||
.onFailure()
|
||||
.retry()
|
||||
.withBackOff(Duration.ofMillis(500))
|
||||
.expireIn(config.failoverWaitTimeout.toMillis)
|
||||
|
||||
+21
-16
@@ -39,7 +39,7 @@ class HealthMonitor:
|
||||
private var meterRegistry: MeterRegistry = uninitialized
|
||||
|
||||
@Inject
|
||||
private var grpcServer: CoordinatorGrpcServer = uninitialized
|
||||
private var grpcServerInstance: Instance[CoordinatorGrpcServer] = uninitialized
|
||||
|
||||
@Inject
|
||||
private var failoverService: FailoverService = uninitialized
|
||||
@@ -52,6 +52,10 @@ class HealthMonitor:
|
||||
if kubeClientInstance.isUnsatisfied then None
|
||||
else Some(kubeClientInstance.get())
|
||||
|
||||
private def grpcServerOpt: Option[CoordinatorGrpcServer] =
|
||||
if grpcServerInstance.isUnsatisfied then None
|
||||
else Some(grpcServerInstance.get())
|
||||
|
||||
def setRedisPrefix(prefix: String): Unit =
|
||||
redisPrefix = prefix
|
||||
|
||||
@@ -133,19 +137,18 @@ class HealthMonitor:
|
||||
action match
|
||||
case Watcher.Action.DELETED =>
|
||||
handlePodGone(pod)
|
||||
case Watcher.Action.MODIFIED
|
||||
if Option(pod.getMetadata.getDeletionTimestamp).isDefined =>
|
||||
case Watcher.Action.MODIFIED if Option(pod.getMetadata.getDeletionTimestamp).isDefined =>
|
||||
handlePodTerminating(pod)
|
||||
case _ => ()
|
||||
|
||||
override def onClose(cause: WatcherException): Unit =
|
||||
if cause != null then
|
||||
log.warnf(cause, "Pod watch closed, restarting")
|
||||
Option(cause).foreach { ex =>
|
||||
log.warnf(ex, "Pod watch closed, restarting")
|
||||
startPodWatch()
|
||||
},
|
||||
)
|
||||
log.info("Pod watch started")
|
||||
catch
|
||||
case ex: Exception => log.warnf(ex, "Failed to start pod watch")
|
||||
catch case ex: Exception => log.warnf(ex, "Failed to start pod watch")
|
||||
|
||||
private def isPodReady(pod: Pod): Boolean =
|
||||
Option(pod.getStatus)
|
||||
@@ -180,15 +183,17 @@ class HealthMonitor:
|
||||
|
||||
private def validateStartupInstances(timeoutMs: Long): Unit =
|
||||
Thread.sleep(timeoutMs)
|
||||
instanceRegistry.getAllInstances.foreach { inst =>
|
||||
if !grpcServer.hasActiveStream(inst.instanceId) then
|
||||
log.warnf(
|
||||
"Startup: instance %s did not reconnect within %dms — evicting",
|
||||
inst.instanceId,
|
||||
timeoutMs,
|
||||
)
|
||||
instanceRegistry.removeInstance(inst.instanceId)
|
||||
deleteK8sPod(inst.instanceId)
|
||||
grpcServerOpt.foreach { grpcServer =>
|
||||
instanceRegistry.getAllInstances.foreach { inst =>
|
||||
if !grpcServer.hasActiveStream(inst.instanceId) then
|
||||
log.warnf(
|
||||
"Startup: instance %s did not reconnect within %dms — evicting",
|
||||
inst.instanceId,
|
||||
timeoutMs,
|
||||
)
|
||||
instanceRegistry.removeInstance(inst.instanceId)
|
||||
deleteK8sPod(inst.instanceId)
|
||||
}
|
||||
}
|
||||
|
||||
private def handlePodTerminating(pod: Pod): Unit =
|
||||
|
||||
+3
-2
@@ -51,14 +51,15 @@ class InstanceRegistry:
|
||||
keys.asScala.foreach { key =>
|
||||
val instanceId = key.stripPrefix(s"$redisPrefix:instances:")
|
||||
val json = syncRedis.value(classOf[String]).get(key)
|
||||
if json != null then
|
||||
Option(json).foreach { jsonStr =>
|
||||
try
|
||||
val metadata = mapper.readValue(json, classOf[InstanceMetadata])
|
||||
val metadata = mapper.readValue(jsonStr, classOf[InstanceMetadata])
|
||||
instances.put(instanceId, metadata)
|
||||
log.infof("Startup: loaded instance %s from Redis", instanceId)
|
||||
catch
|
||||
case ex: Exception =>
|
||||
log.warnf(ex, "Startup: failed to parse instance %s", instanceId)
|
||||
}
|
||||
}
|
||||
|
||||
def getInstance(instanceId: String): Option[InstanceMetadata] =
|
||||
|
||||
@@ -1,3 +1,3 @@
|
||||
MAJOR=0
|
||||
MINOR=18
|
||||
MINOR=19
|
||||
PATCH=0
|
||||
|
||||
Reference in New Issue
Block a user