Compare commits
7 Commits
rule-0.18.0
...
main
| Author | SHA1 | Date | |
|---|---|---|---|
| 0060229ee9 | |||
| d5c8da20f8 | |||
| ad9495afa3 | |||
| 2b04d7fa71 | |||
| 81b045d01b | |||
| 118acff0e5 | |||
| a49f9be146 |
@@ -334,3 +334,35 @@
|
||||
* **middleware:** update paths for bot generation and stockfish configuration ([2dd0501](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/2dd0501687db08dcd242359f6837125baf8a2fdc))
|
||||
* **redis:** update Redis configuration with max pool size and waiting parameters ([5baf6a7](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/5baf6a7cdbea484fc49c02e2b5a1c3919b7fa2c4))
|
||||
* update HealthMonitor to evict instances without associated pods ([0f41f13](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/0f41f13ce68b76846684bab67241a122250dfaf9))
|
||||
## (2026-05-13)
|
||||
|
||||
### Features
|
||||
|
||||
* add coordinator startup validation and K8s pod watch ([81b045d](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/81b045d01bb054a4bc9dc9e02fc30f814e756205))
|
||||
* add initialization metrics for various services ([d438e97](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/d438e97f32bdde0bfc63c1b4a8cc810cdd093166))
|
||||
* add OpenTelemetry trace configuration with parentbased sampler ([3904d5a](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/3904d5ad8ad4930ddee65287a7bfab785a6148f5))
|
||||
* **config:** update application.yml for PostgreSQL and remove staging/production configurations ([2404e61](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/2404e6164c3b50ffccbea5238d636060d6abe4d6))
|
||||
* **config:** update application.yml for staging and production environments ([6113432](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/6113432a14c476a3a0dfc0d449e17d023697f2ba))
|
||||
* configure logging and add OpenTelemetry support ([#49](https://git.janis-eccarius.de/NowChess/NowChessSystems/issues/49)) ([d57c488](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/d57c4886612d1d92da0e1b79209fc83e6ef537a1))
|
||||
* **docker:** add .dockerignore and .gitignore files for build exclusions ([c987d8e](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/c987d8e258c0e6c4cfbdaa8381c64c410d7a2b83))
|
||||
* **docker:** add Dockerfiles for building Quarkus application in native and JVM modes ([3f2d2bb](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/3f2d2bb4c97fa8cddba66e1da4427c54236dfeed))
|
||||
* **docker:** add Dockerfiles for Quarkus application in JVM and native modes ([34b9933](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/34b993304670cf2aa62cd2f6460cee7b9864b08e))
|
||||
* **logging:** add DEBUG/INFO/WARN logging across services (NCS-72) ([#41](https://git.janis-eccarius.de/NowChess/NowChessSystems/issues/41)) ([804a4bf](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/804a4bf179e3dfb19e2be4390e7e543caf5237c6))
|
||||
* NCS-78 Add Traceability to the Applications ([#46](https://git.janis-eccarius.de/NowChess/NowChessSystems/issues/46)) ([649566e](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/649566eb3fcf38f91c8896a739f74ea318af312d))
|
||||
* NCS-78 Add Traceability to the Applications ([#47](https://git.janis-eccarius.de/NowChess/NowChessSystems/issues/47)) ([87dfc6c](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/87dfc6c2bcce7f7d58fc641bd8d468a2e584c108))
|
||||
* true-microservices ([#40](https://git.janis-eccarius.de/NowChess/NowChessSystems/issues/40)) ([5909242](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/590924254e8a2754de661a57a03e43f89ceb6299))
|
||||
|
||||
### Bug Fixes
|
||||
|
||||
* add instance-dead-timeout configuration and update HealthMonitor to use it for stale instance eviction ([be0b710](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/be0b710543b542da5c301efef7d2d587d0ba758a))
|
||||
* clean up code formatting and improve error handling in gRPC server and failover service ([ad9495a](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/ad9495afa3e93593b57154a187346c9b01393911))
|
||||
* **coordinator:** refine type casting in rolloutSpec method ([#45](https://git.janis-eccarius.de/NowChess/NowChessSystems/issues/45)) ([d522f7f](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/d522f7f6edf9c985f03dd16816439d4184f1a589))
|
||||
* **coordinator:** use genericKubernetesResources API for Argo Rollout scaling ([#43](https://git.janis-eccarius.de/NowChess/NowChessSystems/issues/43)) ([fa3c6b2](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/fa3c6b2886dc59c14c5dad834acc9b41e42023bb))
|
||||
* **coordinator:** use genericKubernetesResources API for Argo Rollout scaling ([#44](https://git.janis-eccarius.de/NowChess/NowChessSystems/issues/44)) ([82d0b75](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/82d0b754be1075084944b466858672d944f9f7d8))
|
||||
* **dependencies:** replace Fabric8 Kubernetes client with Quarkus Kubernetes client ([5f44570](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/5f44570b357277d09f33b7296860c421e2e70ce0))
|
||||
* enhance AutoScaler and InstanceRegistry for replica management and stale instance eviction ([b4920d3](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/b4920d3817e58bda94d7764e608b856ce9a909f7))
|
||||
* **middleware:** update paths for bot generation and stockfish configuration ([2dd0501](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/2dd0501687db08dcd242359f6837125baf8a2fdc))
|
||||
* **redis:** update Redis configuration with max pool size and waiting parameters ([5baf6a7](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/5baf6a7cdbea484fc49c02e2b5a1c3919b7fa2c4))
|
||||
* replace null checks with Option in coordinator ([2b04d7f](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/2b04d7fa713e06662bff5afe3fb3f9d04541ce51))
|
||||
* update grpcServer variable to use Instance wrapper and add optional access method ([d5c8da2](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/d5c8da20f8805199e920ea5afbd9cdb39a078e40))
|
||||
* update HealthMonitor to evict instances without associated pods ([0f41f13](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/0f41f13ce68b76846684bab67241a122250dfaf9))
|
||||
|
||||
@@ -45,6 +45,8 @@ nowchess:
|
||||
k8s-namespace: default
|
||||
k8s-rollout-name: nowchess-core
|
||||
k8s-rollout-label-selector: "app=nowchess-core"
|
||||
startup-validation-timeout: 15s
|
||||
failover-wait-timeout: 30s
|
||||
|
||||
---
|
||||
# dev profile
|
||||
|
||||
+6
@@ -56,3 +56,9 @@ trait CoordinatorConfig:
|
||||
|
||||
@WithName("k8s-rollout-label-selector")
|
||||
def k8sRolloutLabelSelector: String
|
||||
|
||||
@WithName("startup-validation-timeout")
|
||||
def startupValidationTimeout: Duration
|
||||
|
||||
@WithName("failover-wait-timeout")
|
||||
def failoverWaitTimeout: Duration
|
||||
|
||||
+31
-7
@@ -9,6 +9,7 @@ import de.nowchess.coordinator.proto.{CoordinatorServiceGrpc, *}
|
||||
import io.grpc.stub.StreamObserver
|
||||
import com.fasterxml.jackson.databind.ObjectMapper
|
||||
import org.jboss.logging.Logger
|
||||
import java.util.concurrent.ConcurrentHashMap
|
||||
|
||||
@GrpcService
|
||||
@Singleton
|
||||
@@ -21,8 +22,9 @@ class CoordinatorGrpcServer extends CoordinatorServiceGrpc.CoordinatorServiceImp
|
||||
private var failoverService: FailoverService = uninitialized
|
||||
// scalafix:on DisableSyntax.var
|
||||
|
||||
private val mapper = ObjectMapper()
|
||||
private val log = Logger.getLogger(classOf[CoordinatorGrpcServer])
|
||||
private val mapper = ObjectMapper()
|
||||
private val log = Logger.getLogger(classOf[CoordinatorGrpcServer])
|
||||
private val activeStreams = ConcurrentHashMap.newKeySet[String]()
|
||||
|
||||
override def heartbeatStream(
|
||||
responseObserver: StreamObserver[CoordinatorCommand],
|
||||
@@ -38,6 +40,7 @@ class CoordinatorGrpcServer extends CoordinatorServiceGrpc.CoordinatorServiceImp
|
||||
lastInstanceId = frame.getInstanceId
|
||||
if !firstFrameSeen then
|
||||
firstFrameSeen = true
|
||||
activeStreams.add(frame.getInstanceId)
|
||||
log.infof(
|
||||
"First heartbeat from instance %s (host=%s http=%d grpc=%d)",
|
||||
frame.getInstanceId,
|
||||
@@ -60,10 +63,19 @@ class CoordinatorGrpcServer extends CoordinatorServiceGrpc.CoordinatorServiceImp
|
||||
|
||||
override def onError(t: Throwable): Unit =
|
||||
log.warnf(t, "Heartbeat stream error for instance %s", lastInstanceId)
|
||||
if lastInstanceId.nonEmpty then failoverService.onInstanceStreamDropped(lastInstanceId)
|
||||
if lastInstanceId.nonEmpty then
|
||||
activeStreams.remove(lastInstanceId)
|
||||
failoverService
|
||||
.onInstanceStreamDropped(lastInstanceId)
|
||||
.subscribe()
|
||||
.`with`(
|
||||
_ => (),
|
||||
ex => log.warnf(ex, "Failover for %s failed", lastInstanceId),
|
||||
)
|
||||
|
||||
override def onCompleted: Unit =
|
||||
log.infof("Heartbeat stream completed for instance %s", lastInstanceId)
|
||||
activeStreams.remove(lastInstanceId)
|
||||
|
||||
override def batchResubscribeGames(
|
||||
request: BatchResubscribeRequest,
|
||||
@@ -108,7 +120,19 @@ class CoordinatorGrpcServer extends CoordinatorServiceGrpc.CoordinatorServiceImp
|
||||
val instanceId = request.getInstanceId
|
||||
log.infof("Drain request for instance %s", instanceId)
|
||||
val gamesBefore = instanceRegistry.getInstance(instanceId).map(_.subscriptionCount).getOrElse(0)
|
||||
failoverService.onInstanceStreamDropped(instanceId)
|
||||
val response = DrainInstanceResponse.newBuilder().setGamesMigrated(gamesBefore).build()
|
||||
responseObserver.onNext(response)
|
||||
responseObserver.onCompleted()
|
||||
failoverService
|
||||
.onInstanceStreamDropped(instanceId)
|
||||
.subscribe()
|
||||
.`with`(
|
||||
_ =>
|
||||
val response = DrainInstanceResponse.newBuilder().setGamesMigrated(gamesBefore).build()
|
||||
responseObserver.onNext(response)
|
||||
responseObserver.onCompleted()
|
||||
,
|
||||
ex =>
|
||||
log.warnf(ex, "Drain failed for %s", instanceId)
|
||||
responseObserver.onError(ex),
|
||||
)
|
||||
|
||||
def hasActiveStream(instanceId: String): Boolean =
|
||||
activeStreams.contains(instanceId)
|
||||
|
||||
+7
-1
@@ -70,7 +70,13 @@ class CoordinatorResource:
|
||||
@Produces(Array(MediaType.APPLICATION_JSON))
|
||||
def triggerFailover(@PathParam("instanceId") instanceId: String): scala.collection.Map[String, String] =
|
||||
log.infof("Manual failover triggered for instance %s", instanceId)
|
||||
failoverService.onInstanceStreamDropped(instanceId)
|
||||
failoverService
|
||||
.onInstanceStreamDropped(instanceId)
|
||||
.subscribe()
|
||||
.`with`(
|
||||
_ => (),
|
||||
ex => log.warnf(ex, "Manual failover for %s failed", instanceId),
|
||||
)
|
||||
Map("status" -> "failover_started", "instanceId" -> instanceId)
|
||||
|
||||
@POST
|
||||
|
||||
+48
-13
@@ -8,6 +8,9 @@ import scala.compiletime.uninitialized
|
||||
import org.jboss.logging.Logger
|
||||
import de.nowchess.coordinator.dto.InstanceMetadata
|
||||
import de.nowchess.coordinator.grpc.CoreGrpcClient
|
||||
import de.nowchess.coordinator.config.CoordinatorConfig
|
||||
import io.smallrye.mutiny.Uni
|
||||
import java.time.Duration
|
||||
|
||||
@ApplicationScoped
|
||||
class FailoverService:
|
||||
@@ -21,6 +24,9 @@ class FailoverService:
|
||||
@Inject
|
||||
private var coreGrpcClient: CoreGrpcClient = uninitialized
|
||||
|
||||
@Inject
|
||||
private var config: CoordinatorConfig = uninitialized
|
||||
|
||||
private val log = Logger.getLogger(classOf[FailoverService])
|
||||
private var redisPrefix = "nowchess"
|
||||
// scalafix:on DisableSyntax.var
|
||||
@@ -28,7 +34,7 @@ class FailoverService:
|
||||
def setRedisPrefix(prefix: String): Unit =
|
||||
redisPrefix = prefix
|
||||
|
||||
def onInstanceStreamDropped(instanceId: String): Unit =
|
||||
def onInstanceStreamDropped(instanceId: String): Uni[Unit] =
|
||||
log.infof("Instance %s stream dropped, triggering failover", instanceId)
|
||||
|
||||
val startTime = System.currentTimeMillis()
|
||||
@@ -37,19 +43,32 @@ class FailoverService:
|
||||
val gameIds = getOrphanedGames(instanceId)
|
||||
log.infof("Found %d orphaned games for instance %s", gameIds.size, instanceId)
|
||||
|
||||
if gameIds.nonEmpty then
|
||||
val healthyInstances = instanceRegistry.getAllInstances
|
||||
.filter(_.state == "HEALTHY")
|
||||
.sortBy(_.subscriptionCount)
|
||||
if gameIds.isEmpty then
|
||||
cleanupDeadInstance(instanceId)
|
||||
Uni.createFrom().item(())
|
||||
else
|
||||
waitForHealthyInstanceAsync()
|
||||
.onItem()
|
||||
.transform { _ =>
|
||||
val healthyInstances = instanceRegistry.getAllInstances
|
||||
.filter(_.state == "HEALTHY")
|
||||
.sortBy(_.subscriptionCount)
|
||||
distributeGames(gameIds, healthyInstances, instanceId)
|
||||
|
||||
if healthyInstances.nonEmpty then
|
||||
distributeGames(gameIds, healthyInstances, instanceId)
|
||||
|
||||
val elapsed = System.currentTimeMillis() - startTime
|
||||
log.infof("Failover completed in %dms for instance %s", elapsed, instanceId)
|
||||
else log.warnf("No healthy instances available for failover of %s", instanceId)
|
||||
|
||||
cleanupDeadInstance(instanceId)
|
||||
val elapsed = System.currentTimeMillis() - startTime
|
||||
log.infof("Failover completed in %dms for instance %s", elapsed, instanceId)
|
||||
cleanupDeadInstance(instanceId)
|
||||
()
|
||||
}
|
||||
.onFailure()
|
||||
.recoverWithItem { _ =>
|
||||
log.errorf(
|
||||
"No healthy instance appeared within %s — games orphaned for %s",
|
||||
config.failoverWaitTimeout,
|
||||
instanceId,
|
||||
)
|
||||
()
|
||||
}
|
||||
|
||||
private def getOrphanedGames(instanceId: String): List[String] =
|
||||
val setKey = s"$redisPrefix:instance:$instanceId:games"
|
||||
@@ -101,3 +120,19 @@ class FailoverService:
|
||||
val setKey = s"$redisPrefix:instance:$instanceId:games"
|
||||
redis.key(classOf[String]).del(setKey)
|
||||
log.infof("Cleaned up games set for instance %s", instanceId)
|
||||
|
||||
private def waitForHealthyInstanceAsync(): Uni[InstanceMetadata] =
|
||||
Uni
|
||||
.createFrom()
|
||||
.deferred(() =>
|
||||
instanceRegistry.getAllInstances
|
||||
.filter(_.state == "HEALTHY")
|
||||
.sortBy(_.subscriptionCount)
|
||||
.headOption match
|
||||
case Some(inst) => Uni.createFrom().item(inst)
|
||||
case None => Uni.createFrom().failure(new RuntimeException("no healthy instance")),
|
||||
)
|
||||
.onFailure()
|
||||
.retry()
|
||||
.withBackOff(Duration.ofMillis(500))
|
||||
.expireIn(config.failoverWaitTimeout.toMillis)
|
||||
|
||||
+91
-28
@@ -2,17 +2,23 @@ package de.nowchess.coordinator.service
|
||||
|
||||
import jakarta.annotation.PostConstruct
|
||||
import jakarta.enterprise.context.ApplicationScoped
|
||||
import jakarta.enterprise.event.Observes
|
||||
import jakarta.enterprise.inject.Instance
|
||||
import jakarta.inject.Inject
|
||||
import de.nowchess.coordinator.config.CoordinatorConfig
|
||||
import io.fabric8.kubernetes.client.KubernetesClient
|
||||
import io.fabric8.kubernetes.api.model.Pod
|
||||
import io.fabric8.kubernetes.client.Watcher
|
||||
import io.fabric8.kubernetes.client.WatcherException
|
||||
import io.micrometer.core.instrument.MeterRegistry
|
||||
import io.quarkus.redis.datasource.RedisDataSource
|
||||
import io.quarkus.runtime.StartupEvent
|
||||
import scala.jdk.CollectionConverters.*
|
||||
import org.jboss.logging.Logger
|
||||
import scala.compiletime.uninitialized
|
||||
import java.time.Instant
|
||||
import de.nowchess.coordinator.grpc.CoordinatorGrpcServer
|
||||
import de.nowchess.coordinator.dto.InstanceMetadata
|
||||
|
||||
@ApplicationScoped
|
||||
class HealthMonitor:
|
||||
@@ -32,6 +38,12 @@ class HealthMonitor:
|
||||
@Inject
|
||||
private var meterRegistry: MeterRegistry = uninitialized
|
||||
|
||||
@Inject
|
||||
private var grpcServerInstance: Instance[CoordinatorGrpcServer] = uninitialized
|
||||
|
||||
@Inject
|
||||
private var failoverService: FailoverService = uninitialized
|
||||
|
||||
private val log = Logger.getLogger(classOf[HealthMonitor])
|
||||
private var redisPrefix = "nowchess"
|
||||
// scalafix:on DisableSyntax.var
|
||||
@@ -40,6 +52,10 @@ class HealthMonitor:
|
||||
if kubeClientInstance.isUnsatisfied then None
|
||||
else Some(kubeClientInstance.get())
|
||||
|
||||
private def grpcServerOpt: Option[CoordinatorGrpcServer] =
|
||||
if grpcServerInstance.isUnsatisfied then None
|
||||
else Some(grpcServerInstance.get())
|
||||
|
||||
def setRedisPrefix(prefix: String): Unit =
|
||||
redisPrefix = prefix
|
||||
|
||||
@@ -48,6 +64,15 @@ class HealthMonitor:
|
||||
meterRegistry.counter("nowchess.coordinator.health.checks").increment(0)
|
||||
meterRegistry.counter("nowchess.coordinator.pods.unhealthy").increment(0)
|
||||
|
||||
def onStartup(@Observes ev: StartupEvent): Unit =
|
||||
instanceRegistry.loadAllFromRedis()
|
||||
val loaded = instanceRegistry.getAllInstances
|
||||
log.infof("Startup: loaded %d instances from Redis", loaded.size)
|
||||
if loaded.nonEmpty then
|
||||
val timeoutMs = config.startupValidationTimeout.toMillis
|
||||
Thread.ofVirtual().start(() => validateStartupInstances(timeoutMs))
|
||||
startPodWatch()
|
||||
|
||||
def checkInstanceHealth: Unit =
|
||||
meterRegistry.counter("nowchess.coordinator.health.checks").increment()
|
||||
val evicted = instanceRegistry.evictStaleInstances(config.instanceDeadTimeout)
|
||||
@@ -98,41 +123,32 @@ class HealthMonitor:
|
||||
true
|
||||
}
|
||||
|
||||
def watchK8sPods: Unit =
|
||||
private def startPodWatch(): Unit =
|
||||
kubeClientOpt match
|
||||
case None =>
|
||||
log.debug("Kubernetes client not available for pod watch")
|
||||
case None => log.debug("K8s client unavailable, skipping pod watch")
|
||||
case Some(kube) =>
|
||||
try
|
||||
val pods = kube
|
||||
kube
|
||||
.pods()
|
||||
.inNamespace(config.k8sNamespace)
|
||||
.withLabel(config.k8sRolloutLabelSelector)
|
||||
.list()
|
||||
.getItems
|
||||
.asScala
|
||||
.watch(new Watcher[Pod]:
|
||||
override def eventReceived(action: Watcher.Action, pod: Pod): Unit =
|
||||
action match
|
||||
case Watcher.Action.DELETED =>
|
||||
handlePodGone(pod)
|
||||
case Watcher.Action.MODIFIED if Option(pod.getMetadata.getDeletionTimestamp).isDefined =>
|
||||
handlePodTerminating(pod)
|
||||
case _ => ()
|
||||
|
||||
val instances = instanceRegistry.getAllInstances
|
||||
instances.foreach { inst =>
|
||||
val matchingPod = pods.find { pod =>
|
||||
pod.getMetadata.getName.contains(inst.instanceId)
|
||||
}
|
||||
|
||||
matchingPod match
|
||||
case Some(pod) =>
|
||||
val isReady = isPodReady(pod)
|
||||
if !isReady && inst.state == "HEALTHY" then
|
||||
meterRegistry.counter("nowchess.coordinator.pods.unhealthy").increment()
|
||||
log.warnf("Pod %s not ready, marking instance %s dead", pod.getMetadata.getName, inst.instanceId)
|
||||
instanceRegistry.markInstanceDead(inst.instanceId)
|
||||
deleteK8sPod(inst.instanceId)
|
||||
case None =>
|
||||
log.warnf("No pod found for instance %s, evicting from registry", inst.instanceId)
|
||||
instanceRegistry.removeInstance(inst.instanceId)
|
||||
}
|
||||
catch
|
||||
case ex: Exception =>
|
||||
log.warnf(ex, "Failed to watch k8s pods")
|
||||
override def onClose(cause: WatcherException): Unit =
|
||||
Option(cause).foreach { ex =>
|
||||
log.warnf(ex, "Pod watch closed, restarting")
|
||||
startPodWatch()
|
||||
},
|
||||
)
|
||||
log.info("Pod watch started")
|
||||
catch case ex: Exception => log.warnf(ex, "Failed to start pod watch")
|
||||
|
||||
private def isPodReady(pod: Pod): Boolean =
|
||||
Option(pod.getStatus)
|
||||
@@ -164,3 +180,50 @@ class HealthMonitor:
|
||||
catch
|
||||
case ex: Exception =>
|
||||
log.warnf(ex, "Failed to delete pod for instance %s", instanceId)
|
||||
|
||||
private def validateStartupInstances(timeoutMs: Long): Unit =
|
||||
Thread.sleep(timeoutMs)
|
||||
grpcServerOpt.foreach { grpcServer =>
|
||||
instanceRegistry.getAllInstances.foreach { inst =>
|
||||
if !grpcServer.hasActiveStream(inst.instanceId) then
|
||||
log.warnf(
|
||||
"Startup: instance %s did not reconnect within %dms — evicting",
|
||||
inst.instanceId,
|
||||
timeoutMs,
|
||||
)
|
||||
instanceRegistry.removeInstance(inst.instanceId)
|
||||
deleteK8sPod(inst.instanceId)
|
||||
}
|
||||
}
|
||||
|
||||
private def handlePodTerminating(pod: Pod): Unit =
|
||||
findRegisteredInstance(pod).foreach { inst =>
|
||||
if inst.state == "HEALTHY" then
|
||||
meterRegistry.counter("nowchess.coordinator.pods.unhealthy").increment()
|
||||
log.warnf(
|
||||
"Pod %s terminating — marking instance %s dead",
|
||||
pod.getMetadata.getName,
|
||||
inst.instanceId,
|
||||
)
|
||||
instanceRegistry.markInstanceDead(inst.instanceId)
|
||||
}
|
||||
|
||||
private def handlePodGone(pod: Pod): Unit =
|
||||
findRegisteredInstance(pod).foreach { inst =>
|
||||
log.warnf(
|
||||
"Pod %s deleted — triggering failover for %s",
|
||||
pod.getMetadata.getName,
|
||||
inst.instanceId,
|
||||
)
|
||||
failoverService
|
||||
.onInstanceStreamDropped(inst.instanceId)
|
||||
.subscribe()
|
||||
.`with`(
|
||||
_ => (),
|
||||
ex => log.warnf(ex, "Failover for %s failed", inst.instanceId),
|
||||
)
|
||||
}
|
||||
|
||||
private def findRegisteredInstance(pod: Pod): Option[InstanceMetadata] =
|
||||
val podName = pod.getMetadata.getName
|
||||
instanceRegistry.getAllInstances.find(inst => podName.contains(inst.instanceId))
|
||||
|
||||
+21
-1
@@ -4,6 +4,7 @@ import jakarta.annotation.PostConstruct
|
||||
import jakarta.enterprise.context.ApplicationScoped
|
||||
import jakarta.inject.Inject
|
||||
import io.quarkus.redis.datasource.ReactiveRedisDataSource
|
||||
import io.quarkus.redis.datasource.RedisDataSource
|
||||
import scala.jdk.CollectionConverters.*
|
||||
import scala.compiletime.uninitialized
|
||||
import com.fasterxml.jackson.databind.ObjectMapper
|
||||
@@ -19,7 +20,10 @@ class InstanceRegistry:
|
||||
// scalafix:off DisableSyntax.var
|
||||
@Inject
|
||||
private var redis: ReactiveRedisDataSource = uninitialized
|
||||
private var redisPrefix = "nowchess"
|
||||
|
||||
@Inject
|
||||
private var syncRedis: RedisDataSource = uninitialized
|
||||
private var redisPrefix = "nowchess"
|
||||
|
||||
@Inject
|
||||
private var meterRegistry: MeterRegistry = uninitialized
|
||||
@@ -42,6 +46,22 @@ class InstanceRegistry:
|
||||
def setRedisPrefix(prefix: String): Unit =
|
||||
redisPrefix = prefix
|
||||
|
||||
def loadAllFromRedis(): Unit =
|
||||
val keys = syncRedis.key(classOf[String]).keys(s"$redisPrefix:instances:*")
|
||||
keys.asScala.foreach { key =>
|
||||
val instanceId = key.stripPrefix(s"$redisPrefix:instances:")
|
||||
val json = syncRedis.value(classOf[String]).get(key)
|
||||
Option(json).foreach { jsonStr =>
|
||||
try
|
||||
val metadata = mapper.readValue(jsonStr, classOf[InstanceMetadata])
|
||||
instances.put(instanceId, metadata)
|
||||
log.infof("Startup: loaded instance %s from Redis", instanceId)
|
||||
catch
|
||||
case ex: Exception =>
|
||||
log.warnf(ex, "Startup: failed to parse instance %s", instanceId)
|
||||
}
|
||||
}
|
||||
|
||||
def getInstance(instanceId: String): Option[InstanceMetadata] =
|
||||
Option(instances.get(instanceId))
|
||||
|
||||
|
||||
@@ -1,3 +1,3 @@
|
||||
MAJOR=0
|
||||
MINOR=18
|
||||
MINOR=19
|
||||
PATCH=0
|
||||
|
||||
@@ -1138,3 +1138,60 @@
|
||||
|
||||
* Revert "feat: add authentication permissions for metrics endpoints in application.yml" ([a298417](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/a298417b9e4d68dc73bbf40be63d9484536e9f83))
|
||||
* Revert "refactor: update metrics paths formatting in application.yml for clarity" ([3870566](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/38705663498d5f47c40dafe2f26198589ede8656))
|
||||
## (2026-05-13)
|
||||
|
||||
### Features
|
||||
|
||||
* add authentication permissions for metrics endpoints in application.yml ([04edd4d](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/04edd4d6fd8a63196c36f6d67992832febc9bebb))
|
||||
* add CORS configuration and reorder JWT settings in application.yml ([a49f9be](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/a49f9be146f04c14561c305d980846a92f8c12b2))
|
||||
* add GameRules stub with PositionStatus enum ([76d4168](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/76d4168038de23e5d6083d4e8f0504fbf31d15a3))
|
||||
* add initialization metrics for various services ([d438e97](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/d438e97f32bdde0bfc63c1b4a8cc810cdd093166))
|
||||
* add MovedInCheck/Checkmate/Stalemate MoveResult variants (stub dispatch) ([8b7ec57](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/8b7ec57e5ea6ee1615a1883848a426dc07d26364))
|
||||
* add OpenTelemetry trace configuration with parentbased sampler ([3904d5a](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/3904d5ad8ad4930ddee65287a7bfab785a6148f5))
|
||||
* **config:** update application.yml for PostgreSQL and remove staging/production configurations ([2404e61](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/2404e6164c3b50ffccbea5238d636060d6abe4d6))
|
||||
* **config:** update application.yml for staging and production environments ([6113432](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/6113432a14c476a3a0dfc0d449e17d023697f2ba))
|
||||
* configure logging and add OpenTelemetry support ([#49](https://git.janis-eccarius.de/NowChess/NowChessSystems/issues/49)) ([d57c488](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/d57c4886612d1d92da0e1b79209fc83e6ef537a1))
|
||||
* implement GameRules with isInCheck, legalMoves, gameStatus ([94a02ff](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/94a02ff6849436d9496c70a0f16c21666dae8e4e))
|
||||
* implement legal castling ([#1](https://git.janis-eccarius.de/NowChess/NowChessSystems/issues/1)) ([00d326c](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/00d326c1ba67711fbe180f04e1100c3f01dd0254))
|
||||
* **logging:** add DEBUG/INFO/WARN logging across services (NCS-72) ([#41](https://git.janis-eccarius.de/NowChess/NowChessSystems/issues/41)) ([804a4bf](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/804a4bf179e3dfb19e2be4390e7e543caf5237c6))
|
||||
* NCS-10 Implement Pawn Promotion ([#12](https://git.janis-eccarius.de/NowChess/NowChessSystems/issues/12)) ([13bfc16](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/13bfc16cfe25db78ec607db523ca6d993c13430c))
|
||||
* NCS-11 50-move rule ([#9](https://git.janis-eccarius.de/NowChess/NowChessSystems/issues/9)) ([412ed98](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/412ed986a95703a3b282276540153480ceed229d))
|
||||
* NCS-13 Implement Threefold Repetition ([#31](https://git.janis-eccarius.de/NowChess/NowChessSystems/issues/31)) ([767d305](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/767d3051a76c266050b6335774d66e2db2273c16))
|
||||
* NCS-14 implemented insufficient moves rule ([#30](https://git.janis-eccarius.de/NowChess/NowChessSystems/issues/30)) ([b0399a4](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/b0399a4e489950083066c9538df9a84dcc7a4613))
|
||||
* NCS-16 Core Separation via Patterns ([#10](https://git.janis-eccarius.de/NowChess/NowChessSystems/issues/10)) ([1361dfc](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/1361dfc89553b146864fb8ff3526cf12cf3f293a))
|
||||
* NCS-17 Implement basic ScalaFX UI ([#14](https://git.janis-eccarius.de/NowChess/NowChessSystems/issues/14)) ([3ff8031](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/3ff80318b4f16c59733a46498581a5c27f048287))
|
||||
* NCS-21 Write Scripts to automate certain tasks ([#15](https://git.janis-eccarius.de/NowChess/NowChessSystems/issues/15)) ([8051871](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/80518719d536a087d339fe02530825dc07f8b388))
|
||||
* NCS-25 Add linters to keep quality up ([#27](https://git.janis-eccarius.de/NowChess/NowChessSystems/issues/27)) ([fd4e67d](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/fd4e67d4f782a7e955822d90cb909d0a81676fb2))
|
||||
* NCS-37 Quarkus integration ([#35](https://git.janis-eccarius.de/NowChess/NowChessSystems/issues/35)) ([f088c4e](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/f088c4e9ffcc498d3d1b6f01e8f50042d5830d55))
|
||||
* NCS-40 Rework Draw System ([#34](https://git.janis-eccarius.de/NowChess/NowChessSystems/issues/34)) ([33e785d](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/33e785d22af87724839b62ae91dfe74a05b398c3))
|
||||
* NCS-41 Bot Platform ([#33](https://git.janis-eccarius.de/NowChess/NowChessSystems/issues/33)) ([8744bee](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/8744bee2dd20966dae90a09c21a43d5b06f59e00))
|
||||
* NCS-53 changed IO to MicroService for easier scaling ([#37](https://git.janis-eccarius.de/NowChess/NowChessSystems/issues/37)) ([b5a2966](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/b5a2966adafa9650f0f7d601bdeb8fdd13710327))
|
||||
* NCS-6 Implementing FEN & PGN ([#7](https://git.janis-eccarius.de/NowChess/NowChessSystems/issues/7)) ([f28e69d](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/f28e69dc181416aa2f221fdc4b45c2cda5efbf07))
|
||||
* NCS-78 Add Traceability to the Applications ([#48](https://git.janis-eccarius.de/NowChess/NowChessSystems/issues/48)) ([c96a09b](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/c96a09bb5cee59fc23205bb63baa8b217a7e1b00))
|
||||
* NCS-9 En passant implementation ([#8](https://git.janis-eccarius.de/NowChess/NowChessSystems/issues/8)) ([919beb3](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/919beb3b4bfa8caf2f90976a415fe9b19b7e9747))
|
||||
* **rule:** Rules as a microservice ([#39](https://git.janis-eccarius.de/NowChess/NowChessSystems/issues/39)) ([093134d](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/093134d36c6844ba02a36a28d5d044f09291cd1d))
|
||||
* true-microservices ([#40](https://git.janis-eccarius.de/NowChess/NowChessSystems/issues/40)) ([5909242](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/590924254e8a2754de661a57a03e43f89ceb6299))
|
||||
* update application.yml with new API root paths and add Micrometer and OpenTelemetry dependencies ([72ce262](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/72ce262bc491f94297700e6002fb5d0812e2cc2a))
|
||||
* wire check/checkmate/stalemate into processMove and gameLoop ([5264a22](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/5264a225418b885c5e6ea6411b96f85e38837f6c))
|
||||
|
||||
### Bug Fixes
|
||||
|
||||
* add missing kings to gameLoop capture test board ([aedd787](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/aedd787b77203c2af934751dba7b784eaf165032))
|
||||
* **auth:** change InternalAuthFilter to use @Singleton and add HTTP tests for secret validation ([c08d530](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/c08d5303eb9e70d36c8eebf6a061ccb71e118fe5))
|
||||
* **auth:** update InternalAuthFilter to use @ApplicationScoped and add index-dependency configuration ([6e0fd95](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/6e0fd9523e001756ce7109e639ebb54be4fcdabf))
|
||||
* correct test board positions and captureOutput/withInput interaction ([f0481e2](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/f0481e2561b779df00925b46ee281dc36a795150))
|
||||
* **heartbeat:** inject ObjectMapper into InstanceHeartbeatService ([#42](https://git.janis-eccarius.de/NowChess/NowChessSystems/issues/42)) ([0c98151](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/0c981517da1f94cd10ae396e47bde2b35d0b3ba0))
|
||||
* IO microservice ([#38](https://git.janis-eccarius.de/NowChess/NowChessSystems/issues/38)) ([a386f57](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/a386f57c21d34ead6cc6f92836c52b714597e289))
|
||||
* Lints ([dc224ab](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/dc224abe26acf5361c56956006e1cc51b75b0b7e))
|
||||
* **redis:** add max pool wait time and switch to ReactiveRedisDataSource for heartbeat updates ([33e5017](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/33e5017f51a998327b180f778f73964cc10c05d3))
|
||||
* **redis:** enhance GameRedisSubscriberManager to use ReactiveRedisDataSource and improve subscription handling ([0eb752d](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/0eb752d4935377f75aab710b7f4eda4b29098e6a))
|
||||
* **redis:** prevent concurrent Redis heartbeat refreshes using AtomicBoolean ([847b132](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/847b13202cb909d18ca3304c27ebe17ce2312b8e))
|
||||
* **redis:** simplify refreshRedisHeartbeat logic and ensure proper error handling ([1813ea1](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/1813ea1d2d5d093f7925f87371b5e29820bf1136))
|
||||
* **redis:** update Redis configuration with max pool size and waiting parameters ([5baf6a7](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/5baf6a7cdbea484fc49c02e2b5a1c3919b7fa2c4))
|
||||
* update main class path in build configuration and adjust VCS directory mapping ([7b1f8b1](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/7b1f8b117623d327232a1a92a8a44d18582e0189))
|
||||
* update move validation to check for king safety ([#13](https://git.janis-eccarius.de/NowChess/NowChessSystems/issues/13)) ([e5e20c5](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/e5e20c566e368b12ca1dc59680c34e9112bf6762))
|
||||
|
||||
### Reverts
|
||||
|
||||
* Revert "feat: add authentication permissions for metrics endpoints in application.yml" ([a298417](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/a298417b9e4d68dc73bbf40be63d9484536e9f83))
|
||||
* Revert "refactor: update metrics paths formatting in application.yml for clarity" ([3870566](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/38705663498d5f47c40dafe2f26198589ede8656))
|
||||
|
||||
@@ -76,6 +76,11 @@ nowchess:
|
||||
quarkus:
|
||||
http:
|
||||
root-path: /api
|
||||
cors:
|
||||
~: true
|
||||
origins: ${CORS_ORIGINS}
|
||||
methods: GET,POST,PUT,DELETE,OPTIONS
|
||||
headers: Content-Type,Accept,Authorization
|
||||
log:
|
||||
console:
|
||||
json: true
|
||||
@@ -86,19 +91,6 @@ nowchess:
|
||||
exporter:
|
||||
otlp:
|
||||
endpoint: ${OTEL_EXPORTER_OTLP_ENDPOINT:http://localhost:4317}
|
||||
mp:
|
||||
jwt:
|
||||
verify:
|
||||
publickey:
|
||||
location: ${JWT_PUBLIC_KEY_PATH:keys/public.pem}
|
||||
issuer: nowchess
|
||||
quarkus:
|
||||
http:
|
||||
cors:
|
||||
~: true
|
||||
origins: ${CORS_ORIGINS}
|
||||
methods: GET,POST,PUT,DELETE,OPTIONS
|
||||
headers: Content-Type,Accept,Authorization
|
||||
grpc:
|
||||
clients:
|
||||
rule-grpc:
|
||||
@@ -117,6 +109,12 @@ nowchess:
|
||||
url: ${RULE_SERVICE_URL}
|
||||
store-service:
|
||||
url: ${STORE_SERVICE_URL}
|
||||
mp:
|
||||
jwt:
|
||||
verify:
|
||||
publickey:
|
||||
location: ${JWT_PUBLIC_KEY_PATH:keys/public.pem}
|
||||
issuer: nowchess
|
||||
nowchess:
|
||||
redis:
|
||||
host: ${REDIS_HOST}
|
||||
|
||||
@@ -1,3 +1,3 @@
|
||||
MAJOR=0
|
||||
MINOR=36
|
||||
MINOR=37
|
||||
PATCH=0
|
||||
|
||||
Reference in New Issue
Block a user