Compare commits

..

2 Commits

Author SHA1 Message Date
TeamCity b47ad7ef89 ci: bump version with Build-90 2026-05-16 13:15:48 +00:00
Janis 2e4ba43597 feat: implement clock expiry scanning and handling for game records (#54)
Build & Test (NowChessSystems) TeamCity build finished
Reviewed-on: #54
2026-05-16 15:09:04 +02:00
5 changed files with 131 additions and 24 deletions
+47
View File
@@ -615,3 +615,50 @@
* streamline logging for evicted instances in InstanceRegistry ([10937e7](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/10937e756a56e0e8fcf939decfdcaa4394506cc0))
* update grpcServer variable to use Instance wrapper and add optional access method ([d5c8da2](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/d5c8da20f8805199e920ea5afbd9cdb39a078e40))
* update HealthMonitor to evict instances without associated pods ([0f41f13](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/0f41f13ce68b76846684bab67241a122250dfaf9))
## (2026-05-16)
### Features
* add coordinator startup validation and K8s pod watch ([81b045d](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/81b045d01bb054a4bc9dc9e02fc30f814e756205))
* add initialization metrics for various services ([d438e97](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/d438e97f32bdde0bfc63c1b4a8cc810cdd093166))
* add OpenTelemetry trace configuration with parentbased sampler ([3904d5a](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/3904d5ad8ad4930ddee65287a7bfab785a6148f5))
* add periodic health check to evict dead instances ([380a2cc](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/380a2cceeb5873bf93ff17a1e87d62408ef8e178))
* **config:** update application.yml for PostgreSQL and remove staging/production configurations ([2404e61](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/2404e6164c3b50ffccbea5238d636060d6abe4d6))
* **config:** update application.yml for staging and production environments ([6113432](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/6113432a14c476a3a0dfc0d449e17d023697f2ba))
* configure logging and add OpenTelemetry support ([#49](https://git.janis-eccarius.de/NowChess/NowChessSystems/issues/49)) ([d57c488](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/d57c4886612d1d92da0e1b79209fc83e6ef537a1))
* **docker:** add .dockerignore and .gitignore files for build exclusions ([c987d8e](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/c987d8e258c0e6c4cfbdaa8381c64c410d7a2b83))
* **docker:** add Dockerfiles for building Quarkus application in native and JVM modes ([3f2d2bb](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/3f2d2bb4c97fa8cddba66e1da4427c54236dfeed))
* **docker:** add Dockerfiles for Quarkus application in JVM and native modes ([34b9933](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/34b993304670cf2aa62cd2f6460cee7b9864b08e))
* implement clock expiry scanning and handling for game records ([#53](https://git.janis-eccarius.de/NowChess/NowChessSystems/issues/53)) ([8f9eb12](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/8f9eb12f663efabe4dc72b94394438652ad0ef02))
* implement clock expiry scanning and handling for game records ([#54](https://git.janis-eccarius.de/NowChess/NowChessSystems/issues/54)) ([2e4ba43](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/2e4ba435978ef415b4ee2d7d2fc4af3b4e834b3d))
* implement periodic scaling checks and enhance instance management in AutoScaler ([3f12f69](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/3f12f695f132b92f634d98df2c037292498b6e86))
* **logging:** add DEBUG/INFO/WARN logging across services (NCS-72) ([#41](https://git.janis-eccarius.de/NowChess/NowChessSystems/issues/41)) ([804a4bf](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/804a4bf179e3dfb19e2be4390e7e543caf5237c6))
* NCS-78 Add Traceability to the Applications ([#46](https://git.janis-eccarius.de/NowChess/NowChessSystems/issues/46)) ([649566e](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/649566eb3fcf38f91c8896a739f74ea318af312d))
* NCS-78 Add Traceability to the Applications ([#47](https://git.janis-eccarius.de/NowChess/NowChessSystems/issues/47)) ([87dfc6c](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/87dfc6c2bcce7f7d58fc641bd8d468a2e584c108))
* scale up on high CPU load, not just subscription count ([255e2da](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/255e2da33c37e186ed14f2862f2d2e1b4adc59bf))
* true-microservices ([#40](https://git.janis-eccarius.de/NowChess/NowChessSystems/issues/40)) ([5909242](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/590924254e8a2754de661a57a03e43f89ceb6299))
### Bug Fixes
* add instance-dead-timeout configuration and update HealthMonitor to use it for stale instance eviction ([be0b710](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/be0b710543b542da5c301efef7d2d587d0ba758a))
* clean up code formatting and improve error handling in gRPC server and failover service ([ad9495a](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/ad9495afa3e93593b57154a187346c9b01393911))
* coordinator auto-scaling, cache eviction, rebalancing, and grpc timeouts ([d0c7169](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/d0c71693bb6f55fafdce5bcea0d5f38b9bb505ef))
* **coordinator:** refine type casting in rolloutSpec method ([#45](https://git.janis-eccarius.de/NowChess/NowChessSystems/issues/45)) ([d522f7f](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/d522f7f6edf9c985f03dd16816439d4184f1a589))
* **coordinator:** use genericKubernetesResources API for Argo Rollout scaling ([#43](https://git.janis-eccarius.de/NowChess/NowChessSystems/issues/43)) ([fa3c6b2](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/fa3c6b2886dc59c14c5dad834acc9b41e42023bb))
* **coordinator:** use genericKubernetesResources API for Argo Rollout scaling ([#44](https://git.janis-eccarius.de/NowChess/NowChessSystems/issues/44)) ([82d0b75](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/82d0b754be1075084944b466858672d944f9f7d8))
* **dependencies:** replace Fabric8 Kubernetes client with Quarkus Kubernetes client ([5f44570](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/5f44570b357277d09f33b7296860c421e2e70ce0))
* don't block event loop during scale-down drain ([1d121c7](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/1d121c727cbd4df477827cf64d065b7356a56e59))
* don't trigger scale-down if already at min replicas ([4b3b5e7](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/4b3b5e7c4ed9b3cd4fe2490e9f268f2e3a0d9e85))
* enhance AutoScaler and InstanceRegistry for replica management and stale instance eviction ([b4920d3](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/b4920d3817e58bda94d7764e608b856ce9a909f7))
* force-delete hanging pods and remove failed instances from registry ([960a419](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/960a419792e1161fb7241e465b7349efe4a10137))
* linter formatting and improve code readability ([4a36096](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/4a36096a5586e3f321d9c34c53e60d02bcc02c55))
* **middleware:** update paths for bot generation and stockfish configuration ([2dd0501](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/2dd0501687db08dcd242359f6837125baf8a2fdc))
* NCS-84 More Verbose Logging ([#51](https://git.janis-eccarius.de/NowChess/NowChessSystems/issues/51)) ([4ad92ab](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/4ad92ab23698267f8faa59c4e18388d4a0042cca))
* **redis:** update Redis configuration with max pool size and waiting parameters ([5baf6a7](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/5baf6a7cdbea484fc49c02e2b5a1c3919b7fa2c4))
* remove corrupted instances immediately and evict dead instances ([43184d2](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/43184d296da5a6a7b760ac90c2b739220d86bce3))
* replace null checks with Option in coordinator ([2b04d7f](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/2b04d7fa713e06662bff5afe3fb3f9d04541ce51))
* scalafix violations in metrics check and health monitor ([b991878](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/b99187821489b296d66da5a5a13f5d545b6045c6))
* scale up immediately when instance is lost ([43525d4](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/43525d41a3884c00f1db26bf3c8c4cd9a607c260))
* streamline logging for evicted instances in InstanceRegistry ([10937e7](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/10937e756a56e0e8fcf939decfdcaa4394506cc0))
* update grpcServer variable to use Instance wrapper and add optional access method ([d5c8da2](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/d5c8da20f8805199e920ea5afbd9cdb39a078e40))
* update HealthMonitor to evict instances without associated pods ([0f41f13](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/0f41f13ce68b76846684bab67241a122250dfaf9))
@@ -47,6 +47,8 @@ nowchess:
k8s-rollout-label-selector: "app=nowchess-core"
startup-validation-timeout: 15s
failover-wait-timeout: 30s
scale-cpu-threshold-percent: 0.8
scale-memory-threshold-percent: 0.8
---
# dev profile
@@ -62,3 +62,9 @@ trait CoordinatorConfig:
@WithName("failover-wait-timeout")
def failoverWaitTimeout: Duration
@WithName("scale-cpu-threshold-percent")
def scaleCpuThresholdPercent: Double
@WithName("scale-memory-threshold-percent")
def scaleMemoryThresholdPercent: Double
@@ -84,6 +84,34 @@ class AutoScaler:
}
// scalafix:on DisableSyntax.asInstanceOf
private def parseMillicores(s: String): Long =
if s.endsWith("n") then s.dropRight(1).toLongOption.map(_ / 1000000).getOrElse(0L)
else if s.endsWith("m") then s.dropRight(1).toLongOption.getOrElse(0L)
else s.toLongOption.map(_ * 1000).getOrElse(0L)
private def parseBytes(s: String): Long =
if s.endsWith("Ki") then s.dropRight(2).toLongOption.map(_ * 1024L).getOrElse(0L)
else if s.endsWith("Mi") then s.dropRight(2).toLongOption.map(_ * 1024L * 1024L).getOrElse(0L)
else if s.endsWith("Gi") then s.dropRight(2).toLongOption.map(_ * 1024L * 1024L * 1024L).getOrElse(0L)
else if s.endsWith("K") then s.dropRight(1).toLongOption.map(_ * 1000L).getOrElse(0L)
else if s.endsWith("M") then s.dropRight(1).toLongOption.map(_ * 1000L * 1000L).getOrElse(0L)
else if s.endsWith("G") then s.dropRight(1).toLongOption.map(_ * 1000L * 1000L * 1000L).getOrElse(0L)
else s.toLongOption.getOrElse(0L)
private def exceedsRatio(used: Long, request: Long, threshold: Double, resource: String, instanceId: String): Boolean =
if request <= 0 then false
else
val ratio = used.toDouble / request.toDouble
log.debugf(
"Instance %s %s: %d used / %d requested = %.0f%%",
instanceId,
resource,
used,
request,
ratio * 100,
)
ratio > threshold
// scalafix:off DisableSyntax.asInstanceOf
private def isResourceConstrained(instanceId: String): Boolean =
kubeClientOpt.fold(false) { kube =>
@@ -92,25 +120,37 @@ class AutoScaler:
kube.pods().inNamespace(config.k8sNamespace).withLabel(config.k8sRolloutLabelSelector).list().getItems.asScala
pods.find(_.getMetadata.getName.contains(instanceId)).exists { pod =>
try
val metricsRes = kube
.genericKubernetesResources(metricsApiVersion, "PodMetrics")
.inNamespace(config.k8sNamespace)
.withName(pod.getMetadata.getName)
.get()
val metricsMap = metricsRes.asInstanceOf[java.util.Map[String, AnyRef]]
Option(metricsMap.get("metrics"))
.map(_.asInstanceOf[java.util.Map[String, AnyRef]])
.flatMap(m => Option(m.get("containers")).map(_.asInstanceOf[java.util.List[AnyRef]]))
.filter(!_.isEmpty)
.map(_.get(0).asInstanceOf[java.util.Map[String, AnyRef]])
.flatMap(c => Option(c.get("usage")).map(_.asInstanceOf[java.util.Map[String, AnyRef]]))
.flatMap(u => Option(u.get("cpu")))
.map(_.toString)
.exists { cpuStr =>
val cpuMillis =
if cpuStr.endsWith("m") then cpuStr.dropRight(1).toLongOption.getOrElse(0L)
else cpuStr.toLongOption.map(_ * 1000).getOrElse(0L)
cpuMillis > 800
val requests = Option(pod.getSpec)
.flatMap(s => Option(s.getContainers))
.flatMap(cs => if cs.isEmpty then None else Option(cs.get(0)))
.flatMap(c => Option(c.getResources))
.flatMap(r => Option(r.getRequests))
val cpuRequestMillis = requests.flatMap(m => Option(m.get("cpu"))).map(q => parseMillicores(q.toString)).getOrElse(0L)
val memRequestBytes = requests.flatMap(m => Option(m.get("memory"))).map(q => parseBytes(q.toString)).getOrElse(0L)
if cpuRequestMillis <= 0 && memRequestBytes <= 0 then
log.debugf("No resource requests found for instance %s, skipping resource check", instanceId)
false
else
val metricsRes = kube
.genericKubernetesResources(metricsApiVersion, "PodMetrics")
.inNamespace(config.k8sNamespace)
.withName(pod.getMetadata.getName)
.get()
val metricsMap = metricsRes.asInstanceOf[java.util.Map[String, AnyRef]]
val usageOpt = Option(metricsMap.get("metrics"))
.map(_.asInstanceOf[java.util.Map[String, AnyRef]])
.flatMap(m => Option(m.get("containers")).map(_.asInstanceOf[java.util.List[AnyRef]]))
.filter(!_.isEmpty)
.map(_.get(0).asInstanceOf[java.util.Map[String, AnyRef]])
.flatMap(c => Option(c.get("usage")).map(_.asInstanceOf[java.util.Map[String, AnyRef]]))
usageOpt.exists { usage =>
val cpuUsed = Option(usage.get("cpu")).map(v => parseMillicores(v.toString)).getOrElse(0L)
val memUsed = Option(usage.get("memory")).map(v => parseBytes(v.toString)).getOrElse(0L)
exceedsRatio(cpuUsed, cpuRequestMillis, config.scaleCpuThresholdPercent, "CPU", instanceId) ||
exceedsRatio(memUsed, memRequestBytes, config.scaleMemoryThresholdPercent, "memory", instanceId)
}
catch case _: Exception => false
}
@@ -128,13 +168,25 @@ class AutoScaler:
if now - last >= 120000 && lastScaleTime.compareAndSet(last, now) then
val instances = instanceRegistry.getAllInstances.filter(_.state == "HEALTHY")
if instances.nonEmpty then
val avgLoad = instances.map(_.subscriptionCount).sum.toDouble / instances.size
val avgLoad = instances.map(_.subscriptionCount).sum.toDouble / instances.size
val scaleUpLoad = config.scaleUpThreshold * config.maxGamesPerCore
val scaleDownLoad = config.scaleDownThreshold * config.maxGamesPerCore
avgLoadRef.set(avgLoad)
val hasHighCpuOrMemory = instances.exists(inst => isResourceConstrained(inst.instanceId))
val constrainedInstance = instances.find(inst => isResourceConstrained(inst.instanceId))
val hasHighCpuOrMemory = constrainedInstance.isDefined
if avgLoad > config.scaleUpThreshold * config.maxGamesPerCore || hasHighCpuOrMemory then scaleUp()
else if avgLoad < config.scaleDownThreshold * config.maxGamesPerCore && instances.size > config.scaleMinReplicas
log.infof(
"Scale check: instances=%d avgLoad=%.1f scaleUpAt=%.1f scaleDownAt=%.1f resourceConstrained=%s",
instances.size,
avgLoad,
scaleUpLoad,
scaleDownLoad,
constrainedInstance.map(_.instanceId).getOrElse("none"),
)
if avgLoad > scaleUpLoad || hasHighCpuOrMemory then scaleUp()
else if avgLoad < scaleDownLoad && instances.size > config.scaleMinReplicas
then scaleDown()
private def patchRolloutReplicas(
+1 -1
View File
@@ -1,3 +1,3 @@
MAJOR=0
MINOR=25
MINOR=26
PATCH=0