Compare commits

..

4 Commits

Author SHA1 Message Date
TeamCity d41c03700c ci: bump version with Build-83 2026-05-13 17:06:04 +00:00
Janis 10937e756a fix: streamline logging for evicted instances in InstanceRegistry
Build & Test (NowChessSystems) TeamCity build finished
2026-05-13 18:39:32 +02:00
Janis 380a2cceeb feat: add periodic health check to evict dead instances
Build & Test (NowChessSystems) TeamCity build failed
Add quarkus-scheduler dependency and schedule health check every 10 seconds.
Dead instances (marked with state="DEAD") now automatically evicted instead of
accumulating indefinitely.

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-05-13 18:25:12 +02:00
Janis 43184d296d fix: remove corrupted instances immediately and evict dead instances
Problem: Dead instances pile up indefinitely. Failed metadata parsing leaves stale data in registry. No cleanup mechanism exists.

Changes:
1. Remove instance from registry on parse failure (corrupted metadata = unrecoverable)
2. Evict instances with state="DEAD" on next health check (was only evicting by heartbeat age)

This prevents:
- Memory leak from accumulating dead/corrupted instances
- Stale data persisting after parse failures
- Dead instances blocking resources indefinitely

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-05-13 18:25:12 +02:00
5 changed files with 55 additions and 7 deletions
+35
View File
@@ -366,3 +366,38 @@
* replace null checks with Option in coordinator ([2b04d7f](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/2b04d7fa713e06662bff5afe3fb3f9d04541ce51))
* update grpcServer variable to use Instance wrapper and add optional access method ([d5c8da2](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/d5c8da20f8805199e920ea5afbd9cdb39a078e40))
* update HealthMonitor to evict instances without associated pods ([0f41f13](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/0f41f13ce68b76846684bab67241a122250dfaf9))
## (2026-05-13)
### Features
* add coordinator startup validation and K8s pod watch ([81b045d](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/81b045d01bb054a4bc9dc9e02fc30f814e756205))
* add initialization metrics for various services ([d438e97](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/d438e97f32bdde0bfc63c1b4a8cc810cdd093166))
* add OpenTelemetry trace configuration with parentbased sampler ([3904d5a](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/3904d5ad8ad4930ddee65287a7bfab785a6148f5))
* add periodic health check to evict dead instances ([380a2cc](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/380a2cceeb5873bf93ff17a1e87d62408ef8e178))
* **config:** update application.yml for PostgreSQL and remove staging/production configurations ([2404e61](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/2404e6164c3b50ffccbea5238d636060d6abe4d6))
* **config:** update application.yml for staging and production environments ([6113432](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/6113432a14c476a3a0dfc0d449e17d023697f2ba))
* configure logging and add OpenTelemetry support ([#49](https://git.janis-eccarius.de/NowChess/NowChessSystems/issues/49)) ([d57c488](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/d57c4886612d1d92da0e1b79209fc83e6ef537a1))
* **docker:** add .dockerignore and .gitignore files for build exclusions ([c987d8e](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/c987d8e258c0e6c4cfbdaa8381c64c410d7a2b83))
* **docker:** add Dockerfiles for building Quarkus application in native and JVM modes ([3f2d2bb](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/3f2d2bb4c97fa8cddba66e1da4427c54236dfeed))
* **docker:** add Dockerfiles for Quarkus application in JVM and native modes ([34b9933](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/34b993304670cf2aa62cd2f6460cee7b9864b08e))
* **logging:** add DEBUG/INFO/WARN logging across services (NCS-72) ([#41](https://git.janis-eccarius.de/NowChess/NowChessSystems/issues/41)) ([804a4bf](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/804a4bf179e3dfb19e2be4390e7e543caf5237c6))
* NCS-78 Add Traceability to the Applications ([#46](https://git.janis-eccarius.de/NowChess/NowChessSystems/issues/46)) ([649566e](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/649566eb3fcf38f91c8896a739f74ea318af312d))
* NCS-78 Add Traceability to the Applications ([#47](https://git.janis-eccarius.de/NowChess/NowChessSystems/issues/47)) ([87dfc6c](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/87dfc6c2bcce7f7d58fc641bd8d468a2e584c108))
* true-microservices ([#40](https://git.janis-eccarius.de/NowChess/NowChessSystems/issues/40)) ([5909242](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/590924254e8a2754de661a57a03e43f89ceb6299))
### Bug Fixes
* add instance-dead-timeout configuration and update HealthMonitor to use it for stale instance eviction ([be0b710](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/be0b710543b542da5c301efef7d2d587d0ba758a))
* clean up code formatting and improve error handling in gRPC server and failover service ([ad9495a](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/ad9495afa3e93593b57154a187346c9b01393911))
* **coordinator:** refine type casting in rolloutSpec method ([#45](https://git.janis-eccarius.de/NowChess/NowChessSystems/issues/45)) ([d522f7f](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/d522f7f6edf9c985f03dd16816439d4184f1a589))
* **coordinator:** use genericKubernetesResources API for Argo Rollout scaling ([#43](https://git.janis-eccarius.de/NowChess/NowChessSystems/issues/43)) ([fa3c6b2](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/fa3c6b2886dc59c14c5dad834acc9b41e42023bb))
* **coordinator:** use genericKubernetesResources API for Argo Rollout scaling ([#44](https://git.janis-eccarius.de/NowChess/NowChessSystems/issues/44)) ([82d0b75](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/82d0b754be1075084944b466858672d944f9f7d8))
* **dependencies:** replace Fabric8 Kubernetes client with Quarkus Kubernetes client ([5f44570](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/5f44570b357277d09f33b7296860c421e2e70ce0))
* enhance AutoScaler and InstanceRegistry for replica management and stale instance eviction ([b4920d3](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/b4920d3817e58bda94d7764e608b856ce9a909f7))
* **middleware:** update paths for bot generation and stockfish configuration ([2dd0501](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/2dd0501687db08dcd242359f6837125baf8a2fdc))
* **redis:** update Redis configuration with max pool size and waiting parameters ([5baf6a7](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/5baf6a7cdbea484fc49c02e2b5a1c3919b7fa2c4))
* remove corrupted instances immediately and evict dead instances ([43184d2](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/43184d296da5a6a7b760ac90c2b739220d86bce3))
* replace null checks with Option in coordinator ([2b04d7f](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/2b04d7fa713e06662bff5afe3fb3f9d04541ce51))
* streamline logging for evicted instances in InstanceRegistry ([10937e7](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/10937e756a56e0e8fcf939decfdcaa4394506cc0))
* update grpcServer variable to use Instance wrapper and add optional access method ([d5c8da2](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/d5c8da20f8805199e920ea5afbd9cdb39a078e40))
* update HealthMonitor to evict instances without associated pods ([0f41f13](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/0f41f13ce68b76846684bab67241a122250dfaf9))
+1
View File
@@ -78,6 +78,7 @@ dependencies {
implementation("com.fasterxml.jackson.module:jackson-module-scala_3:${versions["JACKSON_SCALA"]!!}")
implementation("io.quarkus:quarkus-redis-client")
implementation("io.quarkus:quarkus-kubernetes-client")
implementation("io.quarkus:quarkus-scheduler")
testImplementation(platform("org.junit:junit-bom:${versions["JUNIT_BOM"]!!}"))
testImplementation("org.junit.jupiter:junit-jupiter")
@@ -5,6 +5,7 @@ import jakarta.enterprise.context.ApplicationScoped
import jakarta.enterprise.event.Observes
import jakarta.enterprise.inject.Instance
import jakarta.inject.Inject
import io.quarkus.scheduler.Scheduled
import de.nowchess.coordinator.config.CoordinatorConfig
import io.fabric8.kubernetes.client.KubernetesClient
import io.fabric8.kubernetes.api.model.Pod
@@ -73,7 +74,12 @@ class HealthMonitor:
Thread.ofVirtual().start(() => validateStartupInstances(timeoutMs))
startPodWatch()
def checkInstanceHealth: Unit =
@Scheduled(every = "10s")
def periodicHealthCheck(): Unit =
try checkInstanceHealth()
catch case ex: Exception => log.warnf(ex, "Health check failed")
def checkInstanceHealth(): Unit =
meterRegistry.counter("nowchess.coordinator.health.checks").increment()
val evicted = instanceRegistry.evictStaleInstances(config.instanceDeadTimeout)
if evicted.nonEmpty then
@@ -92,7 +92,9 @@ class InstanceRegistry:
Uni.createFrom().item(())
catch
case ex: Exception =>
log.warnf(ex, "Failed to parse instance metadata for %s", instanceId)
log.warnf(ex, "Failed to parse instance metadata for %s — removing from registry", instanceId)
instances.remove(instanceId)
meterRegistry.counter("nowchess.coordinator.instances.removed").increment()
Uni.createFrom().item(())
}
.onFailure()
@@ -112,15 +114,19 @@ class InstanceRegistry:
val stale = instances.asScala
.collect { case (id, inst) =>
try
if Instant.parse(inst.lastHeartbeat).isBefore(cutoff) then Some(id)
else None
val isHeartbeatStale = Instant.parse(inst.lastHeartbeat).isBefore(cutoff)
val isDead = inst.state == "DEAD"
if isHeartbeatStale || isDead then Some(id) else None
catch case _: Exception => None
}
.flatten
.toList
stale.foreach { id =>
instances.remove(id)
val inst = Option(instances.remove(id))
meterRegistry.counter("nowchess.coordinator.instances.evicted").increment()
log.warnf("Evicted stale instance %s (heartbeat older than %s)", id, maxAge)
inst.foreach { i =>
if i.state == "DEAD" then log.warnf("Evicted dead instance %s", id)
else log.warnf("Evicted stale instance %s (heartbeat older than %s)", id, maxAge)
}
}
stale
+1 -1
View File
@@ -1,3 +1,3 @@
MAJOR=0
MINOR=19
MINOR=20
PATCH=0