fix: force-delete hanging pods and remove failed instances from registry

When pod deletion fails, instances remained in registry with state=DEAD, preventing scale-down since avgLoad calculation counted them. Now: - Use gracePeriod(0) for immediate pod deletion instead of 30s wait (prevents cascade when nodes are down or pods stuck terminating) - Remove instance from registry on deletion failure anyway (prevents dead instances from blocking scale-down via avgLoad) This breaks the cycle: failed deletions → scaleUp → max replicas → more failures → more stuck instances blocking recovery.
2026-05-14 04:47:20 +02:00
parent 68d6c1d36f
commit 960a419792
1 changed files with 4 additions and 3 deletions
@@ -185,14 +185,15 @@ class HealthMonitor:
          pods.find(pod => pod.getMetadata.getName.contains(instanceId)) match
            case Some(pod) =>
              val podName = pod.getMetadata.getName
-              kube.pods().inNamespace(config.k8sNamespace).withName(podName).delete()
+              kube.pods().inNamespace(config.k8sNamespace).withName(podName).withGracePeriod(0L).delete()
              meterRegistry.counter("nowchess.coordinator.pods.deleted").increment()
-              log.infof("Deleted pod %s for dead instance %s", podName, instanceId)
+              log.infof("Force-deleted pod %s for dead instance %s", podName, instanceId)
            case None =>
              log.debugf("No pod found for instance %s, skipping deletion", instanceId)
        catch
          case ex: Exception =>
-            log.warnf(ex, "Failed to delete pod for instance %s", instanceId)
+            log.warnf(ex, "Failed to delete pod for instance %s — removing from registry to prevent blocking scale-down", instanceId)
+            instanceRegistry.removeInstance(instanceId)

  private def validateStartupInstances(timeoutMs: Long): Unit =
    Thread.sleep(timeoutMs)