Files
NowChessSystems/COORDINATOR_IMPLEMENTATION.md
T
Janis f327441089 feat(coordinator): scaffold microservice for <300ms failover and load balancing
- Add coordinator module with gRPC stream-based instance health detection
- Implement InstanceHeartbeatService in core: bidirectional stream to coordinator every 200ms
- Track game subscriptions per core via Redis Sets (SADD/SREM)
- Add gRPC handlers for batch resubscribe/unsubscribe/evict/drain operations
- Implement coordinator services: InstanceRegistry, FailoverService, LoadBalancer, AutoScaler, CacheEvictionManager
- Add REST API for metrics and manual failover/rebalance/scaling
- Proto definition: coordinator_service.proto with HeartbeatStream + batch game operations
- Failover timeline: gRPC stream drop (50-200ms) → game migration (<300ms target)
- Support for Argo Rollouts auto-scaling (k8s CRD patching via Fabric8 client)

Note: Proto compilation issues documented in COORDINATOR_IMPLEMENTATION.md. Requires:
- Add task dependency: tasks.compileScala dependsOn tasks.compileJava
- Fix deprecated @Inject var = _ → = uninitialized syntax
- Implement remaining service methods (gRPC clients, FailoverService distribution)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-04-26 08:34:53 +02:00

12 KiB
Raw Blame History

Coordinator Microservice Implementation Guide

Status: Proto Compilation Blockers (Fixable)

Completed: Module scaffold, InstanceHeartbeatService, GameRedisSubscriberManager updates, gRPC handlers, REST API stubs.

Blocking: Proto file → Java stubs not resolving in Scala imports. Solution documented below.


Architecture

Goal: <300ms failover via gRPC bidirectional stream detection + sub-1s game migration.

Core Flow:

  1. Core sends HeartbeatFrame every 200ms on stream to coordinator
  2. Core posts {prefix}:instance:{id}:games Redis Set (SADD on subscribe, SREM on unsubscribe)
  3. Core refreshes {prefix}:instances:{id} Redis key every 2s (5s TTL)
  4. Coordinator watches stream; on drop → immediate failover
  5. Failover: get SMEMBERS {id}:games, call BatchResubscribeGames on healthy cores

Key Insight: Three detection signals (gRPC stream, Redis TTL, k8s watch), but gRPC stream drop is primary (50200ms detection).


Proto Compilation Fix

Problem

Scala code imports de.nowchess.coordinator.HeartbeatFrame but proto plugin generates classes Gradle doesn't make visible.

Solution

Quarkus gRPC plugin generates Java stubs in build/generated/sources/protobuf/java/ during quarkusGenerateCode task. These are compiled to .class files but Scala compiler can't find them at compile time because they're not on Scala's classpath early enough.

Fix: Add proto compilation order dependency in both modules:

modules/coordinator/build.gradle.kts and modules/core/build.gradle.kts:

tasks.compileScala {
    dependsOn(tasks.named("compileJava")) // Ensures Java stubs compiled first
}

Also ensure proto is on sourceSets:

sourceSets {
    main {
        proto {
            srcDir("src/main/proto")
        }
    }
}

Quarkus v3.x should handle this automatically, but explicit dependency helps.

Alternative: Use Generated Java Classes Directly

If proto stubs still not found, import exactly as generated:

// Don't try to import individual types
import de.nowchess.coordinator.{
  CoordinatorServiceGrpc,
  HeartbeatFrame,
  // ...
}

// Instead, use full paths or check actual generated names
val frame = de.nowchess.coordinator.HeartbeatFrame.newBuilder()
  .setInstanceId("...")
  .build()

Run ./gradlew clean modules:coordinator:compileJava to regenerate and inspect build/generated/sources/protobuf/java/de/nowchess/coordinator/ to see actual class names.


Code Quality Issues (Non-Blocking)

Fix in coordinator services (already have = _ deprecation warnings):

// OLD
@Inject var redissonClient: RedissonClient = _

// NEW
import scala.compiletime.uninitialized
@Inject var redissonClient: RedissonClient = uninitialized

Jakarta optional injection:

// Old (doesn't work)
@Inject(optional = true) var kubeClient: KubernetesClient = _

// Better (use null check)
@Inject var kubeClient: KubernetesClient = null
if (kubeClient != null) { ... }

Method params in private helpers: Remove unused params in scaleUp(), scaleDown(), rebalance().


Missing Implementation (Phase 2)

1. InstanceHeartbeatService (DONE, needs testing)

  • Startup: generate instanceId, open gRPC stream, schedule heartbeats
  • Every 200ms: send HeartbeatFrame via stream
  • Every 2s: refresh Redis TTL on {prefix}:instances:{id}
  • addGameSubscription(gameId)SADD {id}:games {gameId}
  • removeGameSubscription(gameId)SREM {id}:games {gameId}
  • Shutdown: cleanup Redis + stream
  • Test: Kill core JVM, verify coordinator detects within 300ms

2. Coordinator HealthMonitor (skeleton done)

  • Watch gRPC streams: on onError() or onCompleted(), mark instance DEAD
  • Fallback: poll Redis heartbeat TTL expiry every 5s
  • Fallback: k8s pod watch for label app=nowchess-core, detect NotReady status
  • Decision: if gRPC drop → immediate failover (no wait)

3. Coordinator FailoverService (partial)

def onInstanceStreamDropped(instanceId: String): Unit =
  val gameIds = SMEMBERS "{prefix}:instance:{id}:games"
  val healthyInstances = getAllHealthyInstances()
  
  // Distribute games round-robin by load
  gameIds.grouped(gameIds.size / healthyInstances.size).zipWithIndex.foreach {
    case (batch, idx) =>
      val target = healthyInstances(idx % healthyInstances.size)
      call target.grpcStub.batchResubscribeGames(batch)
  }
  
  DEL "{prefix}:instance:{id}:games"

4. Coordinator gRPC Client Stubs (need manual integration)

Create modules/coordinator/src/main/scala/de/nowchess/coordinator/grpc/CoreGrpcClient.scala:

@ApplicationScoped
class CoreGrpcClient:
  @GrpcClient("core-grpc")
  private var coreStub: CoordinatorServiceGrpc.CoordinatorServiceStub = uninitialized
  
  def batchResubscribeGames(host: String, port: Int, gameIds: List[String]): Int =
    // Build request, call via dynamic stub to (host, port)
    val response = coreStub.batchResubscribeGames(...)
    response.getSubscribedCount

Need to add dynamic gRPC client (Quarkus doesn't support runtime host:port changing by default). Workaround: Use io.grpc:grpc-netty-shaded + ManagedChannel directly:

val channel = ManagedChannelBuilder.forAddress(host, port).usePlaintext().build()
val stub = CoordinatorServiceGrpc.newStub(channel)

5. LoadBalancer.rebalance() (stub → full impl)

def rebalance(): Unit =
  val instances = listInstancesFromRedis()
  val loads = instances.map(_.subscriptionCount)
  val mean = loads.sum / loads.size
  
  val overloaded = instances.filter(_.subscriptionCount > maxGamesPerCore)
    .sortByDescending(_.subscriptionCount)
  val underloaded = instances.filter(_.subscriptionCount < mean * 0.8)
    .sortBy(_.subscriptionCount)
  
  overloaded.foreach { over =>
    val excess = over.subscriptionCount - targetLoad
    underloaded.headOption.foreach { under =>
      val toMove = getGamesToMove(over.instanceId, excess)
      call over.coreGrpc.unsubscribeGames(toMove)
      call under.coreGrpc.batchResubscribeGames(toMove)
      // Update Redis sets
    }
  }

6. AutoScaler (stub → k8s API calls)

def scaleUp(): Unit =
  if (kubeClient != null && config.autoScaleEnabled) {
    val rollout = kubeClient.resources(classOf[Rollout])
      .inNamespace(config.k8sNamespace)
      .withName(config.k8sRolloutName)
      .get()
    
    val newReplicas = rollout.getSpec.getReplicas + 1
    rollout.getSpec.setReplicas(newReplicas)
    kubeClient.resources(classOf[Rollout])
      .inNamespace(config.k8sNamespace)
      .withName(config.k8sRolloutName)
      .createOrReplace(rollout)
  }

Requires: io.fabric8:kubernetes-client:6.13.0 (already in build.gradle.kts).

7. CacheEvictionManager (stub → full impl)

def evictStaleGames(): Unit =
  val now = System.currentTimeMillis()
  val keys = KEYS "{prefix}:game:entry:*"
  
  keys.foreach { key =>
    val bucket = redissonClient.getBucket[String](key)
    val json = bucket.get()
    val lastUpdated = extractTimestamp(json) // Parse JSON
    
    if (now - lastUpdated > config.gameIdleThreshold.toMillis) {
      val gameId = key.stripPrefix(...)
      val instance = findInstanceWithGame(gameId)
      
      instance.foreach { inst =>
        call inst.coreGrpc.evictGames(List(gameId))
      }
      bucket.delete()
    }
  }

8. CoordinatorGrpcServer HeartbeatStream (stub → full impl)

override def heartbeatStream(
  responseObserver: StreamObserver[CoordinatorCommand]
): StreamObserver[HeartbeatFrame] =
  new StreamObserver[HeartbeatFrame]:
    private var lastInstanceId = ""
    
    override def onNext(frame: HeartbeatFrame): Unit =
      lastInstanceId = frame.getInstanceId
      instanceRegistry.updateInstanceFromRedis(lastInstanceId)
      
    override def onError(t: Throwable): Unit =
      log.warnf(t, "Stream error for %s", lastInstanceId)
      failoverService.onInstanceStreamDropped(lastInstanceId)
    
    override def onCompleted(): Unit =
      log.infof("Stream completed for %s", lastInstanceId)

Testing Checklist

  • Compile with proto fix
  • Start core + coordinator
  • Create game, subscribe core
  • Watch redis-cli SMEMBERS nowchess:instance:{id}:games → game appears
  • Kill core JVM via kill -9
  • Verify coordinator log shows "stream error" within 200ms
  • Verify second core receives batchResubscribeGames call within 300ms total
  • Create second core, rebalance load, verify games migrate
  • Scale up: verify Argo Rollout replica count increases
  • 45min idle game: verify coordinator calls evictGames

File Checklist

Created:

  • modules/coordinator/build.gradle.kts
  • modules/coordinator/src/main/proto/coordinator_service.proto
  • modules/coordinator/src/main/resources/application.yml
  • modules/coordinator/src/main/scala/de/nowchess/coordinator/config/CoordinatorConfig.scala
  • modules/coordinator/src/main/scala/de/nowchess/coordinator/dto/InstanceMetadata.scala
  • modules/coordinator/src/main/scala/de/nowchess/coordinator/service/InstanceRegistry.scala
  • modules/coordinator/src/main/scala/de/nowchess/coordinator/service/FailoverService.scala
  • modules/coordinator/src/main/scala/de/nowchess/coordinator/service/LoadBalancer.scala
  • modules/coordinator/src/main/scala/de/nowchess/coordinator/service/AutoScaler.scala
  • modules/coordinator/src/main/scala/de/nowchess/coordinator/service/HealthMonitor.scala
  • modules/coordinator/src/main/scala/de/nowchess/coordinator/service/CacheEvictionManager.scala
  • modules/coordinator/src/main/scala/de/nowchess/coordinator/grpc/CoordinatorGrpcServer.scala
  • modules/coordinator/src/main/scala/de/nowchess/coordinator/resource/CoordinatorResource.scala
  • modules/coordinator/src/main/scala/de/nowchess/coordinator/CoordinatorApp.scala
  • modules/core/src/main/proto/coordinator_service.proto
  • modules/core/src/main/scala/de/nowchess/chess/service/InstanceHeartbeatService.scala
  • modules/core/src/main/scala/de/nowchess/chess/grpc/CoordinatorServiceHandler.scala

Modified:

  • settings.gradle.kts → added modules:coordinator
  • modules/core/src/main/resources/application.yml → added coordinator gRPC client + heartbeat config
  • modules/core/build.gradle.kts → (no changes, proto handled by quarkus-grpc)
  • modules/core/src/main/scala/de/nowchess/chess/redis/GameRedisSubscriberManager.scala → added InstanceHeartbeatService injection, SADD/SREM, batch ops

Next Steps (New Session)

  1. Run ./gradlew clean modules:coordinator:compileScala with proto fix
  2. Finish gRPC client stubs (Rollout, managed channels)
  3. Implement FailoverService.distributeGames() with actual core gRPC calls
  4. Implement LoadBalancer.rebalance() with game migration
  5. Implement AutoScaler with k8s API
  6. Implement CacheEvictionManager with timestamp parsing
  7. Run integration tests (manual or @QuarkusTest)
  8. Benchmark: create 5000 games, kill 1 core, measure failover time

Design Decisions (Record for Future)

  • GRPC stream as primary: TCP-level detection <200ms vs polling/TTL 5-30s trade-off
  • Redis game sets: SADD/SREM for O(1) lookup vs scanning Redis per failover
  • Argo Rollouts not StatefulSet: Respects canary/blue-green; patch via Fabric8 GenericKubernetesResource
  • Batch gRPC calls: One call per target core vs 1:1 calls per game (saves RPC overhead)
  • No persistent subscriptions: On coordinator restart, gRPC reconnects auto-trigger resubscribe; best-effort is OK

Known Gaps

  • Error handling: what if batchResubscribeGames fails? Retry? Partial migration? (Add circuit breaker)
  • Coordinator HA: single instance. Add Quorum or K8s deployment with multiple replicas + leader election if needed
  • Metrics: no Prometheus exports yet (add via quarkus-micrometer)
  • Monitoring: logs only, no alerts on failover latency SLA violation