Files

T

Janis f327441089 feat(coordinator): scaffold microservice for <300ms failover and load balancing

- Add coordinator module with gRPC stream-based instance health detection
- Implement InstanceHeartbeatService in core: bidirectional stream to coordinator every 200ms
- Track game subscriptions per core via Redis Sets (SADD/SREM)
- Add gRPC handlers for batch resubscribe/unsubscribe/evict/drain operations
- Implement coordinator services: InstanceRegistry, FailoverService, LoadBalancer, AutoScaler, CacheEvictionManager
- Add REST API for metrics and manual failover/rebalance/scaling
- Proto definition: coordinator_service.proto with HeartbeatStream + batch game operations
- Failover timeline: gRPC stream drop (50-200ms) → game migration (<300ms target)
- Support for Argo Rollouts auto-scaling (k8s CRD patching via Fabric8 client)

Note: Proto compilation issues documented in COORDINATOR_IMPLEMENTATION.md. Requires:
- Add task dependency: tasks.compileScala dependsOn tasks.compileJava
- Fix deprecated @Inject var = _ → = uninitialized syntax
- Implement remaining service methods (gRPC clients, FailoverService distribution)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

2026-04-26 08:34:53 +02:00

12 KiB

Raw Blame History

Coordinator Microservice Implementation Guide

Status: Proto Compilation Blockers (Fixable)

Completed: Module scaffold, InstanceHeartbeatService, GameRedisSubscriberManager updates, gRPC handlers, REST API stubs.

Blocking: Proto file → Java stubs not resolving in Scala imports. Solution documented below.

Architecture

Goal: <300ms failover via gRPC bidirectional stream detection + sub-1s game migration.

Core Flow:

Core sends HeartbeatFrame every 200ms on stream to coordinator
Core posts {prefix}:instance:{id}:games Redis Set (SADD on subscribe, SREM on unsubscribe)
Core refreshes {prefix}:instances:{id} Redis key every 2s (5s TTL)
Coordinator watches stream; on drop → immediate failover
Failover: get SMEMBERS {id}:games, call BatchResubscribeGames on healthy cores

Key Insight: Three detection signals (gRPC stream, Redis TTL, k8s watch), but gRPC stream drop is primary (50–200ms detection).

Proto Compilation Fix

Problem

Scala code imports de.nowchess.coordinator.HeartbeatFrame but proto plugin generates classes Gradle doesn't make visible.

Solution

Quarkus gRPC plugin generates Java stubs in build/generated/sources/protobuf/java/ during quarkusGenerateCode task. These are compiled to .class files but Scala compiler can't find them at compile time because they're not on Scala's classpath early enough.

Fix: Add proto compilation order dependency in both modules:

modules/coordinator/build.gradle.kts and modules/core/build.gradle.kts:

tasks.compileScala {
    dependsOn(tasks.named("compileJava")) // Ensures Java stubs compiled first
}

Also ensure proto is on sourceSets:

sourceSets {
    main {
        proto {
            srcDir("src/main/proto")
        }
    }
}

Quarkus v3.x should handle this automatically, but explicit dependency helps.

Alternative: Use Generated Java Classes Directly

If proto stubs still not found, import exactly as generated:

// Don't try to import individual types
import de.nowchess.coordinator.{
  CoordinatorServiceGrpc,
  HeartbeatFrame,
  // ...
}

// Instead, use full paths or check actual generated names
val frame = de.nowchess.coordinator.HeartbeatFrame.newBuilder()
  .setInstanceId("...")
  .build()

Run ./gradlew clean modules:coordinator:compileJava to regenerate and inspect build/generated/sources/protobuf/java/de/nowchess/coordinator/ to see actual class names.

Code Quality Issues (Non-Blocking)

Fix in coordinator services (already have = _ deprecation warnings):

// OLD
@Inject var redissonClient: RedissonClient = _

// NEW
import scala.compiletime.uninitialized
@Inject var redissonClient: RedissonClient = uninitialized

Jakarta optional injection:

// Old (doesn't work)
@Inject(optional = true) var kubeClient: KubernetesClient = _

// Better (use null check)
@Inject var kubeClient: KubernetesClient = null
if (kubeClient != null) { ... }

Method params in private helpers: Remove unused params in scaleUp(), scaleDown(), rebalance().

Missing Implementation (Phase 2)

1. InstanceHeartbeatService (DONE, needs testing)

Startup: generate instanceId, open gRPC stream, schedule heartbeats
Every 200ms: send HeartbeatFrame via stream
Every 2s: refresh Redis TTL on {prefix}:instances:{id}
addGameSubscription(gameId) → SADD {id}:games {gameId}
removeGameSubscription(gameId) → SREM {id}:games {gameId}
Shutdown: cleanup Redis + stream
Test: Kill core JVM, verify coordinator detects within 300ms

2. Coordinator HealthMonitor (skeleton done)

Watch gRPC streams: on onError() or onCompleted(), mark instance DEAD
Fallback: poll Redis heartbeat TTL expiry every 5s
Fallback: k8s pod watch for label app=nowchess-core, detect NotReady status
Decision: if gRPC drop → immediate failover (no wait)

3. Coordinator FailoverService (partial)

def onInstanceStreamDropped(instanceId: String): Unit =
  val gameIds = SMEMBERS "{prefix}:instance:{id}:games"
  val healthyInstances = getAllHealthyInstances()
  
  // Distribute games round-robin by load
  gameIds.grouped(gameIds.size / healthyInstances.size).zipWithIndex.foreach {
    case (batch, idx) =>
      val target = healthyInstances(idx % healthyInstances.size)
      call target.grpcStub.batchResubscribeGames(batch)
  }
  
  DEL "{prefix}:instance:{id}:games"

4. Coordinator gRPC Client Stubs (need manual integration)

Create modules/coordinator/src/main/scala/de/nowchess/coordinator/grpc/CoreGrpcClient.scala:

@ApplicationScoped
class CoreGrpcClient:
  @GrpcClient("core-grpc")
  private var coreStub: CoordinatorServiceGrpc.CoordinatorServiceStub = uninitialized
  
  def batchResubscribeGames(host: String, port: Int, gameIds: List[String]): Int =
    // Build request, call via dynamic stub to (host, port)
    val response = coreStub.batchResubscribeGames(...)
    response.getSubscribedCount

Need to add dynamic gRPC client (Quarkus doesn't support runtime host:port changing by default). Workaround: Use io.grpc:grpc-netty-shaded + ManagedChannel directly:

val channel = ManagedChannelBuilder.forAddress(host, port).usePlaintext().build()
val stub = CoordinatorServiceGrpc.newStub(channel)

5. LoadBalancer.rebalance() (stub → full impl)

def rebalance(): Unit =
  val instances = listInstancesFromRedis()
  val loads = instances.map(_.subscriptionCount)
  val mean = loads.sum / loads.size
  
  val overloaded = instances.filter(_.subscriptionCount > maxGamesPerCore)
    .sortByDescending(_.subscriptionCount)
  val underloaded = instances.filter(_.subscriptionCount < mean * 0.8)
    .sortBy(_.subscriptionCount)
  
  overloaded.foreach { over =>
    val excess = over.subscriptionCount - targetLoad
    underloaded.headOption.foreach { under =>
      val toMove = getGamesToMove(over.instanceId, excess)
      call over.coreGrpc.unsubscribeGames(toMove)
      call under.coreGrpc.batchResubscribeGames(toMove)
      // Update Redis sets
    }
  }

6. AutoScaler (stub → k8s API calls)

def scaleUp(): Unit =
  if (kubeClient != null && config.autoScaleEnabled) {
    val rollout = kubeClient.resources(classOf[Rollout])
      .inNamespace(config.k8sNamespace)
      .withName(config.k8sRolloutName)
      .get()
    
    val newReplicas = rollout.getSpec.getReplicas + 1
    rollout.getSpec.setReplicas(newReplicas)
    kubeClient.resources(classOf[Rollout])
      .inNamespace(config.k8sNamespace)
      .withName(config.k8sRolloutName)
      .createOrReplace(rollout)
  }

Requires: io.fabric8:kubernetes-client:6.13.0 (already in build.gradle.kts).

7. CacheEvictionManager (stub → full impl)

def evictStaleGames(): Unit =
  val now = System.currentTimeMillis()
  val keys = KEYS "{prefix}:game:entry:*"
  
  keys.foreach { key =>
    val bucket = redissonClient.getBucket[String](key)
    val json = bucket.get()
    val lastUpdated = extractTimestamp(json) // Parse JSON
    
    if (now - lastUpdated > config.gameIdleThreshold.toMillis) {
      val gameId = key.stripPrefix(...)
      val instance = findInstanceWithGame(gameId)
      
      instance.foreach { inst =>
        call inst.coreGrpc.evictGames(List(gameId))
      }
      bucket.delete()
    }
  }

8. CoordinatorGrpcServer HeartbeatStream (stub → full impl)

override def heartbeatStream(
  responseObserver: StreamObserver[CoordinatorCommand]
): StreamObserver[HeartbeatFrame] =
  new StreamObserver[HeartbeatFrame]:
    private var lastInstanceId = ""
    
    override def onNext(frame: HeartbeatFrame): Unit =
      lastInstanceId = frame.getInstanceId
      instanceRegistry.updateInstanceFromRedis(lastInstanceId)
      
    override def onError(t: Throwable): Unit =
      log.warnf(t, "Stream error for %s", lastInstanceId)
      failoverService.onInstanceStreamDropped(lastInstanceId)
    
    override def onCompleted(): Unit =
      log.infof("Stream completed for %s", lastInstanceId)

Testing Checklist

Compile with proto fix
Start core + coordinator
Create game, subscribe core
Watch redis-cli SMEMBERS nowchess:instance:{id}:games → game appears
Kill core JVM via kill -9
Verify coordinator log shows "stream error" within 200ms
Verify second core receives batchResubscribeGames call within 300ms total
Create second core, rebalance load, verify games migrate
Scale up: verify Argo Rollout replica count increases
45min idle game: verify coordinator calls evictGames

File Checklist

✅ Created:

modules/coordinator/build.gradle.kts
modules/coordinator/src/main/proto/coordinator_service.proto
modules/coordinator/src/main/resources/application.yml
modules/coordinator/src/main/scala/de/nowchess/coordinator/config/CoordinatorConfig.scala
modules/coordinator/src/main/scala/de/nowchess/coordinator/dto/InstanceMetadata.scala
modules/coordinator/src/main/scala/de/nowchess/coordinator/service/InstanceRegistry.scala
modules/coordinator/src/main/scala/de/nowchess/coordinator/service/FailoverService.scala
modules/coordinator/src/main/scala/de/nowchess/coordinator/service/LoadBalancer.scala
modules/coordinator/src/main/scala/de/nowchess/coordinator/service/AutoScaler.scala
modules/coordinator/src/main/scala/de/nowchess/coordinator/service/HealthMonitor.scala
modules/coordinator/src/main/scala/de/nowchess/coordinator/service/CacheEvictionManager.scala
modules/coordinator/src/main/scala/de/nowchess/coordinator/grpc/CoordinatorGrpcServer.scala
modules/coordinator/src/main/scala/de/nowchess/coordinator/resource/CoordinatorResource.scala
modules/coordinator/src/main/scala/de/nowchess/coordinator/CoordinatorApp.scala
modules/core/src/main/proto/coordinator_service.proto
modules/core/src/main/scala/de/nowchess/chess/service/InstanceHeartbeatService.scala
modules/core/src/main/scala/de/nowchess/chess/grpc/CoordinatorServiceHandler.scala

✅ Modified:

settings.gradle.kts → added modules:coordinator
modules/core/src/main/resources/application.yml → added coordinator gRPC client + heartbeat config
modules/core/build.gradle.kts → (no changes, proto handled by quarkus-grpc)
modules/core/src/main/scala/de/nowchess/chess/redis/GameRedisSubscriberManager.scala → added InstanceHeartbeatService injection, SADD/SREM, batch ops

Next Steps (New Session)

Run ./gradlew clean modules:coordinator:compileScala with proto fix
Finish gRPC client stubs (Rollout, managed channels)
Implement FailoverService.distributeGames() with actual core gRPC calls
Implement LoadBalancer.rebalance() with game migration
Implement AutoScaler with k8s API
Implement CacheEvictionManager with timestamp parsing
Run integration tests (manual or @QuarkusTest)
Benchmark: create 5000 games, kill 1 core, measure failover time

Design Decisions (Record for Future)

GRPC stream as primary: TCP-level detection <200ms vs polling/TTL 5-30s trade-off
Redis game sets: SADD/SREM for O(1) lookup vs scanning Redis per failover
Argo Rollouts not StatefulSet: Respects canary/blue-green; patch via Fabric8 GenericKubernetesResource
Batch gRPC calls: One call per target core vs 1:1 calls per game (saves RPC overhead)
No persistent subscriptions: On coordinator restart, gRPC reconnects auto-trigger resubscribe; best-effort is OK

Known Gaps

Error handling: what if batchResubscribeGames fails? Retry? Partial migration? (Add circuit breaker)
Coordinator HA: single instance. Add Quorum or K8s deployment with multiple replicas + leader election if needed
Metrics: no Prometheus exports yet (add via quarkus-micrometer)
Monitoring: logs only, no alerts on failover latency SLA violation

12 KiB Raw Blame History Unescape Escape