# Coordinator Microservice Implementation Guide ## Status: Proto Compilation Blockers (Fixable) **Completed**: Module scaffold, InstanceHeartbeatService, GameRedisSubscriberManager updates, gRPC handlers, REST API stubs. **Blocking**: Proto file → Java stubs not resolving in Scala imports. Solution documented below. --- ## Architecture **Goal**: <300ms failover via gRPC bidirectional stream detection + sub-1s game migration. **Core Flow**: 1. Core sends `HeartbeatFrame` every 200ms on stream to coordinator 2. Core posts `{prefix}:instance:{id}:games` Redis Set (SADD on subscribe, SREM on unsubscribe) 3. Core refreshes `{prefix}:instances:{id}` Redis key every 2s (5s TTL) 4. Coordinator watches stream; on drop → immediate failover 5. Failover: get `SMEMBERS {id}:games`, call `BatchResubscribeGames` on healthy cores **Key Insight**: Three detection signals (gRPC stream, Redis TTL, k8s watch), but **gRPC stream drop is primary** (50–200ms detection). --- ## Proto Compilation Fix ### Problem Scala code imports `de.nowchess.coordinator.HeartbeatFrame` but proto plugin generates classes Gradle doesn't make visible. ### Solution Quarkus gRPC plugin generates Java stubs in `build/generated/sources/protobuf/java/` during `quarkusGenerateCode` task. These are compiled to `.class` files but Scala compiler can't find them at compile time because they're not on Scala's classpath early enough. **Fix**: Add proto compilation order dependency in both modules: **modules/coordinator/build.gradle.kts** and **modules/core/build.gradle.kts**: ```gradle tasks.compileScala { dependsOn(tasks.named("compileJava")) // Ensures Java stubs compiled first } ``` Also ensure proto is on sourceSets: ```gradle sourceSets { main { proto { srcDir("src/main/proto") } } } ``` Quarkus v3.x should handle this automatically, but explicit dependency helps. ### Alternative: Use Generated Java Classes Directly If proto stubs still not found, import **exactly as generated**: ```scala // Don't try to import individual types import de.nowchess.coordinator.{ CoordinatorServiceGrpc, HeartbeatFrame, // ... } // Instead, use full paths or check actual generated names val frame = de.nowchess.coordinator.HeartbeatFrame.newBuilder() .setInstanceId("...") .build() ``` Run `./gradlew clean modules:coordinator:compileJava` to regenerate and inspect `build/generated/sources/protobuf/java/de/nowchess/coordinator/` to see actual class names. --- ## Code Quality Issues (Non-Blocking) **Fix in coordinator services** (already have `= _` deprecation warnings): ```scala // OLD @Inject var redissonClient: RedissonClient = _ // NEW import scala.compiletime.uninitialized @Inject var redissonClient: RedissonClient = uninitialized ``` **Jakarta optional injection**: ```scala // Old (doesn't work) @Inject(optional = true) var kubeClient: KubernetesClient = _ // Better (use null check) @Inject var kubeClient: KubernetesClient = null if (kubeClient != null) { ... } ``` **Method params in private helpers**: Remove unused params in `scaleUp()`, `scaleDown()`, `rebalance()`. --- ## Missing Implementation (Phase 2) ### 1. **InstanceHeartbeatService** (DONE, needs testing) - [x] Startup: generate instanceId, open gRPC stream, schedule heartbeats - [x] Every 200ms: send `HeartbeatFrame` via stream - [x] Every 2s: refresh Redis TTL on `{prefix}:instances:{id}` - [x] `addGameSubscription(gameId)` → `SADD {id}:games {gameId}` - [x] `removeGameSubscription(gameId)` → `SREM {id}:games {gameId}` - [x] Shutdown: cleanup Redis + stream - [ ] **Test**: Kill core JVM, verify coordinator detects within 300ms ### 2. **Coordinator HealthMonitor** (skeleton done) - [ ] Watch gRPC streams: on `onError()` or `onCompleted()`, mark instance DEAD - [ ] Fallback: poll Redis heartbeat TTL expiry every 5s - [ ] Fallback: k8s pod watch for label `app=nowchess-core`, detect NotReady status - [ ] Decision: if gRPC drop → immediate failover (no wait) ### 3. **Coordinator FailoverService** (partial) ```scala def onInstanceStreamDropped(instanceId: String): Unit = val gameIds = SMEMBERS "{prefix}:instance:{id}:games" val healthyInstances = getAllHealthyInstances() // Distribute games round-robin by load gameIds.grouped(gameIds.size / healthyInstances.size).zipWithIndex.foreach { case (batch, idx) => val target = healthyInstances(idx % healthyInstances.size) call target.grpcStub.batchResubscribeGames(batch) } DEL "{prefix}:instance:{id}:games" ``` ### 4. **Coordinator gRPC Client Stubs** (need manual integration) Create **modules/coordinator/src/main/scala/de/nowchess/coordinator/grpc/CoreGrpcClient.scala**: ```scala @ApplicationScoped class CoreGrpcClient: @GrpcClient("core-grpc") private var coreStub: CoordinatorServiceGrpc.CoordinatorServiceStub = uninitialized def batchResubscribeGames(host: String, port: Int, gameIds: List[String]): Int = // Build request, call via dynamic stub to (host, port) val response = coreStub.batchResubscribeGames(...) response.getSubscribedCount ``` Need to add dynamic gRPC client (Quarkus doesn't support runtime host:port changing by default). **Workaround**: Use `io.grpc:grpc-netty-shaded` + `ManagedChannel` directly: ```scala val channel = ManagedChannelBuilder.forAddress(host, port).usePlaintext().build() val stub = CoordinatorServiceGrpc.newStub(channel) ``` ### 5. **LoadBalancer.rebalance()** (stub → full impl) ```scala def rebalance(): Unit = val instances = listInstancesFromRedis() val loads = instances.map(_.subscriptionCount) val mean = loads.sum / loads.size val overloaded = instances.filter(_.subscriptionCount > maxGamesPerCore) .sortByDescending(_.subscriptionCount) val underloaded = instances.filter(_.subscriptionCount < mean * 0.8) .sortBy(_.subscriptionCount) overloaded.foreach { over => val excess = over.subscriptionCount - targetLoad underloaded.headOption.foreach { under => val toMove = getGamesToMove(over.instanceId, excess) call over.coreGrpc.unsubscribeGames(toMove) call under.coreGrpc.batchResubscribeGames(toMove) // Update Redis sets } } ``` ### 6. **AutoScaler** (stub → k8s API calls) ```scala def scaleUp(): Unit = if (kubeClient != null && config.autoScaleEnabled) { val rollout = kubeClient.resources(classOf[Rollout]) .inNamespace(config.k8sNamespace) .withName(config.k8sRolloutName) .get() val newReplicas = rollout.getSpec.getReplicas + 1 rollout.getSpec.setReplicas(newReplicas) kubeClient.resources(classOf[Rollout]) .inNamespace(config.k8sNamespace) .withName(config.k8sRolloutName) .createOrReplace(rollout) } ``` Requires: `io.fabric8:kubernetes-client:6.13.0` (already in build.gradle.kts). ### 7. **CacheEvictionManager** (stub → full impl) ```scala def evictStaleGames(): Unit = val now = System.currentTimeMillis() val keys = KEYS "{prefix}:game:entry:*" keys.foreach { key => val bucket = redissonClient.getBucket[String](key) val json = bucket.get() val lastUpdated = extractTimestamp(json) // Parse JSON if (now - lastUpdated > config.gameIdleThreshold.toMillis) { val gameId = key.stripPrefix(...) val instance = findInstanceWithGame(gameId) instance.foreach { inst => call inst.coreGrpc.evictGames(List(gameId)) } bucket.delete() } } ``` ### 8. **CoordinatorGrpcServer HeartbeatStream** (stub → full impl) ```scala override def heartbeatStream( responseObserver: StreamObserver[CoordinatorCommand] ): StreamObserver[HeartbeatFrame] = new StreamObserver[HeartbeatFrame]: private var lastInstanceId = "" override def onNext(frame: HeartbeatFrame): Unit = lastInstanceId = frame.getInstanceId instanceRegistry.updateInstanceFromRedis(lastInstanceId) override def onError(t: Throwable): Unit = log.warnf(t, "Stream error for %s", lastInstanceId) failoverService.onInstanceStreamDropped(lastInstanceId) override def onCompleted(): Unit = log.infof("Stream completed for %s", lastInstanceId) ``` --- ## Testing Checklist - [ ] Compile with proto fix - [ ] Start core + coordinator - [ ] Create game, subscribe core - [ ] Watch `redis-cli SMEMBERS nowchess:instance:{id}:games` → game appears - [ ] Kill core JVM via `kill -9` - [ ] Verify coordinator log shows "stream error" within 200ms - [ ] Verify second core receives `batchResubscribeGames` call within 300ms total - [ ] Create second core, rebalance load, verify games migrate - [ ] Scale up: verify Argo Rollout replica count increases - [ ] 45min idle game: verify coordinator calls `evictGames` --- ## File Checklist ✅ **Created**: - `modules/coordinator/build.gradle.kts` - `modules/coordinator/src/main/proto/coordinator_service.proto` - `modules/coordinator/src/main/resources/application.yml` - `modules/coordinator/src/main/scala/de/nowchess/coordinator/config/CoordinatorConfig.scala` - `modules/coordinator/src/main/scala/de/nowchess/coordinator/dto/InstanceMetadata.scala` - `modules/coordinator/src/main/scala/de/nowchess/coordinator/service/InstanceRegistry.scala` - `modules/coordinator/src/main/scala/de/nowchess/coordinator/service/FailoverService.scala` - `modules/coordinator/src/main/scala/de/nowchess/coordinator/service/LoadBalancer.scala` - `modules/coordinator/src/main/scala/de/nowchess/coordinator/service/AutoScaler.scala` - `modules/coordinator/src/main/scala/de/nowchess/coordinator/service/HealthMonitor.scala` - `modules/coordinator/src/main/scala/de/nowchess/coordinator/service/CacheEvictionManager.scala` - `modules/coordinator/src/main/scala/de/nowchess/coordinator/grpc/CoordinatorGrpcServer.scala` - `modules/coordinator/src/main/scala/de/nowchess/coordinator/resource/CoordinatorResource.scala` - `modules/coordinator/src/main/scala/de/nowchess/coordinator/CoordinatorApp.scala` - `modules/core/src/main/proto/coordinator_service.proto` - `modules/core/src/main/scala/de/nowchess/chess/service/InstanceHeartbeatService.scala` - `modules/core/src/main/scala/de/nowchess/chess/grpc/CoordinatorServiceHandler.scala` ✅ **Modified**: - `settings.gradle.kts` → added `modules:coordinator` - `modules/core/src/main/resources/application.yml` → added coordinator gRPC client + heartbeat config - `modules/core/build.gradle.kts` → (no changes, proto handled by quarkus-grpc) - `modules/core/src/main/scala/de/nowchess/chess/redis/GameRedisSubscriberManager.scala` → added InstanceHeartbeatService injection, SADD/SREM, batch ops --- ## Next Steps (New Session) 1. Run `./gradlew clean modules:coordinator:compileScala` with proto fix 2. Finish gRPC client stubs (Rollout, managed channels) 3. Implement `FailoverService.distributeGames()` with actual core gRPC calls 4. Implement `LoadBalancer.rebalance()` with game migration 5. Implement `AutoScaler` with k8s API 6. Implement `CacheEvictionManager` with timestamp parsing 7. Run integration tests (manual or `@QuarkusTest`) 8. Benchmark: create 5000 games, kill 1 core, measure failover time --- ## Design Decisions (Record for Future) - **GRPC stream as primary**: TCP-level detection <200ms vs polling/TTL 5-30s trade-off - **Redis game sets**: SADD/SREM for O(1) lookup vs scanning Redis per failover - **Argo Rollouts not StatefulSet**: Respects canary/blue-green; patch via Fabric8 `GenericKubernetesResource` - **Batch gRPC calls**: One call per target core vs 1:1 calls per game (saves RPC overhead) - **No persistent subscriptions**: On coordinator restart, gRPC reconnects auto-trigger resubscribe; best-effort is OK --- ## Known Gaps - Error handling: what if `batchResubscribeGames` fails? Retry? Partial migration? (Add circuit breaker) - Coordinator HA: single instance. Add Quorum or K8s deployment with multiple replicas + leader election if needed - Metrics: no Prometheus exports yet (add via `quarkus-micrometer`) - Monitoring: logs only, no alerts on failover latency SLA violation