- Add coordinator module with gRPC stream-based instance health detection - Implement InstanceHeartbeatService in core: bidirectional stream to coordinator every 200ms - Track game subscriptions per core via Redis Sets (SADD/SREM) - Add gRPC handlers for batch resubscribe/unsubscribe/evict/drain operations - Implement coordinator services: InstanceRegistry, FailoverService, LoadBalancer, AutoScaler, CacheEvictionManager - Add REST API for metrics and manual failover/rebalance/scaling - Proto definition: coordinator_service.proto with HeartbeatStream + batch game operations - Failover timeline: gRPC stream drop (50-200ms) → game migration (<300ms target) - Support for Argo Rollouts auto-scaling (k8s CRD patching via Fabric8 client) Note: Proto compilation issues documented in COORDINATOR_IMPLEMENTATION.md. Requires: - Add task dependency: tasks.compileScala dependsOn tasks.compileJava - Fix deprecated @Inject var = _ → = uninitialized syntax - Implement remaining service methods (gRPC clients, FailoverService distribution) Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
12 KiB
Coordinator Microservice Implementation Guide
Status: Proto Compilation Blockers (Fixable)
Completed: Module scaffold, InstanceHeartbeatService, GameRedisSubscriberManager updates, gRPC handlers, REST API stubs.
Blocking: Proto file → Java stubs not resolving in Scala imports. Solution documented below.
Architecture
Goal: <300ms failover via gRPC bidirectional stream detection + sub-1s game migration.
Core Flow:
- Core sends
HeartbeatFrameevery 200ms on stream to coordinator - Core posts
{prefix}:instance:{id}:gamesRedis Set (SADD on subscribe, SREM on unsubscribe) - Core refreshes
{prefix}:instances:{id}Redis key every 2s (5s TTL) - Coordinator watches stream; on drop → immediate failover
- Failover: get
SMEMBERS {id}:games, callBatchResubscribeGameson healthy cores
Key Insight: Three detection signals (gRPC stream, Redis TTL, k8s watch), but gRPC stream drop is primary (50–200ms detection).
Proto Compilation Fix
Problem
Scala code imports de.nowchess.coordinator.HeartbeatFrame but proto plugin generates classes Gradle doesn't make visible.
Solution
Quarkus gRPC plugin generates Java stubs in build/generated/sources/protobuf/java/ during quarkusGenerateCode task. These are compiled to .class files but Scala compiler can't find them at compile time because they're not on Scala's classpath early enough.
Fix: Add proto compilation order dependency in both modules:
modules/coordinator/build.gradle.kts and modules/core/build.gradle.kts:
tasks.compileScala {
dependsOn(tasks.named("compileJava")) // Ensures Java stubs compiled first
}
Also ensure proto is on sourceSets:
sourceSets {
main {
proto {
srcDir("src/main/proto")
}
}
}
Quarkus v3.x should handle this automatically, but explicit dependency helps.
Alternative: Use Generated Java Classes Directly
If proto stubs still not found, import exactly as generated:
// Don't try to import individual types
import de.nowchess.coordinator.{
CoordinatorServiceGrpc,
HeartbeatFrame,
// ...
}
// Instead, use full paths or check actual generated names
val frame = de.nowchess.coordinator.HeartbeatFrame.newBuilder()
.setInstanceId("...")
.build()
Run ./gradlew clean modules:coordinator:compileJava to regenerate and inspect build/generated/sources/protobuf/java/de/nowchess/coordinator/ to see actual class names.
Code Quality Issues (Non-Blocking)
Fix in coordinator services (already have = _ deprecation warnings):
// OLD
@Inject var redissonClient: RedissonClient = _
// NEW
import scala.compiletime.uninitialized
@Inject var redissonClient: RedissonClient = uninitialized
Jakarta optional injection:
// Old (doesn't work)
@Inject(optional = true) var kubeClient: KubernetesClient = _
// Better (use null check)
@Inject var kubeClient: KubernetesClient = null
if (kubeClient != null) { ... }
Method params in private helpers: Remove unused params in scaleUp(), scaleDown(), rebalance().
Missing Implementation (Phase 2)
1. InstanceHeartbeatService (DONE, needs testing)
- Startup: generate instanceId, open gRPC stream, schedule heartbeats
- Every 200ms: send
HeartbeatFramevia stream - Every 2s: refresh Redis TTL on
{prefix}:instances:{id} addGameSubscription(gameId)→SADD {id}:games {gameId}removeGameSubscription(gameId)→SREM {id}:games {gameId}- Shutdown: cleanup Redis + stream
- Test: Kill core JVM, verify coordinator detects within 300ms
2. Coordinator HealthMonitor (skeleton done)
- Watch gRPC streams: on
onError()oronCompleted(), mark instance DEAD - Fallback: poll Redis heartbeat TTL expiry every 5s
- Fallback: k8s pod watch for label
app=nowchess-core, detect NotReady status - Decision: if gRPC drop → immediate failover (no wait)
3. Coordinator FailoverService (partial)
def onInstanceStreamDropped(instanceId: String): Unit =
val gameIds = SMEMBERS "{prefix}:instance:{id}:games"
val healthyInstances = getAllHealthyInstances()
// Distribute games round-robin by load
gameIds.grouped(gameIds.size / healthyInstances.size).zipWithIndex.foreach {
case (batch, idx) =>
val target = healthyInstances(idx % healthyInstances.size)
call target.grpcStub.batchResubscribeGames(batch)
}
DEL "{prefix}:instance:{id}:games"
4. Coordinator gRPC Client Stubs (need manual integration)
Create modules/coordinator/src/main/scala/de/nowchess/coordinator/grpc/CoreGrpcClient.scala:
@ApplicationScoped
class CoreGrpcClient:
@GrpcClient("core-grpc")
private var coreStub: CoordinatorServiceGrpc.CoordinatorServiceStub = uninitialized
def batchResubscribeGames(host: String, port: Int, gameIds: List[String]): Int =
// Build request, call via dynamic stub to (host, port)
val response = coreStub.batchResubscribeGames(...)
response.getSubscribedCount
Need to add dynamic gRPC client (Quarkus doesn't support runtime host:port changing by default). Workaround: Use io.grpc:grpc-netty-shaded + ManagedChannel directly:
val channel = ManagedChannelBuilder.forAddress(host, port).usePlaintext().build()
val stub = CoordinatorServiceGrpc.newStub(channel)
5. LoadBalancer.rebalance() (stub → full impl)
def rebalance(): Unit =
val instances = listInstancesFromRedis()
val loads = instances.map(_.subscriptionCount)
val mean = loads.sum / loads.size
val overloaded = instances.filter(_.subscriptionCount > maxGamesPerCore)
.sortByDescending(_.subscriptionCount)
val underloaded = instances.filter(_.subscriptionCount < mean * 0.8)
.sortBy(_.subscriptionCount)
overloaded.foreach { over =>
val excess = over.subscriptionCount - targetLoad
underloaded.headOption.foreach { under =>
val toMove = getGamesToMove(over.instanceId, excess)
call over.coreGrpc.unsubscribeGames(toMove)
call under.coreGrpc.batchResubscribeGames(toMove)
// Update Redis sets
}
}
6. AutoScaler (stub → k8s API calls)
def scaleUp(): Unit =
if (kubeClient != null && config.autoScaleEnabled) {
val rollout = kubeClient.resources(classOf[Rollout])
.inNamespace(config.k8sNamespace)
.withName(config.k8sRolloutName)
.get()
val newReplicas = rollout.getSpec.getReplicas + 1
rollout.getSpec.setReplicas(newReplicas)
kubeClient.resources(classOf[Rollout])
.inNamespace(config.k8sNamespace)
.withName(config.k8sRolloutName)
.createOrReplace(rollout)
}
Requires: io.fabric8:kubernetes-client:6.13.0 (already in build.gradle.kts).
7. CacheEvictionManager (stub → full impl)
def evictStaleGames(): Unit =
val now = System.currentTimeMillis()
val keys = KEYS "{prefix}:game:entry:*"
keys.foreach { key =>
val bucket = redissonClient.getBucket[String](key)
val json = bucket.get()
val lastUpdated = extractTimestamp(json) // Parse JSON
if (now - lastUpdated > config.gameIdleThreshold.toMillis) {
val gameId = key.stripPrefix(...)
val instance = findInstanceWithGame(gameId)
instance.foreach { inst =>
call inst.coreGrpc.evictGames(List(gameId))
}
bucket.delete()
}
}
8. CoordinatorGrpcServer HeartbeatStream (stub → full impl)
override def heartbeatStream(
responseObserver: StreamObserver[CoordinatorCommand]
): StreamObserver[HeartbeatFrame] =
new StreamObserver[HeartbeatFrame]:
private var lastInstanceId = ""
override def onNext(frame: HeartbeatFrame): Unit =
lastInstanceId = frame.getInstanceId
instanceRegistry.updateInstanceFromRedis(lastInstanceId)
override def onError(t: Throwable): Unit =
log.warnf(t, "Stream error for %s", lastInstanceId)
failoverService.onInstanceStreamDropped(lastInstanceId)
override def onCompleted(): Unit =
log.infof("Stream completed for %s", lastInstanceId)
Testing Checklist
- Compile with proto fix
- Start core + coordinator
- Create game, subscribe core
- Watch
redis-cli SMEMBERS nowchess:instance:{id}:games→ game appears - Kill core JVM via
kill -9 - Verify coordinator log shows "stream error" within 200ms
- Verify second core receives
batchResubscribeGamescall within 300ms total - Create second core, rebalance load, verify games migrate
- Scale up: verify Argo Rollout replica count increases
- 45min idle game: verify coordinator calls
evictGames
File Checklist
✅ Created:
modules/coordinator/build.gradle.ktsmodules/coordinator/src/main/proto/coordinator_service.protomodules/coordinator/src/main/resources/application.ymlmodules/coordinator/src/main/scala/de/nowchess/coordinator/config/CoordinatorConfig.scalamodules/coordinator/src/main/scala/de/nowchess/coordinator/dto/InstanceMetadata.scalamodules/coordinator/src/main/scala/de/nowchess/coordinator/service/InstanceRegistry.scalamodules/coordinator/src/main/scala/de/nowchess/coordinator/service/FailoverService.scalamodules/coordinator/src/main/scala/de/nowchess/coordinator/service/LoadBalancer.scalamodules/coordinator/src/main/scala/de/nowchess/coordinator/service/AutoScaler.scalamodules/coordinator/src/main/scala/de/nowchess/coordinator/service/HealthMonitor.scalamodules/coordinator/src/main/scala/de/nowchess/coordinator/service/CacheEvictionManager.scalamodules/coordinator/src/main/scala/de/nowchess/coordinator/grpc/CoordinatorGrpcServer.scalamodules/coordinator/src/main/scala/de/nowchess/coordinator/resource/CoordinatorResource.scalamodules/coordinator/src/main/scala/de/nowchess/coordinator/CoordinatorApp.scalamodules/core/src/main/proto/coordinator_service.protomodules/core/src/main/scala/de/nowchess/chess/service/InstanceHeartbeatService.scalamodules/core/src/main/scala/de/nowchess/chess/grpc/CoordinatorServiceHandler.scala
✅ Modified:
settings.gradle.kts→ addedmodules:coordinatormodules/core/src/main/resources/application.yml→ added coordinator gRPC client + heartbeat configmodules/core/build.gradle.kts→ (no changes, proto handled by quarkus-grpc)modules/core/src/main/scala/de/nowchess/chess/redis/GameRedisSubscriberManager.scala→ added InstanceHeartbeatService injection, SADD/SREM, batch ops
Next Steps (New Session)
- Run
./gradlew clean modules:coordinator:compileScalawith proto fix - Finish gRPC client stubs (Rollout, managed channels)
- Implement
FailoverService.distributeGames()with actual core gRPC calls - Implement
LoadBalancer.rebalance()with game migration - Implement
AutoScalerwith k8s API - Implement
CacheEvictionManagerwith timestamp parsing - Run integration tests (manual or
@QuarkusTest) - Benchmark: create 5000 games, kill 1 core, measure failover time
Design Decisions (Record for Future)
- GRPC stream as primary: TCP-level detection <200ms vs polling/TTL 5-30s trade-off
- Redis game sets: SADD/SREM for O(1) lookup vs scanning Redis per failover
- Argo Rollouts not StatefulSet: Respects canary/blue-green; patch via Fabric8
GenericKubernetesResource - Batch gRPC calls: One call per target core vs 1:1 calls per game (saves RPC overhead)
- No persistent subscriptions: On coordinator restart, gRPC reconnects auto-trigger resubscribe; best-effort is OK
Known Gaps
- Error handling: what if
batchResubscribeGamesfails? Retry? Partial migration? (Add circuit breaker) - Coordinator HA: single instance. Add Quorum or K8s deployment with multiple replicas + leader election if needed
- Metrics: no Prometheus exports yet (add via
quarkus-micrometer) - Monitoring: logs only, no alerts on failover latency SLA violation