Files
NowChessSystems/COORDINATOR_IMPLEMENTATION.md
T
Janis f327441089 feat(coordinator): scaffold microservice for <300ms failover and load balancing
- Add coordinator module with gRPC stream-based instance health detection
- Implement InstanceHeartbeatService in core: bidirectional stream to coordinator every 200ms
- Track game subscriptions per core via Redis Sets (SADD/SREM)
- Add gRPC handlers for batch resubscribe/unsubscribe/evict/drain operations
- Implement coordinator services: InstanceRegistry, FailoverService, LoadBalancer, AutoScaler, CacheEvictionManager
- Add REST API for metrics and manual failover/rebalance/scaling
- Proto definition: coordinator_service.proto with HeartbeatStream + batch game operations
- Failover timeline: gRPC stream drop (50-200ms) → game migration (<300ms target)
- Support for Argo Rollouts auto-scaling (k8s CRD patching via Fabric8 client)

Note: Proto compilation issues documented in COORDINATOR_IMPLEMENTATION.md. Requires:
- Add task dependency: tasks.compileScala dependsOn tasks.compileJava
- Fix deprecated @Inject var = _ → = uninitialized syntax
- Implement remaining service methods (gRPC clients, FailoverService distribution)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-04-26 08:34:53 +02:00

317 lines
12 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Coordinator Microservice Implementation Guide
## Status: Proto Compilation Blockers (Fixable)
**Completed**: Module scaffold, InstanceHeartbeatService, GameRedisSubscriberManager updates, gRPC handlers, REST API stubs.
**Blocking**: Proto file → Java stubs not resolving in Scala imports. Solution documented below.
---
## Architecture
**Goal**: <300ms failover via gRPC bidirectional stream detection + sub-1s game migration.
**Core Flow**:
1. Core sends `HeartbeatFrame` every 200ms on stream to coordinator
2. Core posts `{prefix}:instance:{id}:games` Redis Set (SADD on subscribe, SREM on unsubscribe)
3. Core refreshes `{prefix}:instances:{id}` Redis key every 2s (5s TTL)
4. Coordinator watches stream; on drop → immediate failover
5. Failover: get `SMEMBERS {id}:games`, call `BatchResubscribeGames` on healthy cores
**Key Insight**: Three detection signals (gRPC stream, Redis TTL, k8s watch), but **gRPC stream drop is primary** (50200ms detection).
---
## Proto Compilation Fix
### Problem
Scala code imports `de.nowchess.coordinator.HeartbeatFrame` but proto plugin generates classes Gradle doesn't make visible.
### Solution
Quarkus gRPC plugin generates Java stubs in `build/generated/sources/protobuf/java/` during `quarkusGenerateCode` task. These are compiled to `.class` files but Scala compiler can't find them at compile time because they're not on Scala's classpath early enough.
**Fix**: Add proto compilation order dependency in both modules:
**modules/coordinator/build.gradle.kts** and **modules/core/build.gradle.kts**:
```gradle
tasks.compileScala {
dependsOn(tasks.named("compileJava")) // Ensures Java stubs compiled first
}
```
Also ensure proto is on sourceSets:
```gradle
sourceSets {
main {
proto {
srcDir("src/main/proto")
}
}
}
```
Quarkus v3.x should handle this automatically, but explicit dependency helps.
### Alternative: Use Generated Java Classes Directly
If proto stubs still not found, import **exactly as generated**:
```scala
// Don't try to import individual types
import de.nowchess.coordinator.{
CoordinatorServiceGrpc,
HeartbeatFrame,
// ...
}
// Instead, use full paths or check actual generated names
val frame = de.nowchess.coordinator.HeartbeatFrame.newBuilder()
.setInstanceId("...")
.build()
```
Run `./gradlew clean modules:coordinator:compileJava` to regenerate and inspect `build/generated/sources/protobuf/java/de/nowchess/coordinator/` to see actual class names.
---
## Code Quality Issues (Non-Blocking)
**Fix in coordinator services** (already have `= _` deprecation warnings):
```scala
// OLD
@Inject var redissonClient: RedissonClient = _
// NEW
import scala.compiletime.uninitialized
@Inject var redissonClient: RedissonClient = uninitialized
```
**Jakarta optional injection**:
```scala
// Old (doesn't work)
@Inject(optional = true) var kubeClient: KubernetesClient = _
// Better (use null check)
@Inject var kubeClient: KubernetesClient = null
if (kubeClient != null) { ... }
```
**Method params in private helpers**: Remove unused params in `scaleUp()`, `scaleDown()`, `rebalance()`.
---
## Missing Implementation (Phase 2)
### 1. **InstanceHeartbeatService** (DONE, needs testing)
- [x] Startup: generate instanceId, open gRPC stream, schedule heartbeats
- [x] Every 200ms: send `HeartbeatFrame` via stream
- [x] Every 2s: refresh Redis TTL on `{prefix}:instances:{id}`
- [x] `addGameSubscription(gameId)``SADD {id}:games {gameId}`
- [x] `removeGameSubscription(gameId)``SREM {id}:games {gameId}`
- [x] Shutdown: cleanup Redis + stream
- [ ] **Test**: Kill core JVM, verify coordinator detects within 300ms
### 2. **Coordinator HealthMonitor** (skeleton done)
- [ ] Watch gRPC streams: on `onError()` or `onCompleted()`, mark instance DEAD
- [ ] Fallback: poll Redis heartbeat TTL expiry every 5s
- [ ] Fallback: k8s pod watch for label `app=nowchess-core`, detect NotReady status
- [ ] Decision: if gRPC drop → immediate failover (no wait)
### 3. **Coordinator FailoverService** (partial)
```scala
def onInstanceStreamDropped(instanceId: String): Unit =
val gameIds = SMEMBERS "{prefix}:instance:{id}:games"
val healthyInstances = getAllHealthyInstances()
// Distribute games round-robin by load
gameIds.grouped(gameIds.size / healthyInstances.size).zipWithIndex.foreach {
case (batch, idx) =>
val target = healthyInstances(idx % healthyInstances.size)
call target.grpcStub.batchResubscribeGames(batch)
}
DEL "{prefix}:instance:{id}:games"
```
### 4. **Coordinator gRPC Client Stubs** (need manual integration)
Create **modules/coordinator/src/main/scala/de/nowchess/coordinator/grpc/CoreGrpcClient.scala**:
```scala
@ApplicationScoped
class CoreGrpcClient:
@GrpcClient("core-grpc")
private var coreStub: CoordinatorServiceGrpc.CoordinatorServiceStub = uninitialized
def batchResubscribeGames(host: String, port: Int, gameIds: List[String]): Int =
// Build request, call via dynamic stub to (host, port)
val response = coreStub.batchResubscribeGames(...)
response.getSubscribedCount
```
Need to add dynamic gRPC client (Quarkus doesn't support runtime host:port changing by default). **Workaround**: Use `io.grpc:grpc-netty-shaded` + `ManagedChannel` directly:
```scala
val channel = ManagedChannelBuilder.forAddress(host, port).usePlaintext().build()
val stub = CoordinatorServiceGrpc.newStub(channel)
```
### 5. **LoadBalancer.rebalance()** (stub → full impl)
```scala
def rebalance(): Unit =
val instances = listInstancesFromRedis()
val loads = instances.map(_.subscriptionCount)
val mean = loads.sum / loads.size
val overloaded = instances.filter(_.subscriptionCount > maxGamesPerCore)
.sortByDescending(_.subscriptionCount)
val underloaded = instances.filter(_.subscriptionCount < mean * 0.8)
.sortBy(_.subscriptionCount)
overloaded.foreach { over =>
val excess = over.subscriptionCount - targetLoad
underloaded.headOption.foreach { under =>
val toMove = getGamesToMove(over.instanceId, excess)
call over.coreGrpc.unsubscribeGames(toMove)
call under.coreGrpc.batchResubscribeGames(toMove)
// Update Redis sets
}
}
```
### 6. **AutoScaler** (stub → k8s API calls)
```scala
def scaleUp(): Unit =
if (kubeClient != null && config.autoScaleEnabled) {
val rollout = kubeClient.resources(classOf[Rollout])
.inNamespace(config.k8sNamespace)
.withName(config.k8sRolloutName)
.get()
val newReplicas = rollout.getSpec.getReplicas + 1
rollout.getSpec.setReplicas(newReplicas)
kubeClient.resources(classOf[Rollout])
.inNamespace(config.k8sNamespace)
.withName(config.k8sRolloutName)
.createOrReplace(rollout)
}
```
Requires: `io.fabric8:kubernetes-client:6.13.0` (already in build.gradle.kts).
### 7. **CacheEvictionManager** (stub → full impl)
```scala
def evictStaleGames(): Unit =
val now = System.currentTimeMillis()
val keys = KEYS "{prefix}:game:entry:*"
keys.foreach { key =>
val bucket = redissonClient.getBucket[String](key)
val json = bucket.get()
val lastUpdated = extractTimestamp(json) // Parse JSON
if (now - lastUpdated > config.gameIdleThreshold.toMillis) {
val gameId = key.stripPrefix(...)
val instance = findInstanceWithGame(gameId)
instance.foreach { inst =>
call inst.coreGrpc.evictGames(List(gameId))
}
bucket.delete()
}
}
```
### 8. **CoordinatorGrpcServer HeartbeatStream** (stub → full impl)
```scala
override def heartbeatStream(
responseObserver: StreamObserver[CoordinatorCommand]
): StreamObserver[HeartbeatFrame] =
new StreamObserver[HeartbeatFrame]:
private var lastInstanceId = ""
override def onNext(frame: HeartbeatFrame): Unit =
lastInstanceId = frame.getInstanceId
instanceRegistry.updateInstanceFromRedis(lastInstanceId)
override def onError(t: Throwable): Unit =
log.warnf(t, "Stream error for %s", lastInstanceId)
failoverService.onInstanceStreamDropped(lastInstanceId)
override def onCompleted(): Unit =
log.infof("Stream completed for %s", lastInstanceId)
```
---
## Testing Checklist
- [ ] Compile with proto fix
- [ ] Start core + coordinator
- [ ] Create game, subscribe core
- [ ] Watch `redis-cli SMEMBERS nowchess:instance:{id}:games` → game appears
- [ ] Kill core JVM via `kill -9`
- [ ] Verify coordinator log shows "stream error" within 200ms
- [ ] Verify second core receives `batchResubscribeGames` call within 300ms total
- [ ] Create second core, rebalance load, verify games migrate
- [ ] Scale up: verify Argo Rollout replica count increases
- [ ] 45min idle game: verify coordinator calls `evictGames`
---
## File Checklist
**Created**:
- `modules/coordinator/build.gradle.kts`
- `modules/coordinator/src/main/proto/coordinator_service.proto`
- `modules/coordinator/src/main/resources/application.yml`
- `modules/coordinator/src/main/scala/de/nowchess/coordinator/config/CoordinatorConfig.scala`
- `modules/coordinator/src/main/scala/de/nowchess/coordinator/dto/InstanceMetadata.scala`
- `modules/coordinator/src/main/scala/de/nowchess/coordinator/service/InstanceRegistry.scala`
- `modules/coordinator/src/main/scala/de/nowchess/coordinator/service/FailoverService.scala`
- `modules/coordinator/src/main/scala/de/nowchess/coordinator/service/LoadBalancer.scala`
- `modules/coordinator/src/main/scala/de/nowchess/coordinator/service/AutoScaler.scala`
- `modules/coordinator/src/main/scala/de/nowchess/coordinator/service/HealthMonitor.scala`
- `modules/coordinator/src/main/scala/de/nowchess/coordinator/service/CacheEvictionManager.scala`
- `modules/coordinator/src/main/scala/de/nowchess/coordinator/grpc/CoordinatorGrpcServer.scala`
- `modules/coordinator/src/main/scala/de/nowchess/coordinator/resource/CoordinatorResource.scala`
- `modules/coordinator/src/main/scala/de/nowchess/coordinator/CoordinatorApp.scala`
- `modules/core/src/main/proto/coordinator_service.proto`
- `modules/core/src/main/scala/de/nowchess/chess/service/InstanceHeartbeatService.scala`
- `modules/core/src/main/scala/de/nowchess/chess/grpc/CoordinatorServiceHandler.scala`
**Modified**:
- `settings.gradle.kts` → added `modules:coordinator`
- `modules/core/src/main/resources/application.yml` → added coordinator gRPC client + heartbeat config
- `modules/core/build.gradle.kts` → (no changes, proto handled by quarkus-grpc)
- `modules/core/src/main/scala/de/nowchess/chess/redis/GameRedisSubscriberManager.scala` → added InstanceHeartbeatService injection, SADD/SREM, batch ops
---
## Next Steps (New Session)
1. Run `./gradlew clean modules:coordinator:compileScala` with proto fix
2. Finish gRPC client stubs (Rollout, managed channels)
3. Implement `FailoverService.distributeGames()` with actual core gRPC calls
4. Implement `LoadBalancer.rebalance()` with game migration
5. Implement `AutoScaler` with k8s API
6. Implement `CacheEvictionManager` with timestamp parsing
7. Run integration tests (manual or `@QuarkusTest`)
8. Benchmark: create 5000 games, kill 1 core, measure failover time
---
## Design Decisions (Record for Future)
- **GRPC stream as primary**: TCP-level detection <200ms vs polling/TTL 5-30s trade-off
- **Redis game sets**: SADD/SREM for O(1) lookup vs scanning Redis per failover
- **Argo Rollouts not StatefulSet**: Respects canary/blue-green; patch via Fabric8 `GenericKubernetesResource`
- **Batch gRPC calls**: One call per target core vs 1:1 calls per game (saves RPC overhead)
- **No persistent subscriptions**: On coordinator restart, gRPC reconnects auto-trigger resubscribe; best-effort is OK
---
## Known Gaps
- Error handling: what if `batchResubscribeGames` fails? Retry? Partial migration? (Add circuit breaker)
- Coordinator HA: single instance. Add Quorum or K8s deployment with multiple replicas + leader election if needed
- Metrics: no Prometheus exports yet (add via `quarkus-micrometer`)
- Monitoring: logs only, no alerts on failover latency SLA violation