feat(coordinator): scaffold microservice for <300ms failover and load balancing
- Add coordinator module with gRPC stream-based instance health detection - Implement InstanceHeartbeatService in core: bidirectional stream to coordinator every 200ms - Track game subscriptions per core via Redis Sets (SADD/SREM) - Add gRPC handlers for batch resubscribe/unsubscribe/evict/drain operations - Implement coordinator services: InstanceRegistry, FailoverService, LoadBalancer, AutoScaler, CacheEvictionManager - Add REST API for metrics and manual failover/rebalance/scaling - Proto definition: coordinator_service.proto with HeartbeatStream + batch game operations - Failover timeline: gRPC stream drop (50-200ms) → game migration (<300ms target) - Support for Argo Rollouts auto-scaling (k8s CRD patching via Fabric8 client) Note: Proto compilation issues documented in COORDINATOR_IMPLEMENTATION.md. Requires: - Add task dependency: tasks.compileScala dependsOn tasks.compileJava - Fix deprecated @Inject var = _ → = uninitialized syntax - Implement remaining service methods (gRPC clients, FailoverService distribution) Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,316 @@
|
||||
# Coordinator Microservice Implementation Guide
|
||||
|
||||
## Status: Proto Compilation Blockers (Fixable)
|
||||
|
||||
**Completed**: Module scaffold, InstanceHeartbeatService, GameRedisSubscriberManager updates, gRPC handlers, REST API stubs.
|
||||
|
||||
**Blocking**: Proto file → Java stubs not resolving in Scala imports. Solution documented below.
|
||||
|
||||
---
|
||||
|
||||
## Architecture
|
||||
|
||||
**Goal**: <300ms failover via gRPC bidirectional stream detection + sub-1s game migration.
|
||||
|
||||
**Core Flow**:
|
||||
1. Core sends `HeartbeatFrame` every 200ms on stream to coordinator
|
||||
2. Core posts `{prefix}:instance:{id}:games` Redis Set (SADD on subscribe, SREM on unsubscribe)
|
||||
3. Core refreshes `{prefix}:instances:{id}` Redis key every 2s (5s TTL)
|
||||
4. Coordinator watches stream; on drop → immediate failover
|
||||
5. Failover: get `SMEMBERS {id}:games`, call `BatchResubscribeGames` on healthy cores
|
||||
|
||||
**Key Insight**: Three detection signals (gRPC stream, Redis TTL, k8s watch), but **gRPC stream drop is primary** (50–200ms detection).
|
||||
|
||||
---
|
||||
|
||||
## Proto Compilation Fix
|
||||
|
||||
### Problem
|
||||
Scala code imports `de.nowchess.coordinator.HeartbeatFrame` but proto plugin generates classes Gradle doesn't make visible.
|
||||
|
||||
### Solution
|
||||
Quarkus gRPC plugin generates Java stubs in `build/generated/sources/protobuf/java/` during `quarkusGenerateCode` task. These are compiled to `.class` files but Scala compiler can't find them at compile time because they're not on Scala's classpath early enough.
|
||||
|
||||
**Fix**: Add proto compilation order dependency in both modules:
|
||||
|
||||
**modules/coordinator/build.gradle.kts** and **modules/core/build.gradle.kts**:
|
||||
```gradle
|
||||
tasks.compileScala {
|
||||
dependsOn(tasks.named("compileJava")) // Ensures Java stubs compiled first
|
||||
}
|
||||
```
|
||||
|
||||
Also ensure proto is on sourceSets:
|
||||
```gradle
|
||||
sourceSets {
|
||||
main {
|
||||
proto {
|
||||
srcDir("src/main/proto")
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Quarkus v3.x should handle this automatically, but explicit dependency helps.
|
||||
|
||||
### Alternative: Use Generated Java Classes Directly
|
||||
If proto stubs still not found, import **exactly as generated**:
|
||||
```scala
|
||||
// Don't try to import individual types
|
||||
import de.nowchess.coordinator.{
|
||||
CoordinatorServiceGrpc,
|
||||
HeartbeatFrame,
|
||||
// ...
|
||||
}
|
||||
|
||||
// Instead, use full paths or check actual generated names
|
||||
val frame = de.nowchess.coordinator.HeartbeatFrame.newBuilder()
|
||||
.setInstanceId("...")
|
||||
.build()
|
||||
```
|
||||
|
||||
Run `./gradlew clean modules:coordinator:compileJava` to regenerate and inspect `build/generated/sources/protobuf/java/de/nowchess/coordinator/` to see actual class names.
|
||||
|
||||
---
|
||||
|
||||
## Code Quality Issues (Non-Blocking)
|
||||
|
||||
**Fix in coordinator services** (already have `= _` deprecation warnings):
|
||||
|
||||
```scala
|
||||
// OLD
|
||||
@Inject var redissonClient: RedissonClient = _
|
||||
|
||||
// NEW
|
||||
import scala.compiletime.uninitialized
|
||||
@Inject var redissonClient: RedissonClient = uninitialized
|
||||
```
|
||||
|
||||
**Jakarta optional injection**:
|
||||
```scala
|
||||
// Old (doesn't work)
|
||||
@Inject(optional = true) var kubeClient: KubernetesClient = _
|
||||
|
||||
// Better (use null check)
|
||||
@Inject var kubeClient: KubernetesClient = null
|
||||
if (kubeClient != null) { ... }
|
||||
```
|
||||
|
||||
**Method params in private helpers**: Remove unused params in `scaleUp()`, `scaleDown()`, `rebalance()`.
|
||||
|
||||
---
|
||||
|
||||
## Missing Implementation (Phase 2)
|
||||
|
||||
### 1. **InstanceHeartbeatService** (DONE, needs testing)
|
||||
- [x] Startup: generate instanceId, open gRPC stream, schedule heartbeats
|
||||
- [x] Every 200ms: send `HeartbeatFrame` via stream
|
||||
- [x] Every 2s: refresh Redis TTL on `{prefix}:instances:{id}`
|
||||
- [x] `addGameSubscription(gameId)` → `SADD {id}:games {gameId}`
|
||||
- [x] `removeGameSubscription(gameId)` → `SREM {id}:games {gameId}`
|
||||
- [x] Shutdown: cleanup Redis + stream
|
||||
- [ ] **Test**: Kill core JVM, verify coordinator detects within 300ms
|
||||
|
||||
### 2. **Coordinator HealthMonitor** (skeleton done)
|
||||
- [ ] Watch gRPC streams: on `onError()` or `onCompleted()`, mark instance DEAD
|
||||
- [ ] Fallback: poll Redis heartbeat TTL expiry every 5s
|
||||
- [ ] Fallback: k8s pod watch for label `app=nowchess-core`, detect NotReady status
|
||||
- [ ] Decision: if gRPC drop → immediate failover (no wait)
|
||||
|
||||
### 3. **Coordinator FailoverService** (partial)
|
||||
```scala
|
||||
def onInstanceStreamDropped(instanceId: String): Unit =
|
||||
val gameIds = SMEMBERS "{prefix}:instance:{id}:games"
|
||||
val healthyInstances = getAllHealthyInstances()
|
||||
|
||||
// Distribute games round-robin by load
|
||||
gameIds.grouped(gameIds.size / healthyInstances.size).zipWithIndex.foreach {
|
||||
case (batch, idx) =>
|
||||
val target = healthyInstances(idx % healthyInstances.size)
|
||||
call target.grpcStub.batchResubscribeGames(batch)
|
||||
}
|
||||
|
||||
DEL "{prefix}:instance:{id}:games"
|
||||
```
|
||||
|
||||
### 4. **Coordinator gRPC Client Stubs** (need manual integration)
|
||||
Create **modules/coordinator/src/main/scala/de/nowchess/coordinator/grpc/CoreGrpcClient.scala**:
|
||||
```scala
|
||||
@ApplicationScoped
|
||||
class CoreGrpcClient:
|
||||
@GrpcClient("core-grpc")
|
||||
private var coreStub: CoordinatorServiceGrpc.CoordinatorServiceStub = uninitialized
|
||||
|
||||
def batchResubscribeGames(host: String, port: Int, gameIds: List[String]): Int =
|
||||
// Build request, call via dynamic stub to (host, port)
|
||||
val response = coreStub.batchResubscribeGames(...)
|
||||
response.getSubscribedCount
|
||||
```
|
||||
|
||||
Need to add dynamic gRPC client (Quarkus doesn't support runtime host:port changing by default). **Workaround**: Use `io.grpc:grpc-netty-shaded` + `ManagedChannel` directly:
|
||||
```scala
|
||||
val channel = ManagedChannelBuilder.forAddress(host, port).usePlaintext().build()
|
||||
val stub = CoordinatorServiceGrpc.newStub(channel)
|
||||
```
|
||||
|
||||
### 5. **LoadBalancer.rebalance()** (stub → full impl)
|
||||
```scala
|
||||
def rebalance(): Unit =
|
||||
val instances = listInstancesFromRedis()
|
||||
val loads = instances.map(_.subscriptionCount)
|
||||
val mean = loads.sum / loads.size
|
||||
|
||||
val overloaded = instances.filter(_.subscriptionCount > maxGamesPerCore)
|
||||
.sortByDescending(_.subscriptionCount)
|
||||
val underloaded = instances.filter(_.subscriptionCount < mean * 0.8)
|
||||
.sortBy(_.subscriptionCount)
|
||||
|
||||
overloaded.foreach { over =>
|
||||
val excess = over.subscriptionCount - targetLoad
|
||||
underloaded.headOption.foreach { under =>
|
||||
val toMove = getGamesToMove(over.instanceId, excess)
|
||||
call over.coreGrpc.unsubscribeGames(toMove)
|
||||
call under.coreGrpc.batchResubscribeGames(toMove)
|
||||
// Update Redis sets
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### 6. **AutoScaler** (stub → k8s API calls)
|
||||
```scala
|
||||
def scaleUp(): Unit =
|
||||
if (kubeClient != null && config.autoScaleEnabled) {
|
||||
val rollout = kubeClient.resources(classOf[Rollout])
|
||||
.inNamespace(config.k8sNamespace)
|
||||
.withName(config.k8sRolloutName)
|
||||
.get()
|
||||
|
||||
val newReplicas = rollout.getSpec.getReplicas + 1
|
||||
rollout.getSpec.setReplicas(newReplicas)
|
||||
kubeClient.resources(classOf[Rollout])
|
||||
.inNamespace(config.k8sNamespace)
|
||||
.withName(config.k8sRolloutName)
|
||||
.createOrReplace(rollout)
|
||||
}
|
||||
```
|
||||
|
||||
Requires: `io.fabric8:kubernetes-client:6.13.0` (already in build.gradle.kts).
|
||||
|
||||
### 7. **CacheEvictionManager** (stub → full impl)
|
||||
```scala
|
||||
def evictStaleGames(): Unit =
|
||||
val now = System.currentTimeMillis()
|
||||
val keys = KEYS "{prefix}:game:entry:*"
|
||||
|
||||
keys.foreach { key =>
|
||||
val bucket = redissonClient.getBucket[String](key)
|
||||
val json = bucket.get()
|
||||
val lastUpdated = extractTimestamp(json) // Parse JSON
|
||||
|
||||
if (now - lastUpdated > config.gameIdleThreshold.toMillis) {
|
||||
val gameId = key.stripPrefix(...)
|
||||
val instance = findInstanceWithGame(gameId)
|
||||
|
||||
instance.foreach { inst =>
|
||||
call inst.coreGrpc.evictGames(List(gameId))
|
||||
}
|
||||
bucket.delete()
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### 8. **CoordinatorGrpcServer HeartbeatStream** (stub → full impl)
|
||||
```scala
|
||||
override def heartbeatStream(
|
||||
responseObserver: StreamObserver[CoordinatorCommand]
|
||||
): StreamObserver[HeartbeatFrame] =
|
||||
new StreamObserver[HeartbeatFrame]:
|
||||
private var lastInstanceId = ""
|
||||
|
||||
override def onNext(frame: HeartbeatFrame): Unit =
|
||||
lastInstanceId = frame.getInstanceId
|
||||
instanceRegistry.updateInstanceFromRedis(lastInstanceId)
|
||||
|
||||
override def onError(t: Throwable): Unit =
|
||||
log.warnf(t, "Stream error for %s", lastInstanceId)
|
||||
failoverService.onInstanceStreamDropped(lastInstanceId)
|
||||
|
||||
override def onCompleted(): Unit =
|
||||
log.infof("Stream completed for %s", lastInstanceId)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Testing Checklist
|
||||
|
||||
- [ ] Compile with proto fix
|
||||
- [ ] Start core + coordinator
|
||||
- [ ] Create game, subscribe core
|
||||
- [ ] Watch `redis-cli SMEMBERS nowchess:instance:{id}:games` → game appears
|
||||
- [ ] Kill core JVM via `kill -9`
|
||||
- [ ] Verify coordinator log shows "stream error" within 200ms
|
||||
- [ ] Verify second core receives `batchResubscribeGames` call within 300ms total
|
||||
- [ ] Create second core, rebalance load, verify games migrate
|
||||
- [ ] Scale up: verify Argo Rollout replica count increases
|
||||
- [ ] 45min idle game: verify coordinator calls `evictGames`
|
||||
|
||||
---
|
||||
|
||||
## File Checklist
|
||||
|
||||
✅ **Created**:
|
||||
- `modules/coordinator/build.gradle.kts`
|
||||
- `modules/coordinator/src/main/proto/coordinator_service.proto`
|
||||
- `modules/coordinator/src/main/resources/application.yml`
|
||||
- `modules/coordinator/src/main/scala/de/nowchess/coordinator/config/CoordinatorConfig.scala`
|
||||
- `modules/coordinator/src/main/scala/de/nowchess/coordinator/dto/InstanceMetadata.scala`
|
||||
- `modules/coordinator/src/main/scala/de/nowchess/coordinator/service/InstanceRegistry.scala`
|
||||
- `modules/coordinator/src/main/scala/de/nowchess/coordinator/service/FailoverService.scala`
|
||||
- `modules/coordinator/src/main/scala/de/nowchess/coordinator/service/LoadBalancer.scala`
|
||||
- `modules/coordinator/src/main/scala/de/nowchess/coordinator/service/AutoScaler.scala`
|
||||
- `modules/coordinator/src/main/scala/de/nowchess/coordinator/service/HealthMonitor.scala`
|
||||
- `modules/coordinator/src/main/scala/de/nowchess/coordinator/service/CacheEvictionManager.scala`
|
||||
- `modules/coordinator/src/main/scala/de/nowchess/coordinator/grpc/CoordinatorGrpcServer.scala`
|
||||
- `modules/coordinator/src/main/scala/de/nowchess/coordinator/resource/CoordinatorResource.scala`
|
||||
- `modules/coordinator/src/main/scala/de/nowchess/coordinator/CoordinatorApp.scala`
|
||||
- `modules/core/src/main/proto/coordinator_service.proto`
|
||||
- `modules/core/src/main/scala/de/nowchess/chess/service/InstanceHeartbeatService.scala`
|
||||
- `modules/core/src/main/scala/de/nowchess/chess/grpc/CoordinatorServiceHandler.scala`
|
||||
|
||||
✅ **Modified**:
|
||||
- `settings.gradle.kts` → added `modules:coordinator`
|
||||
- `modules/core/src/main/resources/application.yml` → added coordinator gRPC client + heartbeat config
|
||||
- `modules/core/build.gradle.kts` → (no changes, proto handled by quarkus-grpc)
|
||||
- `modules/core/src/main/scala/de/nowchess/chess/redis/GameRedisSubscriberManager.scala` → added InstanceHeartbeatService injection, SADD/SREM, batch ops
|
||||
|
||||
---
|
||||
|
||||
## Next Steps (New Session)
|
||||
|
||||
1. Run `./gradlew clean modules:coordinator:compileScala` with proto fix
|
||||
2. Finish gRPC client stubs (Rollout, managed channels)
|
||||
3. Implement `FailoverService.distributeGames()` with actual core gRPC calls
|
||||
4. Implement `LoadBalancer.rebalance()` with game migration
|
||||
5. Implement `AutoScaler` with k8s API
|
||||
6. Implement `CacheEvictionManager` with timestamp parsing
|
||||
7. Run integration tests (manual or `@QuarkusTest`)
|
||||
8. Benchmark: create 5000 games, kill 1 core, measure failover time
|
||||
|
||||
---
|
||||
|
||||
## Design Decisions (Record for Future)
|
||||
|
||||
- **GRPC stream as primary**: TCP-level detection <200ms vs polling/TTL 5-30s trade-off
|
||||
- **Redis game sets**: SADD/SREM for O(1) lookup vs scanning Redis per failover
|
||||
- **Argo Rollouts not StatefulSet**: Respects canary/blue-green; patch via Fabric8 `GenericKubernetesResource`
|
||||
- **Batch gRPC calls**: One call per target core vs 1:1 calls per game (saves RPC overhead)
|
||||
- **No persistent subscriptions**: On coordinator restart, gRPC reconnects auto-trigger resubscribe; best-effort is OK
|
||||
|
||||
---
|
||||
|
||||
## Known Gaps
|
||||
|
||||
- Error handling: what if `batchResubscribeGames` fails? Retry? Partial migration? (Add circuit breaker)
|
||||
- Coordinator HA: single instance. Add Quorum or K8s deployment with multiple replicas + leader election if needed
|
||||
- Metrics: no Prometheus exports yet (add via `quarkus-micrometer`)
|
||||
- Monitoring: logs only, no alerts on failover latency SLA violation
|
||||
Reference in New Issue
Block a user