Compare commits

...

2 Commits

Author SHA1 Message Date
LQ63 3bda9466b5 docs(tournament): spec
Build & Test (NowChessSystems) TeamCity build failed
Added tournament spec to docs
2026-05-13 13:32:52 +02:00
Janis 81b045d01b feat: add coordinator startup validation and K8s pod watch
Build & Test (NowChessSystems) TeamCity build failed
On startup, load all known instances from Redis and wait 15s for them to
reconnect via gRPC. Evict instances that don't reconnect within the timeout
and delete their K8s pods.

Replace one-shot pod status check with real fabric8 Watch. On pod Terminating
event, mark instance dead. On pod Deleted event, trigger failover. Failover
now waits reactively for at least one healthy instance before distributing
orphaned games, up to 30s timeout.

- Add startupValidationTimeout and failoverWaitTimeout config (15s, 30s)
- CoordinatorGrpcServer tracks active gRPC streams
- InstanceRegistry.loadAllFromRedis() scans and loads instances on startup
- HealthMonitor startup observer validates instances and starts K8s watch
- FailoverService.onInstanceStreamDropped returns Uni[Unit] for reactive wait
- All failover service callers updated to subscribe to Uni result

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-05-13 09:55:38 +02:00
9 changed files with 1131 additions and 49 deletions
+313
View File
@@ -0,0 +1,313 @@
# NowChess Tournament API
Swiss-system bot tournaments. Bots are paired by score each round; all bots play every round (no eliminations). Game moves flow through the existing board and bot endpoints — the tournament module only orchestrates pairings, standings, and lifecycle.
---
## Base path
```
/api/tournament
```
Routing: `/api/tournament``nowchess-tournament-active:8086`
---
## Authentication
All endpoints require a valid JWT (`Authorization: Bearer <token>`).
Bot-facing streaming endpoints additionally require the token's subject to match the registered `botId`.
---
## Data models
### Tournament
```json
{
"id": "t7kXq2",
"name": "Friday Night Bots",
"status": "created | started | finished",
"rounds": 5,
"currentRound": 2,
"timeControl": {
"limitSeconds": 300,
"incrementSeconds": 3
},
"createdBy": "userId",
"createdAt": "2026-05-13T18:00:00Z",
"startedAt": "2026-05-13T18:05:00Z",
"finishedAt": null
}
```
### Standing
```json
{
"rank": 1,
"botId": "bot_abc",
"botName": "StockfishClone",
"points": 3.5,
"wins": 3,
"draws": 1,
"losses": 0,
"buchholz": 9.0
}
```
Tiebreaker: Buchholz score (sum of opponents' points).
### Pairing
```json
{
"round": 2,
"whiteBot": "bot_abc",
"blackBot": "bot_xyz",
"gameId": "j0nPtcjl",
"result": "white | black | draw | ongoing"
}
```
### TournamentEvent (SSE)
```json
{ "type": "tournamentStarted", "tournamentId": "t7kXq2" }
{ "type": "roundStarted", "tournamentId": "t7kXq2", "round": 2 }
{ "type": "pairingReady", "tournamentId": "t7kXq2", "round": 2, "gameId": "j0nPtcjl", "color": "white" }
{ "type": "roundFinished", "tournamentId": "t7kXq2", "round": 2 }
{ "type": "tournamentFinished","tournamentId": "t7kXq2" }
```
---
## Endpoints
### Tournament lifecycle
#### Create tournament
```
POST /api/tournament
```
Body:
```json
{
"name": "Friday Night Bots",
"rounds": 5,
"timeControl": {
"limitSeconds": 300,
"incrementSeconds": 3
}
}
```
Response `201 Created`:
```json
{ "id": "t7kXq2" }
```
The creator becomes the tournament director. Only the director can start and delete the tournament.
---
#### Get tournament
```
GET /api/tournament/{tournamentId}
```
Response `200 OK`: `Tournament` object.
---
#### List tournaments
```
GET /api/tournament
```
Query params:
| Param | Type | Default |
|----------|---------------------------------|-----------|
| `status` | `created\|started\|finished` | (all) |
| `limit` | integer (max 50) | 20 |
| `offset` | integer | 0 |
Response `200 OK`:
```json
{
"tournaments": [ /* Tournament[] */ ],
"total": 42
}
```
---
#### Start tournament
```
POST /api/tournament/{tournamentId}/start
```
Requires at least 2 registered bots. Computes round 1 pairings (random for round 1; score-based from round 2). Creates one game per pairing via `POST /api/board/game`.
Response `200 OK`: updated `Tournament` object.
---
#### Delete tournament
```
DELETE /api/tournament/{tournamentId}
```
Only allowed while `status == "created"`. Response `204 No Content`.
---
### Bot registration
#### Register bot
```
POST /api/tournament/{tournamentId}/bots
```
Registers a bot for the tournament. Must be called before the tournament starts.
The token subject must match the bot being registered.
Body:
```json
{ "botId": "bot_abc" }
```
Response `200 OK`:
```json
{ "botId": "bot_abc", "tournamentId": "t7kXq2" }
```
---
#### Unregister bot
```
DELETE /api/tournament/{tournamentId}/bots/{botId}
```
Only allowed while `status == "created"`. Response `204 No Content`.
---
#### List registered bots
```
GET /api/tournament/{tournamentId}/bots
```
Response `200 OK`:
```json
{
"bots": [
{ "botId": "bot_abc", "botName": "StockfishClone" }
]
}
```
---
### Standings and pairings
#### Get standings
```
GET /api/tournament/{tournamentId}/standings
```
Response `200 OK`:
```json
{ "standings": [ /* Standing[] */ ] }
```
---
#### Get pairings for a round
```
GET /api/tournament/{tournamentId}/rounds/{round}/pairings
```
Response `200 OK`:
```json
{ "pairings": [ /* Pairing[] */ ] }
```
---
### Bot streaming
#### Stream tournament events
```
GET /api/tournament/{tournamentId}/stream
```
Headers: `Accept: text/event-stream`
Server-Sent Events stream scoped to this tournament. The bot receives `pairingReady` events when it is assigned a game, at which point it should connect to the existing bot game stream:
```
GET /bot/stream/game/{gameId} (existing endpoint)
POST /bot/game/{gameId}/move/{uci} (existing endpoint)
```
The tournament module never sends moves — bots do that themselves through the existing bot endpoints.
---
## Typical bot flow
```
1. POST /api/tournament # director creates tournament
2. POST /api/tournament/{id}/bots # each bot registers
3. POST /api/tournament/{id}/start # director starts
4. GET /api/tournament/{id}/stream (SSE) # each bot opens stream
-- per round --
5. receive: pairingReady { gameId, color }
6. GET /bot/stream/game/{gameId} # existing endpoint
7. POST /bot/game/{gameId}/move/{uci} # existing endpoint, repeated
-- game ends --
8. receive: roundFinished
9. GET /api/tournament/{id}/standings # optional, inspect scores
-- repeat 59 for each round --
10. receive: tournamentFinished
11. GET /api/tournament/{id}/standings # final ranking
```
---
## Error responses
| Status | Meaning |
|--------|------------------------------------------------------|
| 400 | Invalid request body or parameters |
| 401 | Missing or invalid JWT |
| 403 | Action not allowed (wrong director, wrong bot, etc.) |
| 404 | Tournament or bot not found |
| 409 | Tournament already started / bot already registered |
+623
View File
@@ -0,0 +1,623 @@
openapi: 3.0.3
info:
title: NowChess Tournament API
description: |
Swiss-system bot tournaments, modelled after the Lichess API style.
Game moves flow through the existing board and bot endpoints — this module
handles pairings, standings, and lifecycle only.
## Streaming
Endpoints marked **NDJSON** return newline-delimited JSON objects
(`application/x-ndjson`). Each line is one complete JSON object. The
connection stays open until the tournament or round ends.
## Bot flow
```
POST /api/tournament # create
POST /api/tournament/{id}/join # each bot joins
POST /api/tournament/{id}/start # director starts
GET /api/tournament/{id}/stream (NDJSON) # open before start
-- per round --
receive gameStart { gameId, color }
GET /bot/stream/game/{gameId} (existing, NDJSON)
POST /bot/game/{gameId}/move/{uci} (existing)
-- repeat --
GET /api/tournament/{id}/results (NDJSON) # final standings
```
version: 1.0.0
servers:
- url: https://st.nowchess.janis-eccarius.de
description: Staging
- url: https://nowchess.janis-eccarius.de
description: Production
- url: http://localhost:8086
description: Local
security:
- bearerAuth: []
tags:
- name: Tournament
description: Tournament lifecycle
- name: Participation
description: Join and withdraw
- name: Results
description: Standings, pairings, and game export
- name: Stream
description: NDJSON event streams
paths:
/api/tournament:
get:
tags: [Tournament]
summary: Get current tournaments
description: Returns tournaments grouped by status. No auth required.
security: []
responses:
"200":
description: Tournaments by status
content:
application/json:
schema:
type: object
properties:
created:
type: array
items:
$ref: "#/components/schemas/TournamentInfo"
started:
type: array
items:
$ref: "#/components/schemas/TournamentInfo"
finished:
type: array
items:
$ref: "#/components/schemas/TournamentInfo"
post:
tags: [Tournament]
summary: Create a new tournament
description: The authenticated user becomes the tournament director.
requestBody:
required: true
content:
application/x-www-form-urlencoded:
schema:
$ref: "#/components/schemas/CreateTournamentForm"
responses:
"201":
description: Tournament created
content:
application/json:
schema:
$ref: "#/components/schemas/Tournament"
"400":
$ref: "#/components/responses/BadRequest"
"401":
$ref: "#/components/responses/Unauthorized"
/api/tournament/{id}:
parameters:
- $ref: "#/components/parameters/id"
get:
tags: [Tournament]
summary: Get a tournament
description: Includes the first page of standings in the `standing` field.
security: []
responses:
"200":
description: Tournament with embedded standings
content:
application/json:
schema:
$ref: "#/components/schemas/Tournament"
"404":
$ref: "#/components/responses/NotFound"
delete:
tags: [Tournament]
summary: Terminate a tournament
description: Only the director may terminate. Only allowed while status is `created`.
responses:
"204":
description: Terminated
"401":
$ref: "#/components/responses/Unauthorized"
"403":
$ref: "#/components/responses/Forbidden"
"404":
$ref: "#/components/responses/NotFound"
"409":
$ref: "#/components/responses/Conflict"
/api/tournament/{id}/start:
parameters:
- $ref: "#/components/parameters/id"
post:
tags: [Tournament]
summary: Start the tournament
description: |
Only the director may start. Requires at least 2 joined bots.
Computes round 1 pairings and creates games via `POST /api/board/game`.
responses:
"200":
description: Tournament started
content:
application/json:
schema:
$ref: "#/components/schemas/Tournament"
"401":
$ref: "#/components/responses/Unauthorized"
"403":
$ref: "#/components/responses/Forbidden"
"404":
$ref: "#/components/responses/NotFound"
"409":
$ref: "#/components/responses/Conflict"
/api/tournament/{id}/join:
parameters:
- $ref: "#/components/parameters/id"
post:
tags: [Participation]
summary: Join a tournament
description: |
Register the authenticated bot for the tournament. Only allowed while
status is `created`. The token subject must be a bot account.
responses:
"200":
description: Ok
content:
application/json:
schema:
$ref: "#/components/schemas/Ok"
"401":
$ref: "#/components/responses/Unauthorized"
"403":
$ref: "#/components/responses/Forbidden"
"404":
$ref: "#/components/responses/NotFound"
"409":
$ref: "#/components/responses/Conflict"
/api/tournament/{id}/withdraw:
parameters:
- $ref: "#/components/parameters/id"
post:
tags: [Participation]
summary: Withdraw from a tournament
description: Only allowed while status is `created`.
responses:
"200":
description: Ok
content:
application/json:
schema:
$ref: "#/components/schemas/Ok"
"401":
$ref: "#/components/responses/Unauthorized"
"403":
$ref: "#/components/responses/Forbidden"
"404":
$ref: "#/components/responses/NotFound"
"409":
$ref: "#/components/responses/Conflict"
/api/tournament/{id}/results:
parameters:
- $ref: "#/components/parameters/id"
get:
tags: [Results]
summary: Get results as NDJSON stream
description: |
Streams one `Result` object per line, sorted by rank ascending.
Available at any point during or after the tournament.
security: []
parameters:
- name: nb
in: query
description: Max number of results to stream (default all)
schema:
type: integer
minimum: 1
responses:
"200":
description: NDJSON stream of results
content:
application/x-ndjson:
schema:
$ref: "#/components/schemas/Result"
"404":
$ref: "#/components/responses/NotFound"
/api/tournament/{id}/round/{round}:
parameters:
- $ref: "#/components/parameters/id"
- name: round
in: path
required: true
schema:
type: integer
minimum: 1
get:
tags: [Results]
summary: Get pairings for a round
security: []
responses:
"200":
description: Pairings for the specified round
content:
application/json:
schema:
type: object
properties:
round:
type: integer
example: 2
pairings:
type: array
items:
$ref: "#/components/schemas/Pairing"
"404":
$ref: "#/components/responses/NotFound"
/api/tournament/{id}/export/games:
parameters:
- $ref: "#/components/parameters/id"
get:
tags: [Results]
summary: Export all games
description: |
Returns all games of the tournament. Accepts both PGN and NDJSON via
the `Accept` header.
security: []
parameters:
- name: Accept
in: header
schema:
type: string
enum:
- application/x-chess-pgn
- application/x-ndjson
default: application/x-chess-pgn
responses:
"200":
description: Games in the requested format
content:
application/x-chess-pgn:
schema:
type: string
description: Standard PGN, one game per block
application/x-ndjson:
schema:
$ref: "#/components/schemas/GameExport"
"404":
$ref: "#/components/responses/NotFound"
/api/tournament/{id}/stream:
parameters:
- $ref: "#/components/parameters/id"
get:
tags: [Stream]
summary: Stream tournament events
description: |
NDJSON stream scoped to one tournament. Keep this connection open for
the full tournament lifetime.
On `gameStart` the bot connects to the existing bot endpoints:
- `GET /bot/stream/game/{gameId}` — stream game state (existing)
- `POST /bot/game/{gameId}/move/{uci}` — submit moves (existing)
responses:
"200":
description: NDJSON event stream
content:
application/x-ndjson:
schema:
$ref: "#/components/schemas/TournamentEvent"
"401":
$ref: "#/components/responses/Unauthorized"
"404":
$ref: "#/components/responses/NotFound"
components:
securitySchemes:
bearerAuth:
type: http
scheme: bearer
bearerFormat: JWT
parameters:
id:
name: id
in: path
required: true
schema:
type: string
example: t7kXq2
schemas:
Clock:
type: object
required: [limit, increment]
properties:
limit:
type: integer
description: Base time in seconds
example: 300
increment:
type: integer
description: Increment per move in seconds
example: 3
Variant:
type: object
properties:
key:
type: string
example: standard
name:
type: string
example: Standard
BotRef:
type: object
properties:
id:
type: string
example: bot_abc
name:
type: string
example: StockfishClone
Standing:
type: object
properties:
page:
type: integer
example: 1
players:
type: array
items:
$ref: "#/components/schemas/Result"
TournamentInfo:
description: Lightweight tournament summary used in list responses.
type: object
properties:
id:
type: string
example: t7kXq2
fullName:
type: string
example: Friday Night Bots Swiss
clock:
$ref: "#/components/schemas/Clock"
variant:
$ref: "#/components/schemas/Variant"
rated:
type: boolean
example: true
nbPlayers:
type: integer
example: 8
nbRounds:
type: integer
example: 5
createdBy:
type: string
example: userId
startsAt:
type: string
format: date-time
Tournament:
allOf:
- $ref: "#/components/schemas/TournamentInfo"
- type: object
properties:
status:
type: string
enum: [created, started, finished]
example: started
round:
type: integer
description: Current round number
example: 2
standing:
$ref: "#/components/schemas/Standing"
winner:
description: Present only when status is `finished`
allOf:
- $ref: "#/components/schemas/BotRef"
nullable: true
CreateTournamentForm:
type: object
required: [name, nbRounds, clockLimit, clockIncrement]
properties:
name:
type: string
example: Friday Night Bots
nbRounds:
type: integer
minimum: 1
example: 5
clockLimit:
type: integer
description: Base time in seconds
example: 300
clockIncrement:
type: integer
description: Increment per move in seconds
example: 3
rated:
type: boolean
default: true
Result:
type: object
properties:
rank:
type: integer
example: 1
points:
type: number
format: double
example: 3.5
tieBreak:
type: number
format: double
description: Buchholz score (sum of opponents' points)
example: 9.0
bot:
$ref: "#/components/schemas/BotRef"
nbGames:
type: integer
example: 4
wins:
type: integer
example: 3
draws:
type: integer
example: 1
losses:
type: integer
example: 0
Pairing:
type: object
properties:
round:
type: integer
example: 2
white:
$ref: "#/components/schemas/BotRef"
black:
$ref: "#/components/schemas/BotRef"
gameId:
type: string
example: j0nPtcjl
winner:
type: string
enum: [white, black, draw]
nullable: true
description: Null while the game is ongoing
GameExport:
description: One game object per NDJSON line.
type: object
properties:
id:
type: string
example: j0nPtcjl
round:
type: integer
example: 2
white:
$ref: "#/components/schemas/BotRef"
black:
$ref: "#/components/schemas/BotRef"
winner:
type: string
enum: [white, black, draw]
nullable: true
moves:
type: string
description: Space-separated UCI moves
example: e2e4 e7e5 g1f3
TournamentEvent:
description: |
One JSON object per NDJSON line. Discriminate on `type`.
| type | extra fields |
|------|-------------|
| `tournamentStarted` | — |
| `roundStarted` | `round` |
| `gameStart` | `round`, `gameId`, `color` |
| `roundFinished` | `round` |
| `tournamentFinished` | `winner` |
type: object
required: [type]
properties:
type:
type: string
enum:
- tournamentStarted
- roundStarted
- gameStart
- roundFinished
- tournamentFinished
round:
type: integer
example: 2
gameId:
type: string
example: j0nPtcjl
color:
type: string
enum: [white, black]
winner:
$ref: "#/components/schemas/BotRef"
Ok:
type: object
properties:
ok:
type: boolean
example: true
Error:
type: object
properties:
error:
type: string
example: tournament already started
responses:
BadRequest:
description: Invalid request body or parameters
content:
application/json:
schema:
$ref: "#/components/schemas/Error"
Unauthorized:
description: Missing or invalid JWT
content:
application/json:
schema:
$ref: "#/components/schemas/Error"
Forbidden:
description: Action not permitted for this user or bot
content:
application/json:
schema:
$ref: "#/components/schemas/Error"
NotFound:
description: Tournament not found
content:
application/json:
schema:
$ref: "#/components/schemas/Error"
Conflict:
description: Conflicting state (e.g. already started, bot already joined)
content:
application/json:
schema:
$ref: "#/components/schemas/Error"
@@ -45,6 +45,8 @@ nowchess:
k8s-namespace: default k8s-namespace: default
k8s-rollout-name: nowchess-core k8s-rollout-name: nowchess-core
k8s-rollout-label-selector: "app=nowchess-core" k8s-rollout-label-selector: "app=nowchess-core"
startup-validation-timeout: 15s
failover-wait-timeout: 30s
--- ---
# dev profile # dev profile
@@ -56,3 +56,9 @@ trait CoordinatorConfig:
@WithName("k8s-rollout-label-selector") @WithName("k8s-rollout-label-selector")
def k8sRolloutLabelSelector: String def k8sRolloutLabelSelector: String
@WithName("startup-validation-timeout")
def startupValidationTimeout: Duration
@WithName("failover-wait-timeout")
def failoverWaitTimeout: Duration
@@ -9,6 +9,7 @@ import de.nowchess.coordinator.proto.{CoordinatorServiceGrpc, *}
import io.grpc.stub.StreamObserver import io.grpc.stub.StreamObserver
import com.fasterxml.jackson.databind.ObjectMapper import com.fasterxml.jackson.databind.ObjectMapper
import org.jboss.logging.Logger import org.jboss.logging.Logger
import java.util.concurrent.ConcurrentHashMap
@GrpcService @GrpcService
@Singleton @Singleton
@@ -21,8 +22,9 @@ class CoordinatorGrpcServer extends CoordinatorServiceGrpc.CoordinatorServiceImp
private var failoverService: FailoverService = uninitialized private var failoverService: FailoverService = uninitialized
// scalafix:on DisableSyntax.var // scalafix:on DisableSyntax.var
private val mapper = ObjectMapper() private val mapper = ObjectMapper()
private val log = Logger.getLogger(classOf[CoordinatorGrpcServer]) private val log = Logger.getLogger(classOf[CoordinatorGrpcServer])
private val activeStreams = ConcurrentHashMap.newKeySet[String]()
override def heartbeatStream( override def heartbeatStream(
responseObserver: StreamObserver[CoordinatorCommand], responseObserver: StreamObserver[CoordinatorCommand],
@@ -38,6 +40,7 @@ class CoordinatorGrpcServer extends CoordinatorServiceGrpc.CoordinatorServiceImp
lastInstanceId = frame.getInstanceId lastInstanceId = frame.getInstanceId
if !firstFrameSeen then if !firstFrameSeen then
firstFrameSeen = true firstFrameSeen = true
activeStreams.add(frame.getInstanceId)
log.infof( log.infof(
"First heartbeat from instance %s (host=%s http=%d grpc=%d)", "First heartbeat from instance %s (host=%s http=%d grpc=%d)",
frame.getInstanceId, frame.getInstanceId,
@@ -60,10 +63,19 @@ class CoordinatorGrpcServer extends CoordinatorServiceGrpc.CoordinatorServiceImp
override def onError(t: Throwable): Unit = override def onError(t: Throwable): Unit =
log.warnf(t, "Heartbeat stream error for instance %s", lastInstanceId) log.warnf(t, "Heartbeat stream error for instance %s", lastInstanceId)
if lastInstanceId.nonEmpty then failoverService.onInstanceStreamDropped(lastInstanceId) if lastInstanceId.nonEmpty then
activeStreams.remove(lastInstanceId)
failoverService
.onInstanceStreamDropped(lastInstanceId)
.subscribe()
.`with`(
_ => (),
ex => log.warnf(ex, "Failover for %s failed", lastInstanceId),
)
override def onCompleted: Unit = override def onCompleted: Unit =
log.infof("Heartbeat stream completed for instance %s", lastInstanceId) log.infof("Heartbeat stream completed for instance %s", lastInstanceId)
activeStreams.remove(lastInstanceId)
override def batchResubscribeGames( override def batchResubscribeGames(
request: BatchResubscribeRequest, request: BatchResubscribeRequest,
@@ -108,7 +120,18 @@ class CoordinatorGrpcServer extends CoordinatorServiceGrpc.CoordinatorServiceImp
val instanceId = request.getInstanceId val instanceId = request.getInstanceId
log.infof("Drain request for instance %s", instanceId) log.infof("Drain request for instance %s", instanceId)
val gamesBefore = instanceRegistry.getInstance(instanceId).map(_.subscriptionCount).getOrElse(0) val gamesBefore = instanceRegistry.getInstance(instanceId).map(_.subscriptionCount).getOrElse(0)
failoverService.onInstanceStreamDropped(instanceId) failoverService
val response = DrainInstanceResponse.newBuilder().setGamesMigrated(gamesBefore).build() .onInstanceStreamDropped(instanceId)
responseObserver.onNext(response) .subscribe()
responseObserver.onCompleted() .`with`(
_ =>
val response = DrainInstanceResponse.newBuilder().setGamesMigrated(gamesBefore).build()
responseObserver.onNext(response)
responseObserver.onCompleted(),
ex =>
log.warnf(ex, "Drain failed for %s", instanceId)
responseObserver.onError(ex),
)
def hasActiveStream(instanceId: String): Boolean =
activeStreams.contains(instanceId)
@@ -70,7 +70,13 @@ class CoordinatorResource:
@Produces(Array(MediaType.APPLICATION_JSON)) @Produces(Array(MediaType.APPLICATION_JSON))
def triggerFailover(@PathParam("instanceId") instanceId: String): scala.collection.Map[String, String] = def triggerFailover(@PathParam("instanceId") instanceId: String): scala.collection.Map[String, String] =
log.infof("Manual failover triggered for instance %s", instanceId) log.infof("Manual failover triggered for instance %s", instanceId)
failoverService.onInstanceStreamDropped(instanceId) failoverService
.onInstanceStreamDropped(instanceId)
.subscribe()
.`with`(
_ => (),
ex => log.warnf(ex, "Manual failover for %s failed", instanceId),
)
Map("status" -> "failover_started", "instanceId" -> instanceId) Map("status" -> "failover_started", "instanceId" -> instanceId)
@POST @POST
@@ -8,6 +8,9 @@ import scala.compiletime.uninitialized
import org.jboss.logging.Logger import org.jboss.logging.Logger
import de.nowchess.coordinator.dto.InstanceMetadata import de.nowchess.coordinator.dto.InstanceMetadata
import de.nowchess.coordinator.grpc.CoreGrpcClient import de.nowchess.coordinator.grpc.CoreGrpcClient
import de.nowchess.coordinator.config.CoordinatorConfig
import io.smallrye.mutiny.Uni
import java.time.Duration
@ApplicationScoped @ApplicationScoped
class FailoverService: class FailoverService:
@@ -21,6 +24,9 @@ class FailoverService:
@Inject @Inject
private var coreGrpcClient: CoreGrpcClient = uninitialized private var coreGrpcClient: CoreGrpcClient = uninitialized
@Inject
private var config: CoordinatorConfig = uninitialized
private val log = Logger.getLogger(classOf[FailoverService]) private val log = Logger.getLogger(classOf[FailoverService])
private var redisPrefix = "nowchess" private var redisPrefix = "nowchess"
// scalafix:on DisableSyntax.var // scalafix:on DisableSyntax.var
@@ -28,7 +34,7 @@ class FailoverService:
def setRedisPrefix(prefix: String): Unit = def setRedisPrefix(prefix: String): Unit =
redisPrefix = prefix redisPrefix = prefix
def onInstanceStreamDropped(instanceId: String): Unit = def onInstanceStreamDropped(instanceId: String): Uni[Unit] =
log.infof("Instance %s stream dropped, triggering failover", instanceId) log.infof("Instance %s stream dropped, triggering failover", instanceId)
val startTime = System.currentTimeMillis() val startTime = System.currentTimeMillis()
@@ -37,19 +43,32 @@ class FailoverService:
val gameIds = getOrphanedGames(instanceId) val gameIds = getOrphanedGames(instanceId)
log.infof("Found %d orphaned games for instance %s", gameIds.size, instanceId) log.infof("Found %d orphaned games for instance %s", gameIds.size, instanceId)
if gameIds.nonEmpty then if gameIds.isEmpty then
val healthyInstances = instanceRegistry.getAllInstances cleanupDeadInstance(instanceId)
.filter(_.state == "HEALTHY") Uni.createFrom().item(())
.sortBy(_.subscriptionCount) else
waitForHealthyInstanceAsync()
.onItem()
.transform { _ =>
val healthyInstances = instanceRegistry.getAllInstances
.filter(_.state == "HEALTHY")
.sortBy(_.subscriptionCount)
distributeGames(gameIds, healthyInstances, instanceId)
if healthyInstances.nonEmpty then val elapsed = System.currentTimeMillis() - startTime
distributeGames(gameIds, healthyInstances, instanceId) log.infof("Failover completed in %dms for instance %s", elapsed, instanceId)
cleanupDeadInstance(instanceId)
val elapsed = System.currentTimeMillis() - startTime ()
log.infof("Failover completed in %dms for instance %s", elapsed, instanceId) }
else log.warnf("No healthy instances available for failover of %s", instanceId) .onFailure()
.recoverWithItem { _ =>
cleanupDeadInstance(instanceId) log.errorf(
"No healthy instance appeared within %s — games orphaned for %s",
config.failoverWaitTimeout,
instanceId,
)
()
}
private def getOrphanedGames(instanceId: String): List[String] = private def getOrphanedGames(instanceId: String): List[String] =
val setKey = s"$redisPrefix:instance:$instanceId:games" val setKey = s"$redisPrefix:instance:$instanceId:games"
@@ -101,3 +120,16 @@ class FailoverService:
val setKey = s"$redisPrefix:instance:$instanceId:games" val setKey = s"$redisPrefix:instance:$instanceId:games"
redis.key(classOf[String]).del(setKey) redis.key(classOf[String]).del(setKey)
log.infof("Cleaned up games set for instance %s", instanceId) log.infof("Cleaned up games set for instance %s", instanceId)
private def waitForHealthyInstanceAsync(): Uni[InstanceMetadata] =
Uni.createFrom().deferred(() =>
instanceRegistry.getAllInstances
.filter(_.state == "HEALTHY")
.sortBy(_.subscriptionCount)
.headOption match
case Some(inst) => Uni.createFrom().item(inst)
case None => Uni.createFrom().failure(new RuntimeException("no healthy instance"))
).onFailure()
.retry()
.withBackOff(Duration.ofMillis(500))
.expireIn(config.failoverWaitTimeout.toMillis)
@@ -2,17 +2,23 @@ package de.nowchess.coordinator.service
import jakarta.annotation.PostConstruct import jakarta.annotation.PostConstruct
import jakarta.enterprise.context.ApplicationScoped import jakarta.enterprise.context.ApplicationScoped
import jakarta.enterprise.event.Observes
import jakarta.enterprise.inject.Instance import jakarta.enterprise.inject.Instance
import jakarta.inject.Inject import jakarta.inject.Inject
import de.nowchess.coordinator.config.CoordinatorConfig import de.nowchess.coordinator.config.CoordinatorConfig
import io.fabric8.kubernetes.client.KubernetesClient import io.fabric8.kubernetes.client.KubernetesClient
import io.fabric8.kubernetes.api.model.Pod import io.fabric8.kubernetes.api.model.Pod
import io.fabric8.kubernetes.client.Watcher
import io.fabric8.kubernetes.client.WatcherException
import io.micrometer.core.instrument.MeterRegistry import io.micrometer.core.instrument.MeterRegistry
import io.quarkus.redis.datasource.RedisDataSource import io.quarkus.redis.datasource.RedisDataSource
import io.quarkus.runtime.StartupEvent
import scala.jdk.CollectionConverters.* import scala.jdk.CollectionConverters.*
import org.jboss.logging.Logger import org.jboss.logging.Logger
import scala.compiletime.uninitialized import scala.compiletime.uninitialized
import java.time.Instant import java.time.Instant
import de.nowchess.coordinator.grpc.CoordinatorGrpcServer
import de.nowchess.coordinator.dto.InstanceMetadata
@ApplicationScoped @ApplicationScoped
class HealthMonitor: class HealthMonitor:
@@ -32,6 +38,12 @@ class HealthMonitor:
@Inject @Inject
private var meterRegistry: MeterRegistry = uninitialized private var meterRegistry: MeterRegistry = uninitialized
@Inject
private var grpcServer: CoordinatorGrpcServer = uninitialized
@Inject
private var failoverService: FailoverService = uninitialized
private val log = Logger.getLogger(classOf[HealthMonitor]) private val log = Logger.getLogger(classOf[HealthMonitor])
private var redisPrefix = "nowchess" private var redisPrefix = "nowchess"
// scalafix:on DisableSyntax.var // scalafix:on DisableSyntax.var
@@ -48,6 +60,15 @@ class HealthMonitor:
meterRegistry.counter("nowchess.coordinator.health.checks").increment(0) meterRegistry.counter("nowchess.coordinator.health.checks").increment(0)
meterRegistry.counter("nowchess.coordinator.pods.unhealthy").increment(0) meterRegistry.counter("nowchess.coordinator.pods.unhealthy").increment(0)
def onStartup(@Observes ev: StartupEvent): Unit =
instanceRegistry.loadAllFromRedis()
val loaded = instanceRegistry.getAllInstances
log.infof("Startup: loaded %d instances from Redis", loaded.size)
if loaded.nonEmpty then
val timeoutMs = config.startupValidationTimeout.toMillis
Thread.ofVirtual().start(() => validateStartupInstances(timeoutMs))
startPodWatch()
def checkInstanceHealth: Unit = def checkInstanceHealth: Unit =
meterRegistry.counter("nowchess.coordinator.health.checks").increment() meterRegistry.counter("nowchess.coordinator.health.checks").increment()
val evicted = instanceRegistry.evictStaleInstances(config.instanceDeadTimeout) val evicted = instanceRegistry.evictStaleInstances(config.instanceDeadTimeout)
@@ -98,41 +119,33 @@ class HealthMonitor:
true true
} }
def watchK8sPods: Unit = private def startPodWatch(): Unit =
kubeClientOpt match kubeClientOpt match
case None => case None => log.debug("K8s client unavailable, skipping pod watch")
log.debug("Kubernetes client not available for pod watch")
case Some(kube) => case Some(kube) =>
try try
val pods = kube kube
.pods() .pods()
.inNamespace(config.k8sNamespace) .inNamespace(config.k8sNamespace)
.withLabel(config.k8sRolloutLabelSelector) .withLabel(config.k8sRolloutLabelSelector)
.list() .watch(new Watcher[Pod]:
.getItems override def eventReceived(action: Watcher.Action, pod: Pod): Unit =
.asScala action match
case Watcher.Action.DELETED =>
handlePodGone(pod)
case Watcher.Action.MODIFIED
if Option(pod.getMetadata.getDeletionTimestamp).isDefined =>
handlePodTerminating(pod)
case _ => ()
val instances = instanceRegistry.getAllInstances override def onClose(cause: WatcherException): Unit =
instances.foreach { inst => if cause != null then
val matchingPod = pods.find { pod => log.warnf(cause, "Pod watch closed, restarting")
pod.getMetadata.getName.contains(inst.instanceId) startPodWatch()
} )
log.info("Pod watch started")
matchingPod match
case Some(pod) =>
val isReady = isPodReady(pod)
if !isReady && inst.state == "HEALTHY" then
meterRegistry.counter("nowchess.coordinator.pods.unhealthy").increment()
log.warnf("Pod %s not ready, marking instance %s dead", pod.getMetadata.getName, inst.instanceId)
instanceRegistry.markInstanceDead(inst.instanceId)
deleteK8sPod(inst.instanceId)
case None =>
log.warnf("No pod found for instance %s, evicting from registry", inst.instanceId)
instanceRegistry.removeInstance(inst.instanceId)
}
catch catch
case ex: Exception => case ex: Exception => log.warnf(ex, "Failed to start pod watch")
log.warnf(ex, "Failed to watch k8s pods")
private def isPodReady(pod: Pod): Boolean = private def isPodReady(pod: Pod): Boolean =
Option(pod.getStatus) Option(pod.getStatus)
@@ -164,3 +177,48 @@ class HealthMonitor:
catch catch
case ex: Exception => case ex: Exception =>
log.warnf(ex, "Failed to delete pod for instance %s", instanceId) log.warnf(ex, "Failed to delete pod for instance %s", instanceId)
private def validateStartupInstances(timeoutMs: Long): Unit =
Thread.sleep(timeoutMs)
instanceRegistry.getAllInstances.foreach { inst =>
if !grpcServer.hasActiveStream(inst.instanceId) then
log.warnf(
"Startup: instance %s did not reconnect within %dms — evicting",
inst.instanceId,
timeoutMs,
)
instanceRegistry.removeInstance(inst.instanceId)
deleteK8sPod(inst.instanceId)
}
private def handlePodTerminating(pod: Pod): Unit =
findRegisteredInstance(pod).foreach { inst =>
if inst.state == "HEALTHY" then
meterRegistry.counter("nowchess.coordinator.pods.unhealthy").increment()
log.warnf(
"Pod %s terminating — marking instance %s dead",
pod.getMetadata.getName,
inst.instanceId,
)
instanceRegistry.markInstanceDead(inst.instanceId)
}
private def handlePodGone(pod: Pod): Unit =
findRegisteredInstance(pod).foreach { inst =>
log.warnf(
"Pod %s deleted — triggering failover for %s",
pod.getMetadata.getName,
inst.instanceId,
)
failoverService
.onInstanceStreamDropped(inst.instanceId)
.subscribe()
.`with`(
_ => (),
ex => log.warnf(ex, "Failover for %s failed", inst.instanceId),
)
}
private def findRegisteredInstance(pod: Pod): Option[InstanceMetadata] =
val podName = pod.getMetadata.getName
instanceRegistry.getAllInstances.find(inst => podName.contains(inst.instanceId))
@@ -4,6 +4,7 @@ import jakarta.annotation.PostConstruct
import jakarta.enterprise.context.ApplicationScoped import jakarta.enterprise.context.ApplicationScoped
import jakarta.inject.Inject import jakarta.inject.Inject
import io.quarkus.redis.datasource.ReactiveRedisDataSource import io.quarkus.redis.datasource.ReactiveRedisDataSource
import io.quarkus.redis.datasource.RedisDataSource
import scala.jdk.CollectionConverters.* import scala.jdk.CollectionConverters.*
import scala.compiletime.uninitialized import scala.compiletime.uninitialized
import com.fasterxml.jackson.databind.ObjectMapper import com.fasterxml.jackson.databind.ObjectMapper
@@ -19,7 +20,10 @@ class InstanceRegistry:
// scalafix:off DisableSyntax.var // scalafix:off DisableSyntax.var
@Inject @Inject
private var redis: ReactiveRedisDataSource = uninitialized private var redis: ReactiveRedisDataSource = uninitialized
private var redisPrefix = "nowchess"
@Inject
private var syncRedis: RedisDataSource = uninitialized
private var redisPrefix = "nowchess"
@Inject @Inject
private var meterRegistry: MeterRegistry = uninitialized private var meterRegistry: MeterRegistry = uninitialized
@@ -42,6 +46,21 @@ class InstanceRegistry:
def setRedisPrefix(prefix: String): Unit = def setRedisPrefix(prefix: String): Unit =
redisPrefix = prefix redisPrefix = prefix
def loadAllFromRedis(): Unit =
val keys = syncRedis.key(classOf[String]).keys(s"$redisPrefix:instances:*")
keys.asScala.foreach { key =>
val instanceId = key.stripPrefix(s"$redisPrefix:instances:")
val json = syncRedis.value(classOf[String]).get(key)
if json != null then
try
val metadata = mapper.readValue(json, classOf[InstanceMetadata])
instances.put(instanceId, metadata)
log.infof("Startup: loaded instance %s from Redis", instanceId)
catch
case ex: Exception =>
log.warnf(ex, "Startup: failed to parse instance %s", instanceId)
}
def getInstance(instanceId: String): Option[InstanceMetadata] = def getInstance(instanceId: String): Option[InstanceMetadata] =
Option(instances.get(instanceId)) Option(instances.get(instanceId))