Compare commits

...

4 Commits

Author SHA1 Message Date
LQ63 3bda9466b5 docs(tournament): spec
Build & Test (NowChessSystems) TeamCity build failed
Added tournament spec to docs
2026-05-13 13:32:52 +02:00
Janis 81b045d01b feat: add coordinator startup validation and K8s pod watch
Build & Test (NowChessSystems) TeamCity build failed
On startup, load all known instances from Redis and wait 15s for them to
reconnect via gRPC. Evict instances that don't reconnect within the timeout
and delete their K8s pods.

Replace one-shot pod status check with real fabric8 Watch. On pod Terminating
event, mark instance dead. On pod Deleted event, trigger failover. Failover
now waits reactively for at least one healthy instance before distributing
orphaned games, up to 30s timeout.

- Add startupValidationTimeout and failoverWaitTimeout config (15s, 30s)
- CoordinatorGrpcServer tracks active gRPC streams
- InstanceRegistry.loadAllFromRedis() scans and loads instances on startup
- HealthMonitor startup observer validates instances and starts K8s watch
- FailoverService.onInstanceStreamDropped returns Uni[Unit] for reactive wait
- All failover service callers updated to subscribe to Uni result

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-05-13 09:55:38 +02:00
TeamCity 118acff0e5 ci: bump version with Build-80 2026-05-13 07:21:09 +00:00
Janis a49f9be146 feat: add CORS configuration and reorder JWT settings in application.yml
Build & Test (NowChessSystems) TeamCity build finished
2026-05-13 09:03:21 +02:00
12 changed files with 1200 additions and 63 deletions
+313
View File
@@ -0,0 +1,313 @@
# NowChess Tournament API
Swiss-system bot tournaments. Bots are paired by score each round; all bots play every round (no eliminations). Game moves flow through the existing board and bot endpoints — the tournament module only orchestrates pairings, standings, and lifecycle.
---
## Base path
```
/api/tournament
```
Routing: `/api/tournament``nowchess-tournament-active:8086`
---
## Authentication
All endpoints require a valid JWT (`Authorization: Bearer <token>`).
Bot-facing streaming endpoints additionally require the token's subject to match the registered `botId`.
---
## Data models
### Tournament
```json
{
"id": "t7kXq2",
"name": "Friday Night Bots",
"status": "created | started | finished",
"rounds": 5,
"currentRound": 2,
"timeControl": {
"limitSeconds": 300,
"incrementSeconds": 3
},
"createdBy": "userId",
"createdAt": "2026-05-13T18:00:00Z",
"startedAt": "2026-05-13T18:05:00Z",
"finishedAt": null
}
```
### Standing
```json
{
"rank": 1,
"botId": "bot_abc",
"botName": "StockfishClone",
"points": 3.5,
"wins": 3,
"draws": 1,
"losses": 0,
"buchholz": 9.0
}
```
Tiebreaker: Buchholz score (sum of opponents' points).
### Pairing
```json
{
"round": 2,
"whiteBot": "bot_abc",
"blackBot": "bot_xyz",
"gameId": "j0nPtcjl",
"result": "white | black | draw | ongoing"
}
```
### TournamentEvent (SSE)
```json
{ "type": "tournamentStarted", "tournamentId": "t7kXq2" }
{ "type": "roundStarted", "tournamentId": "t7kXq2", "round": 2 }
{ "type": "pairingReady", "tournamentId": "t7kXq2", "round": 2, "gameId": "j0nPtcjl", "color": "white" }
{ "type": "roundFinished", "tournamentId": "t7kXq2", "round": 2 }
{ "type": "tournamentFinished","tournamentId": "t7kXq2" }
```
---
## Endpoints
### Tournament lifecycle
#### Create tournament
```
POST /api/tournament
```
Body:
```json
{
"name": "Friday Night Bots",
"rounds": 5,
"timeControl": {
"limitSeconds": 300,
"incrementSeconds": 3
}
}
```
Response `201 Created`:
```json
{ "id": "t7kXq2" }
```
The creator becomes the tournament director. Only the director can start and delete the tournament.
---
#### Get tournament
```
GET /api/tournament/{tournamentId}
```
Response `200 OK`: `Tournament` object.
---
#### List tournaments
```
GET /api/tournament
```
Query params:
| Param | Type | Default |
|----------|---------------------------------|-----------|
| `status` | `created\|started\|finished` | (all) |
| `limit` | integer (max 50) | 20 |
| `offset` | integer | 0 |
Response `200 OK`:
```json
{
"tournaments": [ /* Tournament[] */ ],
"total": 42
}
```
---
#### Start tournament
```
POST /api/tournament/{tournamentId}/start
```
Requires at least 2 registered bots. Computes round 1 pairings (random for round 1; score-based from round 2). Creates one game per pairing via `POST /api/board/game`.
Response `200 OK`: updated `Tournament` object.
---
#### Delete tournament
```
DELETE /api/tournament/{tournamentId}
```
Only allowed while `status == "created"`. Response `204 No Content`.
---
### Bot registration
#### Register bot
```
POST /api/tournament/{tournamentId}/bots
```
Registers a bot for the tournament. Must be called before the tournament starts.
The token subject must match the bot being registered.
Body:
```json
{ "botId": "bot_abc" }
```
Response `200 OK`:
```json
{ "botId": "bot_abc", "tournamentId": "t7kXq2" }
```
---
#### Unregister bot
```
DELETE /api/tournament/{tournamentId}/bots/{botId}
```
Only allowed while `status == "created"`. Response `204 No Content`.
---
#### List registered bots
```
GET /api/tournament/{tournamentId}/bots
```
Response `200 OK`:
```json
{
"bots": [
{ "botId": "bot_abc", "botName": "StockfishClone" }
]
}
```
---
### Standings and pairings
#### Get standings
```
GET /api/tournament/{tournamentId}/standings
```
Response `200 OK`:
```json
{ "standings": [ /* Standing[] */ ] }
```
---
#### Get pairings for a round
```
GET /api/tournament/{tournamentId}/rounds/{round}/pairings
```
Response `200 OK`:
```json
{ "pairings": [ /* Pairing[] */ ] }
```
---
### Bot streaming
#### Stream tournament events
```
GET /api/tournament/{tournamentId}/stream
```
Headers: `Accept: text/event-stream`
Server-Sent Events stream scoped to this tournament. The bot receives `pairingReady` events when it is assigned a game, at which point it should connect to the existing bot game stream:
```
GET /bot/stream/game/{gameId} (existing endpoint)
POST /bot/game/{gameId}/move/{uci} (existing endpoint)
```
The tournament module never sends moves — bots do that themselves through the existing bot endpoints.
---
## Typical bot flow
```
1. POST /api/tournament # director creates tournament
2. POST /api/tournament/{id}/bots # each bot registers
3. POST /api/tournament/{id}/start # director starts
4. GET /api/tournament/{id}/stream (SSE) # each bot opens stream
-- per round --
5. receive: pairingReady { gameId, color }
6. GET /bot/stream/game/{gameId} # existing endpoint
7. POST /bot/game/{gameId}/move/{uci} # existing endpoint, repeated
-- game ends --
8. receive: roundFinished
9. GET /api/tournament/{id}/standings # optional, inspect scores
-- repeat 59 for each round --
10. receive: tournamentFinished
11. GET /api/tournament/{id}/standings # final ranking
```
---
## Error responses
| Status | Meaning |
|--------|------------------------------------------------------|
| 400 | Invalid request body or parameters |
| 401 | Missing or invalid JWT |
| 403 | Action not allowed (wrong director, wrong bot, etc.) |
| 404 | Tournament or bot not found |
| 409 | Tournament already started / bot already registered |
+623
View File
@@ -0,0 +1,623 @@
openapi: 3.0.3
info:
title: NowChess Tournament API
description: |
Swiss-system bot tournaments, modelled after the Lichess API style.
Game moves flow through the existing board and bot endpoints — this module
handles pairings, standings, and lifecycle only.
## Streaming
Endpoints marked **NDJSON** return newline-delimited JSON objects
(`application/x-ndjson`). Each line is one complete JSON object. The
connection stays open until the tournament or round ends.
## Bot flow
```
POST /api/tournament # create
POST /api/tournament/{id}/join # each bot joins
POST /api/tournament/{id}/start # director starts
GET /api/tournament/{id}/stream (NDJSON) # open before start
-- per round --
receive gameStart { gameId, color }
GET /bot/stream/game/{gameId} (existing, NDJSON)
POST /bot/game/{gameId}/move/{uci} (existing)
-- repeat --
GET /api/tournament/{id}/results (NDJSON) # final standings
```
version: 1.0.0
servers:
- url: https://st.nowchess.janis-eccarius.de
description: Staging
- url: https://nowchess.janis-eccarius.de
description: Production
- url: http://localhost:8086
description: Local
security:
- bearerAuth: []
tags:
- name: Tournament
description: Tournament lifecycle
- name: Participation
description: Join and withdraw
- name: Results
description: Standings, pairings, and game export
- name: Stream
description: NDJSON event streams
paths:
/api/tournament:
get:
tags: [Tournament]
summary: Get current tournaments
description: Returns tournaments grouped by status. No auth required.
security: []
responses:
"200":
description: Tournaments by status
content:
application/json:
schema:
type: object
properties:
created:
type: array
items:
$ref: "#/components/schemas/TournamentInfo"
started:
type: array
items:
$ref: "#/components/schemas/TournamentInfo"
finished:
type: array
items:
$ref: "#/components/schemas/TournamentInfo"
post:
tags: [Tournament]
summary: Create a new tournament
description: The authenticated user becomes the tournament director.
requestBody:
required: true
content:
application/x-www-form-urlencoded:
schema:
$ref: "#/components/schemas/CreateTournamentForm"
responses:
"201":
description: Tournament created
content:
application/json:
schema:
$ref: "#/components/schemas/Tournament"
"400":
$ref: "#/components/responses/BadRequest"
"401":
$ref: "#/components/responses/Unauthorized"
/api/tournament/{id}:
parameters:
- $ref: "#/components/parameters/id"
get:
tags: [Tournament]
summary: Get a tournament
description: Includes the first page of standings in the `standing` field.
security: []
responses:
"200":
description: Tournament with embedded standings
content:
application/json:
schema:
$ref: "#/components/schemas/Tournament"
"404":
$ref: "#/components/responses/NotFound"
delete:
tags: [Tournament]
summary: Terminate a tournament
description: Only the director may terminate. Only allowed while status is `created`.
responses:
"204":
description: Terminated
"401":
$ref: "#/components/responses/Unauthorized"
"403":
$ref: "#/components/responses/Forbidden"
"404":
$ref: "#/components/responses/NotFound"
"409":
$ref: "#/components/responses/Conflict"
/api/tournament/{id}/start:
parameters:
- $ref: "#/components/parameters/id"
post:
tags: [Tournament]
summary: Start the tournament
description: |
Only the director may start. Requires at least 2 joined bots.
Computes round 1 pairings and creates games via `POST /api/board/game`.
responses:
"200":
description: Tournament started
content:
application/json:
schema:
$ref: "#/components/schemas/Tournament"
"401":
$ref: "#/components/responses/Unauthorized"
"403":
$ref: "#/components/responses/Forbidden"
"404":
$ref: "#/components/responses/NotFound"
"409":
$ref: "#/components/responses/Conflict"
/api/tournament/{id}/join:
parameters:
- $ref: "#/components/parameters/id"
post:
tags: [Participation]
summary: Join a tournament
description: |
Register the authenticated bot for the tournament. Only allowed while
status is `created`. The token subject must be a bot account.
responses:
"200":
description: Ok
content:
application/json:
schema:
$ref: "#/components/schemas/Ok"
"401":
$ref: "#/components/responses/Unauthorized"
"403":
$ref: "#/components/responses/Forbidden"
"404":
$ref: "#/components/responses/NotFound"
"409":
$ref: "#/components/responses/Conflict"
/api/tournament/{id}/withdraw:
parameters:
- $ref: "#/components/parameters/id"
post:
tags: [Participation]
summary: Withdraw from a tournament
description: Only allowed while status is `created`.
responses:
"200":
description: Ok
content:
application/json:
schema:
$ref: "#/components/schemas/Ok"
"401":
$ref: "#/components/responses/Unauthorized"
"403":
$ref: "#/components/responses/Forbidden"
"404":
$ref: "#/components/responses/NotFound"
"409":
$ref: "#/components/responses/Conflict"
/api/tournament/{id}/results:
parameters:
- $ref: "#/components/parameters/id"
get:
tags: [Results]
summary: Get results as NDJSON stream
description: |
Streams one `Result` object per line, sorted by rank ascending.
Available at any point during or after the tournament.
security: []
parameters:
- name: nb
in: query
description: Max number of results to stream (default all)
schema:
type: integer
minimum: 1
responses:
"200":
description: NDJSON stream of results
content:
application/x-ndjson:
schema:
$ref: "#/components/schemas/Result"
"404":
$ref: "#/components/responses/NotFound"
/api/tournament/{id}/round/{round}:
parameters:
- $ref: "#/components/parameters/id"
- name: round
in: path
required: true
schema:
type: integer
minimum: 1
get:
tags: [Results]
summary: Get pairings for a round
security: []
responses:
"200":
description: Pairings for the specified round
content:
application/json:
schema:
type: object
properties:
round:
type: integer
example: 2
pairings:
type: array
items:
$ref: "#/components/schemas/Pairing"
"404":
$ref: "#/components/responses/NotFound"
/api/tournament/{id}/export/games:
parameters:
- $ref: "#/components/parameters/id"
get:
tags: [Results]
summary: Export all games
description: |
Returns all games of the tournament. Accepts both PGN and NDJSON via
the `Accept` header.
security: []
parameters:
- name: Accept
in: header
schema:
type: string
enum:
- application/x-chess-pgn
- application/x-ndjson
default: application/x-chess-pgn
responses:
"200":
description: Games in the requested format
content:
application/x-chess-pgn:
schema:
type: string
description: Standard PGN, one game per block
application/x-ndjson:
schema:
$ref: "#/components/schemas/GameExport"
"404":
$ref: "#/components/responses/NotFound"
/api/tournament/{id}/stream:
parameters:
- $ref: "#/components/parameters/id"
get:
tags: [Stream]
summary: Stream tournament events
description: |
NDJSON stream scoped to one tournament. Keep this connection open for
the full tournament lifetime.
On `gameStart` the bot connects to the existing bot endpoints:
- `GET /bot/stream/game/{gameId}` — stream game state (existing)
- `POST /bot/game/{gameId}/move/{uci}` — submit moves (existing)
responses:
"200":
description: NDJSON event stream
content:
application/x-ndjson:
schema:
$ref: "#/components/schemas/TournamentEvent"
"401":
$ref: "#/components/responses/Unauthorized"
"404":
$ref: "#/components/responses/NotFound"
components:
securitySchemes:
bearerAuth:
type: http
scheme: bearer
bearerFormat: JWT
parameters:
id:
name: id
in: path
required: true
schema:
type: string
example: t7kXq2
schemas:
Clock:
type: object
required: [limit, increment]
properties:
limit:
type: integer
description: Base time in seconds
example: 300
increment:
type: integer
description: Increment per move in seconds
example: 3
Variant:
type: object
properties:
key:
type: string
example: standard
name:
type: string
example: Standard
BotRef:
type: object
properties:
id:
type: string
example: bot_abc
name:
type: string
example: StockfishClone
Standing:
type: object
properties:
page:
type: integer
example: 1
players:
type: array
items:
$ref: "#/components/schemas/Result"
TournamentInfo:
description: Lightweight tournament summary used in list responses.
type: object
properties:
id:
type: string
example: t7kXq2
fullName:
type: string
example: Friday Night Bots Swiss
clock:
$ref: "#/components/schemas/Clock"
variant:
$ref: "#/components/schemas/Variant"
rated:
type: boolean
example: true
nbPlayers:
type: integer
example: 8
nbRounds:
type: integer
example: 5
createdBy:
type: string
example: userId
startsAt:
type: string
format: date-time
Tournament:
allOf:
- $ref: "#/components/schemas/TournamentInfo"
- type: object
properties:
status:
type: string
enum: [created, started, finished]
example: started
round:
type: integer
description: Current round number
example: 2
standing:
$ref: "#/components/schemas/Standing"
winner:
description: Present only when status is `finished`
allOf:
- $ref: "#/components/schemas/BotRef"
nullable: true
CreateTournamentForm:
type: object
required: [name, nbRounds, clockLimit, clockIncrement]
properties:
name:
type: string
example: Friday Night Bots
nbRounds:
type: integer
minimum: 1
example: 5
clockLimit:
type: integer
description: Base time in seconds
example: 300
clockIncrement:
type: integer
description: Increment per move in seconds
example: 3
rated:
type: boolean
default: true
Result:
type: object
properties:
rank:
type: integer
example: 1
points:
type: number
format: double
example: 3.5
tieBreak:
type: number
format: double
description: Buchholz score (sum of opponents' points)
example: 9.0
bot:
$ref: "#/components/schemas/BotRef"
nbGames:
type: integer
example: 4
wins:
type: integer
example: 3
draws:
type: integer
example: 1
losses:
type: integer
example: 0
Pairing:
type: object
properties:
round:
type: integer
example: 2
white:
$ref: "#/components/schemas/BotRef"
black:
$ref: "#/components/schemas/BotRef"
gameId:
type: string
example: j0nPtcjl
winner:
type: string
enum: [white, black, draw]
nullable: true
description: Null while the game is ongoing
GameExport:
description: One game object per NDJSON line.
type: object
properties:
id:
type: string
example: j0nPtcjl
round:
type: integer
example: 2
white:
$ref: "#/components/schemas/BotRef"
black:
$ref: "#/components/schemas/BotRef"
winner:
type: string
enum: [white, black, draw]
nullable: true
moves:
type: string
description: Space-separated UCI moves
example: e2e4 e7e5 g1f3
TournamentEvent:
description: |
One JSON object per NDJSON line. Discriminate on `type`.
| type | extra fields |
|------|-------------|
| `tournamentStarted` | — |
| `roundStarted` | `round` |
| `gameStart` | `round`, `gameId`, `color` |
| `roundFinished` | `round` |
| `tournamentFinished` | `winner` |
type: object
required: [type]
properties:
type:
type: string
enum:
- tournamentStarted
- roundStarted
- gameStart
- roundFinished
- tournamentFinished
round:
type: integer
example: 2
gameId:
type: string
example: j0nPtcjl
color:
type: string
enum: [white, black]
winner:
$ref: "#/components/schemas/BotRef"
Ok:
type: object
properties:
ok:
type: boolean
example: true
Error:
type: object
properties:
error:
type: string
example: tournament already started
responses:
BadRequest:
description: Invalid request body or parameters
content:
application/json:
schema:
$ref: "#/components/schemas/Error"
Unauthorized:
description: Missing or invalid JWT
content:
application/json:
schema:
$ref: "#/components/schemas/Error"
Forbidden:
description: Action not permitted for this user or bot
content:
application/json:
schema:
$ref: "#/components/schemas/Error"
NotFound:
description: Tournament not found
content:
application/json:
schema:
$ref: "#/components/schemas/Error"
Conflict:
description: Conflicting state (e.g. already started, bot already joined)
content:
application/json:
schema:
$ref: "#/components/schemas/Error"
@@ -45,6 +45,8 @@ nowchess:
k8s-namespace: default
k8s-rollout-name: nowchess-core
k8s-rollout-label-selector: "app=nowchess-core"
startup-validation-timeout: 15s
failover-wait-timeout: 30s
---
# dev profile
@@ -56,3 +56,9 @@ trait CoordinatorConfig:
@WithName("k8s-rollout-label-selector")
def k8sRolloutLabelSelector: String
@WithName("startup-validation-timeout")
def startupValidationTimeout: Duration
@WithName("failover-wait-timeout")
def failoverWaitTimeout: Duration
@@ -9,6 +9,7 @@ import de.nowchess.coordinator.proto.{CoordinatorServiceGrpc, *}
import io.grpc.stub.StreamObserver
import com.fasterxml.jackson.databind.ObjectMapper
import org.jboss.logging.Logger
import java.util.concurrent.ConcurrentHashMap
@GrpcService
@Singleton
@@ -21,8 +22,9 @@ class CoordinatorGrpcServer extends CoordinatorServiceGrpc.CoordinatorServiceImp
private var failoverService: FailoverService = uninitialized
// scalafix:on DisableSyntax.var
private val mapper = ObjectMapper()
private val log = Logger.getLogger(classOf[CoordinatorGrpcServer])
private val mapper = ObjectMapper()
private val log = Logger.getLogger(classOf[CoordinatorGrpcServer])
private val activeStreams = ConcurrentHashMap.newKeySet[String]()
override def heartbeatStream(
responseObserver: StreamObserver[CoordinatorCommand],
@@ -38,6 +40,7 @@ class CoordinatorGrpcServer extends CoordinatorServiceGrpc.CoordinatorServiceImp
lastInstanceId = frame.getInstanceId
if !firstFrameSeen then
firstFrameSeen = true
activeStreams.add(frame.getInstanceId)
log.infof(
"First heartbeat from instance %s (host=%s http=%d grpc=%d)",
frame.getInstanceId,
@@ -60,10 +63,19 @@ class CoordinatorGrpcServer extends CoordinatorServiceGrpc.CoordinatorServiceImp
override def onError(t: Throwable): Unit =
log.warnf(t, "Heartbeat stream error for instance %s", lastInstanceId)
if lastInstanceId.nonEmpty then failoverService.onInstanceStreamDropped(lastInstanceId)
if lastInstanceId.nonEmpty then
activeStreams.remove(lastInstanceId)
failoverService
.onInstanceStreamDropped(lastInstanceId)
.subscribe()
.`with`(
_ => (),
ex => log.warnf(ex, "Failover for %s failed", lastInstanceId),
)
override def onCompleted: Unit =
log.infof("Heartbeat stream completed for instance %s", lastInstanceId)
activeStreams.remove(lastInstanceId)
override def batchResubscribeGames(
request: BatchResubscribeRequest,
@@ -108,7 +120,18 @@ class CoordinatorGrpcServer extends CoordinatorServiceGrpc.CoordinatorServiceImp
val instanceId = request.getInstanceId
log.infof("Drain request for instance %s", instanceId)
val gamesBefore = instanceRegistry.getInstance(instanceId).map(_.subscriptionCount).getOrElse(0)
failoverService.onInstanceStreamDropped(instanceId)
val response = DrainInstanceResponse.newBuilder().setGamesMigrated(gamesBefore).build()
responseObserver.onNext(response)
responseObserver.onCompleted()
failoverService
.onInstanceStreamDropped(instanceId)
.subscribe()
.`with`(
_ =>
val response = DrainInstanceResponse.newBuilder().setGamesMigrated(gamesBefore).build()
responseObserver.onNext(response)
responseObserver.onCompleted(),
ex =>
log.warnf(ex, "Drain failed for %s", instanceId)
responseObserver.onError(ex),
)
def hasActiveStream(instanceId: String): Boolean =
activeStreams.contains(instanceId)
@@ -70,7 +70,13 @@ class CoordinatorResource:
@Produces(Array(MediaType.APPLICATION_JSON))
def triggerFailover(@PathParam("instanceId") instanceId: String): scala.collection.Map[String, String] =
log.infof("Manual failover triggered for instance %s", instanceId)
failoverService.onInstanceStreamDropped(instanceId)
failoverService
.onInstanceStreamDropped(instanceId)
.subscribe()
.`with`(
_ => (),
ex => log.warnf(ex, "Manual failover for %s failed", instanceId),
)
Map("status" -> "failover_started", "instanceId" -> instanceId)
@POST
@@ -8,6 +8,9 @@ import scala.compiletime.uninitialized
import org.jboss.logging.Logger
import de.nowchess.coordinator.dto.InstanceMetadata
import de.nowchess.coordinator.grpc.CoreGrpcClient
import de.nowchess.coordinator.config.CoordinatorConfig
import io.smallrye.mutiny.Uni
import java.time.Duration
@ApplicationScoped
class FailoverService:
@@ -21,6 +24,9 @@ class FailoverService:
@Inject
private var coreGrpcClient: CoreGrpcClient = uninitialized
@Inject
private var config: CoordinatorConfig = uninitialized
private val log = Logger.getLogger(classOf[FailoverService])
private var redisPrefix = "nowchess"
// scalafix:on DisableSyntax.var
@@ -28,7 +34,7 @@ class FailoverService:
def setRedisPrefix(prefix: String): Unit =
redisPrefix = prefix
def onInstanceStreamDropped(instanceId: String): Unit =
def onInstanceStreamDropped(instanceId: String): Uni[Unit] =
log.infof("Instance %s stream dropped, triggering failover", instanceId)
val startTime = System.currentTimeMillis()
@@ -37,19 +43,32 @@ class FailoverService:
val gameIds = getOrphanedGames(instanceId)
log.infof("Found %d orphaned games for instance %s", gameIds.size, instanceId)
if gameIds.nonEmpty then
val healthyInstances = instanceRegistry.getAllInstances
.filter(_.state == "HEALTHY")
.sortBy(_.subscriptionCount)
if gameIds.isEmpty then
cleanupDeadInstance(instanceId)
Uni.createFrom().item(())
else
waitForHealthyInstanceAsync()
.onItem()
.transform { _ =>
val healthyInstances = instanceRegistry.getAllInstances
.filter(_.state == "HEALTHY")
.sortBy(_.subscriptionCount)
distributeGames(gameIds, healthyInstances, instanceId)
if healthyInstances.nonEmpty then
distributeGames(gameIds, healthyInstances, instanceId)
val elapsed = System.currentTimeMillis() - startTime
log.infof("Failover completed in %dms for instance %s", elapsed, instanceId)
else log.warnf("No healthy instances available for failover of %s", instanceId)
cleanupDeadInstance(instanceId)
val elapsed = System.currentTimeMillis() - startTime
log.infof("Failover completed in %dms for instance %s", elapsed, instanceId)
cleanupDeadInstance(instanceId)
()
}
.onFailure()
.recoverWithItem { _ =>
log.errorf(
"No healthy instance appeared within %s — games orphaned for %s",
config.failoverWaitTimeout,
instanceId,
)
()
}
private def getOrphanedGames(instanceId: String): List[String] =
val setKey = s"$redisPrefix:instance:$instanceId:games"
@@ -101,3 +120,16 @@ class FailoverService:
val setKey = s"$redisPrefix:instance:$instanceId:games"
redis.key(classOf[String]).del(setKey)
log.infof("Cleaned up games set for instance %s", instanceId)
private def waitForHealthyInstanceAsync(): Uni[InstanceMetadata] =
Uni.createFrom().deferred(() =>
instanceRegistry.getAllInstances
.filter(_.state == "HEALTHY")
.sortBy(_.subscriptionCount)
.headOption match
case Some(inst) => Uni.createFrom().item(inst)
case None => Uni.createFrom().failure(new RuntimeException("no healthy instance"))
).onFailure()
.retry()
.withBackOff(Duration.ofMillis(500))
.expireIn(config.failoverWaitTimeout.toMillis)
@@ -2,17 +2,23 @@ package de.nowchess.coordinator.service
import jakarta.annotation.PostConstruct
import jakarta.enterprise.context.ApplicationScoped
import jakarta.enterprise.event.Observes
import jakarta.enterprise.inject.Instance
import jakarta.inject.Inject
import de.nowchess.coordinator.config.CoordinatorConfig
import io.fabric8.kubernetes.client.KubernetesClient
import io.fabric8.kubernetes.api.model.Pod
import io.fabric8.kubernetes.client.Watcher
import io.fabric8.kubernetes.client.WatcherException
import io.micrometer.core.instrument.MeterRegistry
import io.quarkus.redis.datasource.RedisDataSource
import io.quarkus.runtime.StartupEvent
import scala.jdk.CollectionConverters.*
import org.jboss.logging.Logger
import scala.compiletime.uninitialized
import java.time.Instant
import de.nowchess.coordinator.grpc.CoordinatorGrpcServer
import de.nowchess.coordinator.dto.InstanceMetadata
@ApplicationScoped
class HealthMonitor:
@@ -32,6 +38,12 @@ class HealthMonitor:
@Inject
private var meterRegistry: MeterRegistry = uninitialized
@Inject
private var grpcServer: CoordinatorGrpcServer = uninitialized
@Inject
private var failoverService: FailoverService = uninitialized
private val log = Logger.getLogger(classOf[HealthMonitor])
private var redisPrefix = "nowchess"
// scalafix:on DisableSyntax.var
@@ -48,6 +60,15 @@ class HealthMonitor:
meterRegistry.counter("nowchess.coordinator.health.checks").increment(0)
meterRegistry.counter("nowchess.coordinator.pods.unhealthy").increment(0)
def onStartup(@Observes ev: StartupEvent): Unit =
instanceRegistry.loadAllFromRedis()
val loaded = instanceRegistry.getAllInstances
log.infof("Startup: loaded %d instances from Redis", loaded.size)
if loaded.nonEmpty then
val timeoutMs = config.startupValidationTimeout.toMillis
Thread.ofVirtual().start(() => validateStartupInstances(timeoutMs))
startPodWatch()
def checkInstanceHealth: Unit =
meterRegistry.counter("nowchess.coordinator.health.checks").increment()
val evicted = instanceRegistry.evictStaleInstances(config.instanceDeadTimeout)
@@ -98,41 +119,33 @@ class HealthMonitor:
true
}
def watchK8sPods: Unit =
private def startPodWatch(): Unit =
kubeClientOpt match
case None =>
log.debug("Kubernetes client not available for pod watch")
case None => log.debug("K8s client unavailable, skipping pod watch")
case Some(kube) =>
try
val pods = kube
kube
.pods()
.inNamespace(config.k8sNamespace)
.withLabel(config.k8sRolloutLabelSelector)
.list()
.getItems
.asScala
.watch(new Watcher[Pod]:
override def eventReceived(action: Watcher.Action, pod: Pod): Unit =
action match
case Watcher.Action.DELETED =>
handlePodGone(pod)
case Watcher.Action.MODIFIED
if Option(pod.getMetadata.getDeletionTimestamp).isDefined =>
handlePodTerminating(pod)
case _ => ()
val instances = instanceRegistry.getAllInstances
instances.foreach { inst =>
val matchingPod = pods.find { pod =>
pod.getMetadata.getName.contains(inst.instanceId)
}
matchingPod match
case Some(pod) =>
val isReady = isPodReady(pod)
if !isReady && inst.state == "HEALTHY" then
meterRegistry.counter("nowchess.coordinator.pods.unhealthy").increment()
log.warnf("Pod %s not ready, marking instance %s dead", pod.getMetadata.getName, inst.instanceId)
instanceRegistry.markInstanceDead(inst.instanceId)
deleteK8sPod(inst.instanceId)
case None =>
log.warnf("No pod found for instance %s, evicting from registry", inst.instanceId)
instanceRegistry.removeInstance(inst.instanceId)
}
override def onClose(cause: WatcherException): Unit =
if cause != null then
log.warnf(cause, "Pod watch closed, restarting")
startPodWatch()
)
log.info("Pod watch started")
catch
case ex: Exception =>
log.warnf(ex, "Failed to watch k8s pods")
case ex: Exception => log.warnf(ex, "Failed to start pod watch")
private def isPodReady(pod: Pod): Boolean =
Option(pod.getStatus)
@@ -164,3 +177,48 @@ class HealthMonitor:
catch
case ex: Exception =>
log.warnf(ex, "Failed to delete pod for instance %s", instanceId)
private def validateStartupInstances(timeoutMs: Long): Unit =
Thread.sleep(timeoutMs)
instanceRegistry.getAllInstances.foreach { inst =>
if !grpcServer.hasActiveStream(inst.instanceId) then
log.warnf(
"Startup: instance %s did not reconnect within %dms — evicting",
inst.instanceId,
timeoutMs,
)
instanceRegistry.removeInstance(inst.instanceId)
deleteK8sPod(inst.instanceId)
}
private def handlePodTerminating(pod: Pod): Unit =
findRegisteredInstance(pod).foreach { inst =>
if inst.state == "HEALTHY" then
meterRegistry.counter("nowchess.coordinator.pods.unhealthy").increment()
log.warnf(
"Pod %s terminating — marking instance %s dead",
pod.getMetadata.getName,
inst.instanceId,
)
instanceRegistry.markInstanceDead(inst.instanceId)
}
private def handlePodGone(pod: Pod): Unit =
findRegisteredInstance(pod).foreach { inst =>
log.warnf(
"Pod %s deleted — triggering failover for %s",
pod.getMetadata.getName,
inst.instanceId,
)
failoverService
.onInstanceStreamDropped(inst.instanceId)
.subscribe()
.`with`(
_ => (),
ex => log.warnf(ex, "Failover for %s failed", inst.instanceId),
)
}
private def findRegisteredInstance(pod: Pod): Option[InstanceMetadata] =
val podName = pod.getMetadata.getName
instanceRegistry.getAllInstances.find(inst => podName.contains(inst.instanceId))
@@ -4,6 +4,7 @@ import jakarta.annotation.PostConstruct
import jakarta.enterprise.context.ApplicationScoped
import jakarta.inject.Inject
import io.quarkus.redis.datasource.ReactiveRedisDataSource
import io.quarkus.redis.datasource.RedisDataSource
import scala.jdk.CollectionConverters.*
import scala.compiletime.uninitialized
import com.fasterxml.jackson.databind.ObjectMapper
@@ -19,7 +20,10 @@ class InstanceRegistry:
// scalafix:off DisableSyntax.var
@Inject
private var redis: ReactiveRedisDataSource = uninitialized
private var redisPrefix = "nowchess"
@Inject
private var syncRedis: RedisDataSource = uninitialized
private var redisPrefix = "nowchess"
@Inject
private var meterRegistry: MeterRegistry = uninitialized
@@ -42,6 +46,21 @@ class InstanceRegistry:
def setRedisPrefix(prefix: String): Unit =
redisPrefix = prefix
def loadAllFromRedis(): Unit =
val keys = syncRedis.key(classOf[String]).keys(s"$redisPrefix:instances:*")
keys.asScala.foreach { key =>
val instanceId = key.stripPrefix(s"$redisPrefix:instances:")
val json = syncRedis.value(classOf[String]).get(key)
if json != null then
try
val metadata = mapper.readValue(json, classOf[InstanceMetadata])
instances.put(instanceId, metadata)
log.infof("Startup: loaded instance %s from Redis", instanceId)
catch
case ex: Exception =>
log.warnf(ex, "Startup: failed to parse instance %s", instanceId)
}
def getInstance(instanceId: String): Option[InstanceMetadata] =
Option(instances.get(instanceId))
+57
View File
@@ -1138,3 +1138,60 @@
* Revert "feat: add authentication permissions for metrics endpoints in application.yml" ([a298417](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/a298417b9e4d68dc73bbf40be63d9484536e9f83))
* Revert "refactor: update metrics paths formatting in application.yml for clarity" ([3870566](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/38705663498d5f47c40dafe2f26198589ede8656))
## (2026-05-13)
### Features
* add authentication permissions for metrics endpoints in application.yml ([04edd4d](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/04edd4d6fd8a63196c36f6d67992832febc9bebb))
* add CORS configuration and reorder JWT settings in application.yml ([a49f9be](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/a49f9be146f04c14561c305d980846a92f8c12b2))
* add GameRules stub with PositionStatus enum ([76d4168](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/76d4168038de23e5d6083d4e8f0504fbf31d15a3))
* add initialization metrics for various services ([d438e97](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/d438e97f32bdde0bfc63c1b4a8cc810cdd093166))
* add MovedInCheck/Checkmate/Stalemate MoveResult variants (stub dispatch) ([8b7ec57](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/8b7ec57e5ea6ee1615a1883848a426dc07d26364))
* add OpenTelemetry trace configuration with parentbased sampler ([3904d5a](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/3904d5ad8ad4930ddee65287a7bfab785a6148f5))
* **config:** update application.yml for PostgreSQL and remove staging/production configurations ([2404e61](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/2404e6164c3b50ffccbea5238d636060d6abe4d6))
* **config:** update application.yml for staging and production environments ([6113432](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/6113432a14c476a3a0dfc0d449e17d023697f2ba))
* configure logging and add OpenTelemetry support ([#49](https://git.janis-eccarius.de/NowChess/NowChessSystems/issues/49)) ([d57c488](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/d57c4886612d1d92da0e1b79209fc83e6ef537a1))
* implement GameRules with isInCheck, legalMoves, gameStatus ([94a02ff](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/94a02ff6849436d9496c70a0f16c21666dae8e4e))
* implement legal castling ([#1](https://git.janis-eccarius.de/NowChess/NowChessSystems/issues/1)) ([00d326c](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/00d326c1ba67711fbe180f04e1100c3f01dd0254))
* **logging:** add DEBUG/INFO/WARN logging across services (NCS-72) ([#41](https://git.janis-eccarius.de/NowChess/NowChessSystems/issues/41)) ([804a4bf](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/804a4bf179e3dfb19e2be4390e7e543caf5237c6))
* NCS-10 Implement Pawn Promotion ([#12](https://git.janis-eccarius.de/NowChess/NowChessSystems/issues/12)) ([13bfc16](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/13bfc16cfe25db78ec607db523ca6d993c13430c))
* NCS-11 50-move rule ([#9](https://git.janis-eccarius.de/NowChess/NowChessSystems/issues/9)) ([412ed98](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/412ed986a95703a3b282276540153480ceed229d))
* NCS-13 Implement Threefold Repetition ([#31](https://git.janis-eccarius.de/NowChess/NowChessSystems/issues/31)) ([767d305](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/767d3051a76c266050b6335774d66e2db2273c16))
* NCS-14 implemented insufficient moves rule ([#30](https://git.janis-eccarius.de/NowChess/NowChessSystems/issues/30)) ([b0399a4](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/b0399a4e489950083066c9538df9a84dcc7a4613))
* NCS-16 Core Separation via Patterns ([#10](https://git.janis-eccarius.de/NowChess/NowChessSystems/issues/10)) ([1361dfc](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/1361dfc89553b146864fb8ff3526cf12cf3f293a))
* NCS-17 Implement basic ScalaFX UI ([#14](https://git.janis-eccarius.de/NowChess/NowChessSystems/issues/14)) ([3ff8031](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/3ff80318b4f16c59733a46498581a5c27f048287))
* NCS-21 Write Scripts to automate certain tasks ([#15](https://git.janis-eccarius.de/NowChess/NowChessSystems/issues/15)) ([8051871](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/80518719d536a087d339fe02530825dc07f8b388))
* NCS-25 Add linters to keep quality up ([#27](https://git.janis-eccarius.de/NowChess/NowChessSystems/issues/27)) ([fd4e67d](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/fd4e67d4f782a7e955822d90cb909d0a81676fb2))
* NCS-37 Quarkus integration ([#35](https://git.janis-eccarius.de/NowChess/NowChessSystems/issues/35)) ([f088c4e](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/f088c4e9ffcc498d3d1b6f01e8f50042d5830d55))
* NCS-40 Rework Draw System ([#34](https://git.janis-eccarius.de/NowChess/NowChessSystems/issues/34)) ([33e785d](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/33e785d22af87724839b62ae91dfe74a05b398c3))
* NCS-41 Bot Platform ([#33](https://git.janis-eccarius.de/NowChess/NowChessSystems/issues/33)) ([8744bee](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/8744bee2dd20966dae90a09c21a43d5b06f59e00))
* NCS-53 changed IO to MicroService for easier scaling ([#37](https://git.janis-eccarius.de/NowChess/NowChessSystems/issues/37)) ([b5a2966](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/b5a2966adafa9650f0f7d601bdeb8fdd13710327))
* NCS-6 Implementing FEN & PGN ([#7](https://git.janis-eccarius.de/NowChess/NowChessSystems/issues/7)) ([f28e69d](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/f28e69dc181416aa2f221fdc4b45c2cda5efbf07))
* NCS-78 Add Traceability to the Applications ([#48](https://git.janis-eccarius.de/NowChess/NowChessSystems/issues/48)) ([c96a09b](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/c96a09bb5cee59fc23205bb63baa8b217a7e1b00))
* NCS-9 En passant implementation ([#8](https://git.janis-eccarius.de/NowChess/NowChessSystems/issues/8)) ([919beb3](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/919beb3b4bfa8caf2f90976a415fe9b19b7e9747))
* **rule:** Rules as a microservice ([#39](https://git.janis-eccarius.de/NowChess/NowChessSystems/issues/39)) ([093134d](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/093134d36c6844ba02a36a28d5d044f09291cd1d))
* true-microservices ([#40](https://git.janis-eccarius.de/NowChess/NowChessSystems/issues/40)) ([5909242](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/590924254e8a2754de661a57a03e43f89ceb6299))
* update application.yml with new API root paths and add Micrometer and OpenTelemetry dependencies ([72ce262](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/72ce262bc491f94297700e6002fb5d0812e2cc2a))
* wire check/checkmate/stalemate into processMove and gameLoop ([5264a22](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/5264a225418b885c5e6ea6411b96f85e38837f6c))
### Bug Fixes
* add missing kings to gameLoop capture test board ([aedd787](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/aedd787b77203c2af934751dba7b784eaf165032))
* **auth:** change InternalAuthFilter to use @Singleton and add HTTP tests for secret validation ([c08d530](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/c08d5303eb9e70d36c8eebf6a061ccb71e118fe5))
* **auth:** update InternalAuthFilter to use @ApplicationScoped and add index-dependency configuration ([6e0fd95](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/6e0fd9523e001756ce7109e639ebb54be4fcdabf))
* correct test board positions and captureOutput/withInput interaction ([f0481e2](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/f0481e2561b779df00925b46ee281dc36a795150))
* **heartbeat:** inject ObjectMapper into InstanceHeartbeatService ([#42](https://git.janis-eccarius.de/NowChess/NowChessSystems/issues/42)) ([0c98151](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/0c981517da1f94cd10ae396e47bde2b35d0b3ba0))
* IO microservice ([#38](https://git.janis-eccarius.de/NowChess/NowChessSystems/issues/38)) ([a386f57](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/a386f57c21d34ead6cc6f92836c52b714597e289))
* Lints ([dc224ab](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/dc224abe26acf5361c56956006e1cc51b75b0b7e))
* **redis:** add max pool wait time and switch to ReactiveRedisDataSource for heartbeat updates ([33e5017](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/33e5017f51a998327b180f778f73964cc10c05d3))
* **redis:** enhance GameRedisSubscriberManager to use ReactiveRedisDataSource and improve subscription handling ([0eb752d](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/0eb752d4935377f75aab710b7f4eda4b29098e6a))
* **redis:** prevent concurrent Redis heartbeat refreshes using AtomicBoolean ([847b132](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/847b13202cb909d18ca3304c27ebe17ce2312b8e))
* **redis:** simplify refreshRedisHeartbeat logic and ensure proper error handling ([1813ea1](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/1813ea1d2d5d093f7925f87371b5e29820bf1136))
* **redis:** update Redis configuration with max pool size and waiting parameters ([5baf6a7](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/5baf6a7cdbea484fc49c02e2b5a1c3919b7fa2c4))
* update main class path in build configuration and adjust VCS directory mapping ([7b1f8b1](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/7b1f8b117623d327232a1a92a8a44d18582e0189))
* update move validation to check for king safety ([#13](https://git.janis-eccarius.de/NowChess/NowChessSystems/issues/13)) ([e5e20c5](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/e5e20c566e368b12ca1dc59680c34e9112bf6762))
### Reverts
* Revert "feat: add authentication permissions for metrics endpoints in application.yml" ([a298417](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/a298417b9e4d68dc73bbf40be63d9484536e9f83))
* Revert "refactor: update metrics paths formatting in application.yml for clarity" ([3870566](https://git.janis-eccarius.de/NowChess/NowChessSystems/commit/38705663498d5f47c40dafe2f26198589ede8656))
+11 -13
View File
@@ -76,6 +76,11 @@ nowchess:
quarkus:
http:
root-path: /api
cors:
~: true
origins: ${CORS_ORIGINS}
methods: GET,POST,PUT,DELETE,OPTIONS
headers: Content-Type,Accept,Authorization
log:
console:
json: true
@@ -86,19 +91,6 @@ nowchess:
exporter:
otlp:
endpoint: ${OTEL_EXPORTER_OTLP_ENDPOINT:http://localhost:4317}
mp:
jwt:
verify:
publickey:
location: ${JWT_PUBLIC_KEY_PATH:keys/public.pem}
issuer: nowchess
quarkus:
http:
cors:
~: true
origins: ${CORS_ORIGINS}
methods: GET,POST,PUT,DELETE,OPTIONS
headers: Content-Type,Accept,Authorization
grpc:
clients:
rule-grpc:
@@ -117,6 +109,12 @@ nowchess:
url: ${RULE_SERVICE_URL}
store-service:
url: ${STORE_SERVICE_URL}
mp:
jwt:
verify:
publickey:
location: ${JWT_PUBLIC_KEY_PATH:keys/public.pem}
issuer: nowchess
nowchess:
redis:
host: ${REDIS_HOST}
+1 -1
View File
@@ -1,3 +1,3 @@
MAJOR=0
MINOR=36
MINOR=37
PATCH=0