- Add lastUpdatedMs timestamp to GameCacheDto to track actual game updates instead of heartbeat time. Fix cache eviction incorrectly marking correspondence games as idle. - Use atomic SPOP in LoadBalancer.getGamesToMove() to prevent concurrent rebalance calls from selecting same games for migration. - Add game→instance reverse mapping (nowchess:game:$gameId:instance) to eliminate O(instances) linear scan during cache eviction. - Fix HealthMonitor pod matching from loose contains() to reliable endsWith() to prevent matching unintended pods with similar names. - Update FailoverService to maintain game→instance mappings when migrating games during failover. - Update CacheEvictionManager to use game→instance mapping for O(1) lookup instead of O(n) instance scan. Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
4.3 KiB
Coordinator Module - Bug Report
Critical Bugs
1. Cache Eviction Kills Correspondence Games (HIGH)
File: CacheEvictionManager.scala:96-101
Problem: Uses lastHeartbeat timestamp from GameCacheDto to determine if game is idle. But lastHeartbeat is set at store/update time, not move time. Correspondence games with days between moves get evicted while active.
Current Code:
private def extractLastUpdatedTimestamp(json: String): Long =
Try {
val parsed = objectMapper.readTree(json)
Option(parsed.get("lastHeartbeat"))
.filter(_.isTextual)
.fold(0L)(lh => Instant.parse(lh.asText()).toEpochMilli)
}.getOrElse(0L)
Impact: Active correspondence games deleted from cache after idle threshold (config-dependent, typically hours/days) Fix: Track actual move timestamp separately in GameCacheDto or check game state instead of heartbeat
2. Concurrent Rebalance Race Condition (HIGH)
File: LoadBalancer.scala:108-115
Problem: getGamesToMove() reads games from Redis set but doesn't remove them atomically. If multiple rebalance calls run concurrently, same game can be selected in different batches and moved to multiple instances.
Current Code:
private def getGamesToMove(instanceId: String, count: Int): List[String] =
try
val setKey = s"$redisPrefix:instance:$instanceId:games"
redis.set(classOf[String]).smembers(setKey).asScala.toList.take(count) // Read-only, no removal
catch
case ex: Exception =>
log.debugf(ex, "Failed to get games for %s", instanceId)
List()
Impact: Game subscribed to 2+ instances, state corruption, double-processing Fix: Use Redis SPOP (atomic pop) or Lua script for atomic read+remove
3. Pod Matching is Unreliable (MEDIUM)
File: HealthMonitor.scala:134, 188
Problem: Uses .contains() string matching for pod name. Pod "core-1" matches instance "core-11"; loose matching causes wrong pod operations.
Current Code:
instanceId.contains(podName) // Line 134
// and
pods.find(pod => instanceId.contains(pod.getMetadata.getName)) // Line 188
Impact: Wrong pod deleted/evicted when multiple similar names exist Fix: Exact match or structured ID encoding
Medium Priority Bugs
4. Inefficient Game-to-Instance Lookup (MEDIUM)
File: CacheEvictionManager.scala:104-113
Problem: Linear scan through ALL instances to find which one holds a game. Runs per-game during eviction scan every 5 minutes.
Current Code:
private def findInstanceWithGame(gameId: String): Option[InstanceMetadata] =
try
instanceRegistry.getAllInstances.find { instance => // O(n) instances
val setKey = s"$redisPrefix:instance:${instance.instanceId}:games"
redis.set(classOf[String]).sismember(setKey, gameId)
}
Impact: Eviction scans slow with many instances (100+ instances = 100+ Redis ops per game)
Fix: Maintain nowchess:game:$gameId:instance → instanceId mapping in Redis
5. Instance Registry Lookup on Pod Events (MEDIUM)
File: HealthMonitor.scala:245-247
Problem: Linear search through all instances every pod state change. Pod watch fires frequently.
Current Code:
private def findRegisteredInstance(pod: Pod): Option[InstanceMetadata] =
val podName = pod.getMetadata.getName
instanceRegistry.getAllInstances.find(inst => inst.instanceId.contains(podName))
Impact: O(n) lookup on hot path (pod watch events) Fix: Maintain pod-name → instanceId index or use proper ID encoding
Low Priority Bugs
6. Non-idiomatic Sorting (LOW)
File: LoadBalancer.scala:72
Problem: Uses .sortBy[Int](_.subscriptionCount).reverse instead of .sortByDescending()
Current Code:
val overloaded = instances
.filter(_.subscriptionCount > config.maxGamesPerCore)
.sortBy[Int](_.subscriptionCount) // Type annotation unnecessary
.reverse
Impact: Micro code-quality issue
Fix: Use .sortByDescending(_.subscriptionCount)
Fix Priority
- Cache eviction (HIGH) — Data loss risk
- Rebalance race (HIGH) — State corruption risk
- Pod matching (MEDIUM) — Operational blast radius
- Game lookup (MEDIUM) — Performance under scale
- Instance lookup (MEDIUM) — Hot path perf
- Sorting (LOW) — Code style