Files
NowChessSystems/modules/official-bots/python/COLAB_TRAINING_CONCEPT.md
T
Janis Eccarius 1c80abdb8a
Build & Test (NowChessSystems) TeamCity build finished
feat(official-bots): standalone self-play + one-shot dataset builder for NNUE training
Add an easy local data pipeline feeding GPU training on Colab.

- SelfPlayMain: standalone NNUEBot self-play (no microservices) writing FENs
  for labeling; randomised openings for game diversity, sequential due to the
  shared EvaluationNNUE accumulator. Exposed via the `selfPlay` Gradle task and
  selfplay.sh.
- NNUEBot: optional fixedMoveTimeMs so self-play runs fast (default unchanged).
- NbaiLoader: honor `-Dnnue.weights=<path>` to load weights from a file before
  falling back to the bundled resource.
- build_dataset.py / dataset.sh: one command builds the entire dataset
  (Lichess eval-DB backbone + self-play + tactical + random filler), dedups,
  balances the eval histogram, writes append-only zstd shards + manifest, and
  rclone-pushes to Drive.
- train.py: NNUEDataset reads a directory of .jsonl.zst shards (streaming) in
  addition to a single file.
- NNUETraining.ipynb: clone to ephemeral /content, sync shards from Drive
  (cache-aware), train on the shards dir; removed Colab generation/upload steps.
- Concept + implementation plan docs.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-24 22:04:22 +02:00

9.4 KiB
Raw Blame History

Concept: NNUE Training Data — Quality, Scale, and Transfer to Colab

Local generation + labeling is not a constraint (Ryzen 9800X3D / RTX 5070 / 32 GB). So the design splits cleanly:

  • Data plane = local box. Generate, label, shard, publish. Cheap, fast, no limits.
  • Train plane = Colab. Pull a dataset version, GPU-train, export .nbai.

Colab never runs Stockfish and never sees a browser upload. Three problems below: (1) good data, (2) growing it over time, (3) getting it there easily — (3) is the priority.


1. Generating good training sets

The current weak spot

generate.py plays fully random games (random.choice(legal_moves)). Random play produces positions that never occur in real games — material chaos, nonsense pawn structures. An NNUE trained on that learns to evaluate a distribution it will never face. Fine as filler, wrong as the backbone.

What a good NNUE dataset needs

  1. Realistic position distribution. Positions should resemble what the bot actually reaches in search — from real games and engine play, not coin-flip moves.
  2. Phase coverage. Openings, middlegames, endgames all represented. Endgames are under-sampled by random play and matter most for precise eval.
  3. Eval balance. Real game data is dominated by near-equal positions. If 80% of labels sit in [-0.5, +0.5], the net learns "everything is roughly equal." Resample to flatten the eval histogram (cap per-bucket counts).
  4. Accurate labels. Deeper Stockfish = better target. Locally you can afford depth 1620. Or skip labeling entirely with the Lichess eval DB (below).
  5. Clean positions. Dedup by FEN; drop terminal/checkmate/stalemate; the side to move should not already be in check unless intended; tag the game phase.
Source Role How Weight
Lichess eval DB Backbone lichess_importer.py — millions of FENs pre-labeled by deep Stockfish, real human positions, correct sign convention 5070%
Engine self-play Bot's own distribution NNUEBot (or vs Stockfish) plays games; sample positions; label with local Stockfish 2040%
Tactical puzzles Sharp/critical positions tactical_positions_extractor.py (Lichess puzzle DB) 515%
Random play Cheap diversity filler existing generate.py, capped low ≤10%

The backbone is real, pre-labeled data — so labeling cost is near zero and quality is high. Self-play is the part that adapts data to your bot. Random play stays only as a thin diversity sprinkle.

Self-play flywheel (the quality engine over time)

The strongest lever: net N generates the games that train net N+1.

net_vN  ──play self-play games──►  sample positions  ──label (Stockfish)──►
   ▲                                                                        │
   └──────────────── train on (backbone + new self-play) ◄─────────────────┘
                                  net_v(N+1)

Each generation, the bot reaches positions closer to its real playing distribution, labels them with a stronger-than-bot oracle (Stockfish), and learns the gap. Standard modern NNUE practice. Keep the Lichess backbone mixed in every round so the net does not overfit to its own blind spots.


2. Scaling datasets over time — append-only shards

Do not maintain one growing labeled.jsonl and re-copy it. Make a dataset an immutable set of shards plus a manifest:

datasets/
  shards/
    lichess_000001.jsonl.zst      # ~50100k positions each, ~510 MB compressed
    lichess_000002.jsonl.zst
    selfplay_v7_000001.jsonl.zst
    tactical_000001.jsonl.zst
    ...
  manifest.json

manifest.json:

{
  "dataset_version": 7,
  "created": "2026-06-24T...",
  "total_positions": 4200000,
  "scale": 300.0,
  "shards": [
    {"file": "lichess_000001.jsonl.zst", "positions": 100000,
     "sha256": "...", "source": "lichess_eval", "stockfish_depth": 0},
    {"file": "selfplay_v7_000001.jsonl.zst", "positions": 80000,
     "sha256": "...", "source": "selfplay", "net": "v7", "stockfish_depth": 18}
  ]
}

Properties this buys:

  • Growth = add shards. Generate a new batch, label it, write one new shard, append one manifest entry. Never touch existing shards. O(new data), not O(total).
  • Provenance. Each shard records source + net + depth. You can later down-weight or drop a bad batch by editing the manifest, no relabeling.
  • Dedup across shards by FEN hash at build time; record dropped counts in metadata.
  • Reproducible mixes. A "dataset version" is just a manifest selecting shards + per-source sampling weights. Cheap to define many mixes over the same shard pool.
  • Resumable, cache-friendly transfer (next section) — the whole reason for shards.

dataset.py's existing ds_vN + metadata.json scheme generalizes to this directly: the dataset dir holds shards/ + manifest.json instead of one labeled.jsonl.


3. Getting data to Colab easily ← top priority

Shards make this trivial: incremental sync, never a full re-upload.

Colab mounts Drive natively, so the cheapest path is to make Drive the shard store and sync into it with rclone (only uploads new/changed shards):

# Local, after building shards:
rclone copy datasets/ gdrive:NowChess/datasets --progress
#   ^ uploads only shards Drive doesn't have yet. Adding 80k positions = one small file.

Colab side, one cell:

SRC = '/content/drive/MyDrive/NowChess/datasets'   # mounted, no download
import json, shutil, pathlib
manifest = json.load(open(f'{SRC}/manifest.json'))
local = pathlib.Path('/content/datasets'); local.mkdir(exist_ok=True)
for sh in manifest['shards']:                       # copy Drive→local SSD (fast seq read)
    dst = local / sh['file']
    if not dst.exists():                            # cache: only copy missing shards
        shutil.copy(f"{SRC}/shards/{sh['file']}", dst)

Why this wins on "easy":

  • No browser upload, ever. One rclone copy from your PC.
  • Incremental both directions. Add a shard locally → next rclone copy ships only that shard. Colab copies only shards it doesn't already have on /content.
  • Zero new infra. Drive is already mounted in the notebook.

Alternative: Gitea release per dataset version (if Drive quota hurts)

You self-host git.janis-eccarius.de. Tag ds_v7, attach shards + manifest.json as release assets. Colab reads the manifest, then parallel-wget only the shards it lacks (checksum-verified). Versioned, immutable, no Drive quota, token-gated. Slightly more wiring than rclone→Drive.

Pick rclone→Drive for minimum friction; Gitea releases if you want hard versioning and to keep Drive small.

Notebook changes either way

  • Clone repo to ephemeral /content (fast), not Drive. Persist only datasets + checkpoints.
  • Drop Option A (no Colab generation) and Option B (no browser upload). One "sync dataset version" cell instead.
  • Train reads shards via a streaming .jsonl.zst loader (apply per-source sampling weights + eval-bucket balancing here). Keep burst-train + Drive checkpoints + .nbai export.

Resulting workflow

LOCAL (9800X3D / RTX5070)                         COLAB (GPU)
─────────────────────────                         ───────────
import Lichess eval DB ─┐
self-play with net_vN  ─┼─► label ─► dedup ─► write new shard(s) ─► manifest++
tactical / random      ─┘                                  │
                                       rclone copy ────────┘
                                       datasets/ → Drive
                                                              │  (only new shards move)
                                                              ▼
                                    sync version → copy missing shards → train (GPU)
                                                              │
                                                       export .nbai
                                                              ▼
                              place in src/main/resources/, rebuild native image

Build order

  1. Shard format + manifest in dataset.py: write/read shards/*.jsonl.zst + manifest.json; dedup-across-shards on build; provenance per shard.
  2. Streaming .zst dataloader in train.py: read shards, apply per-source weights and eval-bucket balancing.
  3. Self-play generator in src/: NNUEBot/Stockfish self-play → positions → local Stockfish label → new shard. This is the scaling engine.
  4. dataset_sync.py: push (rclone→Drive or Gitea upload) / pull (cache-aware).
  5. Notebook rewrite: ephemeral clone, single sync cell, weighted streaming loader.
  6. Wire lichess_importer.py as the backbone shard source.

Open decisions

  • Transfer backend — rclone→Drive (easiest, recommended) vs Gitea releases (hard versioning).
  • Self-play opponent — NNUEBot vs itself (own distribution) vs vs-Stockfish (stronger, more decisive games). Likely a mix.
  • Backbone/self-play ratio — start ~60/30/10 (lichess/selfplay/tactical), tune by measured strength.
  • Shard size — 50k vs 100k positions/shard (transfer granularity vs file count).