Add an easy local data pipeline feeding GPU training on Colab. - SelfPlayMain: standalone NNUEBot self-play (no microservices) writing FENs for labeling; randomised openings for game diversity, sequential due to the shared EvaluationNNUE accumulator. Exposed via the `selfPlay` Gradle task and selfplay.sh. - NNUEBot: optional fixedMoveTimeMs so self-play runs fast (default unchanged). - NbaiLoader: honor `-Dnnue.weights=<path>` to load weights from a file before falling back to the bundled resource. - build_dataset.py / dataset.sh: one command builds the entire dataset (Lichess eval-DB backbone + self-play + tactical + random filler), dedups, balances the eval histogram, writes append-only zstd shards + manifest, and rclone-pushes to Drive. - train.py: NNUEDataset reads a directory of .jsonl.zst shards (streaming) in addition to a single file. - NNUETraining.ipynb: clone to ephemeral /content, sync shards from Drive (cache-aware), train on the shards dir; removed Colab generation/upload steps. - Concept + implementation plan docs. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
9.4 KiB
Concept: NNUE Training Data — Quality, Scale, and Transfer to Colab
Local generation + labeling is not a constraint (Ryzen 9800X3D / RTX 5070 / 32 GB). So the design splits cleanly:
- Data plane = local box. Generate, label, shard, publish. Cheap, fast, no limits.
- Train plane = Colab. Pull a dataset version, GPU-train, export
.nbai.
Colab never runs Stockfish and never sees a browser upload. Three problems below: (1) good data, (2) growing it over time, (3) getting it there easily — (3) is the priority.
1. Generating good training sets
The current weak spot
generate.py plays fully random games (random.choice(legal_moves)). Random play
produces positions that never occur in real games — material chaos, nonsense pawn
structures. An NNUE trained on that learns to evaluate a distribution it will never
face. Fine as filler, wrong as the backbone.
What a good NNUE dataset needs
- Realistic position distribution. Positions should resemble what the bot actually reaches in search — from real games and engine play, not coin-flip moves.
- Phase coverage. Openings, middlegames, endgames all represented. Endgames are under-sampled by random play and matter most for precise eval.
- Eval balance. Real game data is dominated by near-equal positions. If 80% of
labels sit in
[-0.5, +0.5], the net learns "everything is roughly equal." Resample to flatten the eval histogram (cap per-bucket counts). - Accurate labels. Deeper Stockfish = better target. Locally you can afford depth 16–20. Or skip labeling entirely with the Lichess eval DB (below).
- Clean positions. Dedup by FEN; drop terminal/checkmate/stalemate; the side to move should not already be in check unless intended; tag the game phase.
Recommended source mix (per dataset version)
| Source | Role | How | Weight |
|---|---|---|---|
| Lichess eval DB | Backbone | lichess_importer.py — millions of FENs pre-labeled by deep Stockfish, real human positions, correct sign convention |
50–70% |
| Engine self-play | Bot's own distribution | NNUEBot (or vs Stockfish) plays games; sample positions; label with local Stockfish | 20–40% |
| Tactical puzzles | Sharp/critical positions | tactical_positions_extractor.py (Lichess puzzle DB) |
5–15% |
| Random play | Cheap diversity filler | existing generate.py, capped low |
≤10% |
The backbone is real, pre-labeled data — so labeling cost is near zero and quality is high. Self-play is the part that adapts data to your bot. Random play stays only as a thin diversity sprinkle.
Self-play flywheel (the quality engine over time)
The strongest lever: net N generates the games that train net N+1.
net_vN ──play self-play games──► sample positions ──label (Stockfish)──►
▲ │
└──────────────── train on (backbone + new self-play) ◄─────────────────┘
net_v(N+1)
Each generation, the bot reaches positions closer to its real playing distribution, labels them with a stronger-than-bot oracle (Stockfish), and learns the gap. Standard modern NNUE practice. Keep the Lichess backbone mixed in every round so the net does not overfit to its own blind spots.
2. Scaling datasets over time — append-only shards
Do not maintain one growing labeled.jsonl and re-copy it. Make a dataset an
immutable set of shards plus a manifest:
datasets/
shards/
lichess_000001.jsonl.zst # ~50–100k positions each, ~5–10 MB compressed
lichess_000002.jsonl.zst
selfplay_v7_000001.jsonl.zst
tactical_000001.jsonl.zst
...
manifest.json
manifest.json:
{
"dataset_version": 7,
"created": "2026-06-24T...",
"total_positions": 4200000,
"scale": 300.0,
"shards": [
{"file": "lichess_000001.jsonl.zst", "positions": 100000,
"sha256": "...", "source": "lichess_eval", "stockfish_depth": 0},
{"file": "selfplay_v7_000001.jsonl.zst", "positions": 80000,
"sha256": "...", "source": "selfplay", "net": "v7", "stockfish_depth": 18}
]
}
Properties this buys:
- Growth = add shards. Generate a new batch, label it, write one new shard, append one manifest entry. Never touch existing shards. O(new data), not O(total).
- Provenance. Each shard records source + net + depth. You can later down-weight or drop a bad batch by editing the manifest, no relabeling.
- Dedup across shards by FEN hash at build time; record dropped counts in metadata.
- Reproducible mixes. A "dataset version" is just a manifest selecting shards + per-source sampling weights. Cheap to define many mixes over the same shard pool.
- Resumable, cache-friendly transfer (next section) — the whole reason for shards.
dataset.py's existing ds_vN + metadata.json scheme generalizes to this directly:
the dataset dir holds shards/ + manifest.json instead of one labeled.jsonl.
3. Getting data to Colab easily ← top priority
Shards make this trivial: incremental sync, never a full re-upload.
Recommended: rclone → Google Drive, read from mounted Drive
Colab mounts Drive natively, so the cheapest path is to make Drive the shard store and
sync into it with rclone (only uploads new/changed shards):
# Local, after building shards:
rclone copy datasets/ gdrive:NowChess/datasets --progress
# ^ uploads only shards Drive doesn't have yet. Adding 80k positions = one small file.
Colab side, one cell:
SRC = '/content/drive/MyDrive/NowChess/datasets' # mounted, no download
import json, shutil, pathlib
manifest = json.load(open(f'{SRC}/manifest.json'))
local = pathlib.Path('/content/datasets'); local.mkdir(exist_ok=True)
for sh in manifest['shards']: # copy Drive→local SSD (fast seq read)
dst = local / sh['file']
if not dst.exists(): # cache: only copy missing shards
shutil.copy(f"{SRC}/shards/{sh['file']}", dst)
Why this wins on "easy":
- No browser upload, ever. One
rclone copyfrom your PC. - Incremental both directions. Add a shard locally → next
rclone copyships only that shard. Colab copies only shards it doesn't already have on/content. - Zero new infra. Drive is already mounted in the notebook.
Alternative: Gitea release per dataset version (if Drive quota hurts)
You self-host git.janis-eccarius.de. Tag ds_v7, attach shards + manifest.json as
release assets. Colab reads the manifest, then parallel-wget only the shards it lacks
(checksum-verified). Versioned, immutable, no Drive quota, token-gated. Slightly more
wiring than rclone→Drive.
Pick rclone→Drive for minimum friction; Gitea releases if you want hard versioning and to keep Drive small.
Notebook changes either way
- Clone repo to ephemeral
/content(fast), not Drive. Persist only datasets + checkpoints. - Drop Option A (no Colab generation) and Option B (no browser upload). One "sync dataset version" cell instead.
- Train reads shards via a streaming
.jsonl.zstloader (apply per-source sampling weights + eval-bucket balancing here). Keep burst-train + Drive checkpoints +.nbaiexport.
Resulting workflow
LOCAL (9800X3D / RTX5070) COLAB (GPU)
───────────────────────── ───────────
import Lichess eval DB ─┐
self-play with net_vN ─┼─► label ─► dedup ─► write new shard(s) ─► manifest++
tactical / random ─┘ │
rclone copy ────────┘
datasets/ → Drive
│ (only new shards move)
▼
sync version → copy missing shards → train (GPU)
│
export .nbai
▼
place in src/main/resources/, rebuild native image
Build order
- Shard format + manifest in
dataset.py: write/readshards/*.jsonl.zst+manifest.json; dedup-across-shards on build; provenance per shard. - Streaming
.zstdataloader intrain.py: read shards, apply per-source weights and eval-bucket balancing. - Self-play generator in
src/: NNUEBot/Stockfish self-play → positions → local Stockfish label → new shard. This is the scaling engine. dataset_sync.py:push(rclone→Drive or Gitea upload) /pull(cache-aware).- Notebook rewrite: ephemeral clone, single sync cell, weighted streaming loader.
- Wire
lichess_importer.pyas the backbone shard source.
Open decisions
- Transfer backend — rclone→Drive (easiest, recommended) vs Gitea releases (hard versioning).
- Self-play opponent — NNUEBot vs itself (own distribution) vs vs-Stockfish (stronger, more decisive games). Likely a mix.
- Backbone/self-play ratio — start ~60/30/10 (lichess/selfplay/tactical), tune by measured strength.
- Shard size — 50k vs 100k positions/shard (transfer granularity vs file count).