feat(official-bots): standalone self-play + one-shot dataset builder for NNUE training

Add an easy local data pipeline feeding GPU training on Colab. - SelfPlayMain: standalone NNUEBot self-play (no microservices) writing FENs for labeling; randomised openings for game diversity, sequential due to the shared EvaluationNNUE accumulator. Exposed via the `selfPlay` Gradle task and selfplay.sh. - NNUEBot: optional fixedMoveTimeMs so self-play runs fast (default unchanged). - NbaiLoader: honor `-Dnnue.weights=<path>` to load weights from a file before falling back to the bundled resource. - build_dataset.py / dataset.sh: one command builds the entire dataset (Lichess eval-DB backbone + self-play + tactical + random filler), dedups, balances the eval histogram, writes append-only zstd shards + manifest, and rclone-pushes to Drive. - train.py: NNUEDataset reads a directory of .jsonl.zst shards (streaming) in addition to a single file. - NNUETraining.ipynb: clone to ephemeral /content, sync shards from Drive (cache-aware), train on the shards dir; removed Colab generation/upload steps. - Concept + implementation plan docs. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-24 22:04:22 +02:00
parent c8cbcdca3b
commit 1c80abdb8a
11 changed files with 909 additions and 198 deletions
@@ -0,0 +1,212 @@
+# Concept: NNUE Training Data — Quality, Scale, and Transfer to Colab
+
+Local generation + labeling is **not** a constraint (Ryzen 9800X3D / RTX 5070 / 32 GB).
+So the design splits cleanly:
+
+- **Data plane = local box.** Generate, label, shard, publish. Cheap, fast, no limits.
+- **Train plane = Colab.** Pull a dataset version, GPU-train, export `.nbai`.
+
+Colab never runs Stockfish and never sees a browser upload. Three problems below:
+**(1) good data, (2) growing it over time, (3) getting it there easily** — (3) is the priority.
+
+---
+
+## 1. Generating *good* training sets
+
+### The current weak spot
+
+`generate.py` plays **fully random games** (`random.choice(legal_moves)`). Random play
+produces positions that never occur in real games — material chaos, nonsense pawn
+structures. An NNUE trained on that learns to evaluate a distribution it will never
+face. Fine as filler, wrong as the backbone.
+
+### What a good NNUE dataset needs
+
+1. **Realistic position distribution.** Positions should resemble what the bot actually
+   reaches in search — from real games and engine play, not coin-flip moves.
+2. **Phase coverage.** Openings, middlegames, endgames all represented. Endgames are
+   under-sampled by random play and matter most for precise eval.
+3. **Eval balance.** Real game data is dominated by near-equal positions. If 80% of
+   labels sit in `[-0.5, +0.5]`, the net learns "everything is roughly equal." Resample
+   to flatten the eval histogram (cap per-bucket counts).
+4. **Accurate labels.** Deeper Stockfish = better target. Locally you can afford
+   depth 16–20. Or skip labeling entirely with the Lichess eval DB (below).
+5. **Clean positions.** Dedup by FEN; drop terminal/checkmate/stalemate; the side to
+   move should not already be in check unless intended; tag the game phase.
+
+### Recommended source mix (per dataset version)
+
+| Source | Role | How | Weight |
+|---|---|---|---|
+| **Lichess eval DB** | Backbone | `lichess_importer.py` — millions of FENs **pre-labeled** by deep Stockfish, real human positions, correct sign convention | 50–70% |
+| **Engine self-play** | Bot's own distribution | NNUEBot (or vs Stockfish) plays games; sample positions; label with local Stockfish | 20–40% |
+| **Tactical puzzles** | Sharp/critical positions | `tactical_positions_extractor.py` (Lichess puzzle DB) | 5–15% |
+| **Random play** | Cheap diversity filler | existing `generate.py`, capped low | ≤10% |
+
+The backbone is real, pre-labeled data — so labeling cost is near zero and quality is
+high. Self-play is the part that adapts data to *your* bot. Random play stays only as
+a thin diversity sprinkle.
+
+### Self-play flywheel (the quality engine over time)
+
+The strongest lever: **net N generates the games that train net N+1.**
+
+```
+net_vN  ──play self-play games──►  sample positions  ──label (Stockfish)──►
+   ▲                                                                        │
+   └──────────────── train on (backbone + new self-play) ◄─────────────────┘
+                                  net_v(N+1)
+```
+
+Each generation, the bot reaches positions closer to its real playing distribution,
+labels them with a stronger-than-bot oracle (Stockfish), and learns the gap. Standard
+modern NNUE practice. Keep the Lichess backbone mixed in every round so the net does
+not overfit to its own blind spots.
+
+---
+
+## 2. Scaling datasets over time — append-only shards
+
+Do **not** maintain one growing `labeled.jsonl` and re-copy it. Make a dataset an
+**immutable set of shards plus a manifest**:
+
+```
+datasets/
+  shards/
+    lichess_000001.jsonl.zst      # ~50–100k positions each, ~5–10 MB compressed
+    lichess_000002.jsonl.zst
+    selfplay_v7_000001.jsonl.zst
+    tactical_000001.jsonl.zst
+    ...
+  manifest.json
+```
+
+`manifest.json`:
+
+```json
+{
+  "dataset_version": 7,
+  "created": "2026-06-24T...",
+  "total_positions": 4200000,
+  "scale": 300.0,
+  "shards": [
+    {"file": "lichess_000001.jsonl.zst", "positions": 100000,
+     "sha256": "...", "source": "lichess_eval", "stockfish_depth": 0},
+    {"file": "selfplay_v7_000001.jsonl.zst", "positions": 80000,
+     "sha256": "...", "source": "selfplay", "net": "v7", "stockfish_depth": 18}
+  ]
+}
+```
+
+Properties this buys:
+
+- **Growth = add shards.** Generate a new batch, label it, write one new shard, append
+  one manifest entry. Never touch existing shards. O(new data), not O(total).
+- **Provenance.** Each shard records source + net + depth. You can later down-weight or
+  drop a bad batch by editing the manifest, no relabeling.
+- **Dedup across shards** by FEN hash at build time; record dropped counts in metadata.
+- **Reproducible mixes.** A "dataset version" is just a manifest selecting shards +
+  per-source sampling weights. Cheap to define many mixes over the same shard pool.
+- **Resumable, cache-friendly transfer** (next section) — the whole reason for shards.
+
+`dataset.py`'s existing `ds_vN` + `metadata.json` scheme generalizes to this directly:
+the dataset dir holds `shards/` + `manifest.json` instead of one `labeled.jsonl`.
+
+---
+
+## 3. Getting data to Colab easily  ← top priority
+
+Shards make this trivial: **incremental sync, never a full re-upload.**
+
+### Recommended: rclone → Google Drive, read from mounted Drive
+
+Colab mounts Drive natively, so the cheapest path is to make Drive the shard store and
+sync into it with `rclone` (only uploads new/changed shards):
+
+```bash
+# Local, after building shards:
+rclone copy datasets/ gdrive:NowChess/datasets --progress
+#   ^ uploads only shards Drive doesn't have yet. Adding 80k positions = one small file.
+```
+
+Colab side, one cell:
+
+```python
+SRC = '/content/drive/MyDrive/NowChess/datasets'   # mounted, no download
+import json, shutil, pathlib
+manifest = json.load(open(f'{SRC}/manifest.json'))
+local = pathlib.Path('/content/datasets'); local.mkdir(exist_ok=True)
+for sh in manifest['shards']:                       # copy Drive→local SSD (fast seq read)
+    dst = local / sh['file']
+    if not dst.exists():                            # cache: only copy missing shards
+        shutil.copy(f"{SRC}/shards/{sh['file']}", dst)
+```
+
+Why this wins on "easy":
+- **No browser upload, ever.** One `rclone copy` from your PC.
+- **Incremental both directions.** Add a shard locally → next `rclone copy` ships only
+  that shard. Colab copies only shards it doesn't already have on `/content`.
+- **Zero new infra.** Drive is already mounted in the notebook.
+
+### Alternative: Gitea release per dataset version (if Drive quota hurts)
+
+You self-host `git.janis-eccarius.de`. Tag `ds_v7`, attach shards + `manifest.json` as
+release assets. Colab reads the manifest, then parallel-`wget` only the shards it lacks
+(checksum-verified). Versioned, immutable, no Drive quota, token-gated. Slightly more
+wiring than rclone→Drive.
+
+Pick rclone→Drive for minimum friction; Gitea releases if you want hard versioning and
+to keep Drive small.
+
+### Notebook changes either way
+
+- Clone repo to **ephemeral `/content`** (fast), not Drive. Persist only datasets +
+  checkpoints.
+- Drop Option A (no Colab generation) and Option B (no browser upload). One "sync
+  dataset version" cell instead.
+- Train reads shards via a streaming `.jsonl.zst` loader (apply per-source sampling
+  weights + eval-bucket balancing here). Keep burst-train + Drive checkpoints + `.nbai`
+  export.
+
+---
+
+## Resulting workflow
+
+```
+LOCAL (9800X3D / RTX5070)                         COLAB (GPU)
+─────────────────────────                         ───────────
+import Lichess eval DB ─┐
+self-play with net_vN  ─┼─► label ─► dedup ─► write new shard(s) ─► manifest++
+tactical / random      ─┘                                  │
+                                       rclone copy ────────┘
+                                       datasets/ → Drive
+                                                              │  (only new shards move)
+                                                              ▼
+                                    sync version → copy missing shards → train (GPU)
+                                                              │
+                                                       export .nbai
+                                                              ▼
+                              place in src/main/resources/, rebuild native image
+```
+
+## Build order
+
+1. **Shard format + manifest** in `dataset.py`: write/read `shards/*.jsonl.zst` +
+   `manifest.json`; dedup-across-shards on build; provenance per shard.
+2. **Streaming `.zst` dataloader** in `train.py`: read shards, apply per-source weights
+   and eval-bucket balancing.
+3. **Self-play generator** in `src/`: NNUEBot/Stockfish self-play → positions → local
+   Stockfish label → new shard. This is the scaling engine.
+4. **`dataset_sync.py`**: `push` (rclone→Drive or Gitea upload) / `pull` (cache-aware).
+5. **Notebook rewrite**: ephemeral clone, single sync cell, weighted streaming loader.
+6. Wire `lichess_importer.py` as the backbone shard source.
+
+## Open decisions
+
+- **Transfer backend** — rclone→Drive (easiest, recommended) vs Gitea releases (hard
+  versioning).
+- **Self-play opponent** — NNUEBot vs itself (own distribution) vs vs-Stockfish
+  (stronger, more decisive games). Likely a mix.
+- **Backbone/self-play ratio** — start ~60/30/10 (lichess/selfplay/tactical), tune by
+  measured strength.
+- **Shard size** — 50k vs 100k positions/shard (transfer granularity vs file count).