# Concept: NNUE Training Data — Quality, Scale, and Transfer to Colab Local generation + labeling is **not** a constraint (Ryzen 9800X3D / RTX 5070 / 32 GB). So the design splits cleanly: - **Data plane = local box.** Generate, label, shard, publish. Cheap, fast, no limits. - **Train plane = Colab.** Pull a dataset version, GPU-train, export `.nbai`. Colab never runs Stockfish and never sees a browser upload. Three problems below: **(1) good data, (2) growing it over time, (3) getting it there easily** — (3) is the priority. --- ## 1. Generating *good* training sets ### The current weak spot `generate.py` plays **fully random games** (`random.choice(legal_moves)`). Random play produces positions that never occur in real games — material chaos, nonsense pawn structures. An NNUE trained on that learns to evaluate a distribution it will never face. Fine as filler, wrong as the backbone. ### What a good NNUE dataset needs 1. **Realistic position distribution.** Positions should resemble what the bot actually reaches in search — from real games and engine play, not coin-flip moves. 2. **Phase coverage.** Openings, middlegames, endgames all represented. Endgames are under-sampled by random play and matter most for precise eval. 3. **Eval balance.** Real game data is dominated by near-equal positions. If 80% of labels sit in `[-0.5, +0.5]`, the net learns "everything is roughly equal." Resample to flatten the eval histogram (cap per-bucket counts). 4. **Accurate labels.** Deeper Stockfish = better target. Locally you can afford depth 16–20. Or skip labeling entirely with the Lichess eval DB (below). 5. **Clean positions.** Dedup by FEN; drop terminal/checkmate/stalemate; the side to move should not already be in check unless intended; tag the game phase. ### Recommended source mix (per dataset version) | Source | Role | How | Weight | |---|---|---|---| | **Lichess eval DB** | Backbone | `lichess_importer.py` — millions of FENs **pre-labeled** by deep Stockfish, real human positions, correct sign convention | 50–70% | | **Engine self-play** | Bot's own distribution | NNUEBot (or vs Stockfish) plays games; sample positions; label with local Stockfish | 20–40% | | **Tactical puzzles** | Sharp/critical positions | `tactical_positions_extractor.py` (Lichess puzzle DB) | 5–15% | | **Random play** | Cheap diversity filler | existing `generate.py`, capped low | ≤10% | The backbone is real, pre-labeled data — so labeling cost is near zero and quality is high. Self-play is the part that adapts data to *your* bot. Random play stays only as a thin diversity sprinkle. ### Self-play flywheel (the quality engine over time) The strongest lever: **net N generates the games that train net N+1.** ``` net_vN ──play self-play games──► sample positions ──label (Stockfish)──► ▲ │ └──────────────── train on (backbone + new self-play) ◄─────────────────┘ net_v(N+1) ``` Each generation, the bot reaches positions closer to its real playing distribution, labels them with a stronger-than-bot oracle (Stockfish), and learns the gap. Standard modern NNUE practice. Keep the Lichess backbone mixed in every round so the net does not overfit to its own blind spots. --- ## 2. Scaling datasets over time — append-only shards Do **not** maintain one growing `labeled.jsonl` and re-copy it. Make a dataset an **immutable set of shards plus a manifest**: ``` datasets/ shards/ lichess_000001.jsonl.zst # ~50–100k positions each, ~5–10 MB compressed lichess_000002.jsonl.zst selfplay_v7_000001.jsonl.zst tactical_000001.jsonl.zst ... manifest.json ``` `manifest.json`: ```json { "dataset_version": 7, "created": "2026-06-24T...", "total_positions": 4200000, "scale": 300.0, "shards": [ {"file": "lichess_000001.jsonl.zst", "positions": 100000, "sha256": "...", "source": "lichess_eval", "stockfish_depth": 0}, {"file": "selfplay_v7_000001.jsonl.zst", "positions": 80000, "sha256": "...", "source": "selfplay", "net": "v7", "stockfish_depth": 18} ] } ``` Properties this buys: - **Growth = add shards.** Generate a new batch, label it, write one new shard, append one manifest entry. Never touch existing shards. O(new data), not O(total). - **Provenance.** Each shard records source + net + depth. You can later down-weight or drop a bad batch by editing the manifest, no relabeling. - **Dedup across shards** by FEN hash at build time; record dropped counts in metadata. - **Reproducible mixes.** A "dataset version" is just a manifest selecting shards + per-source sampling weights. Cheap to define many mixes over the same shard pool. - **Resumable, cache-friendly transfer** (next section) — the whole reason for shards. `dataset.py`'s existing `ds_vN` + `metadata.json` scheme generalizes to this directly: the dataset dir holds `shards/` + `manifest.json` instead of one `labeled.jsonl`. --- ## 3. Getting data to Colab easily ← top priority Shards make this trivial: **incremental sync, never a full re-upload.** ### Recommended: rclone → Google Drive, read from mounted Drive Colab mounts Drive natively, so the cheapest path is to make Drive the shard store and sync into it with `rclone` (only uploads new/changed shards): ```bash # Local, after building shards: rclone copy datasets/ gdrive:NowChess/datasets --progress # ^ uploads only shards Drive doesn't have yet. Adding 80k positions = one small file. ``` Colab side, one cell: ```python SRC = '/content/drive/MyDrive/NowChess/datasets' # mounted, no download import json, shutil, pathlib manifest = json.load(open(f'{SRC}/manifest.json')) local = pathlib.Path('/content/datasets'); local.mkdir(exist_ok=True) for sh in manifest['shards']: # copy Drive→local SSD (fast seq read) dst = local / sh['file'] if not dst.exists(): # cache: only copy missing shards shutil.copy(f"{SRC}/shards/{sh['file']}", dst) ``` Why this wins on "easy": - **No browser upload, ever.** One `rclone copy` from your PC. - **Incremental both directions.** Add a shard locally → next `rclone copy` ships only that shard. Colab copies only shards it doesn't already have on `/content`. - **Zero new infra.** Drive is already mounted in the notebook. ### Alternative: Gitea release per dataset version (if Drive quota hurts) You self-host `git.janis-eccarius.de`. Tag `ds_v7`, attach shards + `manifest.json` as release assets. Colab reads the manifest, then parallel-`wget` only the shards it lacks (checksum-verified). Versioned, immutable, no Drive quota, token-gated. Slightly more wiring than rclone→Drive. Pick rclone→Drive for minimum friction; Gitea releases if you want hard versioning and to keep Drive small. ### Notebook changes either way - Clone repo to **ephemeral `/content`** (fast), not Drive. Persist only datasets + checkpoints. - Drop Option A (no Colab generation) and Option B (no browser upload). One "sync dataset version" cell instead. - Train reads shards via a streaming `.jsonl.zst` loader (apply per-source sampling weights + eval-bucket balancing here). Keep burst-train + Drive checkpoints + `.nbai` export. --- ## Resulting workflow ``` LOCAL (9800X3D / RTX5070) COLAB (GPU) ───────────────────────── ─────────── import Lichess eval DB ─┐ self-play with net_vN ─┼─► label ─► dedup ─► write new shard(s) ─► manifest++ tactical / random ─┘ │ rclone copy ────────┘ datasets/ → Drive │ (only new shards move) ▼ sync version → copy missing shards → train (GPU) │ export .nbai ▼ place in src/main/resources/, rebuild native image ``` ## Build order 1. **Shard format + manifest** in `dataset.py`: write/read `shards/*.jsonl.zst` + `manifest.json`; dedup-across-shards on build; provenance per shard. 2. **Streaming `.zst` dataloader** in `train.py`: read shards, apply per-source weights and eval-bucket balancing. 3. **Self-play generator** in `src/`: NNUEBot/Stockfish self-play → positions → local Stockfish label → new shard. This is the scaling engine. 4. **`dataset_sync.py`**: `push` (rclone→Drive or Gitea upload) / `pull` (cache-aware). 5. **Notebook rewrite**: ephemeral clone, single sync cell, weighted streaming loader. 6. Wire `lichess_importer.py` as the backbone shard source. ## Open decisions - **Transfer backend** — rclone→Drive (easiest, recommended) vs Gitea releases (hard versioning). - **Self-play opponent** — NNUEBot vs itself (own distribution) vs vs-Stockfish (stronger, more decisive games). Likely a mix. - **Backbone/self-play ratio** — start ~60/30/10 (lichess/selfplay/tactical), tune by measured strength. - **Shard size** — 50k vs 100k positions/shard (transfer granularity vs file count).