1c80abdb8a
Build & Test (NowChessSystems) TeamCity build finished
Add an easy local data pipeline feeding GPU training on Colab. - SelfPlayMain: standalone NNUEBot self-play (no microservices) writing FENs for labeling; randomised openings for game diversity, sequential due to the shared EvaluationNNUE accumulator. Exposed via the `selfPlay` Gradle task and selfplay.sh. - NNUEBot: optional fixedMoveTimeMs so self-play runs fast (default unchanged). - NbaiLoader: honor `-Dnnue.weights=<path>` to load weights from a file before falling back to the bundled resource. - build_dataset.py / dataset.sh: one command builds the entire dataset (Lichess eval-DB backbone + self-play + tactical + random filler), dedups, balances the eval histogram, writes append-only zstd shards + manifest, and rclone-pushes to Drive. - train.py: NNUEDataset reads a directory of .jsonl.zst shards (streaming) in addition to a single file. - NNUETraining.ipynb: clone to ephemeral /content, sync shards from Drive (cache-aware), train on the shards dir; removed Colab generation/upload steps. - Concept + implementation plan docs. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
213 lines
9.4 KiB
Markdown
213 lines
9.4 KiB
Markdown
# Concept: NNUE Training Data — Quality, Scale, and Transfer to Colab
|
||
|
||
Local generation + labeling is **not** a constraint (Ryzen 9800X3D / RTX 5070 / 32 GB).
|
||
So the design splits cleanly:
|
||
|
||
- **Data plane = local box.** Generate, label, shard, publish. Cheap, fast, no limits.
|
||
- **Train plane = Colab.** Pull a dataset version, GPU-train, export `.nbai`.
|
||
|
||
Colab never runs Stockfish and never sees a browser upload. Three problems below:
|
||
**(1) good data, (2) growing it over time, (3) getting it there easily** — (3) is the priority.
|
||
|
||
---
|
||
|
||
## 1. Generating *good* training sets
|
||
|
||
### The current weak spot
|
||
|
||
`generate.py` plays **fully random games** (`random.choice(legal_moves)`). Random play
|
||
produces positions that never occur in real games — material chaos, nonsense pawn
|
||
structures. An NNUE trained on that learns to evaluate a distribution it will never
|
||
face. Fine as filler, wrong as the backbone.
|
||
|
||
### What a good NNUE dataset needs
|
||
|
||
1. **Realistic position distribution.** Positions should resemble what the bot actually
|
||
reaches in search — from real games and engine play, not coin-flip moves.
|
||
2. **Phase coverage.** Openings, middlegames, endgames all represented. Endgames are
|
||
under-sampled by random play and matter most for precise eval.
|
||
3. **Eval balance.** Real game data is dominated by near-equal positions. If 80% of
|
||
labels sit in `[-0.5, +0.5]`, the net learns "everything is roughly equal." Resample
|
||
to flatten the eval histogram (cap per-bucket counts).
|
||
4. **Accurate labels.** Deeper Stockfish = better target. Locally you can afford
|
||
depth 16–20. Or skip labeling entirely with the Lichess eval DB (below).
|
||
5. **Clean positions.** Dedup by FEN; drop terminal/checkmate/stalemate; the side to
|
||
move should not already be in check unless intended; tag the game phase.
|
||
|
||
### Recommended source mix (per dataset version)
|
||
|
||
| Source | Role | How | Weight |
|
||
|---|---|---|---|
|
||
| **Lichess eval DB** | Backbone | `lichess_importer.py` — millions of FENs **pre-labeled** by deep Stockfish, real human positions, correct sign convention | 50–70% |
|
||
| **Engine self-play** | Bot's own distribution | NNUEBot (or vs Stockfish) plays games; sample positions; label with local Stockfish | 20–40% |
|
||
| **Tactical puzzles** | Sharp/critical positions | `tactical_positions_extractor.py` (Lichess puzzle DB) | 5–15% |
|
||
| **Random play** | Cheap diversity filler | existing `generate.py`, capped low | ≤10% |
|
||
|
||
The backbone is real, pre-labeled data — so labeling cost is near zero and quality is
|
||
high. Self-play is the part that adapts data to *your* bot. Random play stays only as
|
||
a thin diversity sprinkle.
|
||
|
||
### Self-play flywheel (the quality engine over time)
|
||
|
||
The strongest lever: **net N generates the games that train net N+1.**
|
||
|
||
```
|
||
net_vN ──play self-play games──► sample positions ──label (Stockfish)──►
|
||
▲ │
|
||
└──────────────── train on (backbone + new self-play) ◄─────────────────┘
|
||
net_v(N+1)
|
||
```
|
||
|
||
Each generation, the bot reaches positions closer to its real playing distribution,
|
||
labels them with a stronger-than-bot oracle (Stockfish), and learns the gap. Standard
|
||
modern NNUE practice. Keep the Lichess backbone mixed in every round so the net does
|
||
not overfit to its own blind spots.
|
||
|
||
---
|
||
|
||
## 2. Scaling datasets over time — append-only shards
|
||
|
||
Do **not** maintain one growing `labeled.jsonl` and re-copy it. Make a dataset an
|
||
**immutable set of shards plus a manifest**:
|
||
|
||
```
|
||
datasets/
|
||
shards/
|
||
lichess_000001.jsonl.zst # ~50–100k positions each, ~5–10 MB compressed
|
||
lichess_000002.jsonl.zst
|
||
selfplay_v7_000001.jsonl.zst
|
||
tactical_000001.jsonl.zst
|
||
...
|
||
manifest.json
|
||
```
|
||
|
||
`manifest.json`:
|
||
|
||
```json
|
||
{
|
||
"dataset_version": 7,
|
||
"created": "2026-06-24T...",
|
||
"total_positions": 4200000,
|
||
"scale": 300.0,
|
||
"shards": [
|
||
{"file": "lichess_000001.jsonl.zst", "positions": 100000,
|
||
"sha256": "...", "source": "lichess_eval", "stockfish_depth": 0},
|
||
{"file": "selfplay_v7_000001.jsonl.zst", "positions": 80000,
|
||
"sha256": "...", "source": "selfplay", "net": "v7", "stockfish_depth": 18}
|
||
]
|
||
}
|
||
```
|
||
|
||
Properties this buys:
|
||
|
||
- **Growth = add shards.** Generate a new batch, label it, write one new shard, append
|
||
one manifest entry. Never touch existing shards. O(new data), not O(total).
|
||
- **Provenance.** Each shard records source + net + depth. You can later down-weight or
|
||
drop a bad batch by editing the manifest, no relabeling.
|
||
- **Dedup across shards** by FEN hash at build time; record dropped counts in metadata.
|
||
- **Reproducible mixes.** A "dataset version" is just a manifest selecting shards +
|
||
per-source sampling weights. Cheap to define many mixes over the same shard pool.
|
||
- **Resumable, cache-friendly transfer** (next section) — the whole reason for shards.
|
||
|
||
`dataset.py`'s existing `ds_vN` + `metadata.json` scheme generalizes to this directly:
|
||
the dataset dir holds `shards/` + `manifest.json` instead of one `labeled.jsonl`.
|
||
|
||
---
|
||
|
||
## 3. Getting data to Colab easily ← top priority
|
||
|
||
Shards make this trivial: **incremental sync, never a full re-upload.**
|
||
|
||
### Recommended: rclone → Google Drive, read from mounted Drive
|
||
|
||
Colab mounts Drive natively, so the cheapest path is to make Drive the shard store and
|
||
sync into it with `rclone` (only uploads new/changed shards):
|
||
|
||
```bash
|
||
# Local, after building shards:
|
||
rclone copy datasets/ gdrive:NowChess/datasets --progress
|
||
# ^ uploads only shards Drive doesn't have yet. Adding 80k positions = one small file.
|
||
```
|
||
|
||
Colab side, one cell:
|
||
|
||
```python
|
||
SRC = '/content/drive/MyDrive/NowChess/datasets' # mounted, no download
|
||
import json, shutil, pathlib
|
||
manifest = json.load(open(f'{SRC}/manifest.json'))
|
||
local = pathlib.Path('/content/datasets'); local.mkdir(exist_ok=True)
|
||
for sh in manifest['shards']: # copy Drive→local SSD (fast seq read)
|
||
dst = local / sh['file']
|
||
if not dst.exists(): # cache: only copy missing shards
|
||
shutil.copy(f"{SRC}/shards/{sh['file']}", dst)
|
||
```
|
||
|
||
Why this wins on "easy":
|
||
- **No browser upload, ever.** One `rclone copy` from your PC.
|
||
- **Incremental both directions.** Add a shard locally → next `rclone copy` ships only
|
||
that shard. Colab copies only shards it doesn't already have on `/content`.
|
||
- **Zero new infra.** Drive is already mounted in the notebook.
|
||
|
||
### Alternative: Gitea release per dataset version (if Drive quota hurts)
|
||
|
||
You self-host `git.janis-eccarius.de`. Tag `ds_v7`, attach shards + `manifest.json` as
|
||
release assets. Colab reads the manifest, then parallel-`wget` only the shards it lacks
|
||
(checksum-verified). Versioned, immutable, no Drive quota, token-gated. Slightly more
|
||
wiring than rclone→Drive.
|
||
|
||
Pick rclone→Drive for minimum friction; Gitea releases if you want hard versioning and
|
||
to keep Drive small.
|
||
|
||
### Notebook changes either way
|
||
|
||
- Clone repo to **ephemeral `/content`** (fast), not Drive. Persist only datasets +
|
||
checkpoints.
|
||
- Drop Option A (no Colab generation) and Option B (no browser upload). One "sync
|
||
dataset version" cell instead.
|
||
- Train reads shards via a streaming `.jsonl.zst` loader (apply per-source sampling
|
||
weights + eval-bucket balancing here). Keep burst-train + Drive checkpoints + `.nbai`
|
||
export.
|
||
|
||
---
|
||
|
||
## Resulting workflow
|
||
|
||
```
|
||
LOCAL (9800X3D / RTX5070) COLAB (GPU)
|
||
───────────────────────── ───────────
|
||
import Lichess eval DB ─┐
|
||
self-play with net_vN ─┼─► label ─► dedup ─► write new shard(s) ─► manifest++
|
||
tactical / random ─┘ │
|
||
rclone copy ────────┘
|
||
datasets/ → Drive
|
||
│ (only new shards move)
|
||
▼
|
||
sync version → copy missing shards → train (GPU)
|
||
│
|
||
export .nbai
|
||
▼
|
||
place in src/main/resources/, rebuild native image
|
||
```
|
||
|
||
## Build order
|
||
|
||
1. **Shard format + manifest** in `dataset.py`: write/read `shards/*.jsonl.zst` +
|
||
`manifest.json`; dedup-across-shards on build; provenance per shard.
|
||
2. **Streaming `.zst` dataloader** in `train.py`: read shards, apply per-source weights
|
||
and eval-bucket balancing.
|
||
3. **Self-play generator** in `src/`: NNUEBot/Stockfish self-play → positions → local
|
||
Stockfish label → new shard. This is the scaling engine.
|
||
4. **`dataset_sync.py`**: `push` (rclone→Drive or Gitea upload) / `pull` (cache-aware).
|
||
5. **Notebook rewrite**: ephemeral clone, single sync cell, weighted streaming loader.
|
||
6. Wire `lichess_importer.py` as the backbone shard source.
|
||
|
||
## Open decisions
|
||
|
||
- **Transfer backend** — rclone→Drive (easiest, recommended) vs Gitea releases (hard
|
||
versioning).
|
||
- **Self-play opponent** — NNUEBot vs itself (own distribution) vs vs-Stockfish
|
||
(stronger, more decisive games). Likely a mix.
|
||
- **Backbone/self-play ratio** — start ~60/30/10 (lichess/selfplay/tactical), tune by
|
||
measured strength.
|
||
- **Shard size** — 50k vs 100k positions/shard (transfer granularity vs file count).
|