Commit Graph

5 Commits

Author SHA1 Message Date
Janis Eccarius 9d656624d8 fix(official-bots): stream NNUE features as sparse indices to stop host OOM
Build & Test (NowChessSystems) TeamCity build finished
Densifying the 98304-dim HalfKP vector per item filled host RAM and crashed the
Colab runtime even at small batch sizes. The dataset now yields only the ~64
active feature indices; a custom collate carries (row, col) pairs and the
training loop scatters them into a dense [B, INPUT_SIZE] tensor on the GPU. Host
RAM stays tiny; GPU holds one dense batch transiently.

- NNUEDataset.__getitem__ returns indices via new fen_to_indices.
- fen_to_features now derives from fen_to_indices (kept for external callers).
- _collate_sparse builds row/col index batches; loaders use it.
- train/val loops scatter to a GPU dense batch; loss weighting uses batch size.
- Notebook: BATCH_SIZE 4096 -> 8192 (host no longer the limit; GPU is).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-24 22:28:53 +02:00
Janis Eccarius e2b4342f60 fix(official-bots): prevent Colab OOM in NNUE training
Build & Test (NowChessSystems) TeamCity build finished
Dense 98304-dim HalfKP features at batch_size=16384 cost ~6.4 GB/batch on the
host; with 8 hardcoded DataLoader workers and prefetch this OOM-killed the Colab
runtime.

- train.py: adaptive DataLoader workers (min(4, cpu_count), Colab free tier = 2),
  overridable via NNUE_LOADER_WORKERS; persistent_workers only when > 0.
- NNUETraining.ipynb: lower BATCH_SIZE 16384 -> 4096 with a memory-cost note.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-24 22:18:18 +02:00
Janis Eccarius 1c80abdb8a feat(official-bots): standalone self-play + one-shot dataset builder for NNUE training
Build & Test (NowChessSystems) TeamCity build finished
Add an easy local data pipeline feeding GPU training on Colab.

- SelfPlayMain: standalone NNUEBot self-play (no microservices) writing FENs
  for labeling; randomised openings for game diversity, sequential due to the
  shared EvaluationNNUE accumulator. Exposed via the `selfPlay` Gradle task and
  selfplay.sh.
- NNUEBot: optional fixedMoveTimeMs so self-play runs fast (default unchanged).
- NbaiLoader: honor `-Dnnue.weights=<path>` to load weights from a file before
  falling back to the bundled resource.
- build_dataset.py / dataset.sh: one command builds the entire dataset
  (Lichess eval-DB backbone + self-play + tactical + random filler), dedups,
  balances the eval histogram, writes append-only zstd shards + manifest, and
  rclone-pushes to Drive.
- train.py: NNUEDataset reads a directory of .jsonl.zst shards (streaming) in
  addition to a single file.
- NNUETraining.ipynb: clone to ephemeral /content, sync shards from Drive
  (cache-aware), train on the shards dir; removed Colab generation/upload steps.
- Concept + implementation plan docs.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-24 22:04:22 +02:00
Janis Eccarius 9f9140cb58 fix: modified training pipeline
Build & Test (NowChessSystems) TeamCity build finished
2026-06-24 19:37:26 +02:00
Janis fa10852bc9 feat(official-bots): add Google Colab notebook for NNUE training (NCS-111) (#81)
Build & Test (NowChessSystems) TeamCity build finished
Adds python/NNUETraining.ipynb with five sections:
- Setup: mount Drive, clone/update repo, install deps + Stockfish
- Data: Option A (generate + label) or Option B (upload existing labeled.jsonl)
- Train: standard epoch loop or burst mode (recommended for Colab free tier)
- Export: convert best .pt checkpoint to .nbai via export.py
- Download: pull .nbai and .pt to local machine via files.download

Checkpoints and datasets are persisted to Google Drive so training
survives session disconnects and can be resumed automatically.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Janis Eccarius <eccariusjanis@gmail.com>
Reviewed-on: #81
2026-06-24 19:33:24 +02:00