fix(official-bots): stream NNUE features as sparse indices to stop host OOM
Build & Test (NowChessSystems) TeamCity build finished
Build & Test (NowChessSystems) TeamCity build finished
Densifying the 98304-dim HalfKP vector per item filled host RAM and crashed the Colab runtime even at small batch sizes. The dataset now yields only the ~64 active feature indices; a custom collate carries (row, col) pairs and the training loop scatters them into a dense [B, INPUT_SIZE] tensor on the GPU. Host RAM stays tiny; GPU holds one dense batch transiently. - NNUEDataset.__getitem__ returns indices via new fen_to_indices. - fen_to_features now derives from fen_to_indices (kept for external callers). - _collate_sparse builds row/col index batches; loaders use it. - train/val loops scatter to a GPU dense batch; loss weighting uses batch size. - Notebook: BATCH_SIZE 4096 -> 8192 (host no longer the limit; GPU is). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
@@ -92,7 +92,7 @@
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": "from train import train_nnue, burst_train, DEFAULT_HIDDEN_SIZES\n\nWEIGHTS_DIR = Path(DRIVE_ROOT) / 'weights'\nWEIGHTS_DIR.mkdir(parents=True, exist_ok=True)\nOUTPUT_FILE = str(WEIGHTS_DIR / 'nnue_weights.pt')\n\n# ── Training hyperparameters ──────────────────────────────────────────────────\nHIDDEN_SIZES = DEFAULT_HIDDEN_SIZES\n# fen_to_features builds a DENSE 98304-dim input, so a batch costs\n# batch_size * 98304 * 4 bytes on the host (× DataLoader prefetch). On Colab's\n# ~12 GB RAM keep this small; raise it only if you have headroom.\nBATCH_SIZE = 4096\nEPOCHS = 100\nEARLY_STOPPING = 10 # None to disable\nSUBSAMPLE_RATIO = 1.0\n\n# Resume from latest checkpoint if one exists\ncheckpoints = sorted(WEIGHTS_DIR.glob('nnue_weights_v*.pt'))\nCHECKPOINT = str(checkpoints[-1]) if checkpoints else None\nif CHECKPOINT:\n print(f'Resuming from checkpoint: {CHECKPOINT}')\nelse:\n print('Starting training from scratch.')",
|
||||
"source": "from train import train_nnue, burst_train, DEFAULT_HIDDEN_SIZES\n\nWEIGHTS_DIR = Path(DRIVE_ROOT) / 'weights'\nWEIGHTS_DIR.mkdir(parents=True, exist_ok=True)\nOUTPUT_FILE = str(WEIGHTS_DIR / 'nnue_weights.pt')\n\n# ── Training hyperparameters ──────────────────────────────────────────────────\nHIDDEN_SIZES = DEFAULT_HIDDEN_SIZES\n# Features are streamed as sparse indices and densified on the GPU per batch, so\n# host RAM is no longer the limit — GPU memory is. A dense batch is\n# batch_size * 98304 * 4 bytes on the GPU (~3.2 GB at 8192 on a 16 GB T4).\nBATCH_SIZE = 8192\nEPOCHS = 100\nEARLY_STOPPING = 10 # None to disable\nSUBSAMPLE_RATIO = 1.0\n\n# Resume from latest checkpoint if one exists\ncheckpoints = sorted(WEIGHTS_DIR.glob('nnue_weights_v*.pt'))\nCHECKPOINT = str(checkpoints[-1]) if checkpoints else None\nif CHECKPOINT:\n print(f'Resuming from checkpoint: {CHECKPOINT}')\nelse:\n print('Starting training from scratch.')",
|
||||
"id": "train-config"
|
||||
},
|
||||
{
|
||||
|
||||
Reference in New Issue
Block a user