feat: refactor AlphaBetaSearch and ClassicalBot for improved evaluation and organization

2026-04-07 22:46:44 +02:00
parent 558f43d0f6
commit 6a9ac55b31
28 changed files with 3618 additions and 12 deletions
@@ -0,0 +1,381 @@
+# NNUE Training Guide: Incremental Training & Versioning
+
+## Overview
+
+The improved `train_nnue.py` now supports:
+1. **Incremental training** — Resume from checkpoint, continue training on new data
+2. **Automatic versioning** — Each training run saved as `nnue_weights_v{N}.pt`
+3. **Metadata tracking** — Date, positions, depth, losses stored in JSON
+4. **CLI flags** — Full control over training parameters
+
+## Quick Start
+
+### First Training Run (Fresh Start)
+
+```bash
+python train_nnue.py training_data.jsonl nnue_weights.pt
+```
+
+This saves:
+- `nnue_weights_v1.pt` — The trained weights
+- `nnue_weights_v1_metadata.json` — Training metadata
+
+### Continue Training (Incremental)
+
+Add more positions to `training_data.jsonl`, then:
+
+```bash
+python train_nnue.py training_data.jsonl nnue_weights.pt
+```
+
+The trainer will:
+1. Detect `nnue_weights.pt` exists
+2. Load it as a checkpoint automatically
+3. Continue training on all data
+4. Save as `nnue_weights_v2.pt` with updated metadata
+
+Alternatively, specify a checkpoint explicitly:
+
+```bash
+python train_nnue.py training_data.jsonl nnue_weights.pt --checkpoint nnue_weights_v1.pt
+```
+
+## Advanced Usage
+
+### Custom Training Parameters
+
+```bash
+python train_nnue.py training_data.jsonl nnue_weights.pt \
+  --epochs 30 \
+  --batch-size 2048 \
+  --lr 5e-4 \
+  --stockfish-depth 14
+```
+
+- `--epochs` — How many passes through the data (default: 20)
+- `--batch-size` — Samples per gradient update (default: 4096)
+- `--lr` — Learning rate (default: 1e-3)
+- `--stockfish-depth` — Depth of Stockfish evaluation (for metadata only)
+
+### Explicit Checkpoint
+
+Resume from a specific checkpoint (not `nnue_weights.pt`):
+
+```bash
+python train_nnue.py training_data_v2.jsonl nnue_weights.pt \
+  --checkpoint nnue_weights_v1.pt
+```
+
+### Disable Versioning
+
+Save directly to output file without versioning:
+
+```bash
+python train_nnue.py training_data.jsonl nnue_weights.pt --no-versioning
+```
+
+This overwrites `nnue_weights.pt` instead of creating `nnue_weights_v2.pt`.
+
+## Incremental Training Workflow
+
+Typical workflow for improving the model over time:
+
+**Step 1: Initial Training**
+```bash
+# Generate 500K positions with Stockfish
+./run_pipeline.sh
+
+# This saves:
+# - nnue_weights_v1.pt
+# - nnue_weights_v1_metadata.json
+```
+
+**Step 2: Generate More Positions**
+```bash
+# Later, generate 500K more positions
+# Append to training_data.jsonl or create new one
+
+# Label with Stockfish at depth 16 (more thorough)
+python label_positions.py positions_batch2.txt training_data_batch2.jsonl stockfish --stockfish-depth 16
+
+# Combine datasets
+cat training_data_batch1.jsonl training_data_batch2.jsonl > training_data_combined.jsonl
+```
+
+**Step 3: Continue Training**
+```bash
+# Train on combined data, starting from v1 checkpoint
+python train_nnue.py training_data_combined.jsonl nnue_weights.pt
+
+# Saves:
+# - nnue_weights_v2.pt (improved)
+# - nnue_weights_v2_metadata.json
+```
+
+**Step 4: Benchmark & Choose**
+```bash
+# Test both versions in matches
+# If v2 is better, use it; otherwise keep v1
+
+# Update NNUEWeights.scala with best version
+python export_weights.py nnue_weights_v2.pt ../src/main/scala/de/nowchess/bot/bots/nnue/NNUEWeights.scala
+```
+
+## Metadata File Format
+
+Each training session generates a JSON metadata file, e.g., `nnue_weights_v2_metadata.json`:
+
+```json
+{
+  "version": 2,
+  "date": "2026-04-07T21:45:30.123456",
+  "num_positions": 1000000,
+  "stockfish_depth": 12,
+  "epochs": 20,
+  "batch_size": 4096,
+  "learning_rate": 0.001,
+  "final_val_loss": 0.0234567,
+  "device": "cuda",
+  "checkpoint": "nnue_weights_v1.pt",
+  "notes": "Win rate vs classical eval: TBD (requires benchmark games)"
+}
+```
+
+### Fields
+
+- **version**: Training version number (v1, v2, etc.)
+- **date**: ISO timestamp of training start
+- **num_positions**: Total positions in dataset
+- **stockfish_depth**: Depth of Stockfish evaluations (from command-line flag)
+- **epochs**: Number of training passes
+- **batch_size**: Training batch size
+- **learning_rate**: Adam optimizer learning rate
+- **final_val_loss**: Best validation loss achieved
+- **device**: GPU (cuda) or CPU used for training
+- **checkpoint**: Previous model used as starting point (null if from scratch)
+- **notes**: Win rate comparison (currently TBD — requires benchmark)
+
+## Checkpoint Logic
+
+When you run training, the trainer checks for checkpoints in this order:
+
+1. **Explicit checkpoint** — If you provide `--checkpoint`, use it
+2. **Auto-detect** — If output file exists (e.g., `nnue_weights.pt`), load it
+3. **From scratch** — Otherwise, initialize with random weights
+
+Example:
+
+```bash
+# First run: from scratch (no nnue_weights.pt exists)
+python train_nnue.py training_data.jsonl nnue_weights.pt
+# → Creates v1 from scratch, saves as nnue_weights_v1.pt
+
+# Second run: auto-detect nnue_weights.pt as checkpoint
+python train_nnue.py training_data_bigger.jsonl nnue_weights.pt
+# → Loads nnue_weights_v1.pt (because nnue_weights.pt = v1), saves as v2
+
+# Third run: explicit checkpoint
+python train_nnue.py training_data_huge.jsonl nnue_weights.pt --checkpoint nnue_weights_v2.pt
+# → Loads v2, saves as v3
+```
+
+## Resuming Interrupted Training
+
+If training is interrupted (power loss, ^C), you can resume:
+
+```bash
+# Original command
+python train_nnue.py training_data.jsonl nnue_weights.pt
+
+# If interrupted, the same command will:
+# 1. Detect nnue_weights_v1.pt exists (or a higher version)
+# 2. Auto-load it as checkpoint
+# 3. Resume training
+# 4. Save next version (v2, v3, etc.)
+```
+
+## Performance Tips
+
+### Reduce Training Time
+
+```bash
+# Smaller batch size = slower but less memory
+python train_nnue.py training_data.jsonl nnue_weights.pt --batch-size 1024
+
+# Fewer epochs
+python train_nnue.py training_data.jsonl nnue_weights.pt --epochs 5
+
+# Lower learning rate = slower convergence but more stable
+python train_nnue.py training_data.jsonl nnue_weights.pt --lr 5e-4
+```
+
+### Accelerate on GPU
+
+If you have NVIDIA GPU with CUDA:
+
+```bash
+# Training will automatically use CUDA
+# Check metadata device field: should be "cuda" not "cpu"
+python train_nnue.py training_data.jsonl nnue_weights.pt
+```
+
+If training uses CPU but GPU is available:
+```bash
+# Reinstall PyTorch with CUDA
+pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
+```
+
+### Efficient Incremental Training
+
+```bash
+# Fine-tune v1 on slightly different data (high learning rate)
+python train_nnue.py new_positions.jsonl nnue_weights.pt \
+  --checkpoint nnue_weights_v1.pt \
+  --epochs 3 \
+  --lr 5e-4
+
+# Full retraining on combined data (slower, better)
+python train_nnue.py all_positions.jsonl nnue_weights.pt \
+  --checkpoint nnue_weights_v1.pt \
+  --epochs 20 \
+  --lr 1e-3
+```
+
+## Version Management
+
+### List All Versions
+
+```bash
+ls -la nnue_weights_v*.pt
+ls -la nnue_weights_v*_metadata.json
+```
+
+### Compare Versions
+
+```bash
+cat nnue_weights_v1_metadata.json | grep "final_val_loss"
+cat nnue_weights_v2_metadata.json | grep "final_val_loss"
+cat nnue_weights_v3_metadata.json | grep "final_val_loss"
+```
+
+Lower val loss = better model.
+
+### Benchmark Best Version
+
+After training multiple versions, benchmark them:
+
+```bash
+# Export v1 and play some games
+python export_weights.py nnue_weights_v1.pt ../src/main/scala/de/nowchess/bot/bots/nnue/NNUEWeights.scala
+./compile && ./test
+
+# Export v2 and benchmark
+python export_weights.py nnue_weights_v2.pt ../src/main/scala/de/nowchess/bot/bots/nnue/NNUEWeights.scala
+./compile && ./test
+
+# Keep the best, archive others
+```
+
+### Archive Old Versions
+
+```bash
+# Keep only recent versions
+mkdir -p old_models
+mv nnue_weights_v1.pt old_models/
+mv nnue_weights_v1_metadata.json old_models/
+```
+
+## Troubleshooting
+
+### "FileNotFoundError: training_data.jsonl not found"
+
+```bash
+# Make sure you're in the python/ directory
+cd modules/bot/python
+
+# Or provide full path
+python train_nnue.py /full/path/to/training_data.jsonl nnue_weights.pt
+```
+
+### "CUDA out of memory"
+
+Reduce batch size:
+
+```bash
+python train_nnue.py training_data.jsonl nnue_weights.pt --batch-size 2048
+```
+
+### Training seems slow (using CPU not GPU)
+
+```bash
+# Check metadata of a training run
+cat nnue_weights_v1_metadata.json | grep device
+
+# If "cpu", reinstall PyTorch with CUDA support
+pip install torch --index-url https://download.pytorch.org/whl/cu118
+```
+
+### "checkpoint file corrupted"
+
+```bash
+# Start over from scratch (don't load corrupted checkpoint)
+python train_nnue.py training_data.jsonl nnue_weights_fresh.pt --no-versioning
+
+# Or resume from earlier version
+python train_nnue.py training_data.jsonl nnue_weights.pt --checkpoint nnue_weights_v1.pt
+```
+
+## Integration with Pipeline
+
+The `run_pipeline.sh` script now supports incremental training:
+
+```bash
+# First run: generates data, trains v1
+./run_pipeline.sh
+
+# Add more positions
+# ... generate more, label more ...
+
+# Second run: trains on combined data as v2
+./run_pipeline.sh
+```
+
+## Example: Full Workflow
+
+```bash
+cd modules/bot/python
+
+# Session 1: Initial training
+chmod +x run_pipeline.sh
+export STOCKFISH_PATH=/usr/bin/stockfish
+./run_pipeline.sh
+# Creates: nnue_weights_v1.pt, nnue_weights_v1_metadata.json
+
+# Session 2: Improve with deeper analysis
+# (manually evaluate more positions at depth 14)
+python label_positions.py positions_v2.txt training_data_v2.jsonl \
+  /usr/bin/stockfish --stockfish-depth 14
+
+# Combine and retrain
+cat training_data_v1.jsonl training_data_v2.jsonl > training_data_combined.jsonl
+
+python train_nnue.py training_data_combined.jsonl nnue_weights.pt \
+  --epochs 25 \
+  --stockfish-depth 14
+# Creates: nnue_weights_v2.pt, nnue_weights_v2_metadata.json
+
+# Session 3: Benchmark and choose
+# Test both v1 and v2 with matches...
+# If v2 is better, export and use
+python export_weights.py nnue_weights_v2.pt \
+  ../src/main/scala/de/nowchess/bot/bots/nnue/NNUEWeights.scala
+
+cd ../..
+./compile && ./test
+```
+
+## See Also
+
+- `train_nnue.py --help` — Command-line help
+- `README_NNUE.md` — Complete pipeline documentation
+- `NNUE_IMPLEMENTATION_SUMMARY.md` — Technical architecture