feat: refactor AlphaBetaSearch and ClassicalBot for improved evaluation and organization
This commit is contained in:
@@ -0,0 +1,381 @@
|
||||
# NNUE Training Guide: Incremental Training & Versioning
|
||||
|
||||
## Overview
|
||||
|
||||
The improved `train_nnue.py` now supports:
|
||||
1. **Incremental training** — Resume from checkpoint, continue training on new data
|
||||
2. **Automatic versioning** — Each training run saved as `nnue_weights_v{N}.pt`
|
||||
3. **Metadata tracking** — Date, positions, depth, losses stored in JSON
|
||||
4. **CLI flags** — Full control over training parameters
|
||||
|
||||
## Quick Start
|
||||
|
||||
### First Training Run (Fresh Start)
|
||||
|
||||
```bash
|
||||
python train_nnue.py training_data.jsonl nnue_weights.pt
|
||||
```
|
||||
|
||||
This saves:
|
||||
- `nnue_weights_v1.pt` — The trained weights
|
||||
- `nnue_weights_v1_metadata.json` — Training metadata
|
||||
|
||||
### Continue Training (Incremental)
|
||||
|
||||
Add more positions to `training_data.jsonl`, then:
|
||||
|
||||
```bash
|
||||
python train_nnue.py training_data.jsonl nnue_weights.pt
|
||||
```
|
||||
|
||||
The trainer will:
|
||||
1. Detect `nnue_weights.pt` exists
|
||||
2. Load it as a checkpoint automatically
|
||||
3. Continue training on all data
|
||||
4. Save as `nnue_weights_v2.pt` with updated metadata
|
||||
|
||||
Alternatively, specify a checkpoint explicitly:
|
||||
|
||||
```bash
|
||||
python train_nnue.py training_data.jsonl nnue_weights.pt --checkpoint nnue_weights_v1.pt
|
||||
```
|
||||
|
||||
## Advanced Usage
|
||||
|
||||
### Custom Training Parameters
|
||||
|
||||
```bash
|
||||
python train_nnue.py training_data.jsonl nnue_weights.pt \
|
||||
--epochs 30 \
|
||||
--batch-size 2048 \
|
||||
--lr 5e-4 \
|
||||
--stockfish-depth 14
|
||||
```
|
||||
|
||||
- `--epochs` — How many passes through the data (default: 20)
|
||||
- `--batch-size` — Samples per gradient update (default: 4096)
|
||||
- `--lr` — Learning rate (default: 1e-3)
|
||||
- `--stockfish-depth` — Depth of Stockfish evaluation (for metadata only)
|
||||
|
||||
### Explicit Checkpoint
|
||||
|
||||
Resume from a specific checkpoint (not `nnue_weights.pt`):
|
||||
|
||||
```bash
|
||||
python train_nnue.py training_data_v2.jsonl nnue_weights.pt \
|
||||
--checkpoint nnue_weights_v1.pt
|
||||
```
|
||||
|
||||
### Disable Versioning
|
||||
|
||||
Save directly to output file without versioning:
|
||||
|
||||
```bash
|
||||
python train_nnue.py training_data.jsonl nnue_weights.pt --no-versioning
|
||||
```
|
||||
|
||||
This overwrites `nnue_weights.pt` instead of creating `nnue_weights_v2.pt`.
|
||||
|
||||
## Incremental Training Workflow
|
||||
|
||||
Typical workflow for improving the model over time:
|
||||
|
||||
**Step 1: Initial Training**
|
||||
```bash
|
||||
# Generate 500K positions with Stockfish
|
||||
./run_pipeline.sh
|
||||
|
||||
# This saves:
|
||||
# - nnue_weights_v1.pt
|
||||
# - nnue_weights_v1_metadata.json
|
||||
```
|
||||
|
||||
**Step 2: Generate More Positions**
|
||||
```bash
|
||||
# Later, generate 500K more positions
|
||||
# Append to training_data.jsonl or create new one
|
||||
|
||||
# Label with Stockfish at depth 16 (more thorough)
|
||||
python label_positions.py positions_batch2.txt training_data_batch2.jsonl stockfish --stockfish-depth 16
|
||||
|
||||
# Combine datasets
|
||||
cat training_data_batch1.jsonl training_data_batch2.jsonl > training_data_combined.jsonl
|
||||
```
|
||||
|
||||
**Step 3: Continue Training**
|
||||
```bash
|
||||
# Train on combined data, starting from v1 checkpoint
|
||||
python train_nnue.py training_data_combined.jsonl nnue_weights.pt
|
||||
|
||||
# Saves:
|
||||
# - nnue_weights_v2.pt (improved)
|
||||
# - nnue_weights_v2_metadata.json
|
||||
```
|
||||
|
||||
**Step 4: Benchmark & Choose**
|
||||
```bash
|
||||
# Test both versions in matches
|
||||
# If v2 is better, use it; otherwise keep v1
|
||||
|
||||
# Update NNUEWeights.scala with best version
|
||||
python export_weights.py nnue_weights_v2.pt ../src/main/scala/de/nowchess/bot/bots/nnue/NNUEWeights.scala
|
||||
```
|
||||
|
||||
## Metadata File Format
|
||||
|
||||
Each training session generates a JSON metadata file, e.g., `nnue_weights_v2_metadata.json`:
|
||||
|
||||
```json
|
||||
{
|
||||
"version": 2,
|
||||
"date": "2026-04-07T21:45:30.123456",
|
||||
"num_positions": 1000000,
|
||||
"stockfish_depth": 12,
|
||||
"epochs": 20,
|
||||
"batch_size": 4096,
|
||||
"learning_rate": 0.001,
|
||||
"final_val_loss": 0.0234567,
|
||||
"device": "cuda",
|
||||
"checkpoint": "nnue_weights_v1.pt",
|
||||
"notes": "Win rate vs classical eval: TBD (requires benchmark games)"
|
||||
}
|
||||
```
|
||||
|
||||
### Fields
|
||||
|
||||
- **version**: Training version number (v1, v2, etc.)
|
||||
- **date**: ISO timestamp of training start
|
||||
- **num_positions**: Total positions in dataset
|
||||
- **stockfish_depth**: Depth of Stockfish evaluations (from command-line flag)
|
||||
- **epochs**: Number of training passes
|
||||
- **batch_size**: Training batch size
|
||||
- **learning_rate**: Adam optimizer learning rate
|
||||
- **final_val_loss**: Best validation loss achieved
|
||||
- **device**: GPU (cuda) or CPU used for training
|
||||
- **checkpoint**: Previous model used as starting point (null if from scratch)
|
||||
- **notes**: Win rate comparison (currently TBD — requires benchmark)
|
||||
|
||||
## Checkpoint Logic
|
||||
|
||||
When you run training, the trainer checks for checkpoints in this order:
|
||||
|
||||
1. **Explicit checkpoint** — If you provide `--checkpoint`, use it
|
||||
2. **Auto-detect** — If output file exists (e.g., `nnue_weights.pt`), load it
|
||||
3. **From scratch** — Otherwise, initialize with random weights
|
||||
|
||||
Example:
|
||||
|
||||
```bash
|
||||
# First run: from scratch (no nnue_weights.pt exists)
|
||||
python train_nnue.py training_data.jsonl nnue_weights.pt
|
||||
# → Creates v1 from scratch, saves as nnue_weights_v1.pt
|
||||
|
||||
# Second run: auto-detect nnue_weights.pt as checkpoint
|
||||
python train_nnue.py training_data_bigger.jsonl nnue_weights.pt
|
||||
# → Loads nnue_weights_v1.pt (because nnue_weights.pt = v1), saves as v2
|
||||
|
||||
# Third run: explicit checkpoint
|
||||
python train_nnue.py training_data_huge.jsonl nnue_weights.pt --checkpoint nnue_weights_v2.pt
|
||||
# → Loads v2, saves as v3
|
||||
```
|
||||
|
||||
## Resuming Interrupted Training
|
||||
|
||||
If training is interrupted (power loss, ^C), you can resume:
|
||||
|
||||
```bash
|
||||
# Original command
|
||||
python train_nnue.py training_data.jsonl nnue_weights.pt
|
||||
|
||||
# If interrupted, the same command will:
|
||||
# 1. Detect nnue_weights_v1.pt exists (or a higher version)
|
||||
# 2. Auto-load it as checkpoint
|
||||
# 3. Resume training
|
||||
# 4. Save next version (v2, v3, etc.)
|
||||
```
|
||||
|
||||
## Performance Tips
|
||||
|
||||
### Reduce Training Time
|
||||
|
||||
```bash
|
||||
# Smaller batch size = slower but less memory
|
||||
python train_nnue.py training_data.jsonl nnue_weights.pt --batch-size 1024
|
||||
|
||||
# Fewer epochs
|
||||
python train_nnue.py training_data.jsonl nnue_weights.pt --epochs 5
|
||||
|
||||
# Lower learning rate = slower convergence but more stable
|
||||
python train_nnue.py training_data.jsonl nnue_weights.pt --lr 5e-4
|
||||
```
|
||||
|
||||
### Accelerate on GPU
|
||||
|
||||
If you have NVIDIA GPU with CUDA:
|
||||
|
||||
```bash
|
||||
# Training will automatically use CUDA
|
||||
# Check metadata device field: should be "cuda" not "cpu"
|
||||
python train_nnue.py training_data.jsonl nnue_weights.pt
|
||||
```
|
||||
|
||||
If training uses CPU but GPU is available:
|
||||
```bash
|
||||
# Reinstall PyTorch with CUDA
|
||||
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
|
||||
```
|
||||
|
||||
### Efficient Incremental Training
|
||||
|
||||
```bash
|
||||
# Fine-tune v1 on slightly different data (high learning rate)
|
||||
python train_nnue.py new_positions.jsonl nnue_weights.pt \
|
||||
--checkpoint nnue_weights_v1.pt \
|
||||
--epochs 3 \
|
||||
--lr 5e-4
|
||||
|
||||
# Full retraining on combined data (slower, better)
|
||||
python train_nnue.py all_positions.jsonl nnue_weights.pt \
|
||||
--checkpoint nnue_weights_v1.pt \
|
||||
--epochs 20 \
|
||||
--lr 1e-3
|
||||
```
|
||||
|
||||
## Version Management
|
||||
|
||||
### List All Versions
|
||||
|
||||
```bash
|
||||
ls -la nnue_weights_v*.pt
|
||||
ls -la nnue_weights_v*_metadata.json
|
||||
```
|
||||
|
||||
### Compare Versions
|
||||
|
||||
```bash
|
||||
cat nnue_weights_v1_metadata.json | grep "final_val_loss"
|
||||
cat nnue_weights_v2_metadata.json | grep "final_val_loss"
|
||||
cat nnue_weights_v3_metadata.json | grep "final_val_loss"
|
||||
```
|
||||
|
||||
Lower val loss = better model.
|
||||
|
||||
### Benchmark Best Version
|
||||
|
||||
After training multiple versions, benchmark them:
|
||||
|
||||
```bash
|
||||
# Export v1 and play some games
|
||||
python export_weights.py nnue_weights_v1.pt ../src/main/scala/de/nowchess/bot/bots/nnue/NNUEWeights.scala
|
||||
./compile && ./test
|
||||
|
||||
# Export v2 and benchmark
|
||||
python export_weights.py nnue_weights_v2.pt ../src/main/scala/de/nowchess/bot/bots/nnue/NNUEWeights.scala
|
||||
./compile && ./test
|
||||
|
||||
# Keep the best, archive others
|
||||
```
|
||||
|
||||
### Archive Old Versions
|
||||
|
||||
```bash
|
||||
# Keep only recent versions
|
||||
mkdir -p old_models
|
||||
mv nnue_weights_v1.pt old_models/
|
||||
mv nnue_weights_v1_metadata.json old_models/
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### "FileNotFoundError: training_data.jsonl not found"
|
||||
|
||||
```bash
|
||||
# Make sure you're in the python/ directory
|
||||
cd modules/bot/python
|
||||
|
||||
# Or provide full path
|
||||
python train_nnue.py /full/path/to/training_data.jsonl nnue_weights.pt
|
||||
```
|
||||
|
||||
### "CUDA out of memory"
|
||||
|
||||
Reduce batch size:
|
||||
|
||||
```bash
|
||||
python train_nnue.py training_data.jsonl nnue_weights.pt --batch-size 2048
|
||||
```
|
||||
|
||||
### Training seems slow (using CPU not GPU)
|
||||
|
||||
```bash
|
||||
# Check metadata of a training run
|
||||
cat nnue_weights_v1_metadata.json | grep device
|
||||
|
||||
# If "cpu", reinstall PyTorch with CUDA support
|
||||
pip install torch --index-url https://download.pytorch.org/whl/cu118
|
||||
```
|
||||
|
||||
### "checkpoint file corrupted"
|
||||
|
||||
```bash
|
||||
# Start over from scratch (don't load corrupted checkpoint)
|
||||
python train_nnue.py training_data.jsonl nnue_weights_fresh.pt --no-versioning
|
||||
|
||||
# Or resume from earlier version
|
||||
python train_nnue.py training_data.jsonl nnue_weights.pt --checkpoint nnue_weights_v1.pt
|
||||
```
|
||||
|
||||
## Integration with Pipeline
|
||||
|
||||
The `run_pipeline.sh` script now supports incremental training:
|
||||
|
||||
```bash
|
||||
# First run: generates data, trains v1
|
||||
./run_pipeline.sh
|
||||
|
||||
# Add more positions
|
||||
# ... generate more, label more ...
|
||||
|
||||
# Second run: trains on combined data as v2
|
||||
./run_pipeline.sh
|
||||
```
|
||||
|
||||
## Example: Full Workflow
|
||||
|
||||
```bash
|
||||
cd modules/bot/python
|
||||
|
||||
# Session 1: Initial training
|
||||
chmod +x run_pipeline.sh
|
||||
export STOCKFISH_PATH=/usr/bin/stockfish
|
||||
./run_pipeline.sh
|
||||
# Creates: nnue_weights_v1.pt, nnue_weights_v1_metadata.json
|
||||
|
||||
# Session 2: Improve with deeper analysis
|
||||
# (manually evaluate more positions at depth 14)
|
||||
python label_positions.py positions_v2.txt training_data_v2.jsonl \
|
||||
/usr/bin/stockfish --stockfish-depth 14
|
||||
|
||||
# Combine and retrain
|
||||
cat training_data_v1.jsonl training_data_v2.jsonl > training_data_combined.jsonl
|
||||
|
||||
python train_nnue.py training_data_combined.jsonl nnue_weights.pt \
|
||||
--epochs 25 \
|
||||
--stockfish-depth 14
|
||||
# Creates: nnue_weights_v2.pt, nnue_weights_v2_metadata.json
|
||||
|
||||
# Session 3: Benchmark and choose
|
||||
# Test both v1 and v2 with matches...
|
||||
# If v2 is better, export and use
|
||||
python export_weights.py nnue_weights_v2.pt \
|
||||
../src/main/scala/de/nowchess/bot/bots/nnue/NNUEWeights.scala
|
||||
|
||||
cd ../..
|
||||
./compile && ./test
|
||||
```
|
||||
|
||||
## See Also
|
||||
|
||||
- `train_nnue.py --help` — Command-line help
|
||||
- `README_NNUE.md` — Complete pipeline documentation
|
||||
- `NNUE_IMPLEMENTATION_SUMMARY.md` — Technical architecture
|
||||
Reference in New Issue
Block a user