382 lines
9.8 KiB
Markdown
382 lines
9.8 KiB
Markdown
# NNUE Training Guide: Incremental Training & Versioning
|
|
|
|
## Overview
|
|
|
|
The improved `train_nnue.py` now supports:
|
|
1. **Incremental training** — Resume from checkpoint, continue training on new data
|
|
2. **Automatic versioning** — Each training run saved as `nnue_weights_v{N}.pt`
|
|
3. **Metadata tracking** — Date, positions, depth, losses stored in JSON
|
|
4. **CLI flags** — Full control over training parameters
|
|
|
|
## Quick Start
|
|
|
|
### First Training Run (Fresh Start)
|
|
|
|
```bash
|
|
python train_nnue.py training_data.jsonl nnue_weights.pt
|
|
```
|
|
|
|
This saves:
|
|
- `nnue_weights_v1.pt` — The trained weights
|
|
- `nnue_weights_v1_metadata.json` — Training metadata
|
|
|
|
### Continue Training (Incremental)
|
|
|
|
Add more positions to `training_data.jsonl`, then:
|
|
|
|
```bash
|
|
python train_nnue.py training_data.jsonl nnue_weights.pt
|
|
```
|
|
|
|
The trainer will:
|
|
1. Detect `nnue_weights.pt` exists
|
|
2. Load it as a checkpoint automatically
|
|
3. Continue training on all data
|
|
4. Save as `nnue_weights_v2.pt` with updated metadata
|
|
|
|
Alternatively, specify a checkpoint explicitly:
|
|
|
|
```bash
|
|
python train_nnue.py training_data.jsonl nnue_weights.pt --checkpoint nnue_weights_v1.pt
|
|
```
|
|
|
|
## Advanced Usage
|
|
|
|
### Custom Training Parameters
|
|
|
|
```bash
|
|
python train_nnue.py training_data.jsonl nnue_weights.pt \
|
|
--epochs 30 \
|
|
--batch-size 2048 \
|
|
--lr 5e-4 \
|
|
--stockfish-depth 14
|
|
```
|
|
|
|
- `--epochs` — How many passes through the data (default: 20)
|
|
- `--batch-size` — Samples per gradient update (default: 4096)
|
|
- `--lr` — Learning rate (default: 1e-3)
|
|
- `--stockfish-depth` — Depth of Stockfish evaluation (for metadata only)
|
|
|
|
### Explicit Checkpoint
|
|
|
|
Resume from a specific checkpoint (not `nnue_weights.pt`):
|
|
|
|
```bash
|
|
python train_nnue.py training_data_v2.jsonl nnue_weights.pt \
|
|
--checkpoint nnue_weights_v1.pt
|
|
```
|
|
|
|
### Disable Versioning
|
|
|
|
Save directly to output file without versioning:
|
|
|
|
```bash
|
|
python train_nnue.py training_data.jsonl nnue_weights.pt --no-versioning
|
|
```
|
|
|
|
This overwrites `nnue_weights.pt` instead of creating `nnue_weights_v2.pt`.
|
|
|
|
## Incremental Training Workflow
|
|
|
|
Typical workflow for improving the model over time:
|
|
|
|
**Step 1: Initial Training**
|
|
```bash
|
|
# Generate 500K positions with Stockfish
|
|
./run_pipeline.sh
|
|
|
|
# This saves:
|
|
# - nnue_weights_v1.pt
|
|
# - nnue_weights_v1_metadata.json
|
|
```
|
|
|
|
**Step 2: Generate More Positions**
|
|
```bash
|
|
# Later, generate 500K more positions
|
|
# Append to training_data.jsonl or create new one
|
|
|
|
# Label with Stockfish at depth 16 (more thorough)
|
|
python label_positions.py positions_batch2.txt training_data_batch2.jsonl stockfish --stockfish-depth 16
|
|
|
|
# Combine datasets
|
|
cat training_data_batch1.jsonl training_data_batch2.jsonl > training_data_combined.jsonl
|
|
```
|
|
|
|
**Step 3: Continue Training**
|
|
```bash
|
|
# Train on combined data, starting from v1 checkpoint
|
|
python train_nnue.py training_data_combined.jsonl nnue_weights.pt
|
|
|
|
# Saves:
|
|
# - nnue_weights_v2.pt (improved)
|
|
# - nnue_weights_v2_metadata.json
|
|
```
|
|
|
|
**Step 4: Benchmark & Choose**
|
|
```bash
|
|
# Test both versions in matches
|
|
# If v2 is better, use it; otherwise keep v1
|
|
|
|
# Update NNUEWeights.scala with best version
|
|
python export_weights.py nnue_weights_v2.pt ../src/main/scala/de/nowchess/bot/bots/nnue/NNUEWeights.scala
|
|
```
|
|
|
|
## Metadata File Format
|
|
|
|
Each training session generates a JSON metadata file, e.g., `nnue_weights_v2_metadata.json`:
|
|
|
|
```json
|
|
{
|
|
"version": 2,
|
|
"date": "2026-04-07T21:45:30.123456",
|
|
"num_positions": 1000000,
|
|
"stockfish_depth": 12,
|
|
"epochs": 20,
|
|
"batch_size": 4096,
|
|
"learning_rate": 0.001,
|
|
"final_val_loss": 0.0234567,
|
|
"device": "cuda",
|
|
"checkpoint": "nnue_weights_v1.pt",
|
|
"notes": "Win rate vs classical eval: TBD (requires benchmark games)"
|
|
}
|
|
```
|
|
|
|
### Fields
|
|
|
|
- **version**: Training version number (v1, v2, etc.)
|
|
- **date**: ISO timestamp of training start
|
|
- **num_positions**: Total positions in dataset
|
|
- **stockfish_depth**: Depth of Stockfish evaluations (from command-line flag)
|
|
- **epochs**: Number of training passes
|
|
- **batch_size**: Training batch size
|
|
- **learning_rate**: Adam optimizer learning rate
|
|
- **final_val_loss**: Best validation loss achieved
|
|
- **device**: GPU (cuda) or CPU used for training
|
|
- **checkpoint**: Previous model used as starting point (null if from scratch)
|
|
- **notes**: Win rate comparison (currently TBD — requires benchmark)
|
|
|
|
## Checkpoint Logic
|
|
|
|
When you run training, the trainer checks for checkpoints in this order:
|
|
|
|
1. **Explicit checkpoint** — If you provide `--checkpoint`, use it
|
|
2. **Auto-detect** — If output file exists (e.g., `nnue_weights.pt`), load it
|
|
3. **From scratch** — Otherwise, initialize with random weights
|
|
|
|
Example:
|
|
|
|
```bash
|
|
# First run: from scratch (no nnue_weights.pt exists)
|
|
python train_nnue.py training_data.jsonl nnue_weights.pt
|
|
# → Creates v1 from scratch, saves as nnue_weights_v1.pt
|
|
|
|
# Second run: auto-detect nnue_weights.pt as checkpoint
|
|
python train_nnue.py training_data_bigger.jsonl nnue_weights.pt
|
|
# → Loads nnue_weights_v1.pt (because nnue_weights.pt = v1), saves as v2
|
|
|
|
# Third run: explicit checkpoint
|
|
python train_nnue.py training_data_huge.jsonl nnue_weights.pt --checkpoint nnue_weights_v2.pt
|
|
# → Loads v2, saves as v3
|
|
```
|
|
|
|
## Resuming Interrupted Training
|
|
|
|
If training is interrupted (power loss, ^C), you can resume:
|
|
|
|
```bash
|
|
# Original command
|
|
python train_nnue.py training_data.jsonl nnue_weights.pt
|
|
|
|
# If interrupted, the same command will:
|
|
# 1. Detect nnue_weights_v1.pt exists (or a higher version)
|
|
# 2. Auto-load it as checkpoint
|
|
# 3. Resume training
|
|
# 4. Save next version (v2, v3, etc.)
|
|
```
|
|
|
|
## Performance Tips
|
|
|
|
### Reduce Training Time
|
|
|
|
```bash
|
|
# Smaller batch size = slower but less memory
|
|
python train_nnue.py training_data.jsonl nnue_weights.pt --batch-size 1024
|
|
|
|
# Fewer epochs
|
|
python train_nnue.py training_data.jsonl nnue_weights.pt --epochs 5
|
|
|
|
# Lower learning rate = slower convergence but more stable
|
|
python train_nnue.py training_data.jsonl nnue_weights.pt --lr 5e-4
|
|
```
|
|
|
|
### Accelerate on GPU
|
|
|
|
If you have NVIDIA GPU with CUDA:
|
|
|
|
```bash
|
|
# Training will automatically use CUDA
|
|
# Check metadata device field: should be "cuda" not "cpu"
|
|
python train_nnue.py training_data.jsonl nnue_weights.pt
|
|
```
|
|
|
|
If training uses CPU but GPU is available:
|
|
```bash
|
|
# Reinstall PyTorch with CUDA
|
|
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
|
|
```
|
|
|
|
### Efficient Incremental Training
|
|
|
|
```bash
|
|
# Fine-tune v1 on slightly different data (high learning rate)
|
|
python train_nnue.py new_positions.jsonl nnue_weights.pt \
|
|
--checkpoint nnue_weights_v1.pt \
|
|
--epochs 3 \
|
|
--lr 5e-4
|
|
|
|
# Full retraining on combined data (slower, better)
|
|
python train_nnue.py all_positions.jsonl nnue_weights.pt \
|
|
--checkpoint nnue_weights_v1.pt \
|
|
--epochs 20 \
|
|
--lr 1e-3
|
|
```
|
|
|
|
## Version Management
|
|
|
|
### List All Versions
|
|
|
|
```bash
|
|
ls -la nnue_weights_v*.pt
|
|
ls -la nnue_weights_v*_metadata.json
|
|
```
|
|
|
|
### Compare Versions
|
|
|
|
```bash
|
|
cat nnue_weights_v1_metadata.json | grep "final_val_loss"
|
|
cat nnue_weights_v2_metadata.json | grep "final_val_loss"
|
|
cat nnue_weights_v3_metadata.json | grep "final_val_loss"
|
|
```
|
|
|
|
Lower val loss = better model.
|
|
|
|
### Benchmark Best Version
|
|
|
|
After training multiple versions, benchmark them:
|
|
|
|
```bash
|
|
# Export v1 and play some games
|
|
python export_weights.py nnue_weights_v1.pt ../src/main/scala/de/nowchess/bot/bots/nnue/NNUEWeights.scala
|
|
./compile && ./test
|
|
|
|
# Export v2 and benchmark
|
|
python export_weights.py nnue_weights_v2.pt ../src/main/scala/de/nowchess/bot/bots/nnue/NNUEWeights.scala
|
|
./compile && ./test
|
|
|
|
# Keep the best, archive others
|
|
```
|
|
|
|
### Archive Old Versions
|
|
|
|
```bash
|
|
# Keep only recent versions
|
|
mkdir -p old_models
|
|
mv nnue_weights_v1.pt old_models/
|
|
mv nnue_weights_v1_metadata.json old_models/
|
|
```
|
|
|
|
## Troubleshooting
|
|
|
|
### "FileNotFoundError: training_data.jsonl not found"
|
|
|
|
```bash
|
|
# Make sure you're in the python/ directory
|
|
cd modules/bot/python
|
|
|
|
# Or provide full path
|
|
python train_nnue.py /full/path/to/training_data.jsonl nnue_weights.pt
|
|
```
|
|
|
|
### "CUDA out of memory"
|
|
|
|
Reduce batch size:
|
|
|
|
```bash
|
|
python train_nnue.py training_data.jsonl nnue_weights.pt --batch-size 2048
|
|
```
|
|
|
|
### Training seems slow (using CPU not GPU)
|
|
|
|
```bash
|
|
# Check metadata of a training run
|
|
cat nnue_weights_v1_metadata.json | grep device
|
|
|
|
# If "cpu", reinstall PyTorch with CUDA support
|
|
pip install torch --index-url https://download.pytorch.org/whl/cu118
|
|
```
|
|
|
|
### "checkpoint file corrupted"
|
|
|
|
```bash
|
|
# Start over from scratch (don't load corrupted checkpoint)
|
|
python train_nnue.py training_data.jsonl nnue_weights_fresh.pt --no-versioning
|
|
|
|
# Or resume from earlier version
|
|
python train_nnue.py training_data.jsonl nnue_weights.pt --checkpoint nnue_weights_v1.pt
|
|
```
|
|
|
|
## Integration with Pipeline
|
|
|
|
The `run_pipeline.sh` script now supports incremental training:
|
|
|
|
```bash
|
|
# First run: generates data, trains v1
|
|
./run_pipeline.sh
|
|
|
|
# Add more positions
|
|
# ... generate more, label more ...
|
|
|
|
# Second run: trains on combined data as v2
|
|
./run_pipeline.sh
|
|
```
|
|
|
|
## Example: Full Workflow
|
|
|
|
```bash
|
|
cd modules/bot/python
|
|
|
|
# Session 1: Initial training
|
|
chmod +x run_pipeline.sh
|
|
export STOCKFISH_PATH=/usr/bin/stockfish
|
|
./run_pipeline.sh
|
|
# Creates: nnue_weights_v1.pt, nnue_weights_v1_metadata.json
|
|
|
|
# Session 2: Improve with deeper analysis
|
|
# (manually evaluate more positions at depth 14)
|
|
python label_positions.py positions_v2.txt training_data_v2.jsonl \
|
|
/usr/bin/stockfish --stockfish-depth 14
|
|
|
|
# Combine and retrain
|
|
cat training_data_v1.jsonl training_data_v2.jsonl > training_data_combined.jsonl
|
|
|
|
python train_nnue.py training_data_combined.jsonl nnue_weights.pt \
|
|
--epochs 25 \
|
|
--stockfish-depth 14
|
|
# Creates: nnue_weights_v2.pt, nnue_weights_v2_metadata.json
|
|
|
|
# Session 3: Benchmark and choose
|
|
# Test both v1 and v2 with matches...
|
|
# If v2 is better, export and use
|
|
python export_weights.py nnue_weights_v2.pt \
|
|
../src/main/scala/de/nowchess/bot/bots/nnue/NNUEWeights.scala
|
|
|
|
cd ../..
|
|
./compile && ./test
|
|
```
|
|
|
|
## See Also
|
|
|
|
- `train_nnue.py --help` — Command-line help
|
|
- `README_NNUE.md` — Complete pipeline documentation
|
|
- `NNUE_IMPLEMENTATION_SUMMARY.md` — Technical architecture
|