Files
NowChessSystems/modules/bot/python/TRAINING_GUIDE.md
T

382 lines
9.8 KiB
Markdown

# NNUE Training Guide: Incremental Training & Versioning
## Overview
The improved `train_nnue.py` now supports:
1. **Incremental training** — Resume from checkpoint, continue training on new data
2. **Automatic versioning** — Each training run saved as `nnue_weights_v{N}.pt`
3. **Metadata tracking** — Date, positions, depth, losses stored in JSON
4. **CLI flags** — Full control over training parameters
## Quick Start
### First Training Run (Fresh Start)
```bash
python train_nnue.py training_data.jsonl nnue_weights.pt
```
This saves:
- `nnue_weights_v1.pt` — The trained weights
- `nnue_weights_v1_metadata.json` — Training metadata
### Continue Training (Incremental)
Add more positions to `training_data.jsonl`, then:
```bash
python train_nnue.py training_data.jsonl nnue_weights.pt
```
The trainer will:
1. Detect `nnue_weights.pt` exists
2. Load it as a checkpoint automatically
3. Continue training on all data
4. Save as `nnue_weights_v2.pt` with updated metadata
Alternatively, specify a checkpoint explicitly:
```bash
python train_nnue.py training_data.jsonl nnue_weights.pt --checkpoint nnue_weights_v1.pt
```
## Advanced Usage
### Custom Training Parameters
```bash
python train_nnue.py training_data.jsonl nnue_weights.pt \
--epochs 30 \
--batch-size 2048 \
--lr 5e-4 \
--stockfish-depth 14
```
- `--epochs` — How many passes through the data (default: 20)
- `--batch-size` — Samples per gradient update (default: 4096)
- `--lr` — Learning rate (default: 1e-3)
- `--stockfish-depth` — Depth of Stockfish evaluation (for metadata only)
### Explicit Checkpoint
Resume from a specific checkpoint (not `nnue_weights.pt`):
```bash
python train_nnue.py training_data_v2.jsonl nnue_weights.pt \
--checkpoint nnue_weights_v1.pt
```
### Disable Versioning
Save directly to output file without versioning:
```bash
python train_nnue.py training_data.jsonl nnue_weights.pt --no-versioning
```
This overwrites `nnue_weights.pt` instead of creating `nnue_weights_v2.pt`.
## Incremental Training Workflow
Typical workflow for improving the model over time:
**Step 1: Initial Training**
```bash
# Generate 500K positions with Stockfish
./run_pipeline.sh
# This saves:
# - nnue_weights_v1.pt
# - nnue_weights_v1_metadata.json
```
**Step 2: Generate More Positions**
```bash
# Later, generate 500K more positions
# Append to training_data.jsonl or create new one
# Label with Stockfish at depth 16 (more thorough)
python label_positions.py positions_batch2.txt training_data_batch2.jsonl stockfish --stockfish-depth 16
# Combine datasets
cat training_data_batch1.jsonl training_data_batch2.jsonl > training_data_combined.jsonl
```
**Step 3: Continue Training**
```bash
# Train on combined data, starting from v1 checkpoint
python train_nnue.py training_data_combined.jsonl nnue_weights.pt
# Saves:
# - nnue_weights_v2.pt (improved)
# - nnue_weights_v2_metadata.json
```
**Step 4: Benchmark & Choose**
```bash
# Test both versions in matches
# If v2 is better, use it; otherwise keep v1
# Update NNUEWeights.scala with best version
python export_weights.py nnue_weights_v2.pt ../src/main/scala/de/nowchess/bot/bots/nnue/NNUEWeights.scala
```
## Metadata File Format
Each training session generates a JSON metadata file, e.g., `nnue_weights_v2_metadata.json`:
```json
{
"version": 2,
"date": "2026-04-07T21:45:30.123456",
"num_positions": 1000000,
"stockfish_depth": 12,
"epochs": 20,
"batch_size": 4096,
"learning_rate": 0.001,
"final_val_loss": 0.0234567,
"device": "cuda",
"checkpoint": "nnue_weights_v1.pt",
"notes": "Win rate vs classical eval: TBD (requires benchmark games)"
}
```
### Fields
- **version**: Training version number (v1, v2, etc.)
- **date**: ISO timestamp of training start
- **num_positions**: Total positions in dataset
- **stockfish_depth**: Depth of Stockfish evaluations (from command-line flag)
- **epochs**: Number of training passes
- **batch_size**: Training batch size
- **learning_rate**: Adam optimizer learning rate
- **final_val_loss**: Best validation loss achieved
- **device**: GPU (cuda) or CPU used for training
- **checkpoint**: Previous model used as starting point (null if from scratch)
- **notes**: Win rate comparison (currently TBD — requires benchmark)
## Checkpoint Logic
When you run training, the trainer checks for checkpoints in this order:
1. **Explicit checkpoint** — If you provide `--checkpoint`, use it
2. **Auto-detect** — If output file exists (e.g., `nnue_weights.pt`), load it
3. **From scratch** — Otherwise, initialize with random weights
Example:
```bash
# First run: from scratch (no nnue_weights.pt exists)
python train_nnue.py training_data.jsonl nnue_weights.pt
# → Creates v1 from scratch, saves as nnue_weights_v1.pt
# Second run: auto-detect nnue_weights.pt as checkpoint
python train_nnue.py training_data_bigger.jsonl nnue_weights.pt
# → Loads nnue_weights_v1.pt (because nnue_weights.pt = v1), saves as v2
# Third run: explicit checkpoint
python train_nnue.py training_data_huge.jsonl nnue_weights.pt --checkpoint nnue_weights_v2.pt
# → Loads v2, saves as v3
```
## Resuming Interrupted Training
If training is interrupted (power loss, ^C), you can resume:
```bash
# Original command
python train_nnue.py training_data.jsonl nnue_weights.pt
# If interrupted, the same command will:
# 1. Detect nnue_weights_v1.pt exists (or a higher version)
# 2. Auto-load it as checkpoint
# 3. Resume training
# 4. Save next version (v2, v3, etc.)
```
## Performance Tips
### Reduce Training Time
```bash
# Smaller batch size = slower but less memory
python train_nnue.py training_data.jsonl nnue_weights.pt --batch-size 1024
# Fewer epochs
python train_nnue.py training_data.jsonl nnue_weights.pt --epochs 5
# Lower learning rate = slower convergence but more stable
python train_nnue.py training_data.jsonl nnue_weights.pt --lr 5e-4
```
### Accelerate on GPU
If you have NVIDIA GPU with CUDA:
```bash
# Training will automatically use CUDA
# Check metadata device field: should be "cuda" not "cpu"
python train_nnue.py training_data.jsonl nnue_weights.pt
```
If training uses CPU but GPU is available:
```bash
# Reinstall PyTorch with CUDA
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
```
### Efficient Incremental Training
```bash
# Fine-tune v1 on slightly different data (high learning rate)
python train_nnue.py new_positions.jsonl nnue_weights.pt \
--checkpoint nnue_weights_v1.pt \
--epochs 3 \
--lr 5e-4
# Full retraining on combined data (slower, better)
python train_nnue.py all_positions.jsonl nnue_weights.pt \
--checkpoint nnue_weights_v1.pt \
--epochs 20 \
--lr 1e-3
```
## Version Management
### List All Versions
```bash
ls -la nnue_weights_v*.pt
ls -la nnue_weights_v*_metadata.json
```
### Compare Versions
```bash
cat nnue_weights_v1_metadata.json | grep "final_val_loss"
cat nnue_weights_v2_metadata.json | grep "final_val_loss"
cat nnue_weights_v3_metadata.json | grep "final_val_loss"
```
Lower val loss = better model.
### Benchmark Best Version
After training multiple versions, benchmark them:
```bash
# Export v1 and play some games
python export_weights.py nnue_weights_v1.pt ../src/main/scala/de/nowchess/bot/bots/nnue/NNUEWeights.scala
./compile && ./test
# Export v2 and benchmark
python export_weights.py nnue_weights_v2.pt ../src/main/scala/de/nowchess/bot/bots/nnue/NNUEWeights.scala
./compile && ./test
# Keep the best, archive others
```
### Archive Old Versions
```bash
# Keep only recent versions
mkdir -p old_models
mv nnue_weights_v1.pt old_models/
mv nnue_weights_v1_metadata.json old_models/
```
## Troubleshooting
### "FileNotFoundError: training_data.jsonl not found"
```bash
# Make sure you're in the python/ directory
cd modules/bot/python
# Or provide full path
python train_nnue.py /full/path/to/training_data.jsonl nnue_weights.pt
```
### "CUDA out of memory"
Reduce batch size:
```bash
python train_nnue.py training_data.jsonl nnue_weights.pt --batch-size 2048
```
### Training seems slow (using CPU not GPU)
```bash
# Check metadata of a training run
cat nnue_weights_v1_metadata.json | grep device
# If "cpu", reinstall PyTorch with CUDA support
pip install torch --index-url https://download.pytorch.org/whl/cu118
```
### "checkpoint file corrupted"
```bash
# Start over from scratch (don't load corrupted checkpoint)
python train_nnue.py training_data.jsonl nnue_weights_fresh.pt --no-versioning
# Or resume from earlier version
python train_nnue.py training_data.jsonl nnue_weights.pt --checkpoint nnue_weights_v1.pt
```
## Integration with Pipeline
The `run_pipeline.sh` script now supports incremental training:
```bash
# First run: generates data, trains v1
./run_pipeline.sh
# Add more positions
# ... generate more, label more ...
# Second run: trains on combined data as v2
./run_pipeline.sh
```
## Example: Full Workflow
```bash
cd modules/bot/python
# Session 1: Initial training
chmod +x run_pipeline.sh
export STOCKFISH_PATH=/usr/bin/stockfish
./run_pipeline.sh
# Creates: nnue_weights_v1.pt, nnue_weights_v1_metadata.json
# Session 2: Improve with deeper analysis
# (manually evaluate more positions at depth 14)
python label_positions.py positions_v2.txt training_data_v2.jsonl \
/usr/bin/stockfish --stockfish-depth 14
# Combine and retrain
cat training_data_v1.jsonl training_data_v2.jsonl > training_data_combined.jsonl
python train_nnue.py training_data_combined.jsonl nnue_weights.pt \
--epochs 25 \
--stockfish-depth 14
# Creates: nnue_weights_v2.pt, nnue_weights_v2_metadata.json
# Session 3: Benchmark and choose
# Test both v1 and v2 with matches...
# If v2 is better, export and use
python export_weights.py nnue_weights_v2.pt \
../src/main/scala/de/nowchess/bot/bots/nnue/NNUEWeights.scala
cd ../..
./compile && ./test
```
## See Also
- `train_nnue.py --help` — Command-line help
- `README_NNUE.md` — Complete pipeline documentation
- `NNUE_IMPLEMENTATION_SUMMARY.md` — Technical architecture