9.8 KiB
NNUE Training Guide: Incremental Training & Versioning
Overview
The improved train_nnue.py now supports:
- Incremental training — Resume from checkpoint, continue training on new data
- Automatic versioning — Each training run saved as
nnue_weights_v{N}.pt - Metadata tracking — Date, positions, depth, losses stored in JSON
- CLI flags — Full control over training parameters
Quick Start
First Training Run (Fresh Start)
python train_nnue.py training_data.jsonl nnue_weights.pt
This saves:
nnue_weights_v1.pt— The trained weightsnnue_weights_v1_metadata.json— Training metadata
Continue Training (Incremental)
Add more positions to training_data.jsonl, then:
python train_nnue.py training_data.jsonl nnue_weights.pt
The trainer will:
- Detect
nnue_weights.ptexists - Load it as a checkpoint automatically
- Continue training on all data
- Save as
nnue_weights_v2.ptwith updated metadata
Alternatively, specify a checkpoint explicitly:
python train_nnue.py training_data.jsonl nnue_weights.pt --checkpoint nnue_weights_v1.pt
Advanced Usage
Custom Training Parameters
python train_nnue.py training_data.jsonl nnue_weights.pt \
--epochs 30 \
--batch-size 2048 \
--lr 5e-4 \
--stockfish-depth 14
--epochs— How many passes through the data (default: 20)--batch-size— Samples per gradient update (default: 4096)--lr— Learning rate (default: 1e-3)--stockfish-depth— Depth of Stockfish evaluation (for metadata only)
Explicit Checkpoint
Resume from a specific checkpoint (not nnue_weights.pt):
python train_nnue.py training_data_v2.jsonl nnue_weights.pt \
--checkpoint nnue_weights_v1.pt
Disable Versioning
Save directly to output file without versioning:
python train_nnue.py training_data.jsonl nnue_weights.pt --no-versioning
This overwrites nnue_weights.pt instead of creating nnue_weights_v2.pt.
Incremental Training Workflow
Typical workflow for improving the model over time:
Step 1: Initial Training
# Generate 500K positions with Stockfish
./run_pipeline.sh
# This saves:
# - nnue_weights_v1.pt
# - nnue_weights_v1_metadata.json
Step 2: Generate More Positions
# Later, generate 500K more positions
# Append to training_data.jsonl or create new one
# Label with Stockfish at depth 16 (more thorough)
python label_positions.py positions_batch2.txt training_data_batch2.jsonl stockfish --stockfish-depth 16
# Combine datasets
cat training_data_batch1.jsonl training_data_batch2.jsonl > training_data_combined.jsonl
Step 3: Continue Training
# Train on combined data, starting from v1 checkpoint
python train_nnue.py training_data_combined.jsonl nnue_weights.pt
# Saves:
# - nnue_weights_v2.pt (improved)
# - nnue_weights_v2_metadata.json
Step 4: Benchmark & Choose
# Test both versions in matches
# If v2 is better, use it; otherwise keep v1
# Update NNUEWeights.scala with best version
python export_weights.py nnue_weights_v2.pt ../src/main/scala/de/nowchess/bot/bots/nnue/NNUEWeights.scala
Metadata File Format
Each training session generates a JSON metadata file, e.g., nnue_weights_v2_metadata.json:
{
"version": 2,
"date": "2026-04-07T21:45:30.123456",
"num_positions": 1000000,
"stockfish_depth": 12,
"epochs": 20,
"batch_size": 4096,
"learning_rate": 0.001,
"final_val_loss": 0.0234567,
"device": "cuda",
"checkpoint": "nnue_weights_v1.pt",
"notes": "Win rate vs classical eval: TBD (requires benchmark games)"
}
Fields
- version: Training version number (v1, v2, etc.)
- date: ISO timestamp of training start
- num_positions: Total positions in dataset
- stockfish_depth: Depth of Stockfish evaluations (from command-line flag)
- epochs: Number of training passes
- batch_size: Training batch size
- learning_rate: Adam optimizer learning rate
- final_val_loss: Best validation loss achieved
- device: GPU (cuda) or CPU used for training
- checkpoint: Previous model used as starting point (null if from scratch)
- notes: Win rate comparison (currently TBD — requires benchmark)
Checkpoint Logic
When you run training, the trainer checks for checkpoints in this order:
- Explicit checkpoint — If you provide
--checkpoint, use it - Auto-detect — If output file exists (e.g.,
nnue_weights.pt), load it - From scratch — Otherwise, initialize with random weights
Example:
# First run: from scratch (no nnue_weights.pt exists)
python train_nnue.py training_data.jsonl nnue_weights.pt
# → Creates v1 from scratch, saves as nnue_weights_v1.pt
# Second run: auto-detect nnue_weights.pt as checkpoint
python train_nnue.py training_data_bigger.jsonl nnue_weights.pt
# → Loads nnue_weights_v1.pt (because nnue_weights.pt = v1), saves as v2
# Third run: explicit checkpoint
python train_nnue.py training_data_huge.jsonl nnue_weights.pt --checkpoint nnue_weights_v2.pt
# → Loads v2, saves as v3
Resuming Interrupted Training
If training is interrupted (power loss, ^C), you can resume:
# Original command
python train_nnue.py training_data.jsonl nnue_weights.pt
# If interrupted, the same command will:
# 1. Detect nnue_weights_v1.pt exists (or a higher version)
# 2. Auto-load it as checkpoint
# 3. Resume training
# 4. Save next version (v2, v3, etc.)
Performance Tips
Reduce Training Time
# Smaller batch size = slower but less memory
python train_nnue.py training_data.jsonl nnue_weights.pt --batch-size 1024
# Fewer epochs
python train_nnue.py training_data.jsonl nnue_weights.pt --epochs 5
# Lower learning rate = slower convergence but more stable
python train_nnue.py training_data.jsonl nnue_weights.pt --lr 5e-4
Accelerate on GPU
If you have NVIDIA GPU with CUDA:
# Training will automatically use CUDA
# Check metadata device field: should be "cuda" not "cpu"
python train_nnue.py training_data.jsonl nnue_weights.pt
If training uses CPU but GPU is available:
# Reinstall PyTorch with CUDA
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
Efficient Incremental Training
# Fine-tune v1 on slightly different data (high learning rate)
python train_nnue.py new_positions.jsonl nnue_weights.pt \
--checkpoint nnue_weights_v1.pt \
--epochs 3 \
--lr 5e-4
# Full retraining on combined data (slower, better)
python train_nnue.py all_positions.jsonl nnue_weights.pt \
--checkpoint nnue_weights_v1.pt \
--epochs 20 \
--lr 1e-3
Version Management
List All Versions
ls -la nnue_weights_v*.pt
ls -la nnue_weights_v*_metadata.json
Compare Versions
cat nnue_weights_v1_metadata.json | grep "final_val_loss"
cat nnue_weights_v2_metadata.json | grep "final_val_loss"
cat nnue_weights_v3_metadata.json | grep "final_val_loss"
Lower val loss = better model.
Benchmark Best Version
After training multiple versions, benchmark them:
# Export v1 and play some games
python export_weights.py nnue_weights_v1.pt ../src/main/scala/de/nowchess/bot/bots/nnue/NNUEWeights.scala
./compile && ./test
# Export v2 and benchmark
python export_weights.py nnue_weights_v2.pt ../src/main/scala/de/nowchess/bot/bots/nnue/NNUEWeights.scala
./compile && ./test
# Keep the best, archive others
Archive Old Versions
# Keep only recent versions
mkdir -p old_models
mv nnue_weights_v1.pt old_models/
mv nnue_weights_v1_metadata.json old_models/
Troubleshooting
"FileNotFoundError: training_data.jsonl not found"
# Make sure you're in the python/ directory
cd modules/bot/python
# Or provide full path
python train_nnue.py /full/path/to/training_data.jsonl nnue_weights.pt
"CUDA out of memory"
Reduce batch size:
python train_nnue.py training_data.jsonl nnue_weights.pt --batch-size 2048
Training seems slow (using CPU not GPU)
# Check metadata of a training run
cat nnue_weights_v1_metadata.json | grep device
# If "cpu", reinstall PyTorch with CUDA support
pip install torch --index-url https://download.pytorch.org/whl/cu118
"checkpoint file corrupted"
# Start over from scratch (don't load corrupted checkpoint)
python train_nnue.py training_data.jsonl nnue_weights_fresh.pt --no-versioning
# Or resume from earlier version
python train_nnue.py training_data.jsonl nnue_weights.pt --checkpoint nnue_weights_v1.pt
Integration with Pipeline
The run_pipeline.sh script now supports incremental training:
# First run: generates data, trains v1
./run_pipeline.sh
# Add more positions
# ... generate more, label more ...
# Second run: trains on combined data as v2
./run_pipeline.sh
Example: Full Workflow
cd modules/bot/python
# Session 1: Initial training
chmod +x run_pipeline.sh
export STOCKFISH_PATH=/usr/bin/stockfish
./run_pipeline.sh
# Creates: nnue_weights_v1.pt, nnue_weights_v1_metadata.json
# Session 2: Improve with deeper analysis
# (manually evaluate more positions at depth 14)
python label_positions.py positions_v2.txt training_data_v2.jsonl \
/usr/bin/stockfish --stockfish-depth 14
# Combine and retrain
cat training_data_v1.jsonl training_data_v2.jsonl > training_data_combined.jsonl
python train_nnue.py training_data_combined.jsonl nnue_weights.pt \
--epochs 25 \
--stockfish-depth 14
# Creates: nnue_weights_v2.pt, nnue_weights_v2_metadata.json
# Session 3: Benchmark and choose
# Test both v1 and v2 with matches...
# If v2 is better, export and use
python export_weights.py nnue_weights_v2.pt \
../src/main/scala/de/nowchess/bot/bots/nnue/NNUEWeights.scala
cd ../..
./compile && ./test
See Also
train_nnue.py --help— Command-line helpREADME_NNUE.md— Complete pipeline documentationNNUE_IMPLEMENTATION_SUMMARY.md— Technical architecture