Files
NowChessSystems/modules/bot/python/TRAINING_GUIDE.md
T

9.8 KiB

NNUE Training Guide: Incremental Training & Versioning

Overview

The improved train_nnue.py now supports:

  1. Incremental training — Resume from checkpoint, continue training on new data
  2. Automatic versioning — Each training run saved as nnue_weights_v{N}.pt
  3. Metadata tracking — Date, positions, depth, losses stored in JSON
  4. CLI flags — Full control over training parameters

Quick Start

First Training Run (Fresh Start)

python train_nnue.py training_data.jsonl nnue_weights.pt

This saves:

  • nnue_weights_v1.pt — The trained weights
  • nnue_weights_v1_metadata.json — Training metadata

Continue Training (Incremental)

Add more positions to training_data.jsonl, then:

python train_nnue.py training_data.jsonl nnue_weights.pt

The trainer will:

  1. Detect nnue_weights.pt exists
  2. Load it as a checkpoint automatically
  3. Continue training on all data
  4. Save as nnue_weights_v2.pt with updated metadata

Alternatively, specify a checkpoint explicitly:

python train_nnue.py training_data.jsonl nnue_weights.pt --checkpoint nnue_weights_v1.pt

Advanced Usage

Custom Training Parameters

python train_nnue.py training_data.jsonl nnue_weights.pt \
  --epochs 30 \
  --batch-size 2048 \
  --lr 5e-4 \
  --stockfish-depth 14
  • --epochs — How many passes through the data (default: 20)
  • --batch-size — Samples per gradient update (default: 4096)
  • --lr — Learning rate (default: 1e-3)
  • --stockfish-depth — Depth of Stockfish evaluation (for metadata only)

Explicit Checkpoint

Resume from a specific checkpoint (not nnue_weights.pt):

python train_nnue.py training_data_v2.jsonl nnue_weights.pt \
  --checkpoint nnue_weights_v1.pt

Disable Versioning

Save directly to output file without versioning:

python train_nnue.py training_data.jsonl nnue_weights.pt --no-versioning

This overwrites nnue_weights.pt instead of creating nnue_weights_v2.pt.

Incremental Training Workflow

Typical workflow for improving the model over time:

Step 1: Initial Training

# Generate 500K positions with Stockfish
./run_pipeline.sh

# This saves:
# - nnue_weights_v1.pt
# - nnue_weights_v1_metadata.json

Step 2: Generate More Positions

# Later, generate 500K more positions
# Append to training_data.jsonl or create new one

# Label with Stockfish at depth 16 (more thorough)
python label_positions.py positions_batch2.txt training_data_batch2.jsonl stockfish --stockfish-depth 16

# Combine datasets
cat training_data_batch1.jsonl training_data_batch2.jsonl > training_data_combined.jsonl

Step 3: Continue Training

# Train on combined data, starting from v1 checkpoint
python train_nnue.py training_data_combined.jsonl nnue_weights.pt

# Saves:
# - nnue_weights_v2.pt (improved)
# - nnue_weights_v2_metadata.json

Step 4: Benchmark & Choose

# Test both versions in matches
# If v2 is better, use it; otherwise keep v1

# Update NNUEWeights.scala with best version
python export_weights.py nnue_weights_v2.pt ../src/main/scala/de/nowchess/bot/bots/nnue/NNUEWeights.scala

Metadata File Format

Each training session generates a JSON metadata file, e.g., nnue_weights_v2_metadata.json:

{
  "version": 2,
  "date": "2026-04-07T21:45:30.123456",
  "num_positions": 1000000,
  "stockfish_depth": 12,
  "epochs": 20,
  "batch_size": 4096,
  "learning_rate": 0.001,
  "final_val_loss": 0.0234567,
  "device": "cuda",
  "checkpoint": "nnue_weights_v1.pt",
  "notes": "Win rate vs classical eval: TBD (requires benchmark games)"
}

Fields

  • version: Training version number (v1, v2, etc.)
  • date: ISO timestamp of training start
  • num_positions: Total positions in dataset
  • stockfish_depth: Depth of Stockfish evaluations (from command-line flag)
  • epochs: Number of training passes
  • batch_size: Training batch size
  • learning_rate: Adam optimizer learning rate
  • final_val_loss: Best validation loss achieved
  • device: GPU (cuda) or CPU used for training
  • checkpoint: Previous model used as starting point (null if from scratch)
  • notes: Win rate comparison (currently TBD — requires benchmark)

Checkpoint Logic

When you run training, the trainer checks for checkpoints in this order:

  1. Explicit checkpoint — If you provide --checkpoint, use it
  2. Auto-detect — If output file exists (e.g., nnue_weights.pt), load it
  3. From scratch — Otherwise, initialize with random weights

Example:

# First run: from scratch (no nnue_weights.pt exists)
python train_nnue.py training_data.jsonl nnue_weights.pt
# → Creates v1 from scratch, saves as nnue_weights_v1.pt

# Second run: auto-detect nnue_weights.pt as checkpoint
python train_nnue.py training_data_bigger.jsonl nnue_weights.pt
# → Loads nnue_weights_v1.pt (because nnue_weights.pt = v1), saves as v2

# Third run: explicit checkpoint
python train_nnue.py training_data_huge.jsonl nnue_weights.pt --checkpoint nnue_weights_v2.pt
# → Loads v2, saves as v3

Resuming Interrupted Training

If training is interrupted (power loss, ^C), you can resume:

# Original command
python train_nnue.py training_data.jsonl nnue_weights.pt

# If interrupted, the same command will:
# 1. Detect nnue_weights_v1.pt exists (or a higher version)
# 2. Auto-load it as checkpoint
# 3. Resume training
# 4. Save next version (v2, v3, etc.)

Performance Tips

Reduce Training Time

# Smaller batch size = slower but less memory
python train_nnue.py training_data.jsonl nnue_weights.pt --batch-size 1024

# Fewer epochs
python train_nnue.py training_data.jsonl nnue_weights.pt --epochs 5

# Lower learning rate = slower convergence but more stable
python train_nnue.py training_data.jsonl nnue_weights.pt --lr 5e-4

Accelerate on GPU

If you have NVIDIA GPU with CUDA:

# Training will automatically use CUDA
# Check metadata device field: should be "cuda" not "cpu"
python train_nnue.py training_data.jsonl nnue_weights.pt

If training uses CPU but GPU is available:

# Reinstall PyTorch with CUDA
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

Efficient Incremental Training

# Fine-tune v1 on slightly different data (high learning rate)
python train_nnue.py new_positions.jsonl nnue_weights.pt \
  --checkpoint nnue_weights_v1.pt \
  --epochs 3 \
  --lr 5e-4

# Full retraining on combined data (slower, better)
python train_nnue.py all_positions.jsonl nnue_weights.pt \
  --checkpoint nnue_weights_v1.pt \
  --epochs 20 \
  --lr 1e-3

Version Management

List All Versions

ls -la nnue_weights_v*.pt
ls -la nnue_weights_v*_metadata.json

Compare Versions

cat nnue_weights_v1_metadata.json | grep "final_val_loss"
cat nnue_weights_v2_metadata.json | grep "final_val_loss"
cat nnue_weights_v3_metadata.json | grep "final_val_loss"

Lower val loss = better model.

Benchmark Best Version

After training multiple versions, benchmark them:

# Export v1 and play some games
python export_weights.py nnue_weights_v1.pt ../src/main/scala/de/nowchess/bot/bots/nnue/NNUEWeights.scala
./compile && ./test

# Export v2 and benchmark
python export_weights.py nnue_weights_v2.pt ../src/main/scala/de/nowchess/bot/bots/nnue/NNUEWeights.scala
./compile && ./test

# Keep the best, archive others

Archive Old Versions

# Keep only recent versions
mkdir -p old_models
mv nnue_weights_v1.pt old_models/
mv nnue_weights_v1_metadata.json old_models/

Troubleshooting

"FileNotFoundError: training_data.jsonl not found"

# Make sure you're in the python/ directory
cd modules/bot/python

# Or provide full path
python train_nnue.py /full/path/to/training_data.jsonl nnue_weights.pt

"CUDA out of memory"

Reduce batch size:

python train_nnue.py training_data.jsonl nnue_weights.pt --batch-size 2048

Training seems slow (using CPU not GPU)

# Check metadata of a training run
cat nnue_weights_v1_metadata.json | grep device

# If "cpu", reinstall PyTorch with CUDA support
pip install torch --index-url https://download.pytorch.org/whl/cu118

"checkpoint file corrupted"

# Start over from scratch (don't load corrupted checkpoint)
python train_nnue.py training_data.jsonl nnue_weights_fresh.pt --no-versioning

# Or resume from earlier version
python train_nnue.py training_data.jsonl nnue_weights.pt --checkpoint nnue_weights_v1.pt

Integration with Pipeline

The run_pipeline.sh script now supports incremental training:

# First run: generates data, trains v1
./run_pipeline.sh

# Add more positions
# ... generate more, label more ...

# Second run: trains on combined data as v2
./run_pipeline.sh

Example: Full Workflow

cd modules/bot/python

# Session 1: Initial training
chmod +x run_pipeline.sh
export STOCKFISH_PATH=/usr/bin/stockfish
./run_pipeline.sh
# Creates: nnue_weights_v1.pt, nnue_weights_v1_metadata.json

# Session 2: Improve with deeper analysis
# (manually evaluate more positions at depth 14)
python label_positions.py positions_v2.txt training_data_v2.jsonl \
  /usr/bin/stockfish --stockfish-depth 14

# Combine and retrain
cat training_data_v1.jsonl training_data_v2.jsonl > training_data_combined.jsonl

python train_nnue.py training_data_combined.jsonl nnue_weights.pt \
  --epochs 25 \
  --stockfish-depth 14
# Creates: nnue_weights_v2.pt, nnue_weights_v2_metadata.json

# Session 3: Benchmark and choose
# Test both v1 and v2 with matches...
# If v2 is better, export and use
python export_weights.py nnue_weights_v2.pt \
  ../src/main/scala/de/nowchess/bot/bots/nnue/NNUEWeights.scala

cd ../..
./compile && ./test

See Also

  • train_nnue.py --help — Command-line help
  • README_NNUE.md — Complete pipeline documentation
  • NNUE_IMPLEMENTATION_SUMMARY.md — Technical architecture