Files
NowChessSystems/modules/bot/python/INCREMENTAL_TRAINING.md
T

7.5 KiB

Incremental Training & Versioning: New Features

Summary

train_nnue.py now supports:

Checkpoint Loading — Resume from previous models
Automatic Versioning — v1, v2, v3... naming
Metadata Tracking — Date, positions, losses, depth
CLI Arguments — Full control via command line


Feature 1: Automatic Checkpoint Detection

When you run training, the trainer automatically looks for and loads existing weights:

# First run: nnue_weights.pt doesn't exist
python train_nnue.py training_data.jsonl nnue_weights.pt
# → Trains from scratch, saves as nnue_weights_v1.pt

# Second run: nnue_weights.pt exists (symlink to v1)
python train_nnue.py training_data_bigger.jsonl nnue_weights.pt
# → Auto-loads nnue_weights_v1.pt as checkpoint
# → Continues training
# → Saves as nnue_weights_v2.pt

No command-line flag needed — automatic detection of existing weights!


Feature 2: Explicit Checkpoint

Override auto-detection with --checkpoint:

# Use v1 as starting point, ignore any other weights
python train_nnue.py training_data.jsonl nnue_weights.pt \
  --checkpoint nnue_weights_v1.pt

# Or load from external checkpoint
python train_nnue.py training_data.jsonl nnue_weights.pt \
  --checkpoint /path/to/backup_model.pt

Feature 3: Automatic Versioning

Models are saved with version numbers:

First run:

nnue_weights_v1.pt          ← Model weights
nnue_weights_v1_metadata.json ← Training info

Second run:

nnue_weights_v2.pt          ← Model weights
nnue_weights_v2_metadata.json ← Training info

Third run:

nnue_weights_v3.pt
nnue_weights_v3_metadata.json

Disable with --no-versioning:

python train_nnue.py training_data.jsonl nnue_weights.pt --no-versioning
# → Saves directly to nnue_weights.pt (no version number)

Feature 4: Training Metadata

Each model save includes a JSON metadata file tracking:

{
  "version": 2,
  "date": "2026-04-07T15:30:45.123456",
  "num_positions": 1000000,
  "stockfish_depth": 12,
  "epochs": 20,
  "batch_size": 4096,
  "learning_rate": 0.001,
  "final_val_loss": 0.0234567,
  "device": "cuda",
  "checkpoint": "nnue_weights_v1.pt",
  "notes": "Win rate vs classical eval: TBD"
}

Useful for:

  • Tracking progress — Compare val_loss across versions
  • Reproducibility — Know exactly how each model was trained
  • Debugging — Identify which positions/depth produced best results
  • Benchmarking — Record win rates (manually added to notes)

Feature 5: CLI Arguments

Full control over training via command-line flags:

python train_nnue.py training_data.jsonl nnue_weights.pt \
  --epochs 30 \
  --batch-size 2048 \
  --lr 5e-4 \
  --stockfish-depth 14 \
  --checkpoint nnue_weights_v1.pt

All flags:

  • --epochs — Number of training passes (default: 20)
  • --batch-size — Samples per update (default: 4096)
  • --lr — Learning rate (default: 1e-3)
  • --stockfish-depth — Depth for metadata (default: 12)
  • --checkpoint — Resume from checkpoint (default: auto-detect)
  • --no-versioning — Disable versioning

Workflow Examples

Scenario 1: Continuous Improvement

# Initial training: 500K positions
./run_pipeline.sh
# → nnue_weights_v1.pt created

# Add more positions (500K more)
python label_positions.py positions_v2.txt training_data_v2.jsonl stockfish

# Combine and retrain
cat training_data.jsonl training_data_v2.jsonl > all_data.jsonl
python train_nnue.py all_data.jsonl nnue_weights.pt
# → Loads v1, trains on all 1M positions
# → nnue_weights_v2.pt created

# Export best version
python export_weights.py nnue_weights_v2.pt ../src/main/scala/de/nowchess/bot/bots/nnue/NNUEWeights.scala

Scenario 2: Hyperparameter Tuning

# Baseline
python train_nnue.py data.jsonl nnue_weights.pt
# → v1 with default settings

# Try lower learning rate
python train_nnue.py data.jsonl nnue_weights.pt --lr 5e-4
# → v2 with lr=5e-4

# Try higher learning rate
python train_nnue.py data.jsonl nnue_weights.pt --lr 2e-3
# → v3 with lr=2e-3

# Compare metadata
cat nnue_weights_v*_metadata.json | grep final_val_loss
# → Pick the lowest loss

Scenario 3: Interrupted Training Resume

# Start training
python train_nnue.py training_data.jsonl nnue_weights.pt --epochs 50
# → Epoch 30 of 50, then crash/interrupt

# Resume: same command
python train_nnue.py training_data.jsonl nnue_weights.pt --epochs 50
# → Auto-detects checkpoint, continues from epoch 30
# → Completes to epoch 50

Command-Line Help

View all options:

python train_nnue.py --help

Output:

usage: train_nnue.py [-h] [--checkpoint CHECKPOINT] [--epochs EPOCHS]
                     [--batch-size BATCH_SIZE] [--lr LR]
                     [--stockfish-depth STOCKFISH_DEPTH] [--no-versioning]
                     [data_file] [output_file]

Train NNUE neural network for chess evaluation

positional arguments:
  data_file             Path to training_data.jsonl (default: training_data.jsonl)
  output_file           Output file base name (default: nnue_weights.pt)

optional arguments:
  -h, --help            show this help message and exit
  --checkpoint CHECKPOINT
                        Path to checkpoint file to resume training from (optional)
  --epochs EPOCHS       Number of epochs to train (default: 20)
  --batch-size BATCH_SIZE
                        Batch size (default: 4096)
  --lr LR               Learning rate (default: 1e-3)
  --stockfish-depth STOCKFISH_DEPTH
                        Stockfish depth used for evaluations (for metadata, default: 12)
  --no-versioning       Disable automatic versioning (save directly to output file)

Key Differences from Previous Version

Feature Before After
Checkpoint support No Yes (auto + explicit)
Versioning Single file v1, v2, v3...
Metadata tracking No JSON with all info
CLI arguments Limited Full argparse
Resumed training Always from scratch Resume from checkpoint
Training history Lost Tracked in metadata

Integration with Pipeline

The run_pipeline.sh and run_pipeline.bat scripts automatically use versioning:

./run_pipeline.sh
# First run:
# - Generates data
# - Trains model
# - Creates nnue_weights_v1.pt + metadata
# - Exports to NNUEWeights.scala

# Second run:
# - Auto-detects v1, loads as checkpoint
# - Continues training on all data
# - Creates nnue_weights_v2.pt + metadata
# - Exports updated NNUEWeights.scala

Tips & Tricks

List all versions with losses:

for f in nnue_weights_v*_metadata.json; do
  version=$(grep version $f | head -1)
  loss=$(grep final_val_loss $f)
  echo "$version | $loss"
done

Auto-export best version:

# Find version with lowest loss
BEST=$(for f in nnue_weights_v*_metadata.json; do
  echo "$f $(grep final_val_loss $f | cut -d: -f2)"
done | sort -k2 -n | head -1 | cut -d_ -f3 | cut -d. -f1)

python export_weights.py nnue_weights_$BEST.pt ../src/main/scala/de/nowchess/bot/bots/nnue/NNUEWeights.scala

Archive old versions:

mkdir -p archive
mv nnue_weights_v{1,2,3}.pt archive/
mv nnue_weights_v{1,2,3}_metadata.json archive/
# Keep only v4+

See Also

  • TRAINING_GUIDE.md — Detailed examples and workflows
  • README_NNUE.md — Complete pipeline documentation
  • train_nnue.py --help — Command-line reference