7.5 KiB
7.5 KiB
Incremental Training & Versioning: New Features
Summary
train_nnue.py now supports:
✅ Checkpoint Loading — Resume from previous models
✅ Automatic Versioning — v1, v2, v3... naming
✅ Metadata Tracking — Date, positions, losses, depth
✅ CLI Arguments — Full control via command line
Feature 1: Automatic Checkpoint Detection
When you run training, the trainer automatically looks for and loads existing weights:
# First run: nnue_weights.pt doesn't exist
python train_nnue.py training_data.jsonl nnue_weights.pt
# → Trains from scratch, saves as nnue_weights_v1.pt
# Second run: nnue_weights.pt exists (symlink to v1)
python train_nnue.py training_data_bigger.jsonl nnue_weights.pt
# → Auto-loads nnue_weights_v1.pt as checkpoint
# → Continues training
# → Saves as nnue_weights_v2.pt
No command-line flag needed — automatic detection of existing weights!
Feature 2: Explicit Checkpoint
Override auto-detection with --checkpoint:
# Use v1 as starting point, ignore any other weights
python train_nnue.py training_data.jsonl nnue_weights.pt \
--checkpoint nnue_weights_v1.pt
# Or load from external checkpoint
python train_nnue.py training_data.jsonl nnue_weights.pt \
--checkpoint /path/to/backup_model.pt
Feature 3: Automatic Versioning
Models are saved with version numbers:
First run:
nnue_weights_v1.pt ← Model weights
nnue_weights_v1_metadata.json ← Training info
Second run:
nnue_weights_v2.pt ← Model weights
nnue_weights_v2_metadata.json ← Training info
Third run:
nnue_weights_v3.pt
nnue_weights_v3_metadata.json
Disable with --no-versioning:
python train_nnue.py training_data.jsonl nnue_weights.pt --no-versioning
# → Saves directly to nnue_weights.pt (no version number)
Feature 4: Training Metadata
Each model save includes a JSON metadata file tracking:
{
"version": 2,
"date": "2026-04-07T15:30:45.123456",
"num_positions": 1000000,
"stockfish_depth": 12,
"epochs": 20,
"batch_size": 4096,
"learning_rate": 0.001,
"final_val_loss": 0.0234567,
"device": "cuda",
"checkpoint": "nnue_weights_v1.pt",
"notes": "Win rate vs classical eval: TBD"
}
Useful for:
- Tracking progress — Compare val_loss across versions
- Reproducibility — Know exactly how each model was trained
- Debugging — Identify which positions/depth produced best results
- Benchmarking — Record win rates (manually added to notes)
Feature 5: CLI Arguments
Full control over training via command-line flags:
python train_nnue.py training_data.jsonl nnue_weights.pt \
--epochs 30 \
--batch-size 2048 \
--lr 5e-4 \
--stockfish-depth 14 \
--checkpoint nnue_weights_v1.pt
All flags:
--epochs— Number of training passes (default: 20)--batch-size— Samples per update (default: 4096)--lr— Learning rate (default: 1e-3)--stockfish-depth— Depth for metadata (default: 12)--checkpoint— Resume from checkpoint (default: auto-detect)--no-versioning— Disable versioning
Workflow Examples
Scenario 1: Continuous Improvement
# Initial training: 500K positions
./run_pipeline.sh
# → nnue_weights_v1.pt created
# Add more positions (500K more)
python label_positions.py positions_v2.txt training_data_v2.jsonl stockfish
# Combine and retrain
cat training_data.jsonl training_data_v2.jsonl > all_data.jsonl
python train_nnue.py all_data.jsonl nnue_weights.pt
# → Loads v1, trains on all 1M positions
# → nnue_weights_v2.pt created
# Export best version
python export_weights.py nnue_weights_v2.pt ../src/main/scala/de/nowchess/bot/bots/nnue/NNUEWeights.scala
Scenario 2: Hyperparameter Tuning
# Baseline
python train_nnue.py data.jsonl nnue_weights.pt
# → v1 with default settings
# Try lower learning rate
python train_nnue.py data.jsonl nnue_weights.pt --lr 5e-4
# → v2 with lr=5e-4
# Try higher learning rate
python train_nnue.py data.jsonl nnue_weights.pt --lr 2e-3
# → v3 with lr=2e-3
# Compare metadata
cat nnue_weights_v*_metadata.json | grep final_val_loss
# → Pick the lowest loss
Scenario 3: Interrupted Training Resume
# Start training
python train_nnue.py training_data.jsonl nnue_weights.pt --epochs 50
# → Epoch 30 of 50, then crash/interrupt
# Resume: same command
python train_nnue.py training_data.jsonl nnue_weights.pt --epochs 50
# → Auto-detects checkpoint, continues from epoch 30
# → Completes to epoch 50
Command-Line Help
View all options:
python train_nnue.py --help
Output:
usage: train_nnue.py [-h] [--checkpoint CHECKPOINT] [--epochs EPOCHS]
[--batch-size BATCH_SIZE] [--lr LR]
[--stockfish-depth STOCKFISH_DEPTH] [--no-versioning]
[data_file] [output_file]
Train NNUE neural network for chess evaluation
positional arguments:
data_file Path to training_data.jsonl (default: training_data.jsonl)
output_file Output file base name (default: nnue_weights.pt)
optional arguments:
-h, --help show this help message and exit
--checkpoint CHECKPOINT
Path to checkpoint file to resume training from (optional)
--epochs EPOCHS Number of epochs to train (default: 20)
--batch-size BATCH_SIZE
Batch size (default: 4096)
--lr LR Learning rate (default: 1e-3)
--stockfish-depth STOCKFISH_DEPTH
Stockfish depth used for evaluations (for metadata, default: 12)
--no-versioning Disable automatic versioning (save directly to output file)
Key Differences from Previous Version
| Feature | Before | After |
|---|---|---|
| Checkpoint support | ❌ No | ✅ Yes (auto + explicit) |
| Versioning | ❌ Single file | ✅ v1, v2, v3... |
| Metadata tracking | ❌ No | ✅ JSON with all info |
| CLI arguments | ❌ Limited | ✅ Full argparse |
| Resumed training | ❌ Always from scratch | ✅ Resume from checkpoint |
| Training history | ❌ Lost | ✅ Tracked in metadata |
Integration with Pipeline
The run_pipeline.sh and run_pipeline.bat scripts automatically use versioning:
./run_pipeline.sh
# First run:
# - Generates data
# - Trains model
# - Creates nnue_weights_v1.pt + metadata
# - Exports to NNUEWeights.scala
# Second run:
# - Auto-detects v1, loads as checkpoint
# - Continues training on all data
# - Creates nnue_weights_v2.pt + metadata
# - Exports updated NNUEWeights.scala
Tips & Tricks
List all versions with losses:
for f in nnue_weights_v*_metadata.json; do
version=$(grep version $f | head -1)
loss=$(grep final_val_loss $f)
echo "$version | $loss"
done
Auto-export best version:
# Find version with lowest loss
BEST=$(for f in nnue_weights_v*_metadata.json; do
echo "$f $(grep final_val_loss $f | cut -d: -f2)"
done | sort -k2 -n | head -1 | cut -d_ -f3 | cut -d. -f1)
python export_weights.py nnue_weights_$BEST.pt ../src/main/scala/de/nowchess/bot/bots/nnue/NNUEWeights.scala
Archive old versions:
mkdir -p archive
mv nnue_weights_v{1,2,3}.pt archive/
mv nnue_weights_v{1,2,3}_metadata.json archive/
# Keep only v4+
See Also
TRAINING_GUIDE.md— Detailed examples and workflowsREADME_NNUE.md— Complete pipeline documentationtrain_nnue.py --help— Command-line reference