NowChess/NowChessSystems

Fork 0

Files

T

Janis 6a9ac55b31 feat: refactor AlphaBetaSearch and ClassicalBot for improved evaluation and organization

2026-04-16 18:49:58 +02:00

7.5 KiB

Raw Blame History

Incremental Training & Versioning: New Features

Summary

train_nnue.py now supports:

✅ Checkpoint Loading — Resume from previous models
✅ Automatic Versioning — v1, v2, v3... naming
✅ Metadata Tracking — Date, positions, losses, depth
✅ CLI Arguments — Full control via command line

Feature 1: Automatic Checkpoint Detection

When you run training, the trainer automatically looks for and loads existing weights:

# First run: nnue_weights.pt doesn't exist
python train_nnue.py training_data.jsonl nnue_weights.pt
# → Trains from scratch, saves as nnue_weights_v1.pt

# Second run: nnue_weights.pt exists (symlink to v1)
python train_nnue.py training_data_bigger.jsonl nnue_weights.pt
# → Auto-loads nnue_weights_v1.pt as checkpoint
# → Continues training
# → Saves as nnue_weights_v2.pt

No command-line flag needed — automatic detection of existing weights!

Feature 2: Explicit Checkpoint

Override auto-detection with --checkpoint:

# Use v1 as starting point, ignore any other weights
python train_nnue.py training_data.jsonl nnue_weights.pt \
  --checkpoint nnue_weights_v1.pt

# Or load from external checkpoint
python train_nnue.py training_data.jsonl nnue_weights.pt \
  --checkpoint /path/to/backup_model.pt

Feature 3: Automatic Versioning

Models are saved with version numbers:

First run:

nnue_weights_v1.pt          ← Model weights
nnue_weights_v1_metadata.json ← Training info

Second run:

nnue_weights_v2.pt          ← Model weights
nnue_weights_v2_metadata.json ← Training info

Third run:

nnue_weights_v3.pt
nnue_weights_v3_metadata.json

Disable with --no-versioning:

python train_nnue.py training_data.jsonl nnue_weights.pt --no-versioning
# → Saves directly to nnue_weights.pt (no version number)

Feature 4: Training Metadata

Each model save includes a JSON metadata file tracking:

{
  "version": 2,
  "date": "2026-04-07T15:30:45.123456",
  "num_positions": 1000000,
  "stockfish_depth": 12,
  "epochs": 20,
  "batch_size": 4096,
  "learning_rate": 0.001,
  "final_val_loss": 0.0234567,
  "device": "cuda",
  "checkpoint": "nnue_weights_v1.pt",
  "notes": "Win rate vs classical eval: TBD"
}

Useful for:

Tracking progress — Compare val_loss across versions
Reproducibility — Know exactly how each model was trained
Debugging — Identify which positions/depth produced best results
Benchmarking — Record win rates (manually added to notes)

Feature 5: CLI Arguments

Full control over training via command-line flags:

python train_nnue.py training_data.jsonl nnue_weights.pt \
  --epochs 30 \
  --batch-size 2048 \
  --lr 5e-4 \
  --stockfish-depth 14 \
  --checkpoint nnue_weights_v1.pt

All flags:

--epochs — Number of training passes (default: 20)
--batch-size — Samples per update (default: 4096)
--lr — Learning rate (default: 1e-3)
--stockfish-depth — Depth for metadata (default: 12)
--checkpoint — Resume from checkpoint (default: auto-detect)
--no-versioning — Disable versioning

Workflow Examples

Scenario 1: Continuous Improvement

# Initial training: 500K positions
./run_pipeline.sh
# → nnue_weights_v1.pt created

# Add more positions (500K more)
python label_positions.py positions_v2.txt training_data_v2.jsonl stockfish

# Combine and retrain
cat training_data.jsonl training_data_v2.jsonl > all_data.jsonl
python train_nnue.py all_data.jsonl nnue_weights.pt
# → Loads v1, trains on all 1M positions
# → nnue_weights_v2.pt created

# Export best version
python export_weights.py nnue_weights_v2.pt ../src/main/scala/de/nowchess/bot/bots/nnue/NNUEWeights.scala

Scenario 2: Hyperparameter Tuning

# Baseline
python train_nnue.py data.jsonl nnue_weights.pt
# → v1 with default settings

# Try lower learning rate
python train_nnue.py data.jsonl nnue_weights.pt --lr 5e-4
# → v2 with lr=5e-4

# Try higher learning rate
python train_nnue.py data.jsonl nnue_weights.pt --lr 2e-3
# → v3 with lr=2e-3

# Compare metadata
cat nnue_weights_v*_metadata.json | grep final_val_loss
# → Pick the lowest loss

Scenario 3: Interrupted Training Resume

# Start training
python train_nnue.py training_data.jsonl nnue_weights.pt --epochs 50
# → Epoch 30 of 50, then crash/interrupt

# Resume: same command
python train_nnue.py training_data.jsonl nnue_weights.pt --epochs 50
# → Auto-detects checkpoint, continues from epoch 30
# → Completes to epoch 50

Command-Line Help

View all options:

python train_nnue.py --help

Output:

usage: train_nnue.py [-h] [--checkpoint CHECKPOINT] [--epochs EPOCHS]
                     [--batch-size BATCH_SIZE] [--lr LR]
                     [--stockfish-depth STOCKFISH_DEPTH] [--no-versioning]
                     [data_file] [output_file]

Train NNUE neural network for chess evaluation

positional arguments:
  data_file             Path to training_data.jsonl (default: training_data.jsonl)
  output_file           Output file base name (default: nnue_weights.pt)

optional arguments:
  -h, --help            show this help message and exit
  --checkpoint CHECKPOINT
                        Path to checkpoint file to resume training from (optional)
  --epochs EPOCHS       Number of epochs to train (default: 20)
  --batch-size BATCH_SIZE
                        Batch size (default: 4096)
  --lr LR               Learning rate (default: 1e-3)
  --stockfish-depth STOCKFISH_DEPTH
                        Stockfish depth used for evaluations (for metadata, default: 12)
  --no-versioning       Disable automatic versioning (save directly to output file)

Key Differences from Previous Version

Feature	Before	After
Checkpoint support	❌ No	✅ Yes (auto + explicit)
Versioning	❌ Single file	✅ v1, v2, v3...
Metadata tracking	❌ No	✅ JSON with all info
CLI arguments	❌ Limited	✅ Full argparse
Resumed training	❌ Always from scratch	✅ Resume from checkpoint
Training history	❌ Lost	✅ Tracked in metadata

Integration with Pipeline

The run_pipeline.sh and run_pipeline.bat scripts automatically use versioning:

./run_pipeline.sh
# First run:
# - Generates data
# - Trains model
# - Creates nnue_weights_v1.pt + metadata
# - Exports to NNUEWeights.scala

# Second run:
# - Auto-detects v1, loads as checkpoint
# - Continues training on all data
# - Creates nnue_weights_v2.pt + metadata
# - Exports updated NNUEWeights.scala

Tips & Tricks

List all versions with losses:

for f in nnue_weights_v*_metadata.json; do
  version=$(grep version $f | head -1)
  loss=$(grep final_val_loss $f)
  echo "$version | $loss"
done

Auto-export best version:

# Find version with lowest loss
BEST=$(for f in nnue_weights_v*_metadata.json; do
  echo "$f $(grep final_val_loss $f | cut -d: -f2)"
done | sort -k2 -n | head -1 | cut -d_ -f3 | cut -d. -f1)

python export_weights.py nnue_weights_$BEST.pt ../src/main/scala/de/nowchess/bot/bots/nnue/NNUEWeights.scala

Archive old versions:

mkdir -p archive
mv nnue_weights_v{1,2,3}.pt archive/
mv nnue_weights_v{1,2,3}_metadata.json archive/
# Keep only v4+

7.5 KiB Raw Blame History