297 lines
7.5 KiB
Markdown
297 lines
7.5 KiB
Markdown
# Incremental Training & Versioning: New Features
|
|
|
|
## Summary
|
|
|
|
`train_nnue.py` now supports:
|
|
|
|
✅ **Checkpoint Loading** — Resume from previous models
|
|
✅ **Automatic Versioning** — v1, v2, v3... naming
|
|
✅ **Metadata Tracking** — Date, positions, losses, depth
|
|
✅ **CLI Arguments** — Full control via command line
|
|
|
|
---
|
|
|
|
## Feature 1: Automatic Checkpoint Detection
|
|
|
|
When you run training, the trainer automatically looks for and loads existing weights:
|
|
|
|
```bash
|
|
# First run: nnue_weights.pt doesn't exist
|
|
python train_nnue.py training_data.jsonl nnue_weights.pt
|
|
# → Trains from scratch, saves as nnue_weights_v1.pt
|
|
|
|
# Second run: nnue_weights.pt exists (symlink to v1)
|
|
python train_nnue.py training_data_bigger.jsonl nnue_weights.pt
|
|
# → Auto-loads nnue_weights_v1.pt as checkpoint
|
|
# → Continues training
|
|
# → Saves as nnue_weights_v2.pt
|
|
```
|
|
|
|
**No command-line flag needed** — automatic detection of existing weights!
|
|
|
|
---
|
|
|
|
## Feature 2: Explicit Checkpoint
|
|
|
|
Override auto-detection with `--checkpoint`:
|
|
|
|
```bash
|
|
# Use v1 as starting point, ignore any other weights
|
|
python train_nnue.py training_data.jsonl nnue_weights.pt \
|
|
--checkpoint nnue_weights_v1.pt
|
|
|
|
# Or load from external checkpoint
|
|
python train_nnue.py training_data.jsonl nnue_weights.pt \
|
|
--checkpoint /path/to/backup_model.pt
|
|
```
|
|
|
|
---
|
|
|
|
## Feature 3: Automatic Versioning
|
|
|
|
Models are saved with version numbers:
|
|
|
|
**First run:**
|
|
```
|
|
nnue_weights_v1.pt ← Model weights
|
|
nnue_weights_v1_metadata.json ← Training info
|
|
```
|
|
|
|
**Second run:**
|
|
```
|
|
nnue_weights_v2.pt ← Model weights
|
|
nnue_weights_v2_metadata.json ← Training info
|
|
```
|
|
|
|
**Third run:**
|
|
```
|
|
nnue_weights_v3.pt
|
|
nnue_weights_v3_metadata.json
|
|
```
|
|
|
|
Disable with `--no-versioning`:
|
|
```bash
|
|
python train_nnue.py training_data.jsonl nnue_weights.pt --no-versioning
|
|
# → Saves directly to nnue_weights.pt (no version number)
|
|
```
|
|
|
|
---
|
|
|
|
## Feature 4: Training Metadata
|
|
|
|
Each model save includes a JSON metadata file tracking:
|
|
|
|
```json
|
|
{
|
|
"version": 2,
|
|
"date": "2026-04-07T15:30:45.123456",
|
|
"num_positions": 1000000,
|
|
"stockfish_depth": 12,
|
|
"epochs": 20,
|
|
"batch_size": 4096,
|
|
"learning_rate": 0.001,
|
|
"final_val_loss": 0.0234567,
|
|
"device": "cuda",
|
|
"checkpoint": "nnue_weights_v1.pt",
|
|
"notes": "Win rate vs classical eval: TBD"
|
|
}
|
|
```
|
|
|
|
### Useful for:
|
|
- **Tracking progress** — Compare val_loss across versions
|
|
- **Reproducibility** — Know exactly how each model was trained
|
|
- **Debugging** — Identify which positions/depth produced best results
|
|
- **Benchmarking** — Record win rates (manually added to notes)
|
|
|
|
---
|
|
|
|
## Feature 5: CLI Arguments
|
|
|
|
Full control over training via command-line flags:
|
|
|
|
```bash
|
|
python train_nnue.py training_data.jsonl nnue_weights.pt \
|
|
--epochs 30 \
|
|
--batch-size 2048 \
|
|
--lr 5e-4 \
|
|
--stockfish-depth 14 \
|
|
--checkpoint nnue_weights_v1.pt
|
|
```
|
|
|
|
**All flags:**
|
|
- `--epochs` — Number of training passes (default: 20)
|
|
- `--batch-size` — Samples per update (default: 4096)
|
|
- `--lr` — Learning rate (default: 1e-3)
|
|
- `--stockfish-depth` — Depth for metadata (default: 12)
|
|
- `--checkpoint` — Resume from checkpoint (default: auto-detect)
|
|
- `--no-versioning` — Disable versioning
|
|
|
|
---
|
|
|
|
## Workflow Examples
|
|
|
|
### Scenario 1: Continuous Improvement
|
|
|
|
```bash
|
|
# Initial training: 500K positions
|
|
./run_pipeline.sh
|
|
# → nnue_weights_v1.pt created
|
|
|
|
# Add more positions (500K more)
|
|
python label_positions.py positions_v2.txt training_data_v2.jsonl stockfish
|
|
|
|
# Combine and retrain
|
|
cat training_data.jsonl training_data_v2.jsonl > all_data.jsonl
|
|
python train_nnue.py all_data.jsonl nnue_weights.pt
|
|
# → Loads v1, trains on all 1M positions
|
|
# → nnue_weights_v2.pt created
|
|
|
|
# Export best version
|
|
python export_weights.py nnue_weights_v2.pt ../src/main/scala/de/nowchess/bot/bots/nnue/NNUEWeights.scala
|
|
```
|
|
|
|
### Scenario 2: Hyperparameter Tuning
|
|
|
|
```bash
|
|
# Baseline
|
|
python train_nnue.py data.jsonl nnue_weights.pt
|
|
# → v1 with default settings
|
|
|
|
# Try lower learning rate
|
|
python train_nnue.py data.jsonl nnue_weights.pt --lr 5e-4
|
|
# → v2 with lr=5e-4
|
|
|
|
# Try higher learning rate
|
|
python train_nnue.py data.jsonl nnue_weights.pt --lr 2e-3
|
|
# → v3 with lr=2e-3
|
|
|
|
# Compare metadata
|
|
cat nnue_weights_v*_metadata.json | grep final_val_loss
|
|
# → Pick the lowest loss
|
|
```
|
|
|
|
### Scenario 3: Interrupted Training Resume
|
|
|
|
```bash
|
|
# Start training
|
|
python train_nnue.py training_data.jsonl nnue_weights.pt --epochs 50
|
|
# → Epoch 30 of 50, then crash/interrupt
|
|
|
|
# Resume: same command
|
|
python train_nnue.py training_data.jsonl nnue_weights.pt --epochs 50
|
|
# → Auto-detects checkpoint, continues from epoch 30
|
|
# → Completes to epoch 50
|
|
```
|
|
|
|
---
|
|
|
|
## Command-Line Help
|
|
|
|
View all options:
|
|
|
|
```bash
|
|
python train_nnue.py --help
|
|
```
|
|
|
|
Output:
|
|
```
|
|
usage: train_nnue.py [-h] [--checkpoint CHECKPOINT] [--epochs EPOCHS]
|
|
[--batch-size BATCH_SIZE] [--lr LR]
|
|
[--stockfish-depth STOCKFISH_DEPTH] [--no-versioning]
|
|
[data_file] [output_file]
|
|
|
|
Train NNUE neural network for chess evaluation
|
|
|
|
positional arguments:
|
|
data_file Path to training_data.jsonl (default: training_data.jsonl)
|
|
output_file Output file base name (default: nnue_weights.pt)
|
|
|
|
optional arguments:
|
|
-h, --help show this help message and exit
|
|
--checkpoint CHECKPOINT
|
|
Path to checkpoint file to resume training from (optional)
|
|
--epochs EPOCHS Number of epochs to train (default: 20)
|
|
--batch-size BATCH_SIZE
|
|
Batch size (default: 4096)
|
|
--lr LR Learning rate (default: 1e-3)
|
|
--stockfish-depth STOCKFISH_DEPTH
|
|
Stockfish depth used for evaluations (for metadata, default: 12)
|
|
--no-versioning Disable automatic versioning (save directly to output file)
|
|
```
|
|
|
|
---
|
|
|
|
## Key Differences from Previous Version
|
|
|
|
| Feature | Before | After |
|
|
|---------|--------|-------|
|
|
| Checkpoint support | ❌ No | ✅ Yes (auto + explicit) |
|
|
| Versioning | ❌ Single file | ✅ v1, v2, v3... |
|
|
| Metadata tracking | ❌ No | ✅ JSON with all info |
|
|
| CLI arguments | ❌ Limited | ✅ Full argparse |
|
|
| Resumed training | ❌ Always from scratch | ✅ Resume from checkpoint |
|
|
| Training history | ❌ Lost | ✅ Tracked in metadata |
|
|
|
|
---
|
|
|
|
## Integration with Pipeline
|
|
|
|
The `run_pipeline.sh` and `run_pipeline.bat` scripts automatically use versioning:
|
|
|
|
```bash
|
|
./run_pipeline.sh
|
|
# First run:
|
|
# - Generates data
|
|
# - Trains model
|
|
# - Creates nnue_weights_v1.pt + metadata
|
|
# - Exports to NNUEWeights.scala
|
|
|
|
# Second run:
|
|
# - Auto-detects v1, loads as checkpoint
|
|
# - Continues training on all data
|
|
# - Creates nnue_weights_v2.pt + metadata
|
|
# - Exports updated NNUEWeights.scala
|
|
```
|
|
|
|
---
|
|
|
|
## Tips & Tricks
|
|
|
|
### List all versions with losses:
|
|
|
|
```bash
|
|
for f in nnue_weights_v*_metadata.json; do
|
|
version=$(grep version $f | head -1)
|
|
loss=$(grep final_val_loss $f)
|
|
echo "$version | $loss"
|
|
done
|
|
```
|
|
|
|
### Auto-export best version:
|
|
|
|
```bash
|
|
# Find version with lowest loss
|
|
BEST=$(for f in nnue_weights_v*_metadata.json; do
|
|
echo "$f $(grep final_val_loss $f | cut -d: -f2)"
|
|
done | sort -k2 -n | head -1 | cut -d_ -f3 | cut -d. -f1)
|
|
|
|
python export_weights.py nnue_weights_$BEST.pt ../src/main/scala/de/nowchess/bot/bots/nnue/NNUEWeights.scala
|
|
```
|
|
|
|
### Archive old versions:
|
|
|
|
```bash
|
|
mkdir -p archive
|
|
mv nnue_weights_v{1,2,3}.pt archive/
|
|
mv nnue_weights_v{1,2,3}_metadata.json archive/
|
|
# Keep only v4+
|
|
```
|
|
|
|
---
|
|
|
|
## See Also
|
|
|
|
- `TRAINING_GUIDE.md` — Detailed examples and workflows
|
|
- `README_NNUE.md` — Complete pipeline documentation
|
|
- `train_nnue.py --help` — Command-line reference
|