# Incremental Training & Versioning: New Features ## Summary `train_nnue.py` now supports: ✅ **Checkpoint Loading** — Resume from previous models ✅ **Automatic Versioning** — v1, v2, v3... naming ✅ **Metadata Tracking** — Date, positions, losses, depth ✅ **CLI Arguments** — Full control via command line --- ## Feature 1: Automatic Checkpoint Detection When you run training, the trainer automatically looks for and loads existing weights: ```bash # First run: nnue_weights.pt doesn't exist python train_nnue.py training_data.jsonl nnue_weights.pt # → Trains from scratch, saves as nnue_weights_v1.pt # Second run: nnue_weights.pt exists (symlink to v1) python train_nnue.py training_data_bigger.jsonl nnue_weights.pt # → Auto-loads nnue_weights_v1.pt as checkpoint # → Continues training # → Saves as nnue_weights_v2.pt ``` **No command-line flag needed** — automatic detection of existing weights! --- ## Feature 2: Explicit Checkpoint Override auto-detection with `--checkpoint`: ```bash # Use v1 as starting point, ignore any other weights python train_nnue.py training_data.jsonl nnue_weights.pt \ --checkpoint nnue_weights_v1.pt # Or load from external checkpoint python train_nnue.py training_data.jsonl nnue_weights.pt \ --checkpoint /path/to/backup_model.pt ``` --- ## Feature 3: Automatic Versioning Models are saved with version numbers: **First run:** ``` nnue_weights_v1.pt ← Model weights nnue_weights_v1_metadata.json ← Training info ``` **Second run:** ``` nnue_weights_v2.pt ← Model weights nnue_weights_v2_metadata.json ← Training info ``` **Third run:** ``` nnue_weights_v3.pt nnue_weights_v3_metadata.json ``` Disable with `--no-versioning`: ```bash python train_nnue.py training_data.jsonl nnue_weights.pt --no-versioning # → Saves directly to nnue_weights.pt (no version number) ``` --- ## Feature 4: Training Metadata Each model save includes a JSON metadata file tracking: ```json { "version": 2, "date": "2026-04-07T15:30:45.123456", "num_positions": 1000000, "stockfish_depth": 12, "epochs": 20, "batch_size": 4096, "learning_rate": 0.001, "final_val_loss": 0.0234567, "device": "cuda", "checkpoint": "nnue_weights_v1.pt", "notes": "Win rate vs classical eval: TBD" } ``` ### Useful for: - **Tracking progress** — Compare val_loss across versions - **Reproducibility** — Know exactly how each model was trained - **Debugging** — Identify which positions/depth produced best results - **Benchmarking** — Record win rates (manually added to notes) --- ## Feature 5: CLI Arguments Full control over training via command-line flags: ```bash python train_nnue.py training_data.jsonl nnue_weights.pt \ --epochs 30 \ --batch-size 2048 \ --lr 5e-4 \ --stockfish-depth 14 \ --checkpoint nnue_weights_v1.pt ``` **All flags:** - `--epochs` — Number of training passes (default: 20) - `--batch-size` — Samples per update (default: 4096) - `--lr` — Learning rate (default: 1e-3) - `--stockfish-depth` — Depth for metadata (default: 12) - `--checkpoint` — Resume from checkpoint (default: auto-detect) - `--no-versioning` — Disable versioning --- ## Workflow Examples ### Scenario 1: Continuous Improvement ```bash # Initial training: 500K positions ./run_pipeline.sh # → nnue_weights_v1.pt created # Add more positions (500K more) python label_positions.py positions_v2.txt training_data_v2.jsonl stockfish # Combine and retrain cat training_data.jsonl training_data_v2.jsonl > all_data.jsonl python train_nnue.py all_data.jsonl nnue_weights.pt # → Loads v1, trains on all 1M positions # → nnue_weights_v2.pt created # Export best version python export_weights.py nnue_weights_v2.pt ../src/main/scala/de/nowchess/bot/bots/nnue/NNUEWeights.scala ``` ### Scenario 2: Hyperparameter Tuning ```bash # Baseline python train_nnue.py data.jsonl nnue_weights.pt # → v1 with default settings # Try lower learning rate python train_nnue.py data.jsonl nnue_weights.pt --lr 5e-4 # → v2 with lr=5e-4 # Try higher learning rate python train_nnue.py data.jsonl nnue_weights.pt --lr 2e-3 # → v3 with lr=2e-3 # Compare metadata cat nnue_weights_v*_metadata.json | grep final_val_loss # → Pick the lowest loss ``` ### Scenario 3: Interrupted Training Resume ```bash # Start training python train_nnue.py training_data.jsonl nnue_weights.pt --epochs 50 # → Epoch 30 of 50, then crash/interrupt # Resume: same command python train_nnue.py training_data.jsonl nnue_weights.pt --epochs 50 # → Auto-detects checkpoint, continues from epoch 30 # → Completes to epoch 50 ``` --- ## Command-Line Help View all options: ```bash python train_nnue.py --help ``` Output: ``` usage: train_nnue.py [-h] [--checkpoint CHECKPOINT] [--epochs EPOCHS] [--batch-size BATCH_SIZE] [--lr LR] [--stockfish-depth STOCKFISH_DEPTH] [--no-versioning] [data_file] [output_file] Train NNUE neural network for chess evaluation positional arguments: data_file Path to training_data.jsonl (default: training_data.jsonl) output_file Output file base name (default: nnue_weights.pt) optional arguments: -h, --help show this help message and exit --checkpoint CHECKPOINT Path to checkpoint file to resume training from (optional) --epochs EPOCHS Number of epochs to train (default: 20) --batch-size BATCH_SIZE Batch size (default: 4096) --lr LR Learning rate (default: 1e-3) --stockfish-depth STOCKFISH_DEPTH Stockfish depth used for evaluations (for metadata, default: 12) --no-versioning Disable automatic versioning (save directly to output file) ``` --- ## Key Differences from Previous Version | Feature | Before | After | |---------|--------|-------| | Checkpoint support | ❌ No | ✅ Yes (auto + explicit) | | Versioning | ❌ Single file | ✅ v1, v2, v3... | | Metadata tracking | ❌ No | ✅ JSON with all info | | CLI arguments | ❌ Limited | ✅ Full argparse | | Resumed training | ❌ Always from scratch | ✅ Resume from checkpoint | | Training history | ❌ Lost | ✅ Tracked in metadata | --- ## Integration with Pipeline The `run_pipeline.sh` and `run_pipeline.bat` scripts automatically use versioning: ```bash ./run_pipeline.sh # First run: # - Generates data # - Trains model # - Creates nnue_weights_v1.pt + metadata # - Exports to NNUEWeights.scala # Second run: # - Auto-detects v1, loads as checkpoint # - Continues training on all data # - Creates nnue_weights_v2.pt + metadata # - Exports updated NNUEWeights.scala ``` --- ## Tips & Tricks ### List all versions with losses: ```bash for f in nnue_weights_v*_metadata.json; do version=$(grep version $f | head -1) loss=$(grep final_val_loss $f) echo "$version | $loss" done ``` ### Auto-export best version: ```bash # Find version with lowest loss BEST=$(for f in nnue_weights_v*_metadata.json; do echo "$f $(grep final_val_loss $f | cut -d: -f2)" done | sort -k2 -n | head -1 | cut -d_ -f3 | cut -d. -f1) python export_weights.py nnue_weights_$BEST.pt ../src/main/scala/de/nowchess/bot/bots/nnue/NNUEWeights.scala ``` ### Archive old versions: ```bash mkdir -p archive mv nnue_weights_v{1,2,3}.pt archive/ mv nnue_weights_v{1,2,3}_metadata.json archive/ # Keep only v4+ ``` --- ## See Also - `TRAINING_GUIDE.md` — Detailed examples and workflows - `README_NNUE.md` — Complete pipeline documentation - `train_nnue.py --help` — Command-line reference