# NNUE Training Guide: Incremental Training & Versioning ## Overview The improved `train_nnue.py` now supports: 1. **Incremental training** — Resume from checkpoint, continue training on new data 2. **Automatic versioning** — Each training run saved as `nnue_weights_v{N}.pt` 3. **Metadata tracking** — Date, positions, depth, losses stored in JSON 4. **CLI flags** — Full control over training parameters ## Quick Start ### First Training Run (Fresh Start) ```bash python train_nnue.py training_data.jsonl nnue_weights.pt ``` This saves: - `nnue_weights_v1.pt` — The trained weights - `nnue_weights_v1_metadata.json` — Training metadata ### Continue Training (Incremental) Add more positions to `training_data.jsonl`, then: ```bash python train_nnue.py training_data.jsonl nnue_weights.pt ``` The trainer will: 1. Detect `nnue_weights.pt` exists 2. Load it as a checkpoint automatically 3. Continue training on all data 4. Save as `nnue_weights_v2.pt` with updated metadata Alternatively, specify a checkpoint explicitly: ```bash python train_nnue.py training_data.jsonl nnue_weights.pt --checkpoint nnue_weights_v1.pt ``` ## Advanced Usage ### Custom Training Parameters ```bash python train_nnue.py training_data.jsonl nnue_weights.pt \ --epochs 30 \ --batch-size 2048 \ --lr 5e-4 \ --stockfish-depth 14 ``` - `--epochs` — How many passes through the data (default: 20) - `--batch-size` — Samples per gradient update (default: 4096) - `--lr` — Learning rate (default: 1e-3) - `--stockfish-depth` — Depth of Stockfish evaluation (for metadata only) ### Explicit Checkpoint Resume from a specific checkpoint (not `nnue_weights.pt`): ```bash python train_nnue.py training_data_v2.jsonl nnue_weights.pt \ --checkpoint nnue_weights_v1.pt ``` ### Disable Versioning Save directly to output file without versioning: ```bash python train_nnue.py training_data.jsonl nnue_weights.pt --no-versioning ``` This overwrites `nnue_weights.pt` instead of creating `nnue_weights_v2.pt`. ## Incremental Training Workflow Typical workflow for improving the model over time: **Step 1: Initial Training** ```bash # Generate 500K positions with Stockfish ./run_pipeline.sh # This saves: # - nnue_weights_v1.pt # - nnue_weights_v1_metadata.json ``` **Step 2: Generate More Positions** ```bash # Later, generate 500K more positions # Append to training_data.jsonl or create new one # Label with Stockfish at depth 16 (more thorough) python label_positions.py positions_batch2.txt training_data_batch2.jsonl stockfish --stockfish-depth 16 # Combine datasets cat training_data_batch1.jsonl training_data_batch2.jsonl > training_data_combined.jsonl ``` **Step 3: Continue Training** ```bash # Train on combined data, starting from v1 checkpoint python train_nnue.py training_data_combined.jsonl nnue_weights.pt # Saves: # - nnue_weights_v2.pt (improved) # - nnue_weights_v2_metadata.json ``` **Step 4: Benchmark & Choose** ```bash # Test both versions in matches # If v2 is better, use it; otherwise keep v1 # Update NNUEWeights.scala with best version python export_weights.py nnue_weights_v2.pt ../src/main/scala/de/nowchess/bot/bots/nnue/NNUEWeights.scala ``` ## Metadata File Format Each training session generates a JSON metadata file, e.g., `nnue_weights_v2_metadata.json`: ```json { "version": 2, "date": "2026-04-07T21:45:30.123456", "num_positions": 1000000, "stockfish_depth": 12, "epochs": 20, "batch_size": 4096, "learning_rate": 0.001, "final_val_loss": 0.0234567, "device": "cuda", "checkpoint": "nnue_weights_v1.pt", "notes": "Win rate vs classical eval: TBD (requires benchmark games)" } ``` ### Fields - **version**: Training version number (v1, v2, etc.) - **date**: ISO timestamp of training start - **num_positions**: Total positions in dataset - **stockfish_depth**: Depth of Stockfish evaluations (from command-line flag) - **epochs**: Number of training passes - **batch_size**: Training batch size - **learning_rate**: Adam optimizer learning rate - **final_val_loss**: Best validation loss achieved - **device**: GPU (cuda) or CPU used for training - **checkpoint**: Previous model used as starting point (null if from scratch) - **notes**: Win rate comparison (currently TBD — requires benchmark) ## Checkpoint Logic When you run training, the trainer checks for checkpoints in this order: 1. **Explicit checkpoint** — If you provide `--checkpoint`, use it 2. **Auto-detect** — If output file exists (e.g., `nnue_weights.pt`), load it 3. **From scratch** — Otherwise, initialize with random weights Example: ```bash # First run: from scratch (no nnue_weights.pt exists) python train_nnue.py training_data.jsonl nnue_weights.pt # → Creates v1 from scratch, saves as nnue_weights_v1.pt # Second run: auto-detect nnue_weights.pt as checkpoint python train_nnue.py training_data_bigger.jsonl nnue_weights.pt # → Loads nnue_weights_v1.pt (because nnue_weights.pt = v1), saves as v2 # Third run: explicit checkpoint python train_nnue.py training_data_huge.jsonl nnue_weights.pt --checkpoint nnue_weights_v2.pt # → Loads v2, saves as v3 ``` ## Resuming Interrupted Training If training is interrupted (power loss, ^C), you can resume: ```bash # Original command python train_nnue.py training_data.jsonl nnue_weights.pt # If interrupted, the same command will: # 1. Detect nnue_weights_v1.pt exists (or a higher version) # 2. Auto-load it as checkpoint # 3. Resume training # 4. Save next version (v2, v3, etc.) ``` ## Performance Tips ### Reduce Training Time ```bash # Smaller batch size = slower but less memory python train_nnue.py training_data.jsonl nnue_weights.pt --batch-size 1024 # Fewer epochs python train_nnue.py training_data.jsonl nnue_weights.pt --epochs 5 # Lower learning rate = slower convergence but more stable python train_nnue.py training_data.jsonl nnue_weights.pt --lr 5e-4 ``` ### Accelerate on GPU If you have NVIDIA GPU with CUDA: ```bash # Training will automatically use CUDA # Check metadata device field: should be "cuda" not "cpu" python train_nnue.py training_data.jsonl nnue_weights.pt ``` If training uses CPU but GPU is available: ```bash # Reinstall PyTorch with CUDA pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 ``` ### Efficient Incremental Training ```bash # Fine-tune v1 on slightly different data (high learning rate) python train_nnue.py new_positions.jsonl nnue_weights.pt \ --checkpoint nnue_weights_v1.pt \ --epochs 3 \ --lr 5e-4 # Full retraining on combined data (slower, better) python train_nnue.py all_positions.jsonl nnue_weights.pt \ --checkpoint nnue_weights_v1.pt \ --epochs 20 \ --lr 1e-3 ``` ## Version Management ### List All Versions ```bash ls -la nnue_weights_v*.pt ls -la nnue_weights_v*_metadata.json ``` ### Compare Versions ```bash cat nnue_weights_v1_metadata.json | grep "final_val_loss" cat nnue_weights_v2_metadata.json | grep "final_val_loss" cat nnue_weights_v3_metadata.json | grep "final_val_loss" ``` Lower val loss = better model. ### Benchmark Best Version After training multiple versions, benchmark them: ```bash # Export v1 and play some games python export_weights.py nnue_weights_v1.pt ../src/main/scala/de/nowchess/bot/bots/nnue/NNUEWeights.scala ./compile && ./test # Export v2 and benchmark python export_weights.py nnue_weights_v2.pt ../src/main/scala/de/nowchess/bot/bots/nnue/NNUEWeights.scala ./compile && ./test # Keep the best, archive others ``` ### Archive Old Versions ```bash # Keep only recent versions mkdir -p old_models mv nnue_weights_v1.pt old_models/ mv nnue_weights_v1_metadata.json old_models/ ``` ## Troubleshooting ### "FileNotFoundError: training_data.jsonl not found" ```bash # Make sure you're in the python/ directory cd modules/bot/python # Or provide full path python train_nnue.py /full/path/to/training_data.jsonl nnue_weights.pt ``` ### "CUDA out of memory" Reduce batch size: ```bash python train_nnue.py training_data.jsonl nnue_weights.pt --batch-size 2048 ``` ### Training seems slow (using CPU not GPU) ```bash # Check metadata of a training run cat nnue_weights_v1_metadata.json | grep device # If "cpu", reinstall PyTorch with CUDA support pip install torch --index-url https://download.pytorch.org/whl/cu118 ``` ### "checkpoint file corrupted" ```bash # Start over from scratch (don't load corrupted checkpoint) python train_nnue.py training_data.jsonl nnue_weights_fresh.pt --no-versioning # Or resume from earlier version python train_nnue.py training_data.jsonl nnue_weights.pt --checkpoint nnue_weights_v1.pt ``` ## Integration with Pipeline The `run_pipeline.sh` script now supports incremental training: ```bash # First run: generates data, trains v1 ./run_pipeline.sh # Add more positions # ... generate more, label more ... # Second run: trains on combined data as v2 ./run_pipeline.sh ``` ## Example: Full Workflow ```bash cd modules/bot/python # Session 1: Initial training chmod +x run_pipeline.sh export STOCKFISH_PATH=/usr/bin/stockfish ./run_pipeline.sh # Creates: nnue_weights_v1.pt, nnue_weights_v1_metadata.json # Session 2: Improve with deeper analysis # (manually evaluate more positions at depth 14) python label_positions.py positions_v2.txt training_data_v2.jsonl \ /usr/bin/stockfish --stockfish-depth 14 # Combine and retrain cat training_data_v1.jsonl training_data_v2.jsonl > training_data_combined.jsonl python train_nnue.py training_data_combined.jsonl nnue_weights.pt \ --epochs 25 \ --stockfish-depth 14 # Creates: nnue_weights_v2.pt, nnue_weights_v2_metadata.json # Session 3: Benchmark and choose # Test both v1 and v2 with matches... # If v2 is better, export and use python export_weights.py nnue_weights_v2.pt \ ../src/main/scala/de/nowchess/bot/bots/nnue/NNUEWeights.scala cd ../.. ./compile && ./test ``` ## See Also - `train_nnue.py --help` — Command-line help - `README_NNUE.md` — Complete pipeline documentation - `NNUE_IMPLEMENTATION_SUMMARY.md` — Technical architecture