NowChessSystems/modules/bot/python/INCREMENTAL_TRAINING.md

# Incremental Training & Versioning: New Features

## Summary

`train_nnue.py` now supports:

✅ **Checkpoint Loading** — Resume from previous models
✅ **Automatic Versioning** — v1, v2, v3... naming
✅ **Metadata Tracking** — Date, positions, losses, depth
✅ **CLI Arguments** — Full control via command line

---

## Feature 1: Automatic Checkpoint Detection

When you run training, the trainer automatically looks for and loads existing weights:

```bash
# First run: nnue_weights.pt doesn't exist
python train_nnue.py training_data.jsonl nnue_weights.pt
# → Trains from scratch, saves as nnue_weights_v1.pt

# Second run: nnue_weights.pt exists (symlink to v1)
python train_nnue.py training_data_bigger.jsonl nnue_weights.pt
# → Auto-loads nnue_weights_v1.pt as checkpoint
# → Continues training
# → Saves as nnue_weights_v2.pt
```

**No command-line flag needed** — automatic detection of existing weights!

---

## Feature 2: Explicit Checkpoint

Override auto-detection with `--checkpoint`:

```bash
# Use v1 as starting point, ignore any other weights
python train_nnue.py training_data.jsonl nnue_weights.pt \
  --checkpoint nnue_weights_v1.pt

# Or load from external checkpoint
python train_nnue.py training_data.jsonl nnue_weights.pt \
  --checkpoint /path/to/backup_model.pt
```

---

## Feature 3: Automatic Versioning

Models are saved with version numbers:

**First run:**
```
nnue_weights_v1.pt          ← Model weights
nnue_weights_v1_metadata.json ← Training info
```

**Second run:**
```
nnue_weights_v2.pt          ← Model weights
nnue_weights_v2_metadata.json ← Training info
```

**Third run:**
```
nnue_weights_v3.pt
nnue_weights_v3_metadata.json
```

Disable with `--no-versioning`:
```bash
python train_nnue.py training_data.jsonl nnue_weights.pt --no-versioning
# → Saves directly to nnue_weights.pt (no version number)
```

---

## Feature 4: Training Metadata

Each model save includes a JSON metadata file tracking:

```json
{
  "version": 2,
  "date": "2026-04-07T15:30:45.123456",
  "num_positions": 1000000,
  "stockfish_depth": 12,
  "epochs": 20,
  "batch_size": 4096,
  "learning_rate": 0.001,
  "final_val_loss": 0.0234567,
  "device": "cuda",
  "checkpoint": "nnue_weights_v1.pt",
  "notes": "Win rate vs classical eval: TBD"
}
```

### Useful for:
- **Tracking progress** — Compare val_loss across versions
- **Reproducibility** — Know exactly how each model was trained
- **Debugging** — Identify which positions/depth produced best results
- **Benchmarking** — Record win rates (manually added to notes)

---

## Feature 5: CLI Arguments

Full control over training via command-line flags:

```bash
python train_nnue.py training_data.jsonl nnue_weights.pt \
  --epochs 30 \
  --batch-size 2048 \
  --lr 5e-4 \
  --stockfish-depth 14 \
  --checkpoint nnue_weights_v1.pt
```

**All flags:**
- `--epochs` — Number of training passes (default: 20)
- `--batch-size` — Samples per update (default: 4096)
- `--lr` — Learning rate (default: 1e-3)
- `--stockfish-depth` — Depth for metadata (default: 12)
- `--checkpoint` — Resume from checkpoint (default: auto-detect)
- `--no-versioning` — Disable versioning

---

## Workflow Examples

### Scenario 1: Continuous Improvement

```bash
# Initial training: 500K positions
./run_pipeline.sh
# → nnue_weights_v1.pt created

# Add more positions (500K more)
python label_positions.py positions_v2.txt training_data_v2.jsonl stockfish

# Combine and retrain
cat training_data.jsonl training_data_v2.jsonl > all_data.jsonl
python train_nnue.py all_data.jsonl nnue_weights.pt
# → Loads v1, trains on all 1M positions
# → nnue_weights_v2.pt created

# Export best version
python export_weights.py nnue_weights_v2.pt ../src/main/scala/de/nowchess/bot/bots/nnue/NNUEWeights.scala
```

### Scenario 2: Hyperparameter Tuning

```bash
# Baseline
python train_nnue.py data.jsonl nnue_weights.pt
# → v1 with default settings

# Try lower learning rate
python train_nnue.py data.jsonl nnue_weights.pt --lr 5e-4
# → v2 with lr=5e-4

# Try higher learning rate
python train_nnue.py data.jsonl nnue_weights.pt --lr 2e-3
# → v3 with lr=2e-3

# Compare metadata
cat nnue_weights_v*_metadata.json | grep final_val_loss
# → Pick the lowest loss
```

### Scenario 3: Interrupted Training Resume

```bash
# Start training
python train_nnue.py training_data.jsonl nnue_weights.pt --epochs 50
# → Epoch 30 of 50, then crash/interrupt

# Resume: same command
python train_nnue.py training_data.jsonl nnue_weights.pt --epochs 50
# → Auto-detects checkpoint, continues from epoch 30
# → Completes to epoch 50
```

---

## Command-Line Help

View all options:

```bash
python train_nnue.py --help
```

Output:
```
usage: train_nnue.py [-h] [--checkpoint CHECKPOINT] [--epochs EPOCHS]
                     [--batch-size BATCH_SIZE] [--lr LR]
                     [--stockfish-depth STOCKFISH_DEPTH] [--no-versioning]
                     [data_file] [output_file]

Train NNUE neural network for chess evaluation

positional arguments:
  data_file             Path to training_data.jsonl (default: training_data.jsonl)
  output_file           Output file base name (default: nnue_weights.pt)

optional arguments:
  -h, --help            show this help message and exit
  --checkpoint CHECKPOINT
                        Path to checkpoint file to resume training from (optional)
  --epochs EPOCHS       Number of epochs to train (default: 20)
  --batch-size BATCH_SIZE
                        Batch size (default: 4096)
  --lr LR               Learning rate (default: 1e-3)
  --stockfish-depth STOCKFISH_DEPTH
                        Stockfish depth used for evaluations (for metadata, default: 12)
  --no-versioning       Disable automatic versioning (save directly to output file)
```

---

## Key Differences from Previous Version

| Feature | Before | After |
|---------|--------|-------|
| Checkpoint support | ❌ No | ✅ Yes (auto + explicit) |
| Versioning | ❌ Single file | ✅ v1, v2, v3... |
| Metadata tracking | ❌ No | ✅ JSON with all info |
| CLI arguments | ❌ Limited | ✅ Full argparse |
| Resumed training | ❌ Always from scratch | ✅ Resume from checkpoint |
| Training history | ❌ Lost | ✅ Tracked in metadata |

---

## Integration with Pipeline

The `run_pipeline.sh` and `run_pipeline.bat` scripts automatically use versioning:

```bash
./run_pipeline.sh
# First run:
# - Generates data
# - Trains model
# - Creates nnue_weights_v1.pt + metadata
# - Exports to NNUEWeights.scala

# Second run:
# - Auto-detects v1, loads as checkpoint
# - Continues training on all data
# - Creates nnue_weights_v2.pt + metadata
# - Exports updated NNUEWeights.scala
```

---

## Tips & Tricks

### List all versions with losses:

```bash
for f in nnue_weights_v*_metadata.json; do
  version=$(grep version $f | head -1)
  loss=$(grep final_val_loss $f)
  echo "$version | $loss"
done
```

### Auto-export best version:

```bash
# Find version with lowest loss
BEST=$(for f in nnue_weights_v*_metadata.json; do
  echo "$f $(grep final_val_loss $f | cut -d: -f2)"
done | sort -k2 -n | head -1 | cut -d_ -f3 | cut -d. -f1)

python export_weights.py nnue_weights_$BEST.pt ../src/main/scala/de/nowchess/bot/bots/nnue/NNUEWeights.scala
```

### Archive old versions:

```bash
mkdir -p archive
mv nnue_weights_v{1,2,3}.pt archive/
mv nnue_weights_v{1,2,3}_metadata.json archive/
# Keep only v4+
```

---

## See Also

- `TRAINING_GUIDE.md` — Detailed examples and workflows
- `README_NNUE.md` — Complete pipeline documentation
- `train_nnue.py --help` — Command-line reference