NowChessSystems/modules/bot/python/TRAINING_GUIDE.md

# NNUE Training Guide: Incremental Training & Versioning

## Overview

The improved `train_nnue.py` now supports:
1. **Incremental training** — Resume from checkpoint, continue training on new data
2. **Automatic versioning** — Each training run saved as `nnue_weights_v{N}.pt`
3. **Metadata tracking** — Date, positions, depth, losses stored in JSON
4. **CLI flags** — Full control over training parameters

## Quick Start

### First Training Run (Fresh Start)

```bash
python train_nnue.py training_data.jsonl nnue_weights.pt
```

This saves:
- `nnue_weights_v1.pt` — The trained weights
- `nnue_weights_v1_metadata.json` — Training metadata

### Continue Training (Incremental)

Add more positions to `training_data.jsonl`, then:

```bash
python train_nnue.py training_data.jsonl nnue_weights.pt
```

The trainer will:
1. Detect `nnue_weights.pt` exists
2. Load it as a checkpoint automatically
3. Continue training on all data
4. Save as `nnue_weights_v2.pt` with updated metadata

Alternatively, specify a checkpoint explicitly:

```bash
python train_nnue.py training_data.jsonl nnue_weights.pt --checkpoint nnue_weights_v1.pt
```

## Advanced Usage

### Custom Training Parameters

```bash
python train_nnue.py training_data.jsonl nnue_weights.pt \
  --epochs 30 \
  --batch-size 2048 \
  --lr 5e-4 \
  --stockfish-depth 14
```

- `--epochs` — How many passes through the data (default: 20)
- `--batch-size` — Samples per gradient update (default: 4096)
- `--lr` — Learning rate (default: 1e-3)
- `--stockfish-depth` — Depth of Stockfish evaluation (for metadata only)

### Explicit Checkpoint

Resume from a specific checkpoint (not `nnue_weights.pt`):

```bash
python train_nnue.py training_data_v2.jsonl nnue_weights.pt \
  --checkpoint nnue_weights_v1.pt
```

### Disable Versioning

Save directly to output file without versioning:

```bash
python train_nnue.py training_data.jsonl nnue_weights.pt --no-versioning
```

This overwrites `nnue_weights.pt` instead of creating `nnue_weights_v2.pt`.

## Incremental Training Workflow

Typical workflow for improving the model over time:

**Step 1: Initial Training**
```bash
# Generate 500K positions with Stockfish
./run_pipeline.sh

# This saves:
# - nnue_weights_v1.pt
# - nnue_weights_v1_metadata.json
```

**Step 2: Generate More Positions**
```bash
# Later, generate 500K more positions
# Append to training_data.jsonl or create new one

# Label with Stockfish at depth 16 (more thorough)
python label_positions.py positions_batch2.txt training_data_batch2.jsonl stockfish --stockfish-depth 16

# Combine datasets
cat training_data_batch1.jsonl training_data_batch2.jsonl > training_data_combined.jsonl
```

**Step 3: Continue Training**
```bash
# Train on combined data, starting from v1 checkpoint
python train_nnue.py training_data_combined.jsonl nnue_weights.pt

# Saves:
# - nnue_weights_v2.pt (improved)
# - nnue_weights_v2_metadata.json
```

**Step 4: Benchmark & Choose**
```bash
# Test both versions in matches
# If v2 is better, use it; otherwise keep v1

# Update NNUEWeights.scala with best version
python export_weights.py nnue_weights_v2.pt ../src/main/scala/de/nowchess/bot/bots/nnue/NNUEWeights.scala
```

## Metadata File Format

Each training session generates a JSON metadata file, e.g., `nnue_weights_v2_metadata.json`:

```json
{
  "version": 2,
  "date": "2026-04-07T21:45:30.123456",
  "num_positions": 1000000,
  "stockfish_depth": 12,
  "epochs": 20,
  "batch_size": 4096,
  "learning_rate": 0.001,
  "final_val_loss": 0.0234567,
  "device": "cuda",
  "checkpoint": "nnue_weights_v1.pt",
  "notes": "Win rate vs classical eval: TBD (requires benchmark games)"
}
```

### Fields

- **version**: Training version number (v1, v2, etc.)
- **date**: ISO timestamp of training start
- **num_positions**: Total positions in dataset
- **stockfish_depth**: Depth of Stockfish evaluations (from command-line flag)
- **epochs**: Number of training passes
- **batch_size**: Training batch size
- **learning_rate**: Adam optimizer learning rate
- **final_val_loss**: Best validation loss achieved
- **device**: GPU (cuda) or CPU used for training
- **checkpoint**: Previous model used as starting point (null if from scratch)
- **notes**: Win rate comparison (currently TBD — requires benchmark)

## Checkpoint Logic

When you run training, the trainer checks for checkpoints in this order:

1. **Explicit checkpoint** — If you provide `--checkpoint`, use it
2. **Auto-detect** — If output file exists (e.g., `nnue_weights.pt`), load it
3. **From scratch** — Otherwise, initialize with random weights

Example:

```bash
# First run: from scratch (no nnue_weights.pt exists)
python train_nnue.py training_data.jsonl nnue_weights.pt
# → Creates v1 from scratch, saves as nnue_weights_v1.pt

# Second run: auto-detect nnue_weights.pt as checkpoint
python train_nnue.py training_data_bigger.jsonl nnue_weights.pt
# → Loads nnue_weights_v1.pt (because nnue_weights.pt = v1), saves as v2

# Third run: explicit checkpoint
python train_nnue.py training_data_huge.jsonl nnue_weights.pt --checkpoint nnue_weights_v2.pt
# → Loads v2, saves as v3
```

## Resuming Interrupted Training

If training is interrupted (power loss, ^C), you can resume:

```bash
# Original command
python train_nnue.py training_data.jsonl nnue_weights.pt

# If interrupted, the same command will:
# 1. Detect nnue_weights_v1.pt exists (or a higher version)
# 2. Auto-load it as checkpoint
# 3. Resume training
# 4. Save next version (v2, v3, etc.)
```

## Performance Tips

### Reduce Training Time

```bash
# Smaller batch size = slower but less memory
python train_nnue.py training_data.jsonl nnue_weights.pt --batch-size 1024

# Fewer epochs
python train_nnue.py training_data.jsonl nnue_weights.pt --epochs 5

# Lower learning rate = slower convergence but more stable
python train_nnue.py training_data.jsonl nnue_weights.pt --lr 5e-4
```

### Accelerate on GPU

If you have NVIDIA GPU with CUDA:

```bash
# Training will automatically use CUDA
# Check metadata device field: should be "cuda" not "cpu"
python train_nnue.py training_data.jsonl nnue_weights.pt
```

If training uses CPU but GPU is available:
```bash
# Reinstall PyTorch with CUDA
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
```

### Efficient Incremental Training

```bash
# Fine-tune v1 on slightly different data (high learning rate)
python train_nnue.py new_positions.jsonl nnue_weights.pt \
  --checkpoint nnue_weights_v1.pt \
  --epochs 3 \
  --lr 5e-4

# Full retraining on combined data (slower, better)
python train_nnue.py all_positions.jsonl nnue_weights.pt \
  --checkpoint nnue_weights_v1.pt \
  --epochs 20 \
  --lr 1e-3
```

## Version Management

### List All Versions

```bash
ls -la nnue_weights_v*.pt
ls -la nnue_weights_v*_metadata.json
```

### Compare Versions

```bash
cat nnue_weights_v1_metadata.json | grep "final_val_loss"
cat nnue_weights_v2_metadata.json | grep "final_val_loss"
cat nnue_weights_v3_metadata.json | grep "final_val_loss"
```

Lower val loss = better model.

### Benchmark Best Version

After training multiple versions, benchmark them:

```bash
# Export v1 and play some games
python export_weights.py nnue_weights_v1.pt ../src/main/scala/de/nowchess/bot/bots/nnue/NNUEWeights.scala
./compile && ./test

# Export v2 and benchmark
python export_weights.py nnue_weights_v2.pt ../src/main/scala/de/nowchess/bot/bots/nnue/NNUEWeights.scala
./compile && ./test

# Keep the best, archive others
```

### Archive Old Versions

```bash
# Keep only recent versions
mkdir -p old_models
mv nnue_weights_v1.pt old_models/
mv nnue_weights_v1_metadata.json old_models/
```

## Troubleshooting

### "FileNotFoundError: training_data.jsonl not found"

```bash
# Make sure you're in the python/ directory
cd modules/bot/python

# Or provide full path
python train_nnue.py /full/path/to/training_data.jsonl nnue_weights.pt
```

### "CUDA out of memory"

Reduce batch size:

```bash
python train_nnue.py training_data.jsonl nnue_weights.pt --batch-size 2048
```

### Training seems slow (using CPU not GPU)

```bash
# Check metadata of a training run
cat nnue_weights_v1_metadata.json | grep device

# If "cpu", reinstall PyTorch with CUDA support
pip install torch --index-url https://download.pytorch.org/whl/cu118
```

### "checkpoint file corrupted"

```bash
# Start over from scratch (don't load corrupted checkpoint)
python train_nnue.py training_data.jsonl nnue_weights_fresh.pt --no-versioning

# Or resume from earlier version
python train_nnue.py training_data.jsonl nnue_weights.pt --checkpoint nnue_weights_v1.pt
```

## Integration with Pipeline

The `run_pipeline.sh` script now supports incremental training:

```bash
# First run: generates data, trains v1
./run_pipeline.sh

# Add more positions
# ... generate more, label more ...

# Second run: trains on combined data as v2
./run_pipeline.sh
```

## Example: Full Workflow

```bash
cd modules/bot/python

# Session 1: Initial training
chmod +x run_pipeline.sh
export STOCKFISH_PATH=/usr/bin/stockfish
./run_pipeline.sh
# Creates: nnue_weights_v1.pt, nnue_weights_v1_metadata.json

# Session 2: Improve with deeper analysis
# (manually evaluate more positions at depth 14)
python label_positions.py positions_v2.txt training_data_v2.jsonl \
  /usr/bin/stockfish --stockfish-depth 14

# Combine and retrain
cat training_data_v1.jsonl training_data_v2.jsonl > training_data_combined.jsonl

python train_nnue.py training_data_combined.jsonl nnue_weights.pt \
  --epochs 25 \
  --stockfish-depth 14
# Creates: nnue_weights_v2.pt, nnue_weights_v2_metadata.json

# Session 3: Benchmark and choose
# Test both v1 and v2 with matches...
# If v2 is better, export and use
python export_weights.py nnue_weights_v2.pt \
  ../src/main/scala/de/nowchess/bot/bots/nnue/NNUEWeights.scala

cd ../..
./compile && ./test
```

## See Also

- `train_nnue.py --help` — Command-line help
- `README_NNUE.md` — Complete pipeline documentation
- `NNUE_IMPLEMENTATION_SUMMARY.md` — Technical architecture