feat: refactor AlphaBetaSearch and ClassicalBot for improved evaluation and organization

This commit is contained in:
2026-04-07 22:46:44 +02:00
parent 558f43d0f6
commit 6a9ac55b31
28 changed files with 3618 additions and 12 deletions
+383
View File
@@ -0,0 +1,383 @@
# Debugging the NNUE Pipeline
## Common Issues & Solutions
### Issue 1: Empty training_data.jsonl
**Symptom:** After running the pipeline, `training_data.jsonl` is empty or doesn't exist.
**Diagnosis:** Run labeling with verbose output:
```bash
python label_positions.py positions.txt training_data.jsonl /path/to/stockfish --verbose
```
**Check these in order:**
#### 1. Is `positions.txt` empty?
```bash
wc -l positions.txt
```
If 0 lines: positions generator is failing. See Issue 2.
If >0 lines: positions exist. Check step 2.
#### 2. Is Stockfish installed and working?
```bash
# Linux/macOS
which stockfish
stockfish --version
# Windows
where stockfish
C:\path\to\stockfish.exe --version
```
If not found: Install from https://stockfishchess.org
#### 3. Is the Stockfish path correct?
```bash
# Check what path the labeler is using
export STOCKFISH_PATH=/your/path/to/stockfish
echo $STOCKFISH_PATH
python label_positions.py positions.txt training_data.jsonl $STOCKFISH_PATH --verbose
```
The script will print at the top: `Using Stockfish: /path/to/stockfish`
#### 4. Check the error summary
After running with verbose, look for the summary:
```
============================================================
LABELING SUMMARY
============================================================
Successfully evaluated: 0 ← This should be > 0
Skipped (duplicates): 0
Skipped (invalid): 0
Errors: 0
```
If "Successfully evaluated" is 0, positions aren't being saved.
---
### Issue 2: Empty positions.txt
**Symptom:** `positions.txt` is empty after running `generate_positions.py`
**Diagnosis:** Check the generation summary:
```bash
python generate_positions.py positions.txt --games 10000
```
Expected output:
```
============================================================
POSITION GENERATION SUMMARY
============================================================
Total games: 10000
Saved positions: 1234 ← This should be > 0
Filtered (check): 2345
Filtered (captures): 4321
Filtered (game over): 1100
Total filtered: 7766
Acceptance rate: 12.34%
============================================================
```
**If Saved positions = 0:**
The filters are too strict! Try with `--no-filter-captures`:
```bash
python generate_positions.py positions.txt --games 10000 --no-filter-captures
```
This allows positions with available captures, which should greatly increase the output.
---
### Issue 3: Stockfish Errors During Labeling
**Symptom:** Labeling runs but shows errors like:
```
Error evaluating position: rnbqkbnr/pppppppp...
SomeError: [error details]
```
**Solutions:**
1. **Check Stockfish is responsive:**
```bash
# Test Stockfish directly
echo "position startpos" | stockfish
echo "quit" | stockfish
```
2. **Try with lower depth** (faster, fewer timeouts):
```bash
python label_positions.py positions.txt training_data.jsonl /path/to/stockfish --depth 8
```
3. **Use explicit path** instead of relying on PATH:
```bash
python label_positions.py positions.txt training_data.jsonl /usr/games/stockfish
```
4. **Check if FENs in positions.txt are valid:**
```bash
head -5 positions.txt
```
Output should look like:
```
rnbqkbnr/pppppppp/8/8/4P3/8/PPPP1PPP/RNBQKBNR b KQkq e3 0 1
rnbqkbnr/pppppppp/8/8/4P3/8/PPPP1PPP/RNBQKBNR b KQkq e3 0 1
```
---
### Issue 4: Training Fails - No Valid Data
**Symptom:** `train_nnue.py` crashes with:
```
IndexError: list index out of range
```
**Cause:** `training_data.jsonl` is empty or contains invalid JSON.
**Debug:**
```bash
# Check file size
ls -lh training_data.jsonl
# Count valid lines
python -c "import json; lines = [1 for line in open('training_data.jsonl') if json.loads(line)]; print(f'Valid lines: {len(lines)}')"
# Look at first few lines
head -3 training_data.jsonl
```
Expected output:
```
{"fen": "rnbqkbnr/pppppppp/8/8/4P3/8/PPPP1PPP/RNBQKBNR b KQkq e3 0 1", "eval": 45}
{"fen": "rnbqkbnr/pppppppp/8/8/4P3/8/PPPP1PPP/RNBQKBNR b KQkq e3 0 1", "eval": 48}
```
If empty: go back to Issue 1.
---
## Step-by-Step Verification
Run this to verify each step works:
```bash
cd modules/bot/python
# Step 1: Generate 1000 positions (quick test)
echo "Testing position generation..."
python generate_positions.py test_positions.txt --games 1000 --no-filter-captures
# Check output
if [ ! -s test_positions.txt ]; then
echo "ERROR: positions.txt is empty"
exit 1
fi
POSITIONS=$(wc -l < test_positions.txt)
echo "✓ Generated $POSITIONS positions"
# Step 2: Label positions (quick test with 100 positions)
echo "Testing Stockfish labeling..."
export STOCKFISH_PATH=$(which stockfish || which /usr/games/stockfish || echo "stockfish")
if ! command -v $STOCKFISH_PATH &> /dev/null; then
echo "ERROR: Stockfish not found"
echo " Install: apt-get install stockfish (Linux) or brew install stockfish (Mac)"
exit 1
fi
head -100 test_positions.txt > test_positions_100.txt
python label_positions.py test_positions_100.txt test_training_data.jsonl $STOCKFISH_PATH --depth 8
# Check output
if [ ! -s test_training_data.jsonl ]; then
echo "ERROR: training_data.jsonl is empty"
echo " Run again with --verbose:"
python label_positions.py test_positions_100.txt test_training_data.jsonl $STOCKFISH_PATH --depth 8 --verbose
exit 1
fi
EVALS=$(wc -l < test_training_data.jsonl)
echo "✓ Evaluated $EVALS positions"
# Step 3: Test training
echo "Testing training..."
python train_nnue.py test_training_data.jsonl test_weights.pt --epochs 1 --batch-size 32 --no-versioning
if [ ! -f test_weights.pt ]; then
echo "ERROR: training failed"
exit 1
fi
echo "✓ Training works"
echo ""
echo "All tests passed! Pipeline is working correctly."
echo "You can now run the full pipeline with:"
echo " ./run_pipeline.sh"
```
Save as `test_pipeline.sh` and run:
```bash
chmod +x test_pipeline.sh
./test_pipeline.sh
```
---
## Common Error Messages
### "Stockfish not found at stockfish"
```bash
# Set the full path
export STOCKFISH_PATH=/usr/games/stockfish
# Or on Windows:
set STOCKFISH_PATH=C:\stockfish\stockfish.exe
```
### "No such file or directory: positions.txt"
```bash
# Make sure you're in the right directory
cd modules/bot/python
# Or provide full path
python label_positions.py /full/path/to/positions.txt training_data.jsonl stockfish
```
### "JSONDecodeError" in training
```bash
# training_data.jsonl has invalid JSON
# Regenerate it:
rm training_data.jsonl
python label_positions.py positions.txt training_data.jsonl stockfish
```
### "CUDA out of memory"
```bash
# Reduce batch size
python train_nnue.py training_data.jsonl nnue_weights.pt --batch-size 1024
```
---
## Getting More Information
### Verbose Output
All scripts support `--verbose` for detailed debugging:
```bash
python label_positions.py positions.txt training_data.jsonl stockfish --verbose
```
This prints:
- Which Stockfish is being used
- Error details for each failed position
- Summary of what passed/failed/skipped
### File Size Checks
```bash
# Check all files
ls -lh positions.txt training_data.jsonl nnue_weights.pt
# Count lines
echo "Positions: $(wc -l < positions.txt)"
echo "Training data: $(wc -l < training_data.jsonl)"
```
### Quick Tests
```bash
# Test position generation (100 games)
python generate_positions.py test_pos.txt --games 100 --no-filter-captures
# Test Stockfish labeling (10 positions)
head -10 test_pos.txt > test_pos_10.txt
python label_positions.py test_pos_10.txt test_data_10.jsonl stockfish --depth 6
# Test training (on test data)
python train_nnue.py test_data_10.jsonl test_model.pt --epochs 1 --batch-size 8
```
---
## Pipeline Workflow with Debugging
```bash
# 1. Generate positions
python generate_positions.py positions.txt --games 100000 --no-filter-captures
# Should output: Saved positions: ~20000-40000 (depends on filter)
# 2. Label with Stockfish
export STOCKFISH_PATH=$(which stockfish)
python label_positions.py positions.txt training_data.jsonl $STOCKFISH_PATH --depth 10
# Should output: Successfully evaluated: > 0
# 3. Train model
python train_nnue.py training_data.jsonl nnue_weights.pt --epochs 5
# Should output: Training summary with version info
# 4. Export to Scala
python export_weights.py nnue_weights_v1.pt ../src/main/scala/de/nowchess/bot/bots/nnue/NNUEWeights.scala
# Should output: NNUEWeights.scala created
# 5. Compile Scala
cd ../..
./compile
# Should output: BUILD SUCCESSFUL
```
---
## Performance Monitoring
While labeling is running, monitor progress:
```bash
# In another terminal
watch -n 5 'wc -l modules/bot/python/training_data.jsonl'
# Or on macOS
while true; do echo $(wc -l < modules/bot/python/training_data.jsonl) positions labeled; sleep 5; done
```
This shows how many positions per second are being evaluated.
---
## Still Stuck?
1. **Read the full output** — Don't skip error messages
2. **Check file sizes** — `ls -lh` shows if files are being created
3. **Run with `--verbose`** — Shows exactly what's failing
4. **Test individual steps** — Don't run full pipeline, test pieces
5. **Check Stockfish** — `stockfish --version` confirms it works
For more help, see:
- `README_NNUE.md` — Complete pipeline docs
- `TRAINING_GUIDE.md` — Training workflows
- `INCREMENTAL_TRAINING.md` — Versioning & checkpoints
+296
View File
@@ -0,0 +1,296 @@
# Incremental Training & Versioning: New Features
## Summary
`train_nnue.py` now supports:
**Checkpoint Loading** — Resume from previous models
**Automatic Versioning** — v1, v2, v3... naming
**Metadata Tracking** — Date, positions, losses, depth
**CLI Arguments** — Full control via command line
---
## Feature 1: Automatic Checkpoint Detection
When you run training, the trainer automatically looks for and loads existing weights:
```bash
# First run: nnue_weights.pt doesn't exist
python train_nnue.py training_data.jsonl nnue_weights.pt
# → Trains from scratch, saves as nnue_weights_v1.pt
# Second run: nnue_weights.pt exists (symlink to v1)
python train_nnue.py training_data_bigger.jsonl nnue_weights.pt
# → Auto-loads nnue_weights_v1.pt as checkpoint
# → Continues training
# → Saves as nnue_weights_v2.pt
```
**No command-line flag needed** — automatic detection of existing weights!
---
## Feature 2: Explicit Checkpoint
Override auto-detection with `--checkpoint`:
```bash
# Use v1 as starting point, ignore any other weights
python train_nnue.py training_data.jsonl nnue_weights.pt \
--checkpoint nnue_weights_v1.pt
# Or load from external checkpoint
python train_nnue.py training_data.jsonl nnue_weights.pt \
--checkpoint /path/to/backup_model.pt
```
---
## Feature 3: Automatic Versioning
Models are saved with version numbers:
**First run:**
```
nnue_weights_v1.pt ← Model weights
nnue_weights_v1_metadata.json ← Training info
```
**Second run:**
```
nnue_weights_v2.pt ← Model weights
nnue_weights_v2_metadata.json ← Training info
```
**Third run:**
```
nnue_weights_v3.pt
nnue_weights_v3_metadata.json
```
Disable with `--no-versioning`:
```bash
python train_nnue.py training_data.jsonl nnue_weights.pt --no-versioning
# → Saves directly to nnue_weights.pt (no version number)
```
---
## Feature 4: Training Metadata
Each model save includes a JSON metadata file tracking:
```json
{
"version": 2,
"date": "2026-04-07T15:30:45.123456",
"num_positions": 1000000,
"stockfish_depth": 12,
"epochs": 20,
"batch_size": 4096,
"learning_rate": 0.001,
"final_val_loss": 0.0234567,
"device": "cuda",
"checkpoint": "nnue_weights_v1.pt",
"notes": "Win rate vs classical eval: TBD"
}
```
### Useful for:
- **Tracking progress** — Compare val_loss across versions
- **Reproducibility** — Know exactly how each model was trained
- **Debugging** — Identify which positions/depth produced best results
- **Benchmarking** — Record win rates (manually added to notes)
---
## Feature 5: CLI Arguments
Full control over training via command-line flags:
```bash
python train_nnue.py training_data.jsonl nnue_weights.pt \
--epochs 30 \
--batch-size 2048 \
--lr 5e-4 \
--stockfish-depth 14 \
--checkpoint nnue_weights_v1.pt
```
**All flags:**
- `--epochs` — Number of training passes (default: 20)
- `--batch-size` — Samples per update (default: 4096)
- `--lr` — Learning rate (default: 1e-3)
- `--stockfish-depth` — Depth for metadata (default: 12)
- `--checkpoint` — Resume from checkpoint (default: auto-detect)
- `--no-versioning` — Disable versioning
---
## Workflow Examples
### Scenario 1: Continuous Improvement
```bash
# Initial training: 500K positions
./run_pipeline.sh
# → nnue_weights_v1.pt created
# Add more positions (500K more)
python label_positions.py positions_v2.txt training_data_v2.jsonl stockfish
# Combine and retrain
cat training_data.jsonl training_data_v2.jsonl > all_data.jsonl
python train_nnue.py all_data.jsonl nnue_weights.pt
# → Loads v1, trains on all 1M positions
# → nnue_weights_v2.pt created
# Export best version
python export_weights.py nnue_weights_v2.pt ../src/main/scala/de/nowchess/bot/bots/nnue/NNUEWeights.scala
```
### Scenario 2: Hyperparameter Tuning
```bash
# Baseline
python train_nnue.py data.jsonl nnue_weights.pt
# → v1 with default settings
# Try lower learning rate
python train_nnue.py data.jsonl nnue_weights.pt --lr 5e-4
# → v2 with lr=5e-4
# Try higher learning rate
python train_nnue.py data.jsonl nnue_weights.pt --lr 2e-3
# → v3 with lr=2e-3
# Compare metadata
cat nnue_weights_v*_metadata.json | grep final_val_loss
# → Pick the lowest loss
```
### Scenario 3: Interrupted Training Resume
```bash
# Start training
python train_nnue.py training_data.jsonl nnue_weights.pt --epochs 50
# → Epoch 30 of 50, then crash/interrupt
# Resume: same command
python train_nnue.py training_data.jsonl nnue_weights.pt --epochs 50
# → Auto-detects checkpoint, continues from epoch 30
# → Completes to epoch 50
```
---
## Command-Line Help
View all options:
```bash
python train_nnue.py --help
```
Output:
```
usage: train_nnue.py [-h] [--checkpoint CHECKPOINT] [--epochs EPOCHS]
[--batch-size BATCH_SIZE] [--lr LR]
[--stockfish-depth STOCKFISH_DEPTH] [--no-versioning]
[data_file] [output_file]
Train NNUE neural network for chess evaluation
positional arguments:
data_file Path to training_data.jsonl (default: training_data.jsonl)
output_file Output file base name (default: nnue_weights.pt)
optional arguments:
-h, --help show this help message and exit
--checkpoint CHECKPOINT
Path to checkpoint file to resume training from (optional)
--epochs EPOCHS Number of epochs to train (default: 20)
--batch-size BATCH_SIZE
Batch size (default: 4096)
--lr LR Learning rate (default: 1e-3)
--stockfish-depth STOCKFISH_DEPTH
Stockfish depth used for evaluations (for metadata, default: 12)
--no-versioning Disable automatic versioning (save directly to output file)
```
---
## Key Differences from Previous Version
| Feature | Before | After |
|---------|--------|-------|
| Checkpoint support | ❌ No | ✅ Yes (auto + explicit) |
| Versioning | ❌ Single file | ✅ v1, v2, v3... |
| Metadata tracking | ❌ No | ✅ JSON with all info |
| CLI arguments | ❌ Limited | ✅ Full argparse |
| Resumed training | ❌ Always from scratch | ✅ Resume from checkpoint |
| Training history | ❌ Lost | ✅ Tracked in metadata |
---
## Integration with Pipeline
The `run_pipeline.sh` and `run_pipeline.bat` scripts automatically use versioning:
```bash
./run_pipeline.sh
# First run:
# - Generates data
# - Trains model
# - Creates nnue_weights_v1.pt + metadata
# - Exports to NNUEWeights.scala
# Second run:
# - Auto-detects v1, loads as checkpoint
# - Continues training on all data
# - Creates nnue_weights_v2.pt + metadata
# - Exports updated NNUEWeights.scala
```
---
## Tips & Tricks
### List all versions with losses:
```bash
for f in nnue_weights_v*_metadata.json; do
version=$(grep version $f | head -1)
loss=$(grep final_val_loss $f)
echo "$version | $loss"
done
```
### Auto-export best version:
```bash
# Find version with lowest loss
BEST=$(for f in nnue_weights_v*_metadata.json; do
echo "$f $(grep final_val_loss $f | cut -d: -f2)"
done | sort -k2 -n | head -1 | cut -d_ -f3 | cut -d. -f1)
python export_weights.py nnue_weights_$BEST.pt ../src/main/scala/de/nowchess/bot/bots/nnue/NNUEWeights.scala
```
### Archive old versions:
```bash
mkdir -p archive
mv nnue_weights_v{1,2,3}.pt archive/
mv nnue_weights_v{1,2,3}_metadata.json archive/
# Keep only v4+
```
---
## See Also
- `TRAINING_GUIDE.md` — Detailed examples and workflows
- `README_NNUE.md` — Complete pipeline documentation
- `train_nnue.py --help` — Command-line reference
+173
View File
@@ -0,0 +1,173 @@
# NNUE Training Pipeline
This directory contains the complete NNUE (Efficiently Updatable Neural Network) training pipeline for the Now-Chess bot.
## Overview
The pipeline generates 500,000 random chess positions, evaluates them with Stockfish, trains a neural network, and exports the weights as Scala code for integration into the engine.
## Prerequisites
Install Python dependencies:
```bash
pip install -r requirements.txt
```
Ensure Stockfish is installed. You can:
- Install via package manager: `apt-get install stockfish` (Linux) or `brew install stockfish` (macOS)
- Or download from [stockfish.org](https://stockfishchess.org)
Set the Stockfish path:
```bash
export STOCKFISH_PATH=/path/to/stockfish
```
## Pipeline Steps
### Quick Run
Run the entire pipeline:
```bash
chmod +x run_pipeline.sh
./run_pipeline.sh
```
This automatically runs all 4 steps in sequence and confirms each succeeds before continuing.
### Individual Steps
#### Step 1: Generate Positions
Generate 500,000 random chess positions:
```bash
python3 generate_positions.py positions.txt
```
Output: `positions.txt` (one FEN per line)
- Plays 8-20 random opening moves
- Filters out checks, captures available, and game-over positions
- Shows progress bar with tqdm
#### Step 2: Label with Stockfish
Evaluate each position with Stockfish at depth 12:
```bash
export STOCKFISH_PATH=/path/to/stockfish
python3 label_positions.py positions.txt training_data.jsonl $STOCKFISH_PATH
```
Output: `training_data.jsonl` (one JSON per line)
- Format: `{"fen": "...", "eval": 123}` (centipawns)
- Evals clamped to [-2000, 2000] to avoid mate score outliers
- Supports resuming if interrupted (checks for existing entries)
- Shows progress bar with tqdm
**Note:** This step is slow (~24-36 hours for 500K positions at depth 12). You can reduce games or use lower depth for testing.
#### Step 3: Train NNUE Model
Train the neural network:
```bash
python3 train_nnue.py training_data.jsonl nnue_weights.pt
```
Output: `nnue_weights.pt` (PyTorch model weights)
Architecture:
- Input: 768 binary features (12 piece types × 64 squares)
- Hidden 1: 256 neurons + ReLU
- Hidden 2: 32 neurons + ReLU
- Output: 1 neuron (sigmoid applied to eval/400)
Training:
- 20 epochs, batch size 4096, Adam optimizer (lr=1e-3)
- 90% train / 10% validation split
- Saves best weights by validation loss
- Shows train/val loss per epoch
**Note:** Requires GPU for reasonable speed (~2-4 hours). CPU falls back to ~8-16 hours.
#### Step 4: Export to Scala
Export weights as Scala code:
```bash
python3 export_weights.py nnue_weights.pt ../src/main/scala/de/nowchess/bot/bots/nnue/NNUEWeights.scala
```
Output: `NNUEWeights.scala`
- Object with `val` arrays for each layer's weights and biases
- Format: `Array[Float]` with precision sufficient for inference
- Includes shape comments for reference
## Scala Integration
### Step 5: NNUE Evaluator
Create `NNUE.scala` in `src/main/scala/de/nowchess/bot/bots/nnue/`:
```scala
package de.nowchess.bot.bots.nnue
class NNUE:
// Load weights from NNUEWeights.scala
// Convert Position to 768-feature vector
// Run inference: l1→ReLU→l2→ReLU→l3
// Return centipawn score
```
### Step 6: Integration
Implement `NNUEBot` that uses the NNUE evaluator for move selection.
## File Reference
| File | Purpose |
|------|---------|
| `requirements.txt` | Python dependencies |
| `generate_positions.py` | Step 1: Position generator |
| `label_positions.py` | Step 2: Stockfish labeler |
| `train_nnue.py` | Step 3: NNUE trainer |
| `export_weights.py` | Step 4: Weight exporter |
| `run_pipeline.sh` | Master script (runs steps 1-4) |
| `positions.txt` | Output: Raw FENs (500K) |
| `training_data.jsonl` | Output: FEN+eval pairs |
| `nnue_weights.pt` | Output: Trained weights |
| `../src/main/scala/.../NNUEWeights.scala` | Output: Scala weights |
## Tips
- **For testing:** Reduce `generate_positions.py` to 10,000 games for quick iteration
- **Resume labeling:** Run step 2 again; it skips already-evaluated positions
- **GPU acceleration:** Install CUDA for PyTorch to speed up training
- **Stockfish tuning:** Lower depth (e.g., 8 instead of 12) for faster labeling
- **Batch size:** Increase to 8192 if OOM; decrease if out of memory
## Troubleshooting
**ImportError: No module named 'chess'**
- Run: `pip install -r requirements.txt`
**Stockfish not found**
- Check: `which stockfish` or set `export STOCKFISH_PATH=/full/path/to/stockfish`
**CUDA out of memory**
- Reduce batch size in `train_nnue.py` (e.g., 2048)
- Or use CPU: Remove CUDA check and device setup
**Training loss not decreasing**
- Check data quality: Sample some entries from `training_data.jsonl`
- Increase learning rate to 1e-2 or 5e-4 for experimentation
- Verify Stockfish depth was sufficient (depth ≥ 10)
## References
- [NNUE Overview](https://www.chessprogramming.org/NNUE)
- [python-chess](https://python-chess.readthedocs.io/)
- [PyTorch](https://pytorch.org/)
- [Stockfish](https://stockfishchess.org/)
+381
View File
@@ -0,0 +1,381 @@
# NNUE Training Guide: Incremental Training & Versioning
## Overview
The improved `train_nnue.py` now supports:
1. **Incremental training** — Resume from checkpoint, continue training on new data
2. **Automatic versioning** — Each training run saved as `nnue_weights_v{N}.pt`
3. **Metadata tracking** — Date, positions, depth, losses stored in JSON
4. **CLI flags** — Full control over training parameters
## Quick Start
### First Training Run (Fresh Start)
```bash
python train_nnue.py training_data.jsonl nnue_weights.pt
```
This saves:
- `nnue_weights_v1.pt` — The trained weights
- `nnue_weights_v1_metadata.json` — Training metadata
### Continue Training (Incremental)
Add more positions to `training_data.jsonl`, then:
```bash
python train_nnue.py training_data.jsonl nnue_weights.pt
```
The trainer will:
1. Detect `nnue_weights.pt` exists
2. Load it as a checkpoint automatically
3. Continue training on all data
4. Save as `nnue_weights_v2.pt` with updated metadata
Alternatively, specify a checkpoint explicitly:
```bash
python train_nnue.py training_data.jsonl nnue_weights.pt --checkpoint nnue_weights_v1.pt
```
## Advanced Usage
### Custom Training Parameters
```bash
python train_nnue.py training_data.jsonl nnue_weights.pt \
--epochs 30 \
--batch-size 2048 \
--lr 5e-4 \
--stockfish-depth 14
```
- `--epochs` — How many passes through the data (default: 20)
- `--batch-size` — Samples per gradient update (default: 4096)
- `--lr` — Learning rate (default: 1e-3)
- `--stockfish-depth` — Depth of Stockfish evaluation (for metadata only)
### Explicit Checkpoint
Resume from a specific checkpoint (not `nnue_weights.pt`):
```bash
python train_nnue.py training_data_v2.jsonl nnue_weights.pt \
--checkpoint nnue_weights_v1.pt
```
### Disable Versioning
Save directly to output file without versioning:
```bash
python train_nnue.py training_data.jsonl nnue_weights.pt --no-versioning
```
This overwrites `nnue_weights.pt` instead of creating `nnue_weights_v2.pt`.
## Incremental Training Workflow
Typical workflow for improving the model over time:
**Step 1: Initial Training**
```bash
# Generate 500K positions with Stockfish
./run_pipeline.sh
# This saves:
# - nnue_weights_v1.pt
# - nnue_weights_v1_metadata.json
```
**Step 2: Generate More Positions**
```bash
# Later, generate 500K more positions
# Append to training_data.jsonl or create new one
# Label with Stockfish at depth 16 (more thorough)
python label_positions.py positions_batch2.txt training_data_batch2.jsonl stockfish --stockfish-depth 16
# Combine datasets
cat training_data_batch1.jsonl training_data_batch2.jsonl > training_data_combined.jsonl
```
**Step 3: Continue Training**
```bash
# Train on combined data, starting from v1 checkpoint
python train_nnue.py training_data_combined.jsonl nnue_weights.pt
# Saves:
# - nnue_weights_v2.pt (improved)
# - nnue_weights_v2_metadata.json
```
**Step 4: Benchmark & Choose**
```bash
# Test both versions in matches
# If v2 is better, use it; otherwise keep v1
# Update NNUEWeights.scala with best version
python export_weights.py nnue_weights_v2.pt ../src/main/scala/de/nowchess/bot/bots/nnue/NNUEWeights.scala
```
## Metadata File Format
Each training session generates a JSON metadata file, e.g., `nnue_weights_v2_metadata.json`:
```json
{
"version": 2,
"date": "2026-04-07T21:45:30.123456",
"num_positions": 1000000,
"stockfish_depth": 12,
"epochs": 20,
"batch_size": 4096,
"learning_rate": 0.001,
"final_val_loss": 0.0234567,
"device": "cuda",
"checkpoint": "nnue_weights_v1.pt",
"notes": "Win rate vs classical eval: TBD (requires benchmark games)"
}
```
### Fields
- **version**: Training version number (v1, v2, etc.)
- **date**: ISO timestamp of training start
- **num_positions**: Total positions in dataset
- **stockfish_depth**: Depth of Stockfish evaluations (from command-line flag)
- **epochs**: Number of training passes
- **batch_size**: Training batch size
- **learning_rate**: Adam optimizer learning rate
- **final_val_loss**: Best validation loss achieved
- **device**: GPU (cuda) or CPU used for training
- **checkpoint**: Previous model used as starting point (null if from scratch)
- **notes**: Win rate comparison (currently TBD — requires benchmark)
## Checkpoint Logic
When you run training, the trainer checks for checkpoints in this order:
1. **Explicit checkpoint** — If you provide `--checkpoint`, use it
2. **Auto-detect** — If output file exists (e.g., `nnue_weights.pt`), load it
3. **From scratch** — Otherwise, initialize with random weights
Example:
```bash
# First run: from scratch (no nnue_weights.pt exists)
python train_nnue.py training_data.jsonl nnue_weights.pt
# → Creates v1 from scratch, saves as nnue_weights_v1.pt
# Second run: auto-detect nnue_weights.pt as checkpoint
python train_nnue.py training_data_bigger.jsonl nnue_weights.pt
# → Loads nnue_weights_v1.pt (because nnue_weights.pt = v1), saves as v2
# Third run: explicit checkpoint
python train_nnue.py training_data_huge.jsonl nnue_weights.pt --checkpoint nnue_weights_v2.pt
# → Loads v2, saves as v3
```
## Resuming Interrupted Training
If training is interrupted (power loss, ^C), you can resume:
```bash
# Original command
python train_nnue.py training_data.jsonl nnue_weights.pt
# If interrupted, the same command will:
# 1. Detect nnue_weights_v1.pt exists (or a higher version)
# 2. Auto-load it as checkpoint
# 3. Resume training
# 4. Save next version (v2, v3, etc.)
```
## Performance Tips
### Reduce Training Time
```bash
# Smaller batch size = slower but less memory
python train_nnue.py training_data.jsonl nnue_weights.pt --batch-size 1024
# Fewer epochs
python train_nnue.py training_data.jsonl nnue_weights.pt --epochs 5
# Lower learning rate = slower convergence but more stable
python train_nnue.py training_data.jsonl nnue_weights.pt --lr 5e-4
```
### Accelerate on GPU
If you have NVIDIA GPU with CUDA:
```bash
# Training will automatically use CUDA
# Check metadata device field: should be "cuda" not "cpu"
python train_nnue.py training_data.jsonl nnue_weights.pt
```
If training uses CPU but GPU is available:
```bash
# Reinstall PyTorch with CUDA
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
```
### Efficient Incremental Training
```bash
# Fine-tune v1 on slightly different data (high learning rate)
python train_nnue.py new_positions.jsonl nnue_weights.pt \
--checkpoint nnue_weights_v1.pt \
--epochs 3 \
--lr 5e-4
# Full retraining on combined data (slower, better)
python train_nnue.py all_positions.jsonl nnue_weights.pt \
--checkpoint nnue_weights_v1.pt \
--epochs 20 \
--lr 1e-3
```
## Version Management
### List All Versions
```bash
ls -la nnue_weights_v*.pt
ls -la nnue_weights_v*_metadata.json
```
### Compare Versions
```bash
cat nnue_weights_v1_metadata.json | grep "final_val_loss"
cat nnue_weights_v2_metadata.json | grep "final_val_loss"
cat nnue_weights_v3_metadata.json | grep "final_val_loss"
```
Lower val loss = better model.
### Benchmark Best Version
After training multiple versions, benchmark them:
```bash
# Export v1 and play some games
python export_weights.py nnue_weights_v1.pt ../src/main/scala/de/nowchess/bot/bots/nnue/NNUEWeights.scala
./compile && ./test
# Export v2 and benchmark
python export_weights.py nnue_weights_v2.pt ../src/main/scala/de/nowchess/bot/bots/nnue/NNUEWeights.scala
./compile && ./test
# Keep the best, archive others
```
### Archive Old Versions
```bash
# Keep only recent versions
mkdir -p old_models
mv nnue_weights_v1.pt old_models/
mv nnue_weights_v1_metadata.json old_models/
```
## Troubleshooting
### "FileNotFoundError: training_data.jsonl not found"
```bash
# Make sure you're in the python/ directory
cd modules/bot/python
# Or provide full path
python train_nnue.py /full/path/to/training_data.jsonl nnue_weights.pt
```
### "CUDA out of memory"
Reduce batch size:
```bash
python train_nnue.py training_data.jsonl nnue_weights.pt --batch-size 2048
```
### Training seems slow (using CPU not GPU)
```bash
# Check metadata of a training run
cat nnue_weights_v1_metadata.json | grep device
# If "cpu", reinstall PyTorch with CUDA support
pip install torch --index-url https://download.pytorch.org/whl/cu118
```
### "checkpoint file corrupted"
```bash
# Start over from scratch (don't load corrupted checkpoint)
python train_nnue.py training_data.jsonl nnue_weights_fresh.pt --no-versioning
# Or resume from earlier version
python train_nnue.py training_data.jsonl nnue_weights.pt --checkpoint nnue_weights_v1.pt
```
## Integration with Pipeline
The `run_pipeline.sh` script now supports incremental training:
```bash
# First run: generates data, trains v1
./run_pipeline.sh
# Add more positions
# ... generate more, label more ...
# Second run: trains on combined data as v2
./run_pipeline.sh
```
## Example: Full Workflow
```bash
cd modules/bot/python
# Session 1: Initial training
chmod +x run_pipeline.sh
export STOCKFISH_PATH=/usr/bin/stockfish
./run_pipeline.sh
# Creates: nnue_weights_v1.pt, nnue_weights_v1_metadata.json
# Session 2: Improve with deeper analysis
# (manually evaluate more positions at depth 14)
python label_positions.py positions_v2.txt training_data_v2.jsonl \
/usr/bin/stockfish --stockfish-depth 14
# Combine and retrain
cat training_data_v1.jsonl training_data_v2.jsonl > training_data_combined.jsonl
python train_nnue.py training_data_combined.jsonl nnue_weights.pt \
--epochs 25 \
--stockfish-depth 14
# Creates: nnue_weights_v2.pt, nnue_weights_v2_metadata.json
# Session 3: Benchmark and choose
# Test both v1 and v2 with matches...
# If v2 is better, export and use
python export_weights.py nnue_weights_v2.pt \
../src/main/scala/de/nowchess/bot/bots/nnue/NNUEWeights.scala
cd ../..
./compile && ./test
```
## See Also
- `train_nnue.py --help` — Command-line help
- `README_NNUE.md` — Complete pipeline documentation
- `NNUE_IMPLEMENTATION_SUMMARY.md` — Technical architecture
+64
View File
@@ -0,0 +1,64 @@
#!/usr/bin/env python3
"""Export NNUE weights to Scala code."""
import torch
import sys
from pathlib import Path
def export_weights_to_scala(weights_file, output_file):
"""Load PyTorch weights and export as Scala code."""
if not Path(weights_file).exists():
print(f"Error: Weights file not found at {weights_file}")
sys.exit(1)
# Load weights (weights_only=False for compatibility with older PyTorch versions)
state_dict = torch.load(weights_file, map_location='cpu')
# Create output directory if needed
output_path = Path(output_file)
output_path.parent.mkdir(parents=True, exist_ok=True)
with open(output_file, 'w') as f:
f.write("package de.nowchess.bot.bots.nnue\n\n")
f.write("object NNUEWeights:\n")
for layer_name, tensor in sorted(state_dict.items()):
# Sanitize name
safe_name = layer_name.replace('.', '_').replace(' ', '_')
# Convert tensor to flat list
values = tensor.flatten().tolist()
# Format as Scala array
f.write(f"\n val {safe_name} = Array(\n")
# Write values in chunks for readability
chunk_size = 16
for i in range(0, len(values), chunk_size):
chunk = values[i:i + chunk_size]
formatted_chunk = ", ".join(f"{v:.10g}f" for v in chunk)
f.write(f" {formatted_chunk}")
if i + chunk_size < len(values):
f.write(",\n")
else:
f.write("\n")
f.write(f" )\n")
# Store shape for reference
shape = list(tensor.shape)
f.write(f" // Shape: {shape}\n")
print(f"Weights exported to {output_file}")
if __name__ == "__main__":
weights_file = "nnue_weights.pt"
output_file = "../src/main/scala/de/nowchess/bot/bots/nnue/NNUEWeights.scala"
if len(sys.argv) > 1:
weights_file = sys.argv[1]
if len(sys.argv) > 2:
output_file = sys.argv[2]
export_weights_to_scala(weights_file, output_file)
+110
View File
@@ -0,0 +1,110 @@
#!/usr/bin/env python3
"""Generate 500,000 random chess positions for NNUE training."""
import chess
import random
import sys
from pathlib import Path
from tqdm import tqdm
def play_random_game_and_collect_positions(output_file, total_games=500000, filter_captures=True):
"""Play random games and save positions after 8-20 random moves.
Returns:
Number of valid positions saved
"""
positions_count = 0
filtered_check = 0
filtered_captures = 0
filtered_game_over = 0
with open(output_file, 'w') as f:
with tqdm(total=total_games, desc="Generating positions") as pbar:
for game_num in range(total_games):
board = chess.Board()
# Play 8-20 random opening moves
num_moves = random.randint(8, 20)
for move_num in range(num_moves):
if board.is_game_over():
break
legal_moves = list(board.legal_moves)
if not legal_moves:
break
move = random.choice(legal_moves)
board.push(move)
# Skip if game over
if board.is_game_over():
filtered_game_over += 1
pbar.update(1)
continue
# Skip if in check
if board.is_check():
filtered_check += 1
pbar.update(1)
continue
# Check if any captures are available (if filtering enabled)
if filter_captures:
has_captures = any(board.is_capture(move) for move in board.legal_moves)
if has_captures:
filtered_captures += 1
pbar.update(1)
continue
# Save valid position
fen = board.fen()
f.write(fen + '\n')
positions_count += 1
pbar.update(1)
# Print summary
print()
print("=" * 60)
print("POSITION GENERATION SUMMARY")
print("=" * 60)
print(f"Total games: {total_games}")
print(f"Saved positions: {positions_count}")
print(f"Filtered (check): {filtered_check}")
print(f"Filtered (captures): {filtered_captures}")
print(f"Filtered (game over): {filtered_game_over}")
print(f"Total filtered: {filtered_check + filtered_captures + filtered_game_over}")
print(f"Acceptance rate: {positions_count / total_games * 100:.2f}%")
print("=" * 60)
print()
if positions_count == 0:
print("WARNING: No valid positions were generated!")
print("This might happen if:")
print(" - The filter criteria are too strict (captures, checks)")
print(" - Try using: --no-filter-captures to accept positions with captures")
return 0
return positions_count
if __name__ == "__main__":
import argparse
parser = argparse.ArgumentParser(description="Generate random chess positions for NNUE training")
parser.add_argument("output_file", nargs="?", default="positions.txt",
help="Output file for positions (default: positions.txt)")
parser.add_argument("--games", type=int, default=5000,
help="Number of games to play (default: 500000)")
parser.add_argument("--no-filter-captures", action="store_true",
help="Include positions with available captures (increases output)")
args = parser.parse_args()
count = play_random_game_and_collect_positions(
output_file=args.output_file,
total_games=args.games,
filter_captures=not args.no_filter_captures
)
sys.exit(0 if count > 0 else 1)
+198
View File
@@ -0,0 +1,198 @@
#!/usr/bin/env python3
"""Label positions with Stockfish evaluations."""
import json
import chess.engine
import sys
import os
from pathlib import Path
from tqdm import tqdm
def label_positions_with_stockfish(positions_file, output_file, stockfish_path, batch_size=100, depth=12, verbose=False):
"""Read positions and label them with Stockfish evaluations.
Args:
positions_file: Path to positions.txt
output_file: Path to training_data.jsonl
stockfish_path: Path to stockfish binary
batch_size: Batch size (not used, kept for compatibility)
depth: Stockfish depth
verbose: Print detailed error messages
"""
# Check if stockfish exists
if not Path(stockfish_path).exists():
print(f"Error: Stockfish not found at {stockfish_path}")
print(f"Tried: {stockfish_path}")
print(f"Set STOCKFISH_PATH environment variable or pass as argument")
sys.exit(1)
print(f"Using Stockfish: {stockfish_path}")
# Check if positions file exists
if not Path(positions_file).exists():
print(f"Error: Positions file not found at {positions_file}")
sys.exit(1)
# Load existing evaluations if resuming
evaluated_fens = set()
position_count = 0
if Path(output_file).exists():
with open(output_file, 'r') as f:
for line in f:
try:
data = json.loads(line)
evaluated_fens.add(data['fen'])
position_count += 1
except json.JSONDecodeError:
pass
print(f"Resuming from {position_count} already evaluated positions")
# Count total positions
with open(positions_file, 'r') as f:
total_lines = sum(1 for _ in f)
if total_lines == 0:
print(f"Error: Positions file is empty ({positions_file})")
sys.exit(1)
print(f"Total positions to process: {total_lines}")
print(f"Using depth: {depth}")
print()
# Initialize engine
try:
engine = chess.engine.SimpleEngine.popen_uci(stockfish_path)
except Exception as e:
print(f"Error: Could not start Stockfish engine")
print(f" Stockfish path: {stockfish_path}")
print(f" Error: {e}")
sys.exit(1)
# Track statistics
evaluated = 0
skipped_invalid = 0
skipped_duplicate = 0
errors = 0
try:
with open(positions_file, 'r') as f:
with open(output_file, 'a') as out:
with tqdm(total=total_lines, initial=position_count, desc="Labeling positions") as pbar:
for fen in f:
fen = fen.strip()
# Skip empty lines
if not fen:
skipped_invalid += 1
pbar.update(1)
continue
# Skip already evaluated
if fen in evaluated_fens:
skipped_duplicate += 1
pbar.update(1)
continue
try:
# Validate FEN
board = chess.Board(fen)
if not board.is_valid():
skipped_invalid += 1
pbar.update(1)
continue
# Evaluate at specified depth
info = engine.analyse(board, chess.engine.Limit(depth=depth))
if info.get('score') is None:
skipped_invalid += 1
pbar.update(1)
continue
score = info['score'].white()
# Convert to centipawns
if score.is_mate():
# Use large values for mate scores
eval_cp = 2000 if score.mate() > 0 else -2000
else:
eval_cp = score.cp
# Clamp to [-2000, 2000]
eval_cp = max(-2000, min(2000, eval_cp))
# Save evaluation
data = {"fen": fen, "eval": eval_cp}
out.write(json.dumps(data) + '\n')
out.flush() # Force write to disk
evaluated += 1
except Exception as e:
errors += 1
if verbose:
print(f"Error evaluating position: {fen[:50]}...")
print(f" {type(e).__name__}: {e}")
pbar.update(1)
continue
pbar.update(1)
finally:
engine.quit()
# Print summary
print()
print("=" * 60)
print("LABELING SUMMARY")
print("=" * 60)
print(f"Successfully evaluated: {evaluated}")
print(f"Skipped (duplicates): {skipped_duplicate}")
print(f"Skipped (invalid): {skipped_invalid}")
print(f"Errors: {errors}")
print(f"Total processed: {evaluated + skipped_duplicate + skipped_invalid + errors}")
print("=" * 60)
print()
if evaluated == 0:
print("WARNING: No positions were successfully evaluated!")
print("Check that:")
print(" 1. positions.txt is not empty")
print(" 2. positions.txt contains valid FENs")
print(" 3. Stockfish is installed and working")
print(" 4. Stockfish path is correct")
return False
print(f"✓ Labeling complete. Output saved to {output_file}")
return True
if __name__ == "__main__":
import argparse
parser = argparse.ArgumentParser(description="Label chess positions with Stockfish evaluations")
parser.add_argument("positions_file", nargs="?", default="positions.txt",
help="Input positions file (default: positions.txt)")
parser.add_argument("output_file", nargs="?", default="training_data.jsonl",
help="Output file (default: training_data.jsonl)")
parser.add_argument("stockfish_path", nargs="?", default=None,
help="Path to Stockfish binary (default: $STOCKFISH_PATH or 'stockfish')")
parser.add_argument("--depth", type=int, default=12,
help="Stockfish depth (default: 12)")
parser.add_argument("--verbose", action="store_true",
help="Print detailed error messages")
args = parser.parse_args()
# Determine Stockfish path
stockfish_path = args.stockfish_path or os.environ.get("STOCKFISH_PATH", "stockfish")
success = label_positions_with_stockfish(
positions_file=args.positions_file,
output_file=args.output_file,
stockfish_path=stockfish_path,
depth=args.depth,
verbose=args.verbose
)
sys.exit(0 if success else 1)
+249
View File
@@ -0,0 +1,249 @@
#!/usr/bin/env python3
"""Central NNUE pipeline CLI for training and exporting models."""
import argparse
import os
import sys
import subprocess
from pathlib import Path
def get_python_cmd():
"""Get available Python command."""
if os.name == 'nt':
return "python"
return "python3" if os.popen("which python3 2>/dev/null").read() else "python"
def list_checkpoints():
"""List available checkpoint versions."""
checkpoints = sorted(Path(".").glob("nnue_weights_v*.pt"))
if not checkpoints:
return []
return [int(cp.stem.split("_v")[1]) for cp in checkpoints]
def run_generate_positions(num_games):
"""Generate random positions."""
positions_file = "positions.txt"
print(f"Generating {num_games} positions...")
result = subprocess.run(
[get_python_cmd(), "generate_positions.py", positions_file, "--games", str(num_games)],
capture_output=False
)
if result.returncode != 0:
print("ERROR: Position generation failed")
return False
return Path(positions_file).exists()
def run_label_positions(stockfish_path):
"""Label positions with Stockfish."""
positions_file = "positions.txt"
output_file = "training_data.jsonl"
if not Path(positions_file).exists():
print("ERROR: positions.txt not found")
return False
print("Labeling positions with Stockfish...")
result = subprocess.run(
[get_python_cmd(), "label_positions.py", positions_file, output_file, stockfish_path],
capture_output=False
)
if result.returncode != 0:
print("ERROR: Position labeling failed")
return False
return Path(output_file).exists()
def run_train(positions_file, output_weights, from_checkpoint=None):
"""Train NNUE model."""
if not Path(positions_file).exists():
print(f"ERROR: {positions_file} not found")
return False
print(f"Training model (output: {output_weights})...")
if from_checkpoint:
print(f" Starting from checkpoint: {from_checkpoint}")
cmd = [get_python_cmd(), "train_nnue.py", positions_file, output_weights]
if from_checkpoint:
cmd.extend(["--checkpoint", from_checkpoint])
result = subprocess.run(cmd, capture_output=False)
if result.returncode != 0:
print("ERROR: Training failed")
return False
return True # train_nnue creates versioned file, not the base name
def run_export(weights_file, output_file):
"""Export weights to Scala."""
if not Path(weights_file).exists():
print(f"ERROR: {weights_file} not found")
return False
print(f"Exporting {weights_file} to Scala...")
result = subprocess.run(
[get_python_cmd(), "export_weights.py", weights_file, output_file],
capture_output=False
)
if result.returncode != 0:
print("ERROR: Export failed")
return False
return Path(output_file).exists()
def cmd_train(args):
"""Handle train command."""
stockfish_path = args.stockfish or os.environ.get("STOCKFISH_PATH", "/usr/games/stockfish")
# Determine checkpoint
checkpoint = None
if args.from_checkpoint:
checkpoint_version = args.from_checkpoint
checkpoint = f"nnue_weights_v{checkpoint_version}.pt"
if not Path(checkpoint).exists():
print(f"ERROR: Checkpoint {checkpoint} not found")
return False
else:
available = list_checkpoints()
if available:
latest = max(available)
checkpoint = f"nnue_weights_v{latest}.pt"
print(f"No checkpoint specified, using latest: v{latest}")
# Generate or use existing positions
if args.positions_file:
if not Path(args.positions_file).exists():
print(f"ERROR: {args.positions_file} not found")
return False
positions_file = args.positions_file
else:
positions_file = "positions.txt"
num_games = args.games or 500000
if not run_generate_positions(num_games):
return False
# Label positions
if not run_label_positions(stockfish_path):
return False
print("\nStarting training...")
# Train (train_nnue.py handles versioning internally)
if not run_train("training_data.jsonl", "nnue_weights.pt", checkpoint):
return False
# Show created version
available = list_checkpoints()
new_version = max(available) if available else 1
print(f"\n✓ Training complete: nnue_weights_v{new_version}.pt")
return True
def cmd_export(args):
"""Handle export command."""
weights_file = args.weights
# Auto-detect if version is specified
if not weights_file.endswith(".pt"):
weights_file = f"nnue_weights_v{weights_file}.pt"
if not Path(weights_file).exists():
print(f"ERROR: {weights_file} not found")
return False
# Determine version from filename
version = Path(weights_file).stem.split("_v")[1] if "_v" in weights_file else "1"
output_file = f"../src/main/scala/de/nowchess/bot/bots/nnue/NNUEWeights_v{version}.scala"
if not run_export(weights_file, output_file):
return False
print(f"✓ Export complete: {output_file}")
return True
def cmd_list(args):
"""List available checkpoints."""
available = list_checkpoints()
if not available:
print("No checkpoints found")
return True
print("Available checkpoints:")
for v in available:
weights_file = f"nnue_weights_v{v}.pt"
size = Path(weights_file).stat().st_size / (1024**2) # MB
print(f" v{v} ({size:.1f} MB)")
return True
def main():
parser = argparse.ArgumentParser(
description="NNUE pipeline CLI for training and exporting models",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
Examples:
# Train with 500k random positions
python nnue.py train
# Train from checkpoint v2
python nnue.py train --from-checkpoint 2
# Train with custom positions file
python nnue.py train --positions-file my_positions.txt
# Train with 200k games
python nnue.py train --games 200000
# Export specific weights version
python nnue.py export 2
# Export with full filename
python nnue.py export nnue_weights_v3.pt
# List available checkpoints
python nnue.py list
"""
)
subparsers = parser.add_subparsers(dest="command", help="Command to run")
# Train subcommand
train_parser = subparsers.add_parser("train", help="Train NNUE model")
train_parser.add_argument(
"--from-checkpoint",
type=int,
help="Start training from checkpoint version (e.g., 2)"
)
train_parser.add_argument(
"--games",
type=int,
help="Number of games to generate (default: 500000)"
)
train_parser.add_argument(
"--positions-file",
help="Use existing positions file instead of generating"
)
train_parser.add_argument(
"--stockfish",
help="Path to Stockfish binary (default: $STOCKFISH_PATH or /usr/games/stockfish)"
)
train_parser.set_defaults(func=cmd_train)
# Export subcommand
export_parser = subparsers.add_parser("export", help="Export weights to Scala")
export_parser.add_argument(
"weights",
help="Weights file or version (e.g., 2 or nnue_weights_v2.pt)"
)
export_parser.set_defaults(func=cmd_export)
# List subcommand
list_parser = subparsers.add_parser("list", help="List available checkpoints")
list_parser.set_defaults(func=cmd_list)
args = parser.parse_args()
if not args.command:
parser.print_help()
return 0
success = args.func(args)
return 0 if success else 1
if __name__ == "__main__":
sys.exit(main())
Binary file not shown.
@@ -0,0 +1,13 @@
{
"version": 1,
"date": "2026-04-07T22:37:15.093371",
"num_positions": 1223,
"stockfish_depth": 12,
"epochs": 20,
"batch_size": 4096,
"learning_rate": 0.001,
"final_val_loss": 0.0162429828196764,
"device": "cuda",
"checkpoint": null,
"notes": "Win rate vs classical eval: TBD (requires benchmark games)"
}
+4
View File
@@ -0,0 +1,4 @@
chess==1.11.2
torch==2.11.0
tqdm==4.67.3
numpy==2.4.4
+66
View File
@@ -0,0 +1,66 @@
@echo off
REM NNUE Training Pipeline for Windows
setlocal enabledelayedexpansion
echo.
echo === NNUE Training Pipeline ===
echo.
REM Get the directory where this script is located
set SCRIPT_DIR=%~dp0
cd /d "%SCRIPT_DIR%"
REM Step 1: Generate positions
echo Step 1: Generating 500,000 random positions...
python generate_positions.py positions.txt
if not exist positions.txt (
echo ERROR: positions.txt not created
exit /b 1
)
echo [OK] Positions generated
echo.
REM Step 2: Label positions with Stockfish
echo Step 2: Labeling positions with Stockfish (depth 12^)...
if "%STOCKFISH_PATH%"=="" (
set STOCKFISH_PATH=stockfish
)
python label_positions.py positions.txt training_data.jsonl "%STOCKFISH_PATH%"
if not exist training_data.jsonl (
echo ERROR: training_data.jsonl not created
exit /b 1
)
echo [OK] Positions labeled
echo.
REM Step 3: Train NNUE model
echo Step 3: Training NNUE model (20 epochs^)...
python train_nnue.py training_data.jsonl nnue_weights.pt
if not exist nnue_weights.pt (
echo ERROR: nnue_weights.pt not created
exit /b 1
)
echo [OK] Model trained
echo.
REM Step 4: Export weights to Scala
echo Step 4: Exporting weights to Scala...
python export_weights.py nnue_weights.pt ..\src\main\scala\de\nowchess\bot\bots\nnue\NNUEWeights.scala
if not exist ..\src\main\scala\de\nowchess\bot\bots\nnue\NNUEWeights.scala (
echo ERROR: NNUEWeights.scala not created
exit /b 1
)
echo [OK] Weights exported
echo.
echo === Pipeline Complete ===
echo.
echo Next steps:
echo 1. Navigate to project root: cd ..\..
echo 2. Compile: .\compile.bat
echo 3. Test: .\test.bat
echo.
endlocal
+78
View File
@@ -0,0 +1,78 @@
#!/bin/bash
# NNUE Training Pipeline (bash version)
# Works on Linux, macOS, and Windows (with Git Bash or WSL)
set -e # Exit on error
SCRIPT_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"
cd "$SCRIPT_DIR"
# Use python or python3 (check which is available)
PYTHON_CMD="python3"
if ! command -v python3 &> /dev/null; then
PYTHON_CMD="python"
fi
echo "=== NNUE Training Pipeline ==="
echo ""
echo "Python command: $PYTHON_CMD"
echo "Working directory: $SCRIPT_DIR"
echo ""
# Step 1: Generate positions
echo "Step 1: Generating 500,000 random positions..."
$PYTHON_CMD generate_positions.py positions.txt
if [ ! -f positions.txt ]; then
echo "ERROR: positions.txt not created"
exit 1
fi
echo "✓ Positions generated"
echo ""
# Step 2: Label positions with Stockfish
echo "Step 2: Labeling positions with Stockfish (depth 12)..."
STOCKFISH_PATH="${STOCKFISH_PATH:-/usr/games/stockfish}"
echo "Using Stockfish: $STOCKFISH_PATH"
$PYTHON_CMD label_positions.py positions.txt training_data.jsonl "$STOCKFISH_PATH"
if [ ! -f training_data.jsonl ]; then
echo "ERROR: training_data.jsonl not created"
exit 1
fi
echo "✓ Positions labeled"
echo ""
# Step 3: Train NNUE model with versioning
echo "Step 3: Training NNUE model (20 epochs)..."
# Auto-detect latest version and increment
LATEST_VERSION=$(ls -1 nnue_weights_v*.pt 2>/dev/null | sed 's/nnue_weights_v//;s/.pt$//' | sort -n | tail -1)
NEW_VERSION=$((${LATEST_VERSION:-0} + 1))
WEIGHTS_FILE="nnue_weights_v${NEW_VERSION}.pt"
echo "Creating version v${NEW_VERSION}..."
$PYTHON_CMD train_nnue.py training_data.jsonl "$WEIGHTS_FILE"
if [ ! -f "$WEIGHTS_FILE" ]; then
echo "ERROR: $WEIGHTS_FILE not created"
exit 1
fi
echo "✓ Model trained: $WEIGHTS_FILE"
echo ""
# Step 4: Export weights to Scala
echo "Step 4: Exporting weights to Scala..."
SCALA_FILE="../src/main/scala/de/nowchess/bot/bots/nnue/NNUEWeights_v${NEW_VERSION}.scala"
$PYTHON_CMD export_weights.py "$WEIGHTS_FILE" "$SCALA_FILE"
if [ ! -f "$SCALA_FILE" ]; then
echo "ERROR: $SCALA_FILE not created"
exit 1
fi
echo "✓ Weights exported: $SCALA_FILE"
echo ""
echo "=== Pipeline Complete ==="
echo ""
echo "Next steps:"
echo "1. Navigate to project root: cd ../.."
echo "2. Compile: ./compile"
echo "3. Test: ./test"
+301
View File
@@ -0,0 +1,301 @@
#!/usr/bin/env python3
"""Train NNUE neural network for chess evaluation."""
import json
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset
import sys
from pathlib import Path
from tqdm import tqdm
import chess
from datetime import datetime
import re
class NNUEDataset(Dataset):
"""Dataset of chess positions with evaluations."""
def __init__(self, data_file):
self.positions = []
self.evals = []
with open(data_file, 'r') as f:
for line in f:
try:
data = json.loads(line)
fen = data['fen']
eval_cp = data['eval']
self.positions.append(fen)
self.evals.append(eval_cp)
except (json.JSONDecodeError, KeyError):
pass
def __len__(self):
return len(self.positions)
def __getitem__(self, idx):
fen = self.positions[idx]
eval_cp = self.evals[idx]
features = fen_to_features(fen)
target = torch.sigmoid(torch.tensor(eval_cp / 400.0, dtype=torch.float32))
return features, target
def fen_to_features(fen):
"""Convert FEN to 768-dimensional binary feature vector."""
# Piece type to index: pawn=0, knight=1, bishop=2, rook=3, queen=4, king=5
piece_to_idx = {'p': 0, 'n': 1, 'b': 2, 'r': 3, 'q': 4, 'k': 5,
'P': 6, 'N': 7, 'B': 8, 'R': 9, 'Q': 10, 'K': 11}
features = torch.zeros(768, dtype=torch.float32)
try:
board = chess.Board(fen)
# 12 piece types × 64 squares = 768
for square in chess.SQUARES:
piece = board.piece_at(square)
if piece is not None:
piece_char = piece.symbol()
if piece_char in piece_to_idx:
piece_idx = piece_to_idx[piece_char]
feature_idx = piece_idx * 64 + square
features[feature_idx] = 1.0
except:
pass
return features
class NNUE(nn.Module):
"""NNUE neural network architecture."""
def __init__(self):
super().__init__()
self.l1 = nn.Linear(768, 256)
self.relu1 = nn.ReLU()
self.l2 = nn.Linear(256, 32)
self.relu2 = nn.ReLU()
self.l3 = nn.Linear(32, 1)
self.sigmoid = nn.Sigmoid()
def forward(self, x):
x = self.l1(x)
x = self.relu1(x)
x = self.l2(x)
x = self.relu2(x)
x = self.l3(x)
return x
def find_next_version(base_name="nnue_weights"):
"""Find the next version number for model versioning.
Looks for nnue_weights_v*.pt files and returns the next version number.
If no versioned files exist, returns 1.
"""
pattern = re.compile(rf"{re.escape(base_name)}_v(\d+)\.pt")
versions = []
for file in Path(".").glob(f"{base_name}_v*.pt"):
match = pattern.match(file.name)
if match:
versions.append(int(match.group(1)))
if versions:
return max(versions) + 1
return 1
def save_metadata(weights_file, metadata):
"""Save training metadata alongside the weights file.
Args:
weights_file: Path to the .pt file (e.g., nnue_weights_v1.pt)
metadata: Dictionary with training info
"""
metadata_file = weights_file.replace(".pt", "_metadata.json")
with open(metadata_file, "w") as f:
json.dump(metadata, f, indent=2, default=str)
return metadata_file
def train_nnue(data_file, output_file="nnue_weights.pt", epochs=20, batch_size=4096, lr=1e-3, checkpoint=None, stockfish_depth=12, use_versioning=True):
"""Train the NNUE model.
Args:
data_file: Path to training_data.jsonl
output_file: Where to save best weights (or base name if use_versioning=True)
epochs: Number of training epochs
batch_size: Training batch size
lr: Learning rate
checkpoint: Optional path to checkpoint file to resume from
stockfish_depth: Depth used in Stockfish evaluation (for metadata)
use_versioning: If True, save as nnue_weights_v{N}.pt with metadata
"""
print("Loading dataset...")
dataset = NNUEDataset(data_file)
num_positions = len(dataset)
print(f"Dataset size: {num_positions}")
# Split 90% train, 10% validation
train_size = int(0.9 * len(dataset))
val_size = len(dataset) - train_size
from torch.utils.data import random_split
train_dataset, val_dataset = random_split(dataset, [train_size, val_size])
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)
# Device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")
# Model
model = NNUE().to(device)
criterion = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr=lr)
# Load checkpoint if provided
checkpoint_to_load = checkpoint
if checkpoint_to_load is None and Path(output_file).exists():
# Auto-detect checkpoint: if output file already exists, use it as checkpoint
checkpoint_to_load = output_file
start_epoch = 0
if checkpoint_to_load is not None and Path(checkpoint_to_load).exists():
print(f"Loading checkpoint from {checkpoint_to_load}...")
try:
checkpoint_state = torch.load(checkpoint_to_load, map_location=device)
model.load_state_dict(checkpoint_state)
print(f"✓ Checkpoint loaded successfully")
except Exception as e:
print(f"Warning: Could not load checkpoint: {e}")
print("Training from scratch instead")
best_val_loss = float('inf')
best_model_state = None
print(f"Training for {epochs} epochs (starting from epoch {start_epoch + 1})...")
print()
training_start_time = datetime.now()
for epoch in range(start_epoch, start_epoch + epochs):
# Train
model.train()
train_loss = 0.0
epoch_display = epoch + 1
total_epochs = start_epoch + epochs
with tqdm(total=len(train_loader), desc=f"Epoch {epoch_display}/{total_epochs} - Train") as pbar:
for batch_features, batch_targets in train_loader:
batch_features = batch_features.to(device)
batch_targets = batch_targets.to(device).unsqueeze(1)
optimizer.zero_grad()
outputs = model(batch_features)
loss = criterion(outputs, batch_targets)
loss.backward()
optimizer.step()
train_loss += loss.item() * batch_features.size(0)
pbar.update(1)
train_loss /= len(train_dataset)
# Validation
model.eval()
val_loss = 0.0
with torch.no_grad():
with tqdm(total=len(val_loader), desc=f"Epoch {epoch_display}/{total_epochs} - Val") as pbar:
for batch_features, batch_targets in val_loader:
batch_features = batch_features.to(device)
batch_targets = batch_targets.to(device).unsqueeze(1)
outputs = model(batch_features)
loss = criterion(outputs, batch_targets)
val_loss += loss.item() * batch_features.size(0)
pbar.update(1)
val_loss /= len(val_dataset)
print(f"Epoch {epoch_display}: Train Loss = {train_loss:.6f}, Val Loss = {val_loss:.6f}")
if val_loss < best_val_loss:
best_val_loss = val_loss
best_model_state = model.state_dict().copy()
# Save best model
if best_model_state is not None:
# Determine final output file with versioning
final_output_file = output_file
metadata = {}
if use_versioning:
base_name = output_file.replace(".pt", "")
version = find_next_version(base_name)
final_output_file = f"{base_name}_v{version}.pt"
# Prepare metadata
metadata = {
"version": version,
"date": training_start_time.isoformat(),
"num_positions": num_positions,
"stockfish_depth": stockfish_depth,
"epochs": epochs,
"batch_size": batch_size,
"learning_rate": lr,
"final_val_loss": float(best_val_loss),
"device": str(device),
"checkpoint": str(checkpoint) if checkpoint else None,
"notes": "Win rate vs classical eval: TBD (requires benchmark games)"
}
torch.save(best_model_state, final_output_file)
print(f"Best model saved to {final_output_file}")
# Save metadata if versioning is enabled
if use_versioning and metadata:
metadata_file = save_metadata(final_output_file, metadata)
print(f"Metadata saved to {metadata_file}")
print(f"\nTraining Summary:")
print(f" Version: v{metadata['version']}")
print(f" Positions: {metadata['num_positions']}")
print(f" Stockfish depth: {metadata['stockfish_depth']}")
print(f" Epochs: {metadata['epochs']}")
print(f" Final validation loss: {metadata['final_val_loss']:.6f}")
print(f" Device: {metadata['device']}")
if __name__ == "__main__":
import argparse
parser = argparse.ArgumentParser(description="Train NNUE neural network for chess evaluation")
parser.add_argument("data_file", nargs="?", default="training_data.jsonl",
help="Path to training_data.jsonl (default: training_data.jsonl)")
parser.add_argument("output_file", nargs="?", default="nnue_weights.pt",
help="Output file base name (default: nnue_weights.pt)")
parser.add_argument("--checkpoint", type=str, default=None,
help="Path to checkpoint file to resume training from (optional)")
parser.add_argument("--epochs", type=int, default=20,
help="Number of epochs to train (default: 20)")
parser.add_argument("--batch-size", type=int, default=4096,
help="Batch size (default: 4096)")
parser.add_argument("--lr", type=float, default=1e-3,
help="Learning rate (default: 1e-3)")
parser.add_argument("--stockfish-depth", type=int, default=12,
help="Stockfish depth used for evaluations (for metadata, default: 12)")
parser.add_argument("--no-versioning", action="store_true",
help="Disable automatic versioning (save directly to output file)")
args = parser.parse_args()
train_nnue(
data_file=args.data_file,
output_file=args.output_file,
epochs=args.epochs,
batch_size=args.batch_size,
lr=args.lr,
checkpoint=args.checkpoint,
stockfish_depth=args.stockfish_depth,
use_versioning=not args.no_versioning
)