Files
NowChessSystems/modules/bot/python/DEBUGGING_GUIDE.md
T

384 lines
9.2 KiB
Markdown

# Debugging the NNUE Pipeline
## Common Issues & Solutions
### Issue 1: Empty training_data.jsonl
**Symptom:** After running the pipeline, `training_data.jsonl` is empty or doesn't exist.
**Diagnosis:** Run labeling with verbose output:
```bash
python label_positions.py positions.txt training_data.jsonl /path/to/stockfish --verbose
```
**Check these in order:**
#### 1. Is `positions.txt` empty?
```bash
wc -l positions.txt
```
If 0 lines: positions generator is failing. See Issue 2.
If >0 lines: positions exist. Check step 2.
#### 2. Is Stockfish installed and working?
```bash
# Linux/macOS
which stockfish
stockfish --version
# Windows
where stockfish
C:\path\to\stockfish.exe --version
```
If not found: Install from https://stockfishchess.org
#### 3. Is the Stockfish path correct?
```bash
# Check what path the labeler is using
export STOCKFISH_PATH=/your/path/to/stockfish
echo $STOCKFISH_PATH
python label_positions.py positions.txt training_data.jsonl $STOCKFISH_PATH --verbose
```
The script will print at the top: `Using Stockfish: /path/to/stockfish`
#### 4. Check the error summary
After running with verbose, look for the summary:
```
============================================================
LABELING SUMMARY
============================================================
Successfully evaluated: 0 ← This should be > 0
Skipped (duplicates): 0
Skipped (invalid): 0
Errors: 0
```
If "Successfully evaluated" is 0, positions aren't being saved.
---
### Issue 2: Empty positions.txt
**Symptom:** `positions.txt` is empty after running `generate_positions.py`
**Diagnosis:** Check the generation summary:
```bash
python generate_positions.py positions.txt --games 10000
```
Expected output:
```
============================================================
POSITION GENERATION SUMMARY
============================================================
Total games: 10000
Saved positions: 1234 ← This should be > 0
Filtered (check): 2345
Filtered (captures): 4321
Filtered (game over): 1100
Total filtered: 7766
Acceptance rate: 12.34%
============================================================
```
**If Saved positions = 0:**
The filters are too strict! Try with `--no-filter-captures`:
```bash
python generate_positions.py positions.txt --games 10000 --no-filter-captures
```
This allows positions with available captures, which should greatly increase the output.
---
### Issue 3: Stockfish Errors During Labeling
**Symptom:** Labeling runs but shows errors like:
```
Error evaluating position: rnbqkbnr/pppppppp...
SomeError: [error details]
```
**Solutions:**
1. **Check Stockfish is responsive:**
```bash
# Test Stockfish directly
echo "position startpos" | stockfish
echo "quit" | stockfish
```
2. **Try with lower depth** (faster, fewer timeouts):
```bash
python label_positions.py positions.txt training_data.jsonl /path/to/stockfish --depth 8
```
3. **Use explicit path** instead of relying on PATH:
```bash
python label_positions.py positions.txt training_data.jsonl /usr/games/stockfish
```
4. **Check if FENs in positions.txt are valid:**
```bash
head -5 positions.txt
```
Output should look like:
```
rnbqkbnr/pppppppp/8/8/4P3/8/PPPP1PPP/RNBQKBNR b KQkq e3 0 1
rnbqkbnr/pppppppp/8/8/4P3/8/PPPP1PPP/RNBQKBNR b KQkq e3 0 1
```
---
### Issue 4: Training Fails - No Valid Data
**Symptom:** `train_nnue.py` crashes with:
```
IndexError: list index out of range
```
**Cause:** `training_data.jsonl` is empty or contains invalid JSON.
**Debug:**
```bash
# Check file size
ls -lh training_data.jsonl
# Count valid lines
python -c "import json; lines = [1 for line in open('training_data.jsonl') if json.loads(line)]; print(f'Valid lines: {len(lines)}')"
# Look at first few lines
head -3 training_data.jsonl
```
Expected output:
```
{"fen": "rnbqkbnr/pppppppp/8/8/4P3/8/PPPP1PPP/RNBQKBNR b KQkq e3 0 1", "eval": 45}
{"fen": "rnbqkbnr/pppppppp/8/8/4P3/8/PPPP1PPP/RNBQKBNR b KQkq e3 0 1", "eval": 48}
```
If empty: go back to Issue 1.
---
## Step-by-Step Verification
Run this to verify each step works:
```bash
cd modules/bot/python
# Step 1: Generate 1000 positions (quick test)
echo "Testing position generation..."
python generate_positions.py test_positions.txt --games 1000 --no-filter-captures
# Check output
if [ ! -s test_positions.txt ]; then
echo "ERROR: positions.txt is empty"
exit 1
fi
POSITIONS=$(wc -l < test_positions.txt)
echo "✓ Generated $POSITIONS positions"
# Step 2: Label positions (quick test with 100 positions)
echo "Testing Stockfish labeling..."
export STOCKFISH_PATH=$(which stockfish || which /usr/games/stockfish || echo "stockfish")
if ! command -v $STOCKFISH_PATH &> /dev/null; then
echo "ERROR: Stockfish not found"
echo " Install: apt-get install stockfish (Linux) or brew install stockfish (Mac)"
exit 1
fi
head -100 test_positions.txt > test_positions_100.txt
python label_positions.py test_positions_100.txt test_training_data.jsonl $STOCKFISH_PATH --depth 8
# Check output
if [ ! -s test_training_data.jsonl ]; then
echo "ERROR: training_data.jsonl is empty"
echo " Run again with --verbose:"
python label_positions.py test_positions_100.txt test_training_data.jsonl $STOCKFISH_PATH --depth 8 --verbose
exit 1
fi
EVALS=$(wc -l < test_training_data.jsonl)
echo "✓ Evaluated $EVALS positions"
# Step 3: Test training
echo "Testing training..."
python train_nnue.py test_training_data.jsonl test_weights.pt --epochs 1 --batch-size 32 --no-versioning
if [ ! -f test_weights.pt ]; then
echo "ERROR: training failed"
exit 1
fi
echo "✓ Training works"
echo ""
echo "All tests passed! Pipeline is working correctly."
echo "You can now run the full pipeline with:"
echo " ./run_pipeline.sh"
```
Save as `test_pipeline.sh` and run:
```bash
chmod +x test_pipeline.sh
./test_pipeline.sh
```
---
## Common Error Messages
### "Stockfish not found at stockfish"
```bash
# Set the full path
export STOCKFISH_PATH=/usr/games/stockfish
# Or on Windows:
set STOCKFISH_PATH=C:\stockfish\stockfish.exe
```
### "No such file or directory: positions.txt"
```bash
# Make sure you're in the right directory
cd modules/bot/python
# Or provide full path
python label_positions.py /full/path/to/positions.txt training_data.jsonl stockfish
```
### "JSONDecodeError" in training
```bash
# training_data.jsonl has invalid JSON
# Regenerate it:
rm training_data.jsonl
python label_positions.py positions.txt training_data.jsonl stockfish
```
### "CUDA out of memory"
```bash
# Reduce batch size
python train_nnue.py training_data.jsonl nnue_weights.pt --batch-size 1024
```
---
## Getting More Information
### Verbose Output
All scripts support `--verbose` for detailed debugging:
```bash
python label_positions.py positions.txt training_data.jsonl stockfish --verbose
```
This prints:
- Which Stockfish is being used
- Error details for each failed position
- Summary of what passed/failed/skipped
### File Size Checks
```bash
# Check all files
ls -lh positions.txt training_data.jsonl nnue_weights.pt
# Count lines
echo "Positions: $(wc -l < positions.txt)"
echo "Training data: $(wc -l < training_data.jsonl)"
```
### Quick Tests
```bash
# Test position generation (100 games)
python generate_positions.py test_pos.txt --games 100 --no-filter-captures
# Test Stockfish labeling (10 positions)
head -10 test_pos.txt > test_pos_10.txt
python label_positions.py test_pos_10.txt test_data_10.jsonl stockfish --depth 6
# Test training (on test data)
python train_nnue.py test_data_10.jsonl test_model.pt --epochs 1 --batch-size 8
```
---
## Pipeline Workflow with Debugging
```bash
# 1. Generate positions
python generate_positions.py positions.txt --games 100000 --no-filter-captures
# Should output: Saved positions: ~20000-40000 (depends on filter)
# 2. Label with Stockfish
export STOCKFISH_PATH=$(which stockfish)
python label_positions.py positions.txt training_data.jsonl $STOCKFISH_PATH --depth 10
# Should output: Successfully evaluated: > 0
# 3. Train model
python train_nnue.py training_data.jsonl nnue_weights.pt --epochs 5
# Should output: Training summary with version info
# 4. Export to Scala
python export_weights.py nnue_weights_v1.pt ../src/main/scala/de/nowchess/bot/bots/nnue/NNUEWeights.scala
# Should output: NNUEWeights.scala created
# 5. Compile Scala
cd ../..
./compile
# Should output: BUILD SUCCESSFUL
```
---
## Performance Monitoring
While labeling is running, monitor progress:
```bash
# In another terminal
watch -n 5 'wc -l modules/bot/python/training_data.jsonl'
# Or on macOS
while true; do echo $(wc -l < modules/bot/python/training_data.jsonl) positions labeled; sleep 5; done
```
This shows how many positions per second are being evaluated.
---
## Still Stuck?
1. **Read the full output** — Don't skip error messages
2. **Check file sizes** — `ls -lh` shows if files are being created
3. **Run with `--verbose`** — Shows exactly what's failing
4. **Test individual steps** — Don't run full pipeline, test pieces
5. **Check Stockfish** — `stockfish --version` confirms it works
For more help, see:
- `README_NNUE.md` — Complete pipeline docs
- `TRAINING_GUIDE.md` — Training workflows
- `INCREMENTAL_TRAINING.md` — Versioning & checkpoints