9.2 KiB
Debugging the NNUE Pipeline
Common Issues & Solutions
Issue 1: Empty training_data.jsonl
Symptom: After running the pipeline, training_data.jsonl is empty or doesn't exist.
Diagnosis: Run labeling with verbose output:
python label_positions.py positions.txt training_data.jsonl /path/to/stockfish --verbose
Check these in order:
1. Is positions.txt empty?
wc -l positions.txt
If 0 lines: positions generator is failing. See Issue 2.
If >0 lines: positions exist. Check step 2.
2. Is Stockfish installed and working?
# Linux/macOS
which stockfish
stockfish --version
# Windows
where stockfish
C:\path\to\stockfish.exe --version
If not found: Install from https://stockfishchess.org
3. Is the Stockfish path correct?
# Check what path the labeler is using
export STOCKFISH_PATH=/your/path/to/stockfish
echo $STOCKFISH_PATH
python label_positions.py positions.txt training_data.jsonl $STOCKFISH_PATH --verbose
The script will print at the top: Using Stockfish: /path/to/stockfish
4. Check the error summary
After running with verbose, look for the summary:
============================================================
LABELING SUMMARY
============================================================
Successfully evaluated: 0 ← This should be > 0
Skipped (duplicates): 0
Skipped (invalid): 0
Errors: 0
If "Successfully evaluated" is 0, positions aren't being saved.
Issue 2: Empty positions.txt
Symptom: positions.txt is empty after running generate_positions.py
Diagnosis: Check the generation summary:
python generate_positions.py positions.txt --games 10000
Expected output:
============================================================
POSITION GENERATION SUMMARY
============================================================
Total games: 10000
Saved positions: 1234 ← This should be > 0
Filtered (check): 2345
Filtered (captures): 4321
Filtered (game over): 1100
Total filtered: 7766
Acceptance rate: 12.34%
============================================================
If Saved positions = 0:
The filters are too strict! Try with --no-filter-captures:
python generate_positions.py positions.txt --games 10000 --no-filter-captures
This allows positions with available captures, which should greatly increase the output.
Issue 3: Stockfish Errors During Labeling
Symptom: Labeling runs but shows errors like:
Error evaluating position: rnbqkbnr/pppppppp...
SomeError: [error details]
Solutions:
-
Check Stockfish is responsive:
# Test Stockfish directly echo "position startpos" | stockfish echo "quit" | stockfish -
Try with lower depth (faster, fewer timeouts):
python label_positions.py positions.txt training_data.jsonl /path/to/stockfish --depth 8 -
Use explicit path instead of relying on PATH:
python label_positions.py positions.txt training_data.jsonl /usr/games/stockfish -
Check if FENs in positions.txt are valid:
head -5 positions.txtOutput should look like:
rnbqkbnr/pppppppp/8/8/4P3/8/PPPP1PPP/RNBQKBNR b KQkq e3 0 1 rnbqkbnr/pppppppp/8/8/4P3/8/PPPP1PPP/RNBQKBNR b KQkq e3 0 1
Issue 4: Training Fails - No Valid Data
Symptom: train_nnue.py crashes with:
IndexError: list index out of range
Cause: training_data.jsonl is empty or contains invalid JSON.
Debug:
# Check file size
ls -lh training_data.jsonl
# Count valid lines
python -c "import json; lines = [1 for line in open('training_data.jsonl') if json.loads(line)]; print(f'Valid lines: {len(lines)}')"
# Look at first few lines
head -3 training_data.jsonl
Expected output:
{"fen": "rnbqkbnr/pppppppp/8/8/4P3/8/PPPP1PPP/RNBQKBNR b KQkq e3 0 1", "eval": 45}
{"fen": "rnbqkbnr/pppppppp/8/8/4P3/8/PPPP1PPP/RNBQKBNR b KQkq e3 0 1", "eval": 48}
If empty: go back to Issue 1.
Step-by-Step Verification
Run this to verify each step works:
cd modules/bot/python
# Step 1: Generate 1000 positions (quick test)
echo "Testing position generation..."
python generate_positions.py test_positions.txt --games 1000 --no-filter-captures
# Check output
if [ ! -s test_positions.txt ]; then
echo "ERROR: positions.txt is empty"
exit 1
fi
POSITIONS=$(wc -l < test_positions.txt)
echo "✓ Generated $POSITIONS positions"
# Step 2: Label positions (quick test with 100 positions)
echo "Testing Stockfish labeling..."
export STOCKFISH_PATH=$(which stockfish || which /usr/games/stockfish || echo "stockfish")
if ! command -v $STOCKFISH_PATH &> /dev/null; then
echo "ERROR: Stockfish not found"
echo " Install: apt-get install stockfish (Linux) or brew install stockfish (Mac)"
exit 1
fi
head -100 test_positions.txt > test_positions_100.txt
python label_positions.py test_positions_100.txt test_training_data.jsonl $STOCKFISH_PATH --depth 8
# Check output
if [ ! -s test_training_data.jsonl ]; then
echo "ERROR: training_data.jsonl is empty"
echo " Run again with --verbose:"
python label_positions.py test_positions_100.txt test_training_data.jsonl $STOCKFISH_PATH --depth 8 --verbose
exit 1
fi
EVALS=$(wc -l < test_training_data.jsonl)
echo "✓ Evaluated $EVALS positions"
# Step 3: Test training
echo "Testing training..."
python train_nnue.py test_training_data.jsonl test_weights.pt --epochs 1 --batch-size 32 --no-versioning
if [ ! -f test_weights.pt ]; then
echo "ERROR: training failed"
exit 1
fi
echo "✓ Training works"
echo ""
echo "All tests passed! Pipeline is working correctly."
echo "You can now run the full pipeline with:"
echo " ./run_pipeline.sh"
Save as test_pipeline.sh and run:
chmod +x test_pipeline.sh
./test_pipeline.sh
Common Error Messages
"Stockfish not found at stockfish"
# Set the full path
export STOCKFISH_PATH=/usr/games/stockfish
# Or on Windows:
set STOCKFISH_PATH=C:\stockfish\stockfish.exe
"No such file or directory: positions.txt"
# Make sure you're in the right directory
cd modules/bot/python
# Or provide full path
python label_positions.py /full/path/to/positions.txt training_data.jsonl stockfish
"JSONDecodeError" in training
# training_data.jsonl has invalid JSON
# Regenerate it:
rm training_data.jsonl
python label_positions.py positions.txt training_data.jsonl stockfish
"CUDA out of memory"
# Reduce batch size
python train_nnue.py training_data.jsonl nnue_weights.pt --batch-size 1024
Getting More Information
Verbose Output
All scripts support --verbose for detailed debugging:
python label_positions.py positions.txt training_data.jsonl stockfish --verbose
This prints:
- Which Stockfish is being used
- Error details for each failed position
- Summary of what passed/failed/skipped
File Size Checks
# Check all files
ls -lh positions.txt training_data.jsonl nnue_weights.pt
# Count lines
echo "Positions: $(wc -l < positions.txt)"
echo "Training data: $(wc -l < training_data.jsonl)"
Quick Tests
# Test position generation (100 games)
python generate_positions.py test_pos.txt --games 100 --no-filter-captures
# Test Stockfish labeling (10 positions)
head -10 test_pos.txt > test_pos_10.txt
python label_positions.py test_pos_10.txt test_data_10.jsonl stockfish --depth 6
# Test training (on test data)
python train_nnue.py test_data_10.jsonl test_model.pt --epochs 1 --batch-size 8
Pipeline Workflow with Debugging
# 1. Generate positions
python generate_positions.py positions.txt --games 100000 --no-filter-captures
# Should output: Saved positions: ~20000-40000 (depends on filter)
# 2. Label with Stockfish
export STOCKFISH_PATH=$(which stockfish)
python label_positions.py positions.txt training_data.jsonl $STOCKFISH_PATH --depth 10
# Should output: Successfully evaluated: > 0
# 3. Train model
python train_nnue.py training_data.jsonl nnue_weights.pt --epochs 5
# Should output: Training summary with version info
# 4. Export to Scala
python export_weights.py nnue_weights_v1.pt ../src/main/scala/de/nowchess/bot/bots/nnue/NNUEWeights.scala
# Should output: NNUEWeights.scala created
# 5. Compile Scala
cd ../..
./compile
# Should output: BUILD SUCCESSFUL
Performance Monitoring
While labeling is running, monitor progress:
# In another terminal
watch -n 5 'wc -l modules/bot/python/training_data.jsonl'
# Or on macOS
while true; do echo $(wc -l < modules/bot/python/training_data.jsonl) positions labeled; sleep 5; done
This shows how many positions per second are being evaluated.
Still Stuck?
- Read the full output — Don't skip error messages
- Check file sizes —
ls -lhshows if files are being created - Run with
--verbose— Shows exactly what's failing - Test individual steps — Don't run full pipeline, test pieces
- Check Stockfish —
stockfish --versionconfirms it works
For more help, see:
README_NNUE.md— Complete pipeline docsTRAINING_GUIDE.md— Training workflowsINCREMENTAL_TRAINING.md— Versioning & checkpoints