Files
NowChessSystems/modules/bot/python/DEBUGGING_GUIDE.md
T

9.2 KiB

Debugging the NNUE Pipeline

Common Issues & Solutions

Issue 1: Empty training_data.jsonl

Symptom: After running the pipeline, training_data.jsonl is empty or doesn't exist.

Diagnosis: Run labeling with verbose output:

python label_positions.py positions.txt training_data.jsonl /path/to/stockfish --verbose

Check these in order:

1. Is positions.txt empty?

wc -l positions.txt

If 0 lines: positions generator is failing. See Issue 2.

If >0 lines: positions exist. Check step 2.

2. Is Stockfish installed and working?

# Linux/macOS
which stockfish
stockfish --version

# Windows
where stockfish
C:\path\to\stockfish.exe --version

If not found: Install from https://stockfishchess.org

3. Is the Stockfish path correct?

# Check what path the labeler is using
export STOCKFISH_PATH=/your/path/to/stockfish
echo $STOCKFISH_PATH

python label_positions.py positions.txt training_data.jsonl $STOCKFISH_PATH --verbose

The script will print at the top: Using Stockfish: /path/to/stockfish

4. Check the error summary

After running with verbose, look for the summary:

============================================================
LABELING SUMMARY
============================================================
Successfully evaluated: 0        ← This should be > 0
Skipped (duplicates):   0
Skipped (invalid):      0
Errors:                 0

If "Successfully evaluated" is 0, positions aren't being saved.


Issue 2: Empty positions.txt

Symptom: positions.txt is empty after running generate_positions.py

Diagnosis: Check the generation summary:

python generate_positions.py positions.txt --games 10000

Expected output:

============================================================
POSITION GENERATION SUMMARY
============================================================
Total games:               10000
Saved positions:           1234        ← This should be > 0
Filtered (check):          2345
Filtered (captures):       4321
Filtered (game over):      1100
Total filtered:            7766
Acceptance rate:           12.34%
============================================================

If Saved positions = 0:

The filters are too strict! Try with --no-filter-captures:

python generate_positions.py positions.txt --games 10000 --no-filter-captures

This allows positions with available captures, which should greatly increase the output.


Issue 3: Stockfish Errors During Labeling

Symptom: Labeling runs but shows errors like:

Error evaluating position: rnbqkbnr/pppppppp...
  SomeError: [error details]

Solutions:

  1. Check Stockfish is responsive:

    # Test Stockfish directly
    echo "position startpos" | stockfish
    echo "quit" | stockfish
    
  2. Try with lower depth (faster, fewer timeouts):

    python label_positions.py positions.txt training_data.jsonl /path/to/stockfish --depth 8
    
  3. Use explicit path instead of relying on PATH:

    python label_positions.py positions.txt training_data.jsonl /usr/games/stockfish
    
  4. Check if FENs in positions.txt are valid:

    head -5 positions.txt
    

    Output should look like:

    rnbqkbnr/pppppppp/8/8/4P3/8/PPPP1PPP/RNBQKBNR b KQkq e3 0 1
    rnbqkbnr/pppppppp/8/8/4P3/8/PPPP1PPP/RNBQKBNR b KQkq e3 0 1
    

Issue 4: Training Fails - No Valid Data

Symptom: train_nnue.py crashes with:

IndexError: list index out of range

Cause: training_data.jsonl is empty or contains invalid JSON.

Debug:

# Check file size
ls -lh training_data.jsonl

# Count valid lines
python -c "import json; lines = [1 for line in open('training_data.jsonl') if json.loads(line)]; print(f'Valid lines: {len(lines)}')"

# Look at first few lines
head -3 training_data.jsonl

Expected output:

{"fen": "rnbqkbnr/pppppppp/8/8/4P3/8/PPPP1PPP/RNBQKBNR b KQkq e3 0 1", "eval": 45}
{"fen": "rnbqkbnr/pppppppp/8/8/4P3/8/PPPP1PPP/RNBQKBNR b KQkq e3 0 1", "eval": 48}

If empty: go back to Issue 1.


Step-by-Step Verification

Run this to verify each step works:

cd modules/bot/python

# Step 1: Generate 1000 positions (quick test)
echo "Testing position generation..."
python generate_positions.py test_positions.txt --games 1000 --no-filter-captures

# Check output
if [ ! -s test_positions.txt ]; then
    echo "ERROR: positions.txt is empty"
    exit 1
fi
POSITIONS=$(wc -l < test_positions.txt)
echo "✓ Generated $POSITIONS positions"

# Step 2: Label positions (quick test with 100 positions)
echo "Testing Stockfish labeling..."
export STOCKFISH_PATH=$(which stockfish || which /usr/games/stockfish || echo "stockfish")
if ! command -v $STOCKFISH_PATH &> /dev/null; then
    echo "ERROR: Stockfish not found"
    echo "  Install: apt-get install stockfish (Linux) or brew install stockfish (Mac)"
    exit 1
fi

head -100 test_positions.txt > test_positions_100.txt
python label_positions.py test_positions_100.txt test_training_data.jsonl $STOCKFISH_PATH --depth 8

# Check output
if [ ! -s test_training_data.jsonl ]; then
    echo "ERROR: training_data.jsonl is empty"
    echo "  Run again with --verbose:"
    python label_positions.py test_positions_100.txt test_training_data.jsonl $STOCKFISH_PATH --depth 8 --verbose
    exit 1
fi
EVALS=$(wc -l < test_training_data.jsonl)
echo "✓ Evaluated $EVALS positions"

# Step 3: Test training
echo "Testing training..."
python train_nnue.py test_training_data.jsonl test_weights.pt --epochs 1 --batch-size 32 --no-versioning

if [ ! -f test_weights.pt ]; then
    echo "ERROR: training failed"
    exit 1
fi
echo "✓ Training works"

echo ""
echo "All tests passed! Pipeline is working correctly."
echo "You can now run the full pipeline with:"
echo "  ./run_pipeline.sh"

Save as test_pipeline.sh and run:

chmod +x test_pipeline.sh
./test_pipeline.sh

Common Error Messages

"Stockfish not found at stockfish"

# Set the full path
export STOCKFISH_PATH=/usr/games/stockfish
# Or on Windows:
set STOCKFISH_PATH=C:\stockfish\stockfish.exe

"No such file or directory: positions.txt"

# Make sure you're in the right directory
cd modules/bot/python

# Or provide full path
python label_positions.py /full/path/to/positions.txt training_data.jsonl stockfish

"JSONDecodeError" in training

# training_data.jsonl has invalid JSON
# Regenerate it:
rm training_data.jsonl
python label_positions.py positions.txt training_data.jsonl stockfish

"CUDA out of memory"

# Reduce batch size
python train_nnue.py training_data.jsonl nnue_weights.pt --batch-size 1024

Getting More Information

Verbose Output

All scripts support --verbose for detailed debugging:

python label_positions.py positions.txt training_data.jsonl stockfish --verbose

This prints:

  • Which Stockfish is being used
  • Error details for each failed position
  • Summary of what passed/failed/skipped

File Size Checks

# Check all files
ls -lh positions.txt training_data.jsonl nnue_weights.pt

# Count lines
echo "Positions: $(wc -l < positions.txt)"
echo "Training data: $(wc -l < training_data.jsonl)"

Quick Tests

# Test position generation (100 games)
python generate_positions.py test_pos.txt --games 100 --no-filter-captures

# Test Stockfish labeling (10 positions)
head -10 test_pos.txt > test_pos_10.txt
python label_positions.py test_pos_10.txt test_data_10.jsonl stockfish --depth 6

# Test training (on test data)
python train_nnue.py test_data_10.jsonl test_model.pt --epochs 1 --batch-size 8

Pipeline Workflow with Debugging

# 1. Generate positions
python generate_positions.py positions.txt --games 100000 --no-filter-captures
# Should output: Saved positions: ~20000-40000 (depends on filter)

# 2. Label with Stockfish
export STOCKFISH_PATH=$(which stockfish)
python label_positions.py positions.txt training_data.jsonl $STOCKFISH_PATH --depth 10
# Should output: Successfully evaluated: > 0

# 3. Train model
python train_nnue.py training_data.jsonl nnue_weights.pt --epochs 5
# Should output: Training summary with version info

# 4. Export to Scala
python export_weights.py nnue_weights_v1.pt ../src/main/scala/de/nowchess/bot/bots/nnue/NNUEWeights.scala
# Should output: NNUEWeights.scala created

# 5. Compile Scala
cd ../..
./compile
# Should output: BUILD SUCCESSFUL

Performance Monitoring

While labeling is running, monitor progress:

# In another terminal
watch -n 5 'wc -l modules/bot/python/training_data.jsonl'

# Or on macOS
while true; do echo $(wc -l < modules/bot/python/training_data.jsonl) positions labeled; sleep 5; done

This shows how many positions per second are being evaluated.


Still Stuck?

  1. Read the full output — Don't skip error messages
  2. Check file sizesls -lh shows if files are being created
  3. Run with --verbose — Shows exactly what's failing
  4. Test individual steps — Don't run full pipeline, test pieces
  5. Check Stockfishstockfish --version confirms it works

For more help, see:

  • README_NNUE.md — Complete pipeline docs
  • TRAINING_GUIDE.md — Training workflows
  • INCREMENTAL_TRAINING.md — Versioning & checkpoints