# Debugging the NNUE Pipeline ## Common Issues & Solutions ### Issue 1: Empty training_data.jsonl **Symptom:** After running the pipeline, `training_data.jsonl` is empty or doesn't exist. **Diagnosis:** Run labeling with verbose output: ```bash python label_positions.py positions.txt training_data.jsonl /path/to/stockfish --verbose ``` **Check these in order:** #### 1. Is `positions.txt` empty? ```bash wc -l positions.txt ``` If 0 lines: positions generator is failing. See Issue 2. If >0 lines: positions exist. Check step 2. #### 2. Is Stockfish installed and working? ```bash # Linux/macOS which stockfish stockfish --version # Windows where stockfish C:\path\to\stockfish.exe --version ``` If not found: Install from https://stockfishchess.org #### 3. Is the Stockfish path correct? ```bash # Check what path the labeler is using export STOCKFISH_PATH=/your/path/to/stockfish echo $STOCKFISH_PATH python label_positions.py positions.txt training_data.jsonl $STOCKFISH_PATH --verbose ``` The script will print at the top: `Using Stockfish: /path/to/stockfish` #### 4. Check the error summary After running with verbose, look for the summary: ``` ============================================================ LABELING SUMMARY ============================================================ Successfully evaluated: 0 ← This should be > 0 Skipped (duplicates): 0 Skipped (invalid): 0 Errors: 0 ``` If "Successfully evaluated" is 0, positions aren't being saved. --- ### Issue 2: Empty positions.txt **Symptom:** `positions.txt` is empty after running `generate_positions.py` **Diagnosis:** Check the generation summary: ```bash python generate_positions.py positions.txt --games 10000 ``` Expected output: ``` ============================================================ POSITION GENERATION SUMMARY ============================================================ Total games: 10000 Saved positions: 1234 ← This should be > 0 Filtered (check): 2345 Filtered (captures): 4321 Filtered (game over): 1100 Total filtered: 7766 Acceptance rate: 12.34% ============================================================ ``` **If Saved positions = 0:** The filters are too strict! Try with `--no-filter-captures`: ```bash python generate_positions.py positions.txt --games 10000 --no-filter-captures ``` This allows positions with available captures, which should greatly increase the output. --- ### Issue 3: Stockfish Errors During Labeling **Symptom:** Labeling runs but shows errors like: ``` Error evaluating position: rnbqkbnr/pppppppp... SomeError: [error details] ``` **Solutions:** 1. **Check Stockfish is responsive:** ```bash # Test Stockfish directly echo "position startpos" | stockfish echo "quit" | stockfish ``` 2. **Try with lower depth** (faster, fewer timeouts): ```bash python label_positions.py positions.txt training_data.jsonl /path/to/stockfish --depth 8 ``` 3. **Use explicit path** instead of relying on PATH: ```bash python label_positions.py positions.txt training_data.jsonl /usr/games/stockfish ``` 4. **Check if FENs in positions.txt are valid:** ```bash head -5 positions.txt ``` Output should look like: ``` rnbqkbnr/pppppppp/8/8/4P3/8/PPPP1PPP/RNBQKBNR b KQkq e3 0 1 rnbqkbnr/pppppppp/8/8/4P3/8/PPPP1PPP/RNBQKBNR b KQkq e3 0 1 ``` --- ### Issue 4: Training Fails - No Valid Data **Symptom:** `train_nnue.py` crashes with: ``` IndexError: list index out of range ``` **Cause:** `training_data.jsonl` is empty or contains invalid JSON. **Debug:** ```bash # Check file size ls -lh training_data.jsonl # Count valid lines python -c "import json; lines = [1 for line in open('training_data.jsonl') if json.loads(line)]; print(f'Valid lines: {len(lines)}')" # Look at first few lines head -3 training_data.jsonl ``` Expected output: ``` {"fen": "rnbqkbnr/pppppppp/8/8/4P3/8/PPPP1PPP/RNBQKBNR b KQkq e3 0 1", "eval": 45} {"fen": "rnbqkbnr/pppppppp/8/8/4P3/8/PPPP1PPP/RNBQKBNR b KQkq e3 0 1", "eval": 48} ``` If empty: go back to Issue 1. --- ## Step-by-Step Verification Run this to verify each step works: ```bash cd modules/bot/python # Step 1: Generate 1000 positions (quick test) echo "Testing position generation..." python generate_positions.py test_positions.txt --games 1000 --no-filter-captures # Check output if [ ! -s test_positions.txt ]; then echo "ERROR: positions.txt is empty" exit 1 fi POSITIONS=$(wc -l < test_positions.txt) echo "✓ Generated $POSITIONS positions" # Step 2: Label positions (quick test with 100 positions) echo "Testing Stockfish labeling..." export STOCKFISH_PATH=$(which stockfish || which /usr/games/stockfish || echo "stockfish") if ! command -v $STOCKFISH_PATH &> /dev/null; then echo "ERROR: Stockfish not found" echo " Install: apt-get install stockfish (Linux) or brew install stockfish (Mac)" exit 1 fi head -100 test_positions.txt > test_positions_100.txt python label_positions.py test_positions_100.txt test_training_data.jsonl $STOCKFISH_PATH --depth 8 # Check output if [ ! -s test_training_data.jsonl ]; then echo "ERROR: training_data.jsonl is empty" echo " Run again with --verbose:" python label_positions.py test_positions_100.txt test_training_data.jsonl $STOCKFISH_PATH --depth 8 --verbose exit 1 fi EVALS=$(wc -l < test_training_data.jsonl) echo "✓ Evaluated $EVALS positions" # Step 3: Test training echo "Testing training..." python train_nnue.py test_training_data.jsonl test_weights.pt --epochs 1 --batch-size 32 --no-versioning if [ ! -f test_weights.pt ]; then echo "ERROR: training failed" exit 1 fi echo "✓ Training works" echo "" echo "All tests passed! Pipeline is working correctly." echo "You can now run the full pipeline with:" echo " ./run_pipeline.sh" ``` Save as `test_pipeline.sh` and run: ```bash chmod +x test_pipeline.sh ./test_pipeline.sh ``` --- ## Common Error Messages ### "Stockfish not found at stockfish" ```bash # Set the full path export STOCKFISH_PATH=/usr/games/stockfish # Or on Windows: set STOCKFISH_PATH=C:\stockfish\stockfish.exe ``` ### "No such file or directory: positions.txt" ```bash # Make sure you're in the right directory cd modules/bot/python # Or provide full path python label_positions.py /full/path/to/positions.txt training_data.jsonl stockfish ``` ### "JSONDecodeError" in training ```bash # training_data.jsonl has invalid JSON # Regenerate it: rm training_data.jsonl python label_positions.py positions.txt training_data.jsonl stockfish ``` ### "CUDA out of memory" ```bash # Reduce batch size python train_nnue.py training_data.jsonl nnue_weights.pt --batch-size 1024 ``` --- ## Getting More Information ### Verbose Output All scripts support `--verbose` for detailed debugging: ```bash python label_positions.py positions.txt training_data.jsonl stockfish --verbose ``` This prints: - Which Stockfish is being used - Error details for each failed position - Summary of what passed/failed/skipped ### File Size Checks ```bash # Check all files ls -lh positions.txt training_data.jsonl nnue_weights.pt # Count lines echo "Positions: $(wc -l < positions.txt)" echo "Training data: $(wc -l < training_data.jsonl)" ``` ### Quick Tests ```bash # Test position generation (100 games) python generate_positions.py test_pos.txt --games 100 --no-filter-captures # Test Stockfish labeling (10 positions) head -10 test_pos.txt > test_pos_10.txt python label_positions.py test_pos_10.txt test_data_10.jsonl stockfish --depth 6 # Test training (on test data) python train_nnue.py test_data_10.jsonl test_model.pt --epochs 1 --batch-size 8 ``` --- ## Pipeline Workflow with Debugging ```bash # 1. Generate positions python generate_positions.py positions.txt --games 100000 --no-filter-captures # Should output: Saved positions: ~20000-40000 (depends on filter) # 2. Label with Stockfish export STOCKFISH_PATH=$(which stockfish) python label_positions.py positions.txt training_data.jsonl $STOCKFISH_PATH --depth 10 # Should output: Successfully evaluated: > 0 # 3. Train model python train_nnue.py training_data.jsonl nnue_weights.pt --epochs 5 # Should output: Training summary with version info # 4. Export to Scala python export_weights.py nnue_weights_v1.pt ../src/main/scala/de/nowchess/bot/bots/nnue/NNUEWeights.scala # Should output: NNUEWeights.scala created # 5. Compile Scala cd ../.. ./compile # Should output: BUILD SUCCESSFUL ``` --- ## Performance Monitoring While labeling is running, monitor progress: ```bash # In another terminal watch -n 5 'wc -l modules/bot/python/training_data.jsonl' # Or on macOS while true; do echo $(wc -l < modules/bot/python/training_data.jsonl) positions labeled; sleep 5; done ``` This shows how many positions per second are being evaluated. --- ## Still Stuck? 1. **Read the full output** — Don't skip error messages 2. **Check file sizes** — `ls -lh` shows if files are being created 3. **Run with `--verbose`** — Shows exactly what's failing 4. **Test individual steps** — Don't run full pipeline, test pieces 5. **Check Stockfish** — `stockfish --version` confirms it works For more help, see: - `README_NNUE.md` — Complete pipeline docs - `TRAINING_GUIDE.md` — Training workflows - `INCREMENTAL_TRAINING.md` — Versioning & checkpoints