NowChessSystems/modules/bot/python/DEBUGGING_GUIDE.md

# Debugging the NNUE Pipeline

## Common Issues & Solutions

### Issue 1: Empty training_data.jsonl

**Symptom:** After running the pipeline, `training_data.jsonl` is empty or doesn't exist.

**Diagnosis:** Run labeling with verbose output:

```bash
python label_positions.py positions.txt training_data.jsonl /path/to/stockfish --verbose
```

**Check these in order:**

#### 1. Is `positions.txt` empty?

```bash
wc -l positions.txt
```

If 0 lines: positions generator is failing. See Issue 2.

If >0 lines: positions exist. Check step 2.

#### 2. Is Stockfish installed and working?

```bash
# Linux/macOS
which stockfish
stockfish --version

# Windows
where stockfish
C:\path\to\stockfish.exe --version
```

If not found: Install from https://stockfishchess.org

#### 3. Is the Stockfish path correct?

```bash
# Check what path the labeler is using
export STOCKFISH_PATH=/your/path/to/stockfish
echo $STOCKFISH_PATH

python label_positions.py positions.txt training_data.jsonl $STOCKFISH_PATH --verbose
```

The script will print at the top: `Using Stockfish: /path/to/stockfish`

#### 4. Check the error summary

After running with verbose, look for the summary:

```
============================================================
LABELING SUMMARY
============================================================
Successfully evaluated: 0        ← This should be > 0
Skipped (duplicates):   0
Skipped (invalid):      0
Errors:                 0
```

If "Successfully evaluated" is 0, positions aren't being saved.

---

### Issue 2: Empty positions.txt

**Symptom:** `positions.txt` is empty after running `generate_positions.py`

**Diagnosis:** Check the generation summary:

```bash
python generate_positions.py positions.txt --games 10000
```

Expected output:

```
============================================================
POSITION GENERATION SUMMARY
============================================================
Total games:               10000
Saved positions:           1234        ← This should be > 0
Filtered (check):          2345
Filtered (captures):       4321
Filtered (game over):      1100
Total filtered:            7766
Acceptance rate:           12.34%
============================================================
```

**If Saved positions = 0:**

The filters are too strict! Try with `--no-filter-captures`:

```bash
python generate_positions.py positions.txt --games 10000 --no-filter-captures
```

This allows positions with available captures, which should greatly increase the output.

---

### Issue 3: Stockfish Errors During Labeling

**Symptom:** Labeling runs but shows errors like:
```
Error evaluating position: rnbqkbnr/pppppppp...
  SomeError: [error details]
```

**Solutions:**

1. **Check Stockfish is responsive:**
   ```bash
   # Test Stockfish directly
   echo "position startpos" | stockfish
   echo "quit" | stockfish
   ```

2. **Try with lower depth** (faster, fewer timeouts):
   ```bash
   python label_positions.py positions.txt training_data.jsonl /path/to/stockfish --depth 8
   ```

3. **Use explicit path** instead of relying on PATH:
   ```bash
   python label_positions.py positions.txt training_data.jsonl /usr/games/stockfish
   ```

4. **Check if FENs in positions.txt are valid:**
   ```bash
   head -5 positions.txt
   ```

   Output should look like:
   ```
   rnbqkbnr/pppppppp/8/8/4P3/8/PPPP1PPP/RNBQKBNR b KQkq e3 0 1
   rnbqkbnr/pppppppp/8/8/4P3/8/PPPP1PPP/RNBQKBNR b KQkq e3 0 1
   ```

---

### Issue 4: Training Fails - No Valid Data

**Symptom:** `train_nnue.py` crashes with:
```
IndexError: list index out of range
```

**Cause:** `training_data.jsonl` is empty or contains invalid JSON.

**Debug:**

```bash
# Check file size
ls -lh training_data.jsonl

# Count valid lines
python -c "import json; lines = [1 for line in open('training_data.jsonl') if json.loads(line)]; print(f'Valid lines: {len(lines)}')"

# Look at first few lines
head -3 training_data.jsonl
```

Expected output:
```
{"fen": "rnbqkbnr/pppppppp/8/8/4P3/8/PPPP1PPP/RNBQKBNR b KQkq e3 0 1", "eval": 45}
{"fen": "rnbqkbnr/pppppppp/8/8/4P3/8/PPPP1PPP/RNBQKBNR b KQkq e3 0 1", "eval": 48}
```

If empty: go back to Issue 1.

---

## Step-by-Step Verification

Run this to verify each step works:

```bash
cd modules/bot/python

# Step 1: Generate 1000 positions (quick test)
echo "Testing position generation..."
python generate_positions.py test_positions.txt --games 1000 --no-filter-captures

# Check output
if [ ! -s test_positions.txt ]; then
    echo "ERROR: positions.txt is empty"
    exit 1
fi
POSITIONS=$(wc -l < test_positions.txt)
echo "✓ Generated $POSITIONS positions"

# Step 2: Label positions (quick test with 100 positions)
echo "Testing Stockfish labeling..."
export STOCKFISH_PATH=$(which stockfish || which /usr/games/stockfish || echo "stockfish")
if ! command -v $STOCKFISH_PATH &> /dev/null; then
    echo "ERROR: Stockfish not found"
    echo "  Install: apt-get install stockfish (Linux) or brew install stockfish (Mac)"
    exit 1
fi

head -100 test_positions.txt > test_positions_100.txt
python label_positions.py test_positions_100.txt test_training_data.jsonl $STOCKFISH_PATH --depth 8

# Check output
if [ ! -s test_training_data.jsonl ]; then
    echo "ERROR: training_data.jsonl is empty"
    echo "  Run again with --verbose:"
    python label_positions.py test_positions_100.txt test_training_data.jsonl $STOCKFISH_PATH --depth 8 --verbose
    exit 1
fi
EVALS=$(wc -l < test_training_data.jsonl)
echo "✓ Evaluated $EVALS positions"

# Step 3: Test training
echo "Testing training..."
python train_nnue.py test_training_data.jsonl test_weights.pt --epochs 1 --batch-size 32 --no-versioning

if [ ! -f test_weights.pt ]; then
    echo "ERROR: training failed"
    exit 1
fi
echo "✓ Training works"

echo ""
echo "All tests passed! Pipeline is working correctly."
echo "You can now run the full pipeline with:"
echo "  ./run_pipeline.sh"
```

Save as `test_pipeline.sh` and run:

```bash
chmod +x test_pipeline.sh
./test_pipeline.sh
```

---

## Common Error Messages

### "Stockfish not found at stockfish"

```bash
# Set the full path
export STOCKFISH_PATH=/usr/games/stockfish
# Or on Windows:
set STOCKFISH_PATH=C:\stockfish\stockfish.exe
```

### "No such file or directory: positions.txt"

```bash
# Make sure you're in the right directory
cd modules/bot/python

# Or provide full path
python label_positions.py /full/path/to/positions.txt training_data.jsonl stockfish
```

### "JSONDecodeError" in training

```bash
# training_data.jsonl has invalid JSON
# Regenerate it:
rm training_data.jsonl
python label_positions.py positions.txt training_data.jsonl stockfish
```

### "CUDA out of memory"

```bash
# Reduce batch size
python train_nnue.py training_data.jsonl nnue_weights.pt --batch-size 1024
```

---

## Getting More Information

### Verbose Output

All scripts support `--verbose` for detailed debugging:

```bash
python label_positions.py positions.txt training_data.jsonl stockfish --verbose
```

This prints:
- Which Stockfish is being used
- Error details for each failed position
- Summary of what passed/failed/skipped

### File Size Checks

```bash
# Check all files
ls -lh positions.txt training_data.jsonl nnue_weights.pt

# Count lines
echo "Positions: $(wc -l < positions.txt)"
echo "Training data: $(wc -l < training_data.jsonl)"
```

### Quick Tests

```bash
# Test position generation (100 games)
python generate_positions.py test_pos.txt --games 100 --no-filter-captures

# Test Stockfish labeling (10 positions)
head -10 test_pos.txt > test_pos_10.txt
python label_positions.py test_pos_10.txt test_data_10.jsonl stockfish --depth 6

# Test training (on test data)
python train_nnue.py test_data_10.jsonl test_model.pt --epochs 1 --batch-size 8
```

---

## Pipeline Workflow with Debugging

```bash
# 1. Generate positions
python generate_positions.py positions.txt --games 100000 --no-filter-captures
# Should output: Saved positions: ~20000-40000 (depends on filter)

# 2. Label with Stockfish
export STOCKFISH_PATH=$(which stockfish)
python label_positions.py positions.txt training_data.jsonl $STOCKFISH_PATH --depth 10
# Should output: Successfully evaluated: > 0

# 3. Train model
python train_nnue.py training_data.jsonl nnue_weights.pt --epochs 5
# Should output: Training summary with version info

# 4. Export to Scala
python export_weights.py nnue_weights_v1.pt ../src/main/scala/de/nowchess/bot/bots/nnue/NNUEWeights.scala
# Should output: NNUEWeights.scala created

# 5. Compile Scala
cd ../..
./compile
# Should output: BUILD SUCCESSFUL
```

---

## Performance Monitoring

While labeling is running, monitor progress:

```bash
# In another terminal
watch -n 5 'wc -l modules/bot/python/training_data.jsonl'

# Or on macOS
while true; do echo $(wc -l < modules/bot/python/training_data.jsonl) positions labeled; sleep 5; done
```

This shows how many positions per second are being evaluated.

---

## Still Stuck?

1. **Read the full output** — Don't skip error messages
2. **Check file sizes** — `ls -lh` shows if files are being created
3. **Run with `--verbose`** — Shows exactly what's failing
4. **Test individual steps** — Don't run full pipeline, test pieces
5. **Check Stockfish** — `stockfish --version` confirms it works

For more help, see:
- `README_NNUE.md` — Complete pipeline docs
- `TRAINING_GUIDE.md` — Training workflows
- `INCREMENTAL_TRAINING.md` — Versioning & checkpoints