NowChessSystems/modules/bot/python/DATASETS.md

# Training Dataset Management

The NNUE training pipeline now features versioned dataset management, similar to model versioning. This prevents data loss and allows you to maintain multiple training configurations.

## Directory Structure

```
datasets/
  ds_v1/
    labeled.jsonl       # Training data: {"fen": "...", "eval": 0.5, "eval_raw": 150}
    metadata.json       # Version info and composition
  ds_v2/
    labeled.jsonl
    metadata.json
```

## Metadata Schema

Each dataset has a `metadata.json` file tracking its composition:

```json
{
  "version": 1,
  "created": "2026-04-13T15:30:45.123456",
  "total_positions": 1000000,
  "stockfish_depth": 12,
  "sources": [
    {
      "type": "generated",
      "count": 500000,
      "params": {
        "num_positions": 500000,
        "min_move": 1,
        "max_move": 50
      }
    },
    {
      "type": "tactical",
      "count": 300000,
      "max_puzzles": 300000
    },
    {
      "type": "file_import",
      "count": 200000,
      "path": "/path/to/original_file.txt"
    }
  ]
}
```

## TUI Workflow

### Main Menu
```
1 - Manage Training Data
2 - Train Model
3 - Export Model
4 - Exit
```

### Training Data Management Submenu
```
1 - Create new dataset
2 - Extend existing dataset
3 - View all datasets
4 - Delete dataset
5 - Back
```

## Creating a Dataset

Use the "Create new dataset" option to add data from one or more sources:

1. **Generate random positions** — Play random games and sample positions
   - Number of positions
   - Move range (min/max move number to sample from)
   - Number of worker threads

2. **Import from file** — Load positions from a FEN file
   - File must contain one FEN string per line
   - Duplicates are automatically removed

3. **Extract tactical puzzles** — Download and extract Lichess puzzle database
   - Maximum number of puzzles to include
   - Automatically filters for tactical themes (forks, pins, mates, etc.)

You can combine multiple sources in a single dataset creation session. All positions are:
- Deduplicated (only unique FENs are kept)
- Labeled with Stockfish evaluations
- Saved to `datasets/ds_vN/labeled.jsonl`

## Extending a Dataset

Use "Extend existing dataset" to add more positions to an existing dataset:

1. Select the dataset version to extend
2. Choose data sources (same options as creation)
3. Confirm labeling parameters
4. New positions are:
   - Labeled with Stockfish
   - Deduplicated against the target dataset (preventing duplicates)
   - Merged into the existing `labeled.jsonl`
   - Metadata is updated with the new source entry

## Training with a Dataset

When you start training (Standard or Burst mode), you'll be prompted to select a dataset version. The TUI will display all available datasets with:
- Version number
- Total number of positions
- Source types (generated, tactical, imported)
- Stockfish depth used
- Creation date

## Legacy Data Migration

If you have existing labeled data in `data/training_data.jsonl` from before this update:

1. Open the "Manage Training Data" menu
2. Choose "Create new dataset"
3. Select "Import from file"
4. Point to `data/training_data.jsonl`
5. Complete the dataset creation

Alternatively, you can manually copy the file to `datasets/ds_v1/labeled.jsonl` and create a `metadata.json` file.

## Viewing Dataset Details

Use "View all datasets" to see a table of all datasets with:
- Version number
- Position count
- Source composition
- Stockfish depth
- Creation date

## Deleting a Dataset

Use "Delete dataset" to remove a dataset and free up disk space. **This action cannot be undone.**

⚠️ The system does not prevent deleting datasets used by model checkpoints. Plan accordingly.

## Technical Details

### Deduplication Strategy

When extending a dataset, positions are deduplicated **within that dataset only**. This allows different datasets to contain overlapping positions if desired.

When creating a new dataset from multiple sources, all sources are combined and deduplicated before labeling.

### Labeled Position Format

Each line in `labeled.jsonl` is a JSON object:
```json
{
  "fen": "rnbqkbnr/pppppppp/8/8/8/8/PPPPPPPP/RNBQKBNR w KQkq - 0 1",
  "eval": 0.0,
  "eval_raw": 0
}
```

- `fen`: The position in Forsyth-Edwards Notation
- `eval`: Normalized evaluation ([-1, 1] range using tanh)
- `eval_raw`: Raw Stockfish evaluation in centipawns

### Storage Location

Datasets are stored in the `datasets/` directory relative to the script location. The old `data/` directory is preserved for backward compatibility but is not actively used by the new system.

## Performance Tips

- **Smaller datasets train faster** — Start with 100k-500k positions
- **Deduplication matters** — Use the extend functionality to build up your dataset without redundant data
- **Stockfish depth** — Depth 12-14 balances accuracy and labeling speed
- **Workers** — Use 4-8 workers for labeling if your machine supports it; more workers = faster but uses more CPU/memory