Files
NowChessSystems/modules/bot/python/DATASETS.md
T

174 lines
4.9 KiB
Markdown

# Training Dataset Management
The NNUE training pipeline now features versioned dataset management, similar to model versioning. This prevents data loss and allows you to maintain multiple training configurations.
## Directory Structure
```
datasets/
ds_v1/
labeled.jsonl # Training data: {"fen": "...", "eval": 0.5, "eval_raw": 150}
metadata.json # Version info and composition
ds_v2/
labeled.jsonl
metadata.json
```
## Metadata Schema
Each dataset has a `metadata.json` file tracking its composition:
```json
{
"version": 1,
"created": "2026-04-13T15:30:45.123456",
"total_positions": 1000000,
"stockfish_depth": 12,
"sources": [
{
"type": "generated",
"count": 500000,
"params": {
"num_positions": 500000,
"min_move": 1,
"max_move": 50
}
},
{
"type": "tactical",
"count": 300000,
"max_puzzles": 300000
},
{
"type": "file_import",
"count": 200000,
"path": "/path/to/original_file.txt"
}
]
}
```
## TUI Workflow
### Main Menu
```
1 - Manage Training Data
2 - Train Model
3 - Export Model
4 - Exit
```
### Training Data Management Submenu
```
1 - Create new dataset
2 - Extend existing dataset
3 - View all datasets
4 - Delete dataset
5 - Back
```
## Creating a Dataset
Use the "Create new dataset" option to add data from one or more sources:
1. **Generate random positions** — Play random games and sample positions
- Number of positions
- Move range (min/max move number to sample from)
- Number of worker threads
2. **Import from file** — Load positions from a FEN file
- File must contain one FEN string per line
- Duplicates are automatically removed
3. **Extract tactical puzzles** — Download and extract Lichess puzzle database
- Maximum number of puzzles to include
- Automatically filters for tactical themes (forks, pins, mates, etc.)
You can combine multiple sources in a single dataset creation session. All positions are:
- Deduplicated (only unique FENs are kept)
- Labeled with Stockfish evaluations
- Saved to `datasets/ds_vN/labeled.jsonl`
## Extending a Dataset
Use "Extend existing dataset" to add more positions to an existing dataset:
1. Select the dataset version to extend
2. Choose data sources (same options as creation)
3. Confirm labeling parameters
4. New positions are:
- Labeled with Stockfish
- Deduplicated against the target dataset (preventing duplicates)
- Merged into the existing `labeled.jsonl`
- Metadata is updated with the new source entry
## Training with a Dataset
When you start training (Standard or Burst mode), you'll be prompted to select a dataset version. The TUI will display all available datasets with:
- Version number
- Total number of positions
- Source types (generated, tactical, imported)
- Stockfish depth used
- Creation date
## Legacy Data Migration
If you have existing labeled data in `data/training_data.jsonl` from before this update:
1. Open the "Manage Training Data" menu
2. Choose "Create new dataset"
3. Select "Import from file"
4. Point to `data/training_data.jsonl`
5. Complete the dataset creation
Alternatively, you can manually copy the file to `datasets/ds_v1/labeled.jsonl` and create a `metadata.json` file.
## Viewing Dataset Details
Use "View all datasets" to see a table of all datasets with:
- Version number
- Position count
- Source composition
- Stockfish depth
- Creation date
## Deleting a Dataset
Use "Delete dataset" to remove a dataset and free up disk space. **This action cannot be undone.**
⚠️ The system does not prevent deleting datasets used by model checkpoints. Plan accordingly.
## Technical Details
### Deduplication Strategy
When extending a dataset, positions are deduplicated **within that dataset only**. This allows different datasets to contain overlapping positions if desired.
When creating a new dataset from multiple sources, all sources are combined and deduplicated before labeling.
### Labeled Position Format
Each line in `labeled.jsonl` is a JSON object:
```json
{
"fen": "rnbqkbnr/pppppppp/8/8/8/8/PPPPPPPP/RNBQKBNR w KQkq - 0 1",
"eval": 0.0,
"eval_raw": 0
}
```
- `fen`: The position in Forsyth-Edwards Notation
- `eval`: Normalized evaluation ([-1, 1] range using tanh)
- `eval_raw`: Raw Stockfish evaluation in centipawns
### Storage Location
Datasets are stored in the `datasets/` directory relative to the script location. The old `data/` directory is preserved for backward compatibility but is not actively used by the new system.
## Performance Tips
- **Smaller datasets train faster** — Start with 100k-500k positions
- **Deduplication matters** — Use the extend functionality to build up your dataset without redundant data
- **Stockfish depth** — Depth 12-14 balances accuracy and labeling speed
- **Workers** — Use 4-8 workers for labeling if your machine supports it; more workers = faster but uses more CPU/memory