174 lines
4.9 KiB
Markdown
174 lines
4.9 KiB
Markdown
# Training Dataset Management
|
|
|
|
The NNUE training pipeline now features versioned dataset management, similar to model versioning. This prevents data loss and allows you to maintain multiple training configurations.
|
|
|
|
## Directory Structure
|
|
|
|
```
|
|
datasets/
|
|
ds_v1/
|
|
labeled.jsonl # Training data: {"fen": "...", "eval": 0.5, "eval_raw": 150}
|
|
metadata.json # Version info and composition
|
|
ds_v2/
|
|
labeled.jsonl
|
|
metadata.json
|
|
```
|
|
|
|
## Metadata Schema
|
|
|
|
Each dataset has a `metadata.json` file tracking its composition:
|
|
|
|
```json
|
|
{
|
|
"version": 1,
|
|
"created": "2026-04-13T15:30:45.123456",
|
|
"total_positions": 1000000,
|
|
"stockfish_depth": 12,
|
|
"sources": [
|
|
{
|
|
"type": "generated",
|
|
"count": 500000,
|
|
"params": {
|
|
"num_positions": 500000,
|
|
"min_move": 1,
|
|
"max_move": 50
|
|
}
|
|
},
|
|
{
|
|
"type": "tactical",
|
|
"count": 300000,
|
|
"max_puzzles": 300000
|
|
},
|
|
{
|
|
"type": "file_import",
|
|
"count": 200000,
|
|
"path": "/path/to/original_file.txt"
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
## TUI Workflow
|
|
|
|
### Main Menu
|
|
```
|
|
1 - Manage Training Data
|
|
2 - Train Model
|
|
3 - Export Model
|
|
4 - Exit
|
|
```
|
|
|
|
### Training Data Management Submenu
|
|
```
|
|
1 - Create new dataset
|
|
2 - Extend existing dataset
|
|
3 - View all datasets
|
|
4 - Delete dataset
|
|
5 - Back
|
|
```
|
|
|
|
## Creating a Dataset
|
|
|
|
Use the "Create new dataset" option to add data from one or more sources:
|
|
|
|
1. **Generate random positions** — Play random games and sample positions
|
|
- Number of positions
|
|
- Move range (min/max move number to sample from)
|
|
- Number of worker threads
|
|
|
|
2. **Import from file** — Load positions from a FEN file
|
|
- File must contain one FEN string per line
|
|
- Duplicates are automatically removed
|
|
|
|
3. **Extract tactical puzzles** — Download and extract Lichess puzzle database
|
|
- Maximum number of puzzles to include
|
|
- Automatically filters for tactical themes (forks, pins, mates, etc.)
|
|
|
|
You can combine multiple sources in a single dataset creation session. All positions are:
|
|
- Deduplicated (only unique FENs are kept)
|
|
- Labeled with Stockfish evaluations
|
|
- Saved to `datasets/ds_vN/labeled.jsonl`
|
|
|
|
## Extending a Dataset
|
|
|
|
Use "Extend existing dataset" to add more positions to an existing dataset:
|
|
|
|
1. Select the dataset version to extend
|
|
2. Choose data sources (same options as creation)
|
|
3. Confirm labeling parameters
|
|
4. New positions are:
|
|
- Labeled with Stockfish
|
|
- Deduplicated against the target dataset (preventing duplicates)
|
|
- Merged into the existing `labeled.jsonl`
|
|
- Metadata is updated with the new source entry
|
|
|
|
## Training with a Dataset
|
|
|
|
When you start training (Standard or Burst mode), you'll be prompted to select a dataset version. The TUI will display all available datasets with:
|
|
- Version number
|
|
- Total number of positions
|
|
- Source types (generated, tactical, imported)
|
|
- Stockfish depth used
|
|
- Creation date
|
|
|
|
## Legacy Data Migration
|
|
|
|
If you have existing labeled data in `data/training_data.jsonl` from before this update:
|
|
|
|
1. Open the "Manage Training Data" menu
|
|
2. Choose "Create new dataset"
|
|
3. Select "Import from file"
|
|
4. Point to `data/training_data.jsonl`
|
|
5. Complete the dataset creation
|
|
|
|
Alternatively, you can manually copy the file to `datasets/ds_v1/labeled.jsonl` and create a `metadata.json` file.
|
|
|
|
## Viewing Dataset Details
|
|
|
|
Use "View all datasets" to see a table of all datasets with:
|
|
- Version number
|
|
- Position count
|
|
- Source composition
|
|
- Stockfish depth
|
|
- Creation date
|
|
|
|
## Deleting a Dataset
|
|
|
|
Use "Delete dataset" to remove a dataset and free up disk space. **This action cannot be undone.**
|
|
|
|
⚠️ The system does not prevent deleting datasets used by model checkpoints. Plan accordingly.
|
|
|
|
## Technical Details
|
|
|
|
### Deduplication Strategy
|
|
|
|
When extending a dataset, positions are deduplicated **within that dataset only**. This allows different datasets to contain overlapping positions if desired.
|
|
|
|
When creating a new dataset from multiple sources, all sources are combined and deduplicated before labeling.
|
|
|
|
### Labeled Position Format
|
|
|
|
Each line in `labeled.jsonl` is a JSON object:
|
|
```json
|
|
{
|
|
"fen": "rnbqkbnr/pppppppp/8/8/8/8/PPPPPPPP/RNBQKBNR w KQkq - 0 1",
|
|
"eval": 0.0,
|
|
"eval_raw": 0
|
|
}
|
|
```
|
|
|
|
- `fen`: The position in Forsyth-Edwards Notation
|
|
- `eval`: Normalized evaluation ([-1, 1] range using tanh)
|
|
- `eval_raw`: Raw Stockfish evaluation in centipawns
|
|
|
|
### Storage Location
|
|
|
|
Datasets are stored in the `datasets/` directory relative to the script location. The old `data/` directory is preserved for backward compatibility but is not actively used by the new system.
|
|
|
|
## Performance Tips
|
|
|
|
- **Smaller datasets train faster** — Start with 100k-500k positions
|
|
- **Deduplication matters** — Use the extend functionality to build up your dataset without redundant data
|
|
- **Stockfish depth** — Depth 12-14 balances accuracy and labeling speed
|
|
- **Workers** — Use 4-8 workers for labeling if your machine supports it; more workers = faster but uses more CPU/memory
|