Co-authored-by: Janis <janis@nowchess.de> Reviewed-on: #33 Co-authored-by: Janis <janis.e.20@gmx.de> Co-committed-by: Janis <janis.e.20@gmx.de>
4.9 KiB
Training Dataset Management
The NNUE training pipeline now features versioned dataset management, similar to model versioning. This prevents data loss and allows you to maintain multiple training configurations.
Directory Structure
datasets/
ds_v1/
labeled.jsonl # Training data: {"fen": "...", "eval": 0.5, "eval_raw": 150}
metadata.json # Version info and composition
ds_v2/
labeled.jsonl
metadata.json
Metadata Schema
Each dataset has a metadata.json file tracking its composition:
{
"version": 1,
"created": "2026-04-13T15:30:45.123456",
"total_positions": 1000000,
"stockfish_depth": 12,
"sources": [
{
"type": "generated",
"count": 500000,
"params": {
"num_positions": 500000,
"min_move": 1,
"max_move": 50
}
},
{
"type": "tactical",
"count": 300000,
"max_puzzles": 300000
},
{
"type": "file_import",
"count": 200000,
"path": "/path/to/original_file.txt"
}
]
}
TUI Workflow
Main Menu
1 - Manage Training Data
2 - Train Model
3 - Export Model
4 - Exit
Training Data Management Submenu
1 - Create new dataset
2 - Extend existing dataset
3 - View all datasets
4 - Delete dataset
5 - Back
Creating a Dataset
Use the "Create new dataset" option to add data from one or more sources:
-
Generate random positions — Play random games and sample positions
- Number of positions
- Move range (min/max move number to sample from)
- Number of worker threads
-
Import from file — Load positions from a FEN file
- File must contain one FEN string per line
- Duplicates are automatically removed
-
Extract tactical puzzles — Download and extract Lichess puzzle database
- Maximum number of puzzles to include
- Automatically filters for tactical themes (forks, pins, mates, etc.)
You can combine multiple sources in a single dataset creation session. All positions are:
- Deduplicated (only unique FENs are kept)
- Labeled with Stockfish evaluations
- Saved to
datasets/ds_vN/labeled.jsonl
Extending a Dataset
Use "Extend existing dataset" to add more positions to an existing dataset:
- Select the dataset version to extend
- Choose data sources (same options as creation)
- Confirm labeling parameters
- New positions are:
- Labeled with Stockfish
- Deduplicated against the target dataset (preventing duplicates)
- Merged into the existing
labeled.jsonl - Metadata is updated with the new source entry
Training with a Dataset
When you start training (Standard or Burst mode), you'll be prompted to select a dataset version. The TUI will display all available datasets with:
- Version number
- Total number of positions
- Source types (generated, tactical, imported)
- Stockfish depth used
- Creation date
Legacy Data Migration
If you have existing labeled data in data/training_data.jsonl from before this update:
- Open the "Manage Training Data" menu
- Choose "Create new dataset"
- Select "Import from file"
- Point to
data/training_data.jsonl - Complete the dataset creation
Alternatively, you can manually copy the file to datasets/ds_v1/labeled.jsonl and create a metadata.json file.
Viewing Dataset Details
Use "View all datasets" to see a table of all datasets with:
- Version number
- Position count
- Source composition
- Stockfish depth
- Creation date
Deleting a Dataset
Use "Delete dataset" to remove a dataset and free up disk space. This action cannot be undone.
⚠️ The system does not prevent deleting datasets used by model checkpoints. Plan accordingly.
Technical Details
Deduplication Strategy
When extending a dataset, positions are deduplicated within that dataset only. This allows different datasets to contain overlapping positions if desired.
When creating a new dataset from multiple sources, all sources are combined and deduplicated before labeling.
Labeled Position Format
Each line in labeled.jsonl is a JSON object:
{
"fen": "rnbqkbnr/pppppppp/8/8/8/8/PPPPPPPP/RNBQKBNR w KQkq - 0 1",
"eval": 0.0,
"eval_raw": 0
}
fen: The position in Forsyth-Edwards Notationeval: Normalized evaluation ([-1, 1] range using tanh)eval_raw: Raw Stockfish evaluation in centipawns
Storage Location
Datasets are stored in the datasets/ directory relative to the script location. The old data/ directory is preserved for backward compatibility but is not actively used by the new system.
Performance Tips
- Smaller datasets train faster — Start with 100k-500k positions
- Deduplication matters — Use the extend functionality to build up your dataset without redundant data
- Stockfish depth — Depth 12-14 balances accuracy and labeling speed
- Workers — Use 4-8 workers for labeling if your machine supports it; more workers = faster but uses more CPU/memory