Files

T

Janis 8744bee2dd feat: NCS-41 Bot Platform (#33 )

Co-authored-by: Janis <janis@nowchess.de>
Reviewed-on: #33
Co-authored-by: Janis <janis.e.20@gmx.de>
Co-committed-by: Janis <janis.e.20@gmx.de>

2026-04-19 15:52:08 +02:00

4.9 KiB

Raw Blame History

Training Dataset Management

The NNUE training pipeline now features versioned dataset management, similar to model versioning. This prevents data loss and allows you to maintain multiple training configurations.

Directory Structure

datasets/
  ds_v1/
    labeled.jsonl       # Training data: {"fen": "...", "eval": 0.5, "eval_raw": 150}
    metadata.json       # Version info and composition
  ds_v2/
    labeled.jsonl
    metadata.json

Metadata Schema

Each dataset has a metadata.json file tracking its composition:

{
  "version": 1,
  "created": "2026-04-13T15:30:45.123456",
  "total_positions": 1000000,
  "stockfish_depth": 12,
  "sources": [
    {
      "type": "generated",
      "count": 500000,
      "params": {
        "num_positions": 500000,
        "min_move": 1,
        "max_move": 50
      }
    },
    {
      "type": "tactical",
      "count": 300000,
      "max_puzzles": 300000
    },
    {
      "type": "file_import",
      "count": 200000,
      "path": "/path/to/original_file.txt"
    }
  ]
}

TUI Workflow

1 - Manage Training Data
2 - Train Model
3 - Export Model
4 - Exit

Training Data Management Submenu

1 - Create new dataset
2 - Extend existing dataset
3 - View all datasets
4 - Delete dataset
5 - Back

Creating a Dataset

Use the "Create new dataset" option to add data from one or more sources:

Generate random positions — Play random games and sample positions
- Number of positions
- Move range (min/max move number to sample from)
- Number of worker threads
Import from file — Load positions from a FEN file
- File must contain one FEN string per line
- Duplicates are automatically removed
Extract tactical puzzles — Download and extract Lichess puzzle database
- Maximum number of puzzles to include
- Automatically filters for tactical themes (forks, pins, mates, etc.)

You can combine multiple sources in a single dataset creation session. All positions are:

Deduplicated (only unique FENs are kept)
Labeled with Stockfish evaluations
Saved to datasets/ds_vN/labeled.jsonl

Extending a Dataset

Use "Extend existing dataset" to add more positions to an existing dataset:

Select the dataset version to extend
Choose data sources (same options as creation)
Confirm labeling parameters
New positions are:
- Labeled with Stockfish
- Deduplicated against the target dataset (preventing duplicates)
- Merged into the existing labeled.jsonl
- Metadata is updated with the new source entry

Training with a Dataset

When you start training (Standard or Burst mode), you'll be prompted to select a dataset version. The TUI will display all available datasets with:

Version number
Total number of positions
Source types (generated, tactical, imported)
Stockfish depth used
Creation date

Legacy Data Migration

If you have existing labeled data in data/training_data.jsonl from before this update:

Open the "Manage Training Data" menu
Choose "Create new dataset"
Select "Import from file"
Point to data/training_data.jsonl
Complete the dataset creation

Alternatively, you can manually copy the file to datasets/ds_v1/labeled.jsonl and create a metadata.json file.

Viewing Dataset Details

Use "View all datasets" to see a table of all datasets with:

Version number
Position count
Source composition
Stockfish depth
Creation date

Deleting a Dataset

Use "Delete dataset" to remove a dataset and free up disk space. This action cannot be undone.

⚠️ The system does not prevent deleting datasets used by model checkpoints. Plan accordingly.

Technical Details

Deduplication Strategy

When extending a dataset, positions are deduplicated within that dataset only. This allows different datasets to contain overlapping positions if desired.

When creating a new dataset from multiple sources, all sources are combined and deduplicated before labeling.

Labeled Position Format

Each line in labeled.jsonl is a JSON object:

{
  "fen": "rnbqkbnr/pppppppp/8/8/8/8/PPPPPPPP/RNBQKBNR w KQkq - 0 1",
  "eval": 0.0,
  "eval_raw": 0
}

fen: The position in Forsyth-Edwards Notation
eval: Normalized evaluation ([-1, 1] range using tanh)
eval_raw: Raw Stockfish evaluation in centipawns

Storage Location

Datasets are stored in the datasets/ directory relative to the script location. The old data/ directory is preserved for backward compatibility but is not actively used by the new system.

Performance Tips

Smaller datasets train faster — Start with 100k-500k positions
Deduplication matters — Use the extend functionality to build up your dataset without redundant data
Stockfish depth — Depth 12-14 balances accuracy and labeling speed
Workers — Use 4-8 workers for labeling if your machine supports it; more workers = faster but uses more CPU/memory

4.9 KiB Raw Blame History