SparkFiles.get() on the driver returns a driver-local path. When this was
passed to spark.read.text() the executor tried to open that path on its own
filesystem (separate pod), silently reading 0 rows.
Fix: download and decompress the Lichess PGN to NOWCHESS_PGN_CACHE_DIR
(default /tmp) which must be a filesystem shared between driver and executor
pods. In the k8s deployment this is the spark-analytics-output PVC mounted
at /spark-output, so set NOWCHESS_PGN_CACHE_DIR=/spark-output/.pgn-cache.
Also caches the decompressed file across runs — skips download if already
present.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds GameLengthJob, ColorAdvantageJob, EloDistributionJob, TimeControlJob,
DailyActivityJob, RatingMismatchJob, and TerminationStatsJob bringing total
batch pipelines to 11 (+ 1 streaming).
Extends GameSource with loadExtended() / fromLichessPgnExtended() extracting
WhiteElo, BlackElo, TimeControl, UTCDate, UTCTime, Termination, ECO from PGN
headers; JDBC path returns nulls for extended columns, keeping all existing
jobs unaffected.
PlayerStatsJob gains a CSV output alongside the existing Parquet write so
the analytics webview can display player statistics without pyarrow.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
apache/spark:3.5.4-scala2.13-java17-ubuntu does not exist on Docker Hub.
Oldest available scala2.13 image is 4.0.3. Bump compileOnly deps and
Dockerfile base image to match.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Each batch job now writes its results to a Postgres table in addition to
the existing Parquet/CSV output. OpeningBookJob → analytics_opening_stats,
PlayerStatsJob → analytics_player_stats, PlayerClusteringJob →
analytics_player_clusters + analytics_cluster_archetypes, PlayerGraphJob
→ analytics_player_graph. MLlib Vector columns are excluded from the JDBC
write by reusing the already-selected scalar DataFrame in
PlayerClusteringJob.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Pin jar output to analytics.jar (no version suffix) so Dockerfile COPY is stable
- Add Dockerfile based on apache/spark:3.5.4-scala2.13-java17-ubuntu
- Add versions.env (0.1.0) matching GitOps overlay image tag
- Add analytics-image.yml CI workflow following native-image.yml conventions
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
New standalone modules:analytics submodule with two Spark jobs:
- OpeningBookJob: reads game_records.pgn, extracts first N plies using
pure Catalyst SQL expressions (no UDFs), aggregates win/draw/loss rates
per opening sequence, writes Parquet + CSV top-1000 summary.
- PlayerStatsJob: unions each game into a player-centric view, aggregates
total_games/wins/losses/draws/avg_move_count/win_rate per player_id,
writes Parquet.
Module uses Scala 3 calling spark-sql_2.13 via JVM binary compatibility
(DataFrame API only; no spark.implicits._ / typed Datasets). Spark is
compileOnly; the fat jar bundles only scala3-library + postgresql driver.
Submit via spark-submit; see build.gradle.kts header for invocation.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>