VS Code Extension · v0.9.2

Catch Spark performance
issues before production

CatalystOps analyzes your PySpark and Databricks code inline as you type — 30+ anti-pattern detectors, dry-run plan analysis on your cluster, and actionable fixes. No context switching.

pipeline.py
1 from pyspark.sql import functions as F
2
3 events = spark.read.parquet("s3://bucket/events/")
4 daily = events.groupBy("date").count() ← count() triggers full scan
5 hourly = events.collect() ← OOM risk on large datasets
6
7 result = daily.join(F.broadcast(small_df), "date")
CODE_COUNT_001 · Warning
count() > 0 triggers a full Spark job just to check emptiness. Use df.isEmpty() instead — it short-circuits on the first partition.
Quick fix
if not df.isEmpty(): ...
30+
Anti-pattern rules
0
Cluster needed for local analysis
2
Execution modes — cluster & serverless
Free
MIT licensed
Features

Everything in one place

From static code analysis to live Databricks plan inspection — without leaving the editor.

Local Static Analysis

Detects 30+ PySpark and Databricks anti-patterns inline as you type. No cluster required. Catches collect(), UDFs, cross joins, unsafe writes, SQL injection, schema drift, and more.

Dry-Run Plan Analysis

Submits a neutralized version of your script to Databricks (cluster or Serverless) and returns the physical Catalyst plan — with sort-merge join detection, broadcast thresholds, shuffle analysis, and cost estimation.

Job Run Analysis

Analyze any past Databricks job run from the Jobs sidebar — no re-execution needed. CatalystOps reads the Spark event log from DBFS, extracts physical plans, and opens an interactive DAG showing operator trees, filter conditions, and issue badges.

One-Click SSH to Clusters

The Clusters panel lists every interactive cluster in your workspace. Click SSH on any cluster — CatalystOps starts it if stopped, runs setup automatically, and opens VS Code Remote SSH directly on the driver. No terminal commands needed.

Billing Dashboard

Tracks DBU and dollar spend per period directly from system.billing.usage with a 1-hour cache. After each serverless run, optionally fetches actual DBU consumption.

MCP Server

Exposes a Streamable HTTP MCP server auto-discovered by VS Code 1.99+. Lets Claude and other AI tools analyze your PySpark code, fetch billing summaries, and run dry runs through natural language.

How it works

From install to insight in minutes

1

Install & open a Python file

Install from the VS Code Marketplace. The moment you open a .py file, local analysis kicks in — no configuration needed. 30+ rules light up immediately for any PySpark anti-patterns.

2

Connect to Databricks (optional)

Add your workspace URL and personal access token via CatalystOps: Configure Databricks Connection. Pick cluster or Serverless execution mode. CatalystOps reads your ~/.databrickscfg automatically if it exists.

3

Run a dry-run analysis

Press ⌘⇧K to submit the current file. CatalystOps neutralizes side-effects, executes the Catalyst planner on your cluster, and returns the physical plan with cost annotations, join strategies, and actionable fixes — all in the sidebar.

MCP Integration

Talk to your Spark pipeline with Claude

CatalystOps ships a built-in MCP server auto-discovered by VS Code 1.99+. Claude and other AI clients can call CatalystOps tools directly — analyze code, fetch billing data, run dry-runs, and read plan results through natural language.

  • analyze_pyspark — run local analysis on any code snippet
  • run_dry_run — trigger a Databricks dry-run from chat
  • get_billing_summary — ask about DBU spend in plain English
  • get_plan_analysis — read last dry-run plan results from Claude
  • list_clusters / list_job_runs / get_job_run_plan — inspect workspace clusters and historical job plans
Claude chat
You
What issues are in my active file?
Claude
↳ calling get_active_file_issues…
Found 3 issues in pipeline.py:

Line 12 — collect() pulls all data to the driver. OOM risk on large datasets.
Line 18 — Global orderBy shuffles all data to one partition.
Line 31 — Streaming query has no .trigger() — continuous micro-batches.
You
Fix the streaming issue and run a dry run
Claude
↳ calling run_dry_run…
Dry run complete. The plan shows a Sort-Merge Join on line 8 — one side is small enough to broadcast. Want me to add a broadcast hint?

Start optimizing today

Free, open-source, and available for any Databricks or PySpark project.

Also available on Open VSX for Cursor, Theia, and other editors.