VS Code Extension · v0.9.2

Catch Spark performance
issues before production

CatalystOps analyzes your PySpark and Databricks code inline as you type — 30+ anti-pattern detectors, dry-run plan analysis on your cluster, and actionable fixes. No context switching.

Install for VS Code View on GitHub

pipeline.py

1 from pyspark.sql import functions as F

3 events = spark.read.parquet("s3://bucket/events/")

4 daily = events.groupBy("date").count() ← count() triggers full scan

5 hourly = events.collect() ← OOM risk on large datasets

7 result = daily.join(F.broadcast(small_df), "date")

Features

Everything in one place

From static code analysis to live Databricks plan inspection — without leaving the editor.

Local Static Analysis

Detects 30+ PySpark and Databricks anti-patterns inline as you type. No cluster required. Catches collect(), UDFs, cross joins, unsafe writes, SQL injection, schema drift, and more.

Dry-Run Plan Analysis

Submits a neutralized version of your script to Databricks (cluster or Serverless) and returns the physical Catalyst plan — with sort-merge join detection, broadcast thresholds, shuffle analysis, and cost estimation.

Job Run Analysis

Analyze any past Databricks job run from the Jobs sidebar — no re-execution needed. CatalystOps reads the Spark event log from DBFS, extracts physical plans, and opens an interactive DAG showing operator trees, filter conditions, and issue badges.

One-Click SSH to Clusters

The Clusters panel lists every interactive cluster in your workspace. Click SSH on any cluster — CatalystOps starts it if stopped, runs setup automatically, and opens VS Code Remote SSH directly on the driver. No terminal commands needed.

Billing Dashboard

Tracks DBU and dollar spend per period directly from system.billing.usage with a 1-hour cache. After each serverless run, optionally fetches actual DBU consumption.

MCP Server

Exposes a Streamable HTTP MCP server auto-discovered by VS Code 1.99+. Lets Claude and other AI tools analyze your PySpark code, fetch billing summaries, and run dry runs through natural language.

How it works

From install to insight in minutes

Install & open a Python file

Install from the VS Code Marketplace. The moment you open a .py file, local analysis kicks in — no configuration needed. 30+ rules light up immediately for any PySpark anti-patterns.

Connect to Databricks (optional)

Add your workspace URL and personal access token via CatalystOps: Configure Databricks Connection. Pick cluster or Serverless execution mode. CatalystOps reads your ~/.databrickscfg automatically if it exists.

Run a dry-run analysis

Press ⌘⇧K to submit the current file. CatalystOps neutralizes side-effects, executes the Catalyst planner on your cluster, and returns the physical plan with cost annotations, join strategies, and actionable fixes — all in the sidebar.

MCP Integration

Talk to your Spark pipeline with Claude

CatalystOps ships a built-in MCP server auto-discovered by VS Code 1.99+. Claude and other AI clients can call CatalystOps tools directly — analyze code, fetch billing data, run dry-runs, and read plan results through natural language.

analyze_pyspark — run local analysis on any code snippet
run_dry_run — trigger a Databricks dry-run from chat
get_billing_summary — ask about DBU spend in plain English
get_plan_analysis — read last dry-run plan results from Claude
list_clusters / list_job_runs / get_job_run_plan — inspect workspace clusters and historical job plans

Claude chat

You

What issues are in my active file?

Claude

↳ calling get_active_file_issues…

Found 3 issues in pipeline.py:

● Line 12 — collect() pulls all data to the driver. OOM risk on large datasets.
● Line 18 — Global orderBy shuffles all data to one partition.
● Line 31 — Streaming query has no .trigger() — continuous micro-batches.

You

Fix the streaming issue and run a dry run