X (Twitter) Facebook Pinterest LinkedIn E-mail

Snowflake, the platform specializing in AI Data Cloud, has announced the public preview availability of Snowpark Connect for Apache Spark™, a new feature that allows Spark users to run their code directly on Snowflake’s engine. This integration promises significant improvements in performance, cost reduction, and a notable operational simplification for organizations managing data-intensive workloads.

Built on a decoupled client-server architecture, Snowpark Connect enables the separation of user code from the Spark cluster responsible for processing. Introduced by the Apache Spark™ community in version 3.4, this architecture allows Spark jobs to be driven directly by Snowflake’s engine.

Thanks to this integration, users can execute modern Spark code—including Spark DataFrame, Spark SQL, and user-defined functions (UDFs)—without needing to maintain separate Spark environments or worry about dependencies, version compatibility, or updates. Snowflake manages the entire process automatically, handling dynamic scaling and performance optimization, thus removing operational burdens for developers.

Furthermore, shifting data processing to Snowflake provides a unified governance framework from the start of the data flow, ensuring consistency, security, and regulatory compliance throughout the entire lifecycle without duplicating efforts.

An internal Snowflake study reveals that clients using Snowpark Client to build pipelines in Python, Java, or Scala have achieved an average performance increase of 5.6 times and a 41% cost savings compared to traditionally managed Spark environments.

With this initiative, Snowflake reinforces its commitment to offering efficient and unified tools for developers and data scientists by integrating the best of Spark within its cloud ecosystem.

Developed on Spark Connect and Snowflake’s architecture

Snowpark Connect for Spark leverages the decoupled architecture of Spark Connect, which enables applications to send an unresolved logical plan to a remote Spark cluster for processing. This client-server separation philosophy has been fundamental in Snowpark’s design from the beginning. Currently, Snowpark Connect supports Spark versions 3.5.x, ensuring compatibility with the latest features and improvements introduced in those versions.

This innovation eliminates the need to move data between Spark and Snowflake—a process traditionally associated with additional costs, latency, and governance complexity. Now, organizations can run Spark DataFrame, SQL, and UDF code within Snowflake via Snowflake Notebooks, Jupyter notebooks, Snowflake stored procedures, VSCode, Airflow, or Snowpark Submit, enabling seamless integration across different storage options in Snowflake, Iceberg (inside Snowflake or externally managed), and cloud storage.

Working with an open lakehouse

Snowpark Connect for Spark works with Apache Iceberg™ tables, including externally managed Iceberg tables and catalog-linked databases. This allows you to harness the power, performance, ease of use, and governance of the Snowflake platform without moving your data or rewriting your Spark code.

X (Twitter) Facebook Pinterest LinkedIn E-mail