The Better Data Warehouse??

By:  James Dinkel

Last month Databricks CEO Ali Ghodsi claimed that Databricks had surpassed Snowflake in Data Warehousing capabilities.  This does not even pass the sniff test.  Snowflake is widely regarded as the premier data warehouse platform on the planet by all measures – in terms of features, performance and price-performance.

When we last checked in on SQL Warehouse / Unity Catalog in January of last year, there were a number of gaps.  Let’s check back in – how many did Databricks fix and how many are still outstanding?

Hint: Most of the big problems in SQL Warehouse are still broken or the feature does not yet exist.

Please note this is not an exhaustive list of problems with Databricks SQL Warehouse, but what is here we have found to be a problem.

Yet to fix:

Item

Severity

Issue

UDF’s

Very High 

Databricks supports UDFs in Python, Scala, and SQL. However, in Databricks SQL Warehouse, only SQL UDFs are allowed. Unlike pandas UDFs or vectorized functions, SQL UDFs in Databricks are not vectorized, meaning execution occurs row by row, leading to poor performance on large tables.

This issue is further exacerbated by Databricks’ implementation of dynamic column masking and row filtering, both of which require a UDF to function. The most common use case for dynamic masking or row filtering is on large fact tables, but because these security measures rely on non-vectorized UDFs, query performance degrades significantly at scale.

ACID – Transactional

Very High 

ACID Capabilities in Databricks is single statement only.  There is no support for transactional blocks (e.g. Start Transaction, <multi-statement sql>, End Transaction) in Databricks SQL warehouse.  This is a major DWH / ETL pipeline offering limitation as transactional blocks are widely used.  

Cloning

Very High 

Databricks offers Deep Copy Clone and Shallow Copy Clone (for delta only), both with severe limitations.  Deep Copy Clone fully copies both metadata and data – no pointers, it’s full copy.  This is very expensive and time consuming.  Shallow copy clone uses the source table but copies the metadata only and has issues. Shallow clone was introduced for unity catalog managed tables in 13.3 LTS and for unity catalog external tables in 14.2 LTS.  Both fail approximately 10% of the time.  This is highly problematic as you need it to be 100% successful to be usable.  Consider this common use case:   Create a shallow clone.  Perform update or delete of data on the source system.   Run a compaction if non-serverless (force acid to do its thing).  For serverless, compaction already turned on by default.  Wait retention time period for non-serverless.  For serverless, wait 7 days.  Select from shallow clone…get the file not found exception below.  Clone is rendered useless.  Inheritly Snowflake uses immutable partitions and just keeps track with pointers. Databricks is continually doing compaction and is prone to such file not found exception.  For references we tested both unity catalog managed tables and unity catalog external tables 30 times each and got 3 failures on each.

Concurrency

Very High 

Databricks enforces a concurrency limit of 1000 queries before a 429 error is thrown, but in real-world scenarios, you will struggle to reach this limit. For meaningful workloads with moderate query complexity, performance degradation typically occurs much earlier (50-150 concurrent queries, depending on workload complexity).

Databricks offers Integrated Workload Management (IWM) with Serverless SQL warehouses, which can dynamically spin up additional clusters when query queuing occurs. However, Serverless SQL is already expensive, and scaling inefficient processes across multiple clusters compounds the cost – you are mutliplying a costly and inefficient process times more costly and inefficient processes.

Cross-cloud Replication

Very High 

Cross-cloud replication in Databricks is a highly manual process. Unlike Snowflake, which provides Snowgrid for seamless cross-region and cross-cloud replication (including failover, failback and app redirection), Databricks does not have a native equivalent—this feature simply does not exist.

While Databricks offers Delta Sharing, this is not the same as replication; it enables data access across platforms but does not replicate or synchronize data across clouds. As a result, organizations must build a custom replication solution or purchase an expensive third-party tool, adding significant operational complexity and cost. Delta Sharing is a federated query model where consumers must deal with poor performance and pay egress costs on every query.

Stored Procedures

Very High 

Databricks does not have SQL stored procedures.  Instead, you will need to create a python stored procedure, convert your SQL stored procedure to spark.sql, convert to pandas DF and run in a SparkSession.  Lot of extra work, be careful on syntax when converting logic.  Databricks does have this in the roadmap, but this will take some time.

Semi-structured

High 

Databricks introduced the VARIANT data type in Databricks Runtime 15.3 LTS, but there are limitations. Previously, users had to rely on Apache Spark for semi-structured data parsing, requiring explicit transformations (explode, flatten, get_json_object) to achieve efficient querying. With the introduction of VARIANT, SQL-based operations on semi-structured data are now possible, reducing complexity. However, VARIANT columns in Databricks cannot be used for clustering keys, partitions, or Z-order keys. Additionally, they are not suitable for comparisons, grouping, ordering, or set operations. Also, you cannot use SQL set operators with variant columns and you cannot use column generation to create a variant column.  Delta does not collect minValues or maxValues stats for variant columns.  As a result, users must choose between VARIANT (simpler SQL queries but lagging performance) or Apache Spark transformations (more complex but acceptable performance).

ACID – Performance

High 

Both Snowflake and Databricks offer full ACID capabilities. At small volumes, performance differences may be negligible. However, at large volumes, Snowflake’s 16MB immutable micropartitions with auto-clustering will significantly outperform Delta Lake’s distributed storage model, which relies on a base file + delta transactional log compaction.

Consider this use case: Load 10 years of daily data, each day containing 50GB+ of new records, then perform an update across all days.

==> If delta.autoOptimize.autoCompact = true (default for UC) → The cluster will be heavily taxed as it continuously merges small files in the background.
==> If delta.autoOptimize.autoCompact = false → The accumulation of small files will degrade performance over time, eventually making queries unusable due to excessive file scanning.

This issue is common, especially when business logic changes require historical data updates (e.g., restating financials or reclassifying customer segments).

In contrast, Snowflake uses immutable micropartitions.  Snowflake is just updating pointers to locations (not rewriting data).

Windowing

High 

Databricks SQL Warehouse supports many window functions, but certain advanced analytical functions may have different implementations or performance trade-offs compared to traditional data warehouses. Execution performance depends on query optimization, Spark’s distributed architecture, and memory-intensive operations, which can lead to slower performance on large datasets compared to platforms with native columnar execution like Snowflake.

Example: LEAD Window Function on a Large Fact Table
Suppose you need to calculate the next transaction date for each customer in a billions-row dataset with millions of unique customers, making customer_id a high-cardinality partition column.

With Databricks, since Spark must repartition by customer_id, the operation does not fit into executor memory, leading to heavy data shuffling, high I/O, and significant network overhead.
Snowflake: When executing window functions like LEAD, micropartitions and clustering optimize storage access, reducing the need for shuffling and minimizing I/O and network overhead, leading to faster and more efficient execution.


As a result, analytical workloads that require partitioning on high-cardinality columns often perform better in Snowflake than in Databricks SQL Warehouse due to Snowflake’s optimized storage-layer pruning and native query execution engine.

Delta UniForm

High 

The objective of Delta Lake UniForm is to automatically generate Apache Iceberg metadata alongside Delta Lake metadata, enabling interoperability between different data processing engines. This process utilizes an asynchronous service for metadata synchronization.

Consequently, if a job fails or a cluster terminates before the metadata generation completes, the Iceberg metadata may become outdated or inconsistent with the Delta table. To resolve this discrepancy, you may need to manually trigger the metadata synchronization using the MSCK REPAIR TABLE <table-name> SYNC METADATA command or, in certain cases, reload the table if the metadata corruption is severe.

It is rumored that Databricks has fixed this in the last week, but I have seen no evidence of such.

To mitigate this risk, it is crucial to (a) implement a monitoring job that detects metadata mismatches and alerts you when resynchronization or a full table reload is required (b) don’t use Delta Uniform.  Use Delta Direct from Snowflake.  Delta Direct consumes Delta Lake metadata directly without needing a translation layer like UniForm.  

UC pipelines

High 

Unity Catalog pipelines are significantly slower than legacy Hive Metastore (HMS) pipelines, especially when governance features like row filters and audit logging are enabled. These features introduce additional processing overhead, increasing query latency and execution times. Users migrating from HMS to Unity Catalog have reported noticeable slowdowns in deployment and data pipeline performance, making optimization strategies essential to maintain efficiency.

Unity Catalog

High 

Unity Catalog Open-Source and Unity Catalog Databricks are separately maintained platforms.  They are not the same.  The Open-Source unity catalog allows for a great number of read clients, while Databricks unity catalog allows databricks only.  The OSS Unity Catalog allows Iceberg api read and writes both, but lacks Security features.  The Databricks Unity Catalog allows iceberg read only, and only if Uniform is enabled.

SQL Warehouse Unity Catalog Metastore Cache

High 

Spark doesn’t have metadata cache, except for in-driver of current session.  When Databricks created SQL warehouse they did like Snowflake (which has had metadata cache since the early days and has frequently improved) and created a metadata cache, except they did it only for the current cluster.  So when new clusters start-up, the metadata cache is started anew.  Snowflake however has the metadata cache externalized to the current cluster, so even in multi-cluster warehouses the fully loaded cache is available to new and existing clusters alike.

Unity Catalog Migrations

High 

Databricks is encouraging the move away from HMS’s.  The process of moving dedicated HMS’s to Unity Catalog is very painful.  The basic process is (1) Prereq’s – storage locations, cluster access mode, notating hms permissions (need to be recreated later), (2) setup UC – create metastore , storage credentials, locations, (3) convert hms databases into schemas – for managed hive this requires a ctas statement – and data can’t change during process, (4) migrate comments, grants (involves show grant on hms, apply grant on UC), (5) update jobs, notebooks, dashboards, scripts, etc. (6) validate with select count(*), show tables, etc.  Databricks is adding federation from HMS catalog to tide over lengthy migrations.

Access

Medium 

There are quite a number of access mode limitations and restrictions (~20-25).  For example, in standard access mode, I can not use Databricks Runtime ML and use Unity Catalog metadata.  I must use direct access mode. If I use direct access mode though, I prohibit others from utilizing the same cluster.  See the full list of access mode limitations here:  https://docs.databricks.com/aws/en/compute/access-mode-limitations

For context, Snowflake solution has an architecture where the Notebook runs on a container but all the SQL are executed in the WH. Therefore, the fine grained access control is straightforward and consistent with regular worksheet based SQL queries

Two notable items have been fixed since January 2024 but are not yet up to where Snowflake is:

Item

Severity

Issue

RBAC

High

On August 6th, 2024, row filters and column masks went GA to prevent access to sensitive data by unauthorized users. These filters and masks are implemented as SQL user-defined functions (UDFs), which is a performance problem however (See above for UDF issue).

Materialized Views

High

Databricks announced Materialized Views in Public Preview in June 2024. However, its implementation differs from Snowflake and other platforms. In Databricks, users must explicitly configure materialized view refreshes when the base table changes. Refreshes can be performed fully or incrementally, either synchronously or asynchronously, to prevent stale data.

In contrast, Snowflake automatically maintains materialized views in the background, ensuring that queries always access the most up-to-date precomputed data. This automation reduces the risk of stale results and eliminates the operational burden of managing refresh schedules, making Snowflake’s implementation more seamless and efficient for real-time analytics.

In addition to preventing stale data, Snowflake also evaluates queries directed to base tables to determine if a materialized view will be faster than an underlying table and redirects that query to the mv if so.

Two notable items have been fixed since January 2024:

Item

Severity

Issue

Performance

High

Databricks introduced Liquid Clustering on February 18, 2025, as a replacement for traditional Z-ordering and partitioning techniques. While this approach aims to improve query performance and reduce data skew, it remains to be seen how effectively it competes with Snowflake’s micropartitioning and automatic clustering. 

Note: Liquid Clustering cannot be used alongside Z-ordering or partitioning on the same table. However, you can alter a table to switch between Liquid Clustering and traditional partitioning methods as needed.

Performance

High

Databricks introduced automated stats collection in its January 2025 release.  This is a big deal as the planner can now make decision based upon table size and cardinality in joins.  This is something Snowflake has done all along, but this release allows Databricks to be more on par with Snowflake in this area of performance.

We believe these higher priority items that have yet to be fixed will take Databricks longer to fix than the simple things they have already fixed.  We presume there are a lot of Databricks engineers working in the background to fix them and will keep an eye on the release notes for the next several years to see when they get fixed.