Snowflake: Power Up, Costs Down

By: James Dinkel

In today’s data-driven landscape, velocity and efficiency are paramount. However, for organizations navigating the stringent regulations surrounding sensitive data – think HIPAA, GDPR, or PCI DSS – the pursuit of performance cannot come at the expense of robust privacy controls. In our experience, customers often struggle to balance query performance, data governance, and costs when scaling and optimizing database queries for sensitive use cases on data platforms that are not built ground up to handle complex analytics.
To shed light on these challenges, we conducted a series of tests that compare Snowflake, a proven leader in data analytics and modern data warehousing, against Databricks across various analytics use cases centered on handling sensitive customer data.

The results are telling:

Lower is Better

Chart showing the time elapsed for complex queries across Snowflake and Databricks

Cost advantage –

In addition to the performance advantage, Snowflake also has the price advantage. Databricks Serverless is 1.4x more expensive than Snowflake Enterprise Edition ($8.40/hr compared to Snowflake Enterprise Edition at $6/hr).

Even Snowflake’s most specialized Business Critical is less expensive than Databricks’ ($8/hr versus Databricks Serverless at $8.40/hr).

But how do these performance deltas translate into tangible inefficiencies in a real-world environment dealing with sensitive data? Let’s delve into the details.

How does this unfold in a real-world environment?

Situation: Customers in highly regulated industries like finance and healthcare must adhere to strict privacy laws (HIPAA, GDPR, PCI DSS). Data masking is a cornerstone of compliance, effectively mitigating legal risks by obscuring confidential data. For example, Security numbers or credit card details should be strictly controlled, limiting the potential for internal breaches. Data masking limits exposure and reduces risk. Such security controls are also critical when sharing data both within and outside the organization.

How data teams traditionally apply data masking: Customers often use SQL functions such as User-Defined Functions (UDFs) to simplify, optimize, and scale database queries, and implement controls, saving time, enabling privacy, and reducing complexity. This methodology is cumbersome and inefficient.

How Databricks applies data masking: Data masking in Databricks is primarily facilitated by User-Defined Functions (UDFs). Given that Databricks is not a purpose-built data warehouse for complex analytics, its UDF support lacks consistency, notably in vectorization. Instead, Databricks relies on the underlying query engine (photon) to vectorize the operation. In simple cases, this usually works. In complex scenarios, it does not, resulting in inefficient row-by-row execution, especially impacting performance on large tables for those complex query structures like subqueries. This limitation also affects the performance of dynamic column masking and row filtering, which are built on this same not-always-vectorized UDF framework. Although Databricks Photon offers vectorization for certain UDFs, its incomplete coverage doesn’t fully address the underlying performance challenge.

Moreover, customers must use higher Databricks editions (accruing higher costs) to overcome these performance challenges, but they still do not achieve the price/ performance they get with Snowflake.

How Snowflake enables governance controls: Snowflake has comprehensive UDF support and implements security and governance controls out of the box without any additional scripting. Snowflake’s fully managed governance and privacy controls mean customers do not need to pay more for a feature that is natively built in for query efficiency and data protection.

What is clear from above is that Databricks SQL is still developing its SQL capabilities. On the other hand, Snowflake is purpose-built for complex and sensitive analytical use cases, making it easy and efficient for you.

How to do the tests yourself:

Test 1: UDF Masking in Subquery Expression.
Purpose: Analyze behavioral patterns and identify high-risk users based on their actions.

1. Users Table (users)

Row count: 10 million

Structure: Each record represents a unique user, identified by user_id and associated with an email address.

Purpose: Serves as the primary entity for identifying individuals in the system.

2. User Activity Table (user_activity)

- Row count: 500 million (generated in chunks of 50 million)
- Structure: Captures user activity logs including:
  - activity_id: Unique identifier for each activity
  - user_id: Foreign key referencing the users table
  - activity_type: Indicates type of user behavior (suspect page or normal page)
  - activity_time: Timestamp of the event

Email Masking Requirement: For privacy reasons, we mask the username portion of email addresses. If masking conditions are met, the username part (before the @) is replaced with “xxxx”, preserving the domain.
Example: john.doe@example.com → xxxx@example.com>

The query builds a derived table named masked_users_high_activity that filters and selects users who:

Have accessed a ‘suspect page’ at least once (EXISTS subquery)
Have performed more than 5 activities total, regardless of type (COUNT(*) > 5)

The query returns:

user_id
Conditionally masked email address, based on activity volume

-- UDF Function
CREATE OR REPLACE FUNCTION mask_subquery(email STRING)
RETURNS STRING
RETURN CONCAT('xxxx@', SPLIT(email, '@')[1]);

-- Query
CREATE OR REPLACE TABLE default.masked_users_high_activity2 AS
SELECT 
    u.user_id,
    (
        SELECT mask_subquery(MAX(u2.email))
        FROM users u2
        WHERE 
            u2.user_id = u.user_id AND
            (
                SELECT COUNT(*)
                FROM user_activity ua
                WHERE ua.user_id = u2.user_id
            ) > 5
    ) AS masked_email_if_high_activity
FROM users u
WHERE EXISTS (
    SELECT 1
    FROM user_activity ua1
    WHERE 
        ua1.user_id = u.user_id AND 
        ua1.activity_type = 'suspect page'
);

Results: As expected, on complex queries, Snowflake vastly outperforms Databricks by almost 2x (margin of 103% and 78% for masked and unmasked queries, respectively). The mask on Databricks comes with a 3-second penalty, whereas Snowflake has none.

Use Case	Time (seconds)
Snowflake, unmasked	12.0
Snowflake, masked	12.0
Databricks, unmasked	21.4
Databricks, masked	24.3

Test 2: Simple, conditional masking

Purpose of the test: The objective is to conditionally mask the email address based on the user’s country.

User’s table with geo information

Data Size: 409 Million
Description: Customer-related data
Columns: Index, CustomerId, FirstName, LastName, Company, City, Country, Phone1, Phone2, Email, SubscriptionDate, Website

Masking UDF: The objective of the masking logic is to conditionally mask the email address based on the user’s country. This is done using a rule-based approach where email addresses are masked only if the country does not begin with a letter between ‘A’ and ‘H’ (inclusive).

-- UDF
CREATE OR REPLACE FUNCTION mask_pii_country_new(email STRING, country STRING)
RETURNS STRING
RETURN CASE
    WHEN UPPER(SUBSTR(country, 1, 1)) BETWEEN 'A' AND 'H' THEN email
    ELSE REGEXP_REPLACE(email, '^[^@]+', 'xxxx')
END;

-- Query
CREATE OR REPLACE TABLE default.customers_dynamic_mask_temp AS
SELECT
    firstname,
    lastname,
    country,
    mask_pii_country_new(Email, country) AS masked_email
FROM default.customers_dynamic_mask
WHERE SubscriptionDate > '2021-01-01';

Results: Also, as expected, on the simpler query, Snowflake still outperforms Databricks by a margin of 12% on both the unmasked and masked queries.

Use Case	Time (seconds)
Snowflake, unmasked	22.7
Snowflake, masked	24.3
Databricks, unmasked	25.48
Databricks, masked	27.49

In both cases, we used a small Snowflake warehouse and a small Databricks serverless cluster:

Snowflake: small, enterprise=$6/hr., business critical=$8/hr
Databricks: small serverless, cost=12DBU/hr = $8.40/hr.

What powers this high Snowflake performance and efficiency?

Snowflake’s platform powers simplicity across the customer journey:

Simplicity in set-up – Snowflake is a fully managed platform and does not require elaborate set-up processes.

Simplicity in platform scaling – Snowflake is built on a foundation of high performance and efficiency. With elastic scaling, micro-partitions, high concurrency support, automatic performance improvements, and intelligent workload optimization features, Snowflake is one of the fastest analytics platforms in the industry.

Simplicity in performing complex analytics – Snowflake has had robust analytics capabilities for years, including support for vectorized UDFs, stored procedures & multi-table transaction support, and automatic MV refreshes.

Simplicity in enabling strong end-to-end security and governance – Snowflake governance is foundational with Horizon catalog – out-of-the-box row filtering, dynamic data masking, tag-based masking, fine-grained access controls with no significant performance impact.

Conclusion

Snowflake has meticulously built an industry-leading analytics platform that is not only fully managed and constantly improving but also extends to meet customers’ requirements as an open lakehouse, modern warehouse, or a global data mesh while preserving its simplicity. In addition, all of this is backed by a robust engine that powers one of the fastest data analytics platforms in the market and provides cost efficiencies. While Databricks might make several claims but many of them fall apart because of the complexity Databricks passes along to the users. Don’t just take our word, try Snowflake today. Need help? Squadron Data can help you get started.

Snowflake: Power Up, Costs Down

Snowflake: Power Up, Costs Down

Contact Us