Snowflake DWH Migrations: Success where others failed!
Plus: The why migrations to other platforms fail (and will continue to for the foreseeable future)
Recently we had a customer come to us who had struggled with moving their data warehouse workloads to Databricks. They had been at it for a year at the time and hadn’t made much progress on any of the workloads. After several discussions, they pivoted from a Databricks migration to instead use Squadron Data to move those workloads to Snowflake.
We are in our third month on the migration for the first workload and are currently running well in test. It will go production next month. 😊
We are performing code conversion on the other workload. It will go production end of the year. 😊
Last week, the customer referred us to executives at another company. We ❤referrals!
This is not uncommon; you see of this happening. It begs the question why? Is it the Systems Integrator? Is it issues within the customer organization itself? Is it the target platform selected? The truth is failure can be attributed to all or any one of those. The target selection however makes a huge difference.
Squadron Data has a tremendous amount of experience in this space. Our migration toolkit, formerly known and sold as the Cloudera Data Warehouse Migrator, migrated customers from Teradata, Exadata and Netezza to Hive. When we exited Cloudera, we re-wrote the toolkit as Hive to Databricks and Hive to Snowflake, knowing full-well we would throw one away. We used this as an exercise to more deeply understand the tech of the target platforms better.
Here are six big reasons why the DWH migration to Snowflake is much easier and brings a much higher probability of success than Databricks, Fabric and others.
- Ease of use / Simplicity of the platform –
- Snowflake is a fully managed service. You don’t have to worry about underlying instances, cluster management, tuning, etc. Both compute and storage infrastructure are entirely managed for you and scaled independently.
- Databricks on the other hand relies heavily on Spark. The cluster management and data engineering heavy nature of spark adds complexities. Many of the features require Delta Lake, which brings about its own set of complexities and risk. Confusion often abounds on which clusters to use for which things.
- Fabric is an attempt by Microsoft to provide a fully managed service, but it is still evolving, and you still have to manage all of the individual services (e.g. Azure Data Lake, ADF, Synapse, Power BI, etc.). Managing the different services is not seamless. Synapse in and of itself is a lot of work.
Neither Databricks nor Fabric are anywhere close to Snowflake in the ease of use of the platform. Maybe someday they will get there, but certainly not anytime soon.
- Tooling around the platform – metadata replication, e.g., among other tooling.
- Snowflake has quite a lot of tooling around the platform – things like Zero Copy Clone that automatically replicates data, metadata and security policies to different environments without actually moving them. This makes the testing and code promotion process much easier than the others (half the battle on large scale migrations is ‘how do I test and validate this’).
- Databricks has no such metadata replication. With HMS backing DBX’s, you would previously replicate the contents of a number of tables (DBS, TBLS, TBL_PRIVS, etc.). With Unity Catalog, now you have to either iterate through the Unity Catalog API (GET /api/2.0/unity-catalog/catalogs) or iterate through a series of SQL commands (show tables in catalog_name.schema_name). Having to carry these scripts around is a lot of work and will cause questions such as ‘is test really the same as prod’ or problems of ‘test succeeded, prod failed. Schemas were not the same’.
- With Fabric, it is even worse. As Fabric is a collection of services, one must do this across all services. The Data warehousing component, Synapse, is a restart of Azure SQL Warehouse. To replicate metadata from one Synapse instance, you have three choices:
- Synapse studio, right-click on a database, generate script, export..run import on another…one db at time (no fun).
- Create ADF pipeline that reads the metadata, drops it somewhere. Create script (or stored proc) on target that reads it, imports it. (also not fun.
- Backup and restore of one db at a time. (also not fun and a really bad idea if you have PII or sensitive data)
- Security and Data Governance Model
- Snowflake has Horizon Catalog – it is hands down the most robust and powerful RBAC engine on the planet. Security administrators can create policies by resource (db/schema/tbl/column) or by data classification (by tag, pii, cpni, pci, etc.). Security administrators can do row level masking and column level filtering. The RBAC engine has been around for quite some time and is very mature.
- Databricks is newer to the RBAC game. In the old days, Privacera and Okera were used as query interceptors to provide authorization controls. DBX purchased Okera and is integrating it. With Unity Catalog (GA’ed in June 2022), Security Administrators can create policies based on resource, but it can not create policies by classification. Unity Catalog can integrate with Azure purview or AWS glue to provide policy by classification, though it is not a part of Databricks itself. I’m not aware of any row-level filtering capabilities. Masking at the table level became possible in Dec 2023, but the mask has to be specified at create table time and must use an external function. Administrators commonly (and still do) ‘create view with select mask(column) from db.schema.tbl’ and grant access to view but deny underlying table. Be prepared, this is a lot of work.
- Fabric is new to the game. The applicable services here are Purview (governance) and Synapse (dw). Synapse does provide RBAC at the resource level and Purview does provide tagging capabilities, but Synapse does not yet provide access control based on the Purview tags. At some point Microsoft will provide this integration, but I have not seen on the roadmap when this will be made available. This begs a larger question – with the lineage/classification/governance in one service (Purview) and the warehouse (Synapse) in another service and the Data Engineering (ADF) in yet another service, how messy will it be for Microsoft to tie together authorization policy control? This is where the uniform, consistent policy design across the stack that Snowflake already has had for some time, and which Databricks is actively in development moving towards, will make it really hard for Fabric to get to.
- SQL Compliance
- Snowflake is fully ANSI SQL compliant, so that is a great start. Snowflake ACID compliance has been around for a long time and is fully baked. Things like Teradata BTEQ, query band, WLM, etc. will require conversion. Stored procs are straight-forward. The effort here is less than the others, but there is effort.
- Databricks offers Spark-SQL which is partially ANSI SQL compliant. ACID-like functionality is only available via Delta, so you will be stuck using Delta and it is not a one for one. Extra work required here. Things like Teradata BTEQ, query band, WLM, Stored Procs, etc. will require conversion. While technically possible with a heavy lift, it is going to be the most difficult of the three for sure.
- Fabric (Synapse) is mostly ANSI SQL compliant (dedicated SQL pools, not serverless). Some conversion on ANSI dialect is required that is not required with Snowflake. Things like Teradata BTEQ, query band, WLM, Stored Procs, etc. will require conversion. Within Synapse, those are converted to T-SQL, which is an extension that is proprietary to Microsoft. A conversion to synapse requires more work than Snowflake, but not as much as Databricks.
- Migration Accelerators
- Snowflake has SnowConvert, it’s native EDW migration tooling and assessment kit. SnowConvert is available for partners and customers to use. It is very powerful and has been used on the largest and most complex Teradata migrations in the world. We, at Squadron Data, also have our own migration toolkits from Teradata, Netezza, Exadata and Hive. We are retiring portions of our data warehouse migrator in favor of SnowConvert, which says a lot right there about the power of SnowConvert. SnowConvert has been executed many times, over many years. One common legacy use case in the 2013-2017 time period was to use Spark for the ETL ingest to create a set of final tables and then TPT those tables to Teradata. Snowflake has this covered too – with Snowpark Migration Accelerator (SMA). SMA migrates legacy Spark workloads (doesn’t matter if legacy Opensource Spark, Databricks or Cloudera) to Snowpark.
- Databricks has Brick Builders, which is a collection of accelerators from third party partners, such as TCS, Wipro, Accenture, etc. From a Data Warehouse Migrator perspective, TCS (and potentially others) has the data warehouse migration toolkit. I’m not aware of any success they have had. I am aware of one Teradata offload attempted, but they were not able to perform SQL conversion and BTEQ conversion. They pivoted to attempt to do by hand with offshore resources, but the volume of code was too large. See SQL compliance above on why this is a very challenging and hard thing to do.
- Fabric – there is Azure Synapse Pathway. I’m not familiar with how great or how poorly it performs.
- Concurrency limits
- Snowflake – Snowflake scales endlessly in both Compute and Storage. I am sure there is a limit somewhere, but we have not been able to find it. Last week we demonstrated to customer more than 1,000 concurrent query with their datasets and their query. Obviously, they were impressed. I know of folks that have done 2,500 concurrent queries and I presume more is capable. There is a limit somewhere, I’m just not sure what it is.
- Databricks – Databricks recently increased the limits from 1,000 to 2,000 concurrent tasks before throwing a 429 error. But, in practicality, can you get much over 100? Remember unity catalog is java. And so is the spark driver, and executor. I would like to see real world use cases (with sizable real-world compile query, metadata pulls, locking, cache check/invalidations) that is able to do well north of 100 before I would state concurrency limit is well north of 100. TBD on what that number really is.
- Fabric (Synapse) – the limit for a single pool is 128 (in practicality, it is half of that – 60-70). The limit for all pools is 1,000. The 60 concurrent limit is a big problem, and I know of at least one large customer who killed a very large migration problem after 18 months and many millions of dollars because of it. Be very wary.
Migrations are hard enough – especially if there are literally years of code and tech debt piled on them. You may as well go with the platform that has the endless scalability, is a fully managed service, governed consistently, is the simplest and easiest to use and has the best tooling for migrations. That one of course, is Snowflake.