MON AUGUST 15TH, 2022 11:40 AM MONTREAL, CANADA
Hey Guys,
Just as there is Databricks vs. Snowflake, there is DevOps vs. MlOps. While I’m not a technical person, I often find myself thinking about this.
For software developers this is already rather intuitive:
DevOps methodology helps improve communication between your developers and ops working on projects. It best serves the following purposes:
you can launch new features faster
increases the customer’s satisfaction and of developers too at the same time.
feedback loops help better communication
Key principles of DevOps:
Automation
Iteration
Self-service
Continuous improvement
Continuous testing
Collaboration
Machine Learning Operations (MLOps)
If you think of how all this plays out in the real world, there appears to be a lack of a good bridge between DevOps and MLOps. Correct me if I am wrong?
AI has been heralded as the new “brains” for software applications, a role long held by databases. Think about it, ML models depend on specific combinations of hardware and software infrastructure. Without the right infrastructure, the models either cannot perform well enough to be viable or, in some cases, become prohibitively costly.
According to Databricks, MLOps stands for Machine Learning Operations. MLOps is a core function of Machine Learning engineering, focused on streamlining the process of taking machine learning models to production, and then maintaining and monitoring them. MLOps is a collaborative function, often comprising data scientists, devOps engineers, and IT.
How DevOps and MLOps operates together seems to be a bit lacking. There’s a lot of wasted inefficiency.
Today there is no efficient bridge between the creation of ML models and the process of getting them into production. To illustrate this: The average time to production for ML models is 12 weeks. That’s 4 months, it’s not ideal.
The MLOps loop can be complicated with some bottlenecks along the way: data collection, data processing, feature engineering, data labeling, model building, training, optimizing, deploying, risk monitoring, and retraining. And in each organization, different people and teams may own one or more steps.
Why AI Falls Flat
What’s worse, nearly half of the models are shelved for performance or cost reasons, which makes AI less transformational than many hoped. Organizations have to think better about how to integrate DevOps and MLOps, and what tools can help?
I’m sometimes reading SeattleDataguy maybe one of the best Substack’s on data science right now in 2022:
This is more his realm of expertise.
Clearly in the real world reasons why A.I. isn’t so transformative have to be dealt with head one. If AI is to be the “brains” of applications, a world where ML models are heavily specialized, requiring unique and customized workflows and tools is problematic.
Companies like Snowflake and Databricks are looking to create easier access to applications, machine learning models, and dashboards through their data marketplaces. They want to be your data platform, not your data warehouse or lakehouse. - Seattle Data Guy
One of the reasons I like Seattle Data guy is because he’s also often a guest on YouTube podcasts, I find this supplements his Substack and LinkedIn posts well. In case you are wondering who this guy really is, it’s Benjamin Rogojan.
Ben on what is Data Science
Ben Rogojan is a data engineering solutions architect with expertise in data architecture and statistics. He focuses on developing end-to-end data solutions that help take data from raw format into data products and analytics.
Ben has nearly 50k followers on Medium. I believe he does consulting as well. I view him as definately a pioneer of Substack’s data science community as well. On this LinkedIn, he says he talks about #bigdata, #datainfra, #datascience, #dataengineering, and #datawarehousing. LinkedIn has an incredible data science community (check out my list). I recommend you super-follow (tap on the notification bell) all of the people on this list.
MLOps Cycle
For developing machine learning solutions the standard lifecycle goes like this:
Requirement gathering
Exploratory data analysis
Feature engineering
Feature selection
Model creation
Model hyperparameter tuning
Model deployment
Retraining, if needed
The fact is once an ML model is trained and ready, we should be able to work with it as we do with any other software module because it is just code and data.
The theory goes that since DevOps came first, MLops has to integrate better with it and its loop cycle. It still seems to lack a good bridge.
As you know, MLOps originated as a term to refer to a set of best practices to design, build, deploy and maintain machine-learning models in production. As it evolves, however, the scope has expanded to the whole of ML lifecycle management.
It’s no surprise the Blog of Databricks often mentions MLOps.
So the current reality is sub-optimal at most organizations. Siloed teams of data engineers, data scientists, IT ops professionals, auditors, business domain experts, and ML engineering teams operate in a patchwork arrangement that bogs down the process. It’s not good. This means A.I. isn’t being implemented properly.
According to some ML Engineers, when model creation and model deployment are forced together into one mega-process, however, it usually limits flexibility and choice in a way that creates obstacles. Organizations clearly need to re-vamp how they integrate their DevOps, MLOps vis-a-viz model creation as distinct from model deployment. I don’t know what the answer is, but these problems are unique to each organization and to the field as a whole.
Databricks vs. Snowflake
In some sense I view the Databricks vs. Snowflake debate also as symbolic. Snowflake is a relational database management system and analytics data warehouse for structured and semi-structured data.
Again, I’m not an engineer. Both are incredible companies. With enterprises large and small racing to build out their data infrastructure, one foundational piece these enterprise companies all need is an easy place to store their data.
Databricks, has auto-scaling of clusters but is supposedly not so user friendly. The UI is more complex as it is aimed at a technical audience. It requires more manual input when it comes to things like resizing clusters, updating configurations, or switching options. There is a steeper learning curve to overcome.
Databricks, which innovated what is called a data lake, a place where you can dump all of your data – no matter the format. This is super convenient.
Some Terms
A data warehouse is the database of choice for general-purpose analytics, including reporting, dashboards, ad hoc, and any other high-performance analytics.
A data lake is a data store (only) for any raw structured, semi-structured, and unstructured data that makes data easily accessible to anyone. You can use it as a batch source for a data warehouse or any other workload.
A data lakehouse is often described as a new, open data management architecture that combines the best of a data lake with a data warehouse. The goal is to implement the best of a data lake and a data warehouse, and to reduce complexity by moving more analytics directly against the data lake, thereby eliminating the need for multiple query engines.
In reality in 2022, I think many companies use Databricks and Snowflake together, so they aren’t really direct competitors per se. That being said they are rising Giants that are overlapping. Functionally, Databricks and Snowflake have been steadily moving into each other’s core markets - ETL and data processing, and data warehousing/lakehousing - for some time as they both try to become a data platform of choice for multiple workloads.
I think overtime Databricks and Snowflake will create a better bridge between DevOps and MLOps, among others. This will reduce friction between A.I. model creation and model deployment, thereby reducing cost and improving efficiency making A.I. easier to implement in the real world.
On the business side, I cannot wait for Databricks to go public with an IPO. Snowflake SNOW 0.00%↑ has a lot of great momentum. Incredibly it already has a market cap of $54.3 Billion, with gross margins of 64%. By the time it goes public, it could be worth approximately what Snowflake is worth or maybe a little less. Databricks is worth around $38 billion following its latest fundraise of $1.6 billion in August 2021, led by Counterpoint Global.
How do you see DevOps and MLops evolving together and the data science community forming on Substack?
Thanks for reading! If you want to support the channel and allow me to continue to write Newsletters feel free to get access to more content.
DataBricks and Snowflake are though similar might perhaps be targeting different users and marketing. The former technicals and the latter nontechnical or business users. How to bridge the DevOps and MLOps might depend on the kinds of data that are needing integration. A single interface with options to choose the various data types might lead to an automation path. Who or how that interface might get built? An open source project might evolve into such a product.