Databricks' AI Business and Product Strategy
Unstructured Data Management is Becoming Even More Valuable with the Growth of AI Applications
This article appeared as a guest post in the AI Supremacy newsletter earlier this week.
Founded in 2013 as a data storage and management company, Databricks has been successfully growing into a data + AI venture, as the title of last week’s conference suggests. Especially, but not only, over the past year, the San Francisco-based firm has added an entire layer of AI development tools to complement and integrate with its data lakehouse platform. The new products come from in-house development, as well as acquisitions, the most notable being Mosaic ML in July 2023. This is an overview of Databricks’ AI business strategy and product portfolio.
With data, and particularly unstructured data, representing such an important part of the ML value chain, Databricks’ expansion into the adjacent market of AI development and deployment tools is an organic growth opportunity into an emerging segment with tremendous potential. At the same time, not adapting to the emergence of AI constitutes a significant risk to Databricks’ core product offering, as customers would have to duplicate or move their data to other platforms to use it for their AI solutions.
As companies are experimenting and implementing ways to differentiate and capture market share with AI, their proprietary data comes to the forefront as an important source of competitive advantage in the following ways:
The quality of the data used to train a new model is very important, potentially more so than the quantity, as shown, among others, by Microsoft’s phi series - a collection of small and high-performing models trained on textbook-quality data that score as high as much larger ones on popular benchmarks. For companies, this means that they don’t have to license 3rd-party solutions with hundreds of billions of parameters in order to develop well-performing AI applications, if they pay attention to and manage the quality of the training data.
There is demand in the market for highly specialized and accurate models for a wide variety of use cases, for which public training data is not available in good quantities. Companies with exclusive and proprietary datasets in areas such as finance, legal, healthcare, logistics are well positioned to develop models and AI products tailored to specific, in-demand use cases that cannot be as well served by General Purpose solutions.
The vast majority of data generated and stored by businesses is unstructured and, until now, relatively inaccessible to analysis and the extraction of insights. Generative AI makes it possible for companies to uncover intelligence from those vast quantities of unstructured data about their market, customers and operations to create new product lines and marketing strategies or streamline processes.
Databricks is positioning itself in the market as the ideal partner for businesses to take advantage of these value drivers in the data space and realize competitive advantage from using their proprietary, high-quality, exclusive (and largely unstructured) datasets to build AI applications for internal operations or commercialization.
Databricks’ AI business strategy can be described as follows:
Build an end-to-end AI deployment platform, complete with LLMs and all the tooling necessary, to allow customers to develop and deploy their own AI applications
Integrate the AI developer tools in the overall Databricks data management platform and enable customers to leverage their business data stored there to train, fine-tune and use models
Develop an ecosystem of up- and downstream partnerships and 3rd-party data connectors (e.g. cloud providers, consulting agencies, analytics companies) to facilitate access to Databricks services
At the product level, the integration of AI tools augments the customer value proposition:
Current users get the opportunity to test and build AI within their data storage service, without having to search for other platforms and duplicate / manage the data
Customer retention and engagement is improved, as the Databricks platform will be used for even more business functions and will be even more integrated into customers’ operations
Potential users are enticed with comprehensive, flexible and private AI development tools, which, if successfully deployed, can lead to more data management usage
Last week, at the 10th Data + AI Summit in San Francisco, the company announced an array of new AI products and features centered on the themes of serverless AI development, compound systems (also called agents) and integration with Databricks’ data storage and management platform.
The first theme is meant to strengthen the value proposition of Mosaic AI and Databricks’ ML development tools and make them available to as many engineers and researchers from as many companies as possible. A serverless development platform is easier and faster to use and more accessible to businesses with limited engineering resources.
Compound AI systems are intelligent agents made up of multiple models and subsystems, which can perform more complex tasks and use other software tools to complete projects. An early, and so far the most popular, example of AI agents are Retrieval Augmented Generation systems, which use a search index and engine alongside a foundation model in order to generate answers and provide quotations and sources for their response.
One of the advantages of compound systems is increased accuracy, as the models within can each be fine-tuned to deliver better results for one sub-task. Additionally, the models can be trained to check each other’s work and correct it, if necessary, before they send it to the end-user.
All of the AI development tools are tightly integrated with the data storage and management platform and allow businesses to leverage their data system to build their AI applications.
Last week’s AI product announcements from Databricks include:
Lakehouse IQ - a knowledge engine for business data that can search, query and answer questions in natural language. The engine processes the data, usage patterns, and org charts of a company, generates a custom understanding of the business jargon and market context, and returns answers to employee questions. The Lakehouse IQ engine is used to power natural language features within the Databricks platform, such as in the SQL Editor, and can be accessed through its API to create 3rd-party AI enterprise assistants.
AI/BI - an agentic system that draws insights from the data stored on the Databricks platform. It provides low-code dashboards with standard business queries and a conversational interface with extended reasoning capabilities and contextual data that can answer more complex questions. AI/BI Dashboards are generally available on AWS and Azure and in public preview on GCP. Genie, the conversational interface, is available to all AWS and Azure customers in public preview, with availability on GCP coming soon.
Mosaic AI tools to support model and agent building and deployment:
API for fine-tuning of foundation models
Vector Search with support for Customer Managed Keys and Hybrid Search
Agent framework for development of compound systems
Model Serving support for agents and RAG systems
AI Tool Catalog and Function-Calling with support for SQL functions, Python functions, model endpoints, remote functions and retrievers
Agent evaluation and alignment tools
Earlier this year, Databricks released DBRX, a General Purpose, mixture-of-experts model with 132B total parameters, of which 36B are active on any input.
Last week, the company announced the launch of Shutterstock ImageAI Powered by Databricks, a text-to-image model built using Mosaic AI and trained exclusively on Shutterstock’s image repository. ImageAI is available in private preview on Mosaic AI Model Serving and live on Shutterstock.com/ai-image-generator.
Databricks’ AI growth in the past year has been supported by a series of acquisitions:
In March 2024, it acquired Lilac, a tool for data scientists to search, cluster, and analyze text datasets with a focus on generative AI. The tool allows researchers to explore data clusters, derive new data categories using human feedback and classifiers, and tailor datasets based on these insights. It also enables analysis of model outputs for bias or toxicity, and preparation of data for RAG and fine-tuning or pre-training LLMs.
In January 2024, it bought Einblick - an AI-native collaboration platform that helps users solve data problems with just one sentence. The tool integrates AI directly into the authoring surface to enable users to transform their thoughts into data workflows, which the product then enhances with contextual information and breaks into smaller solvable chunks using SQL, Python, and higher level logical operators.
In July 2023 it acquired Mosaic ML, a platform for building ML models and deploying them in AI applications.
In May 2023, Databricks purchased Okera, an AI-centric data governance company. Okera’s offering includes an AI-powered interface to discover, classify, and tag sensitive data such as personally identifiable information (PII). These tags enable data governance stakeholders to assess compliance and create no-code access policies that improve visibility and control over data.
Through its Ventures arm, Databricks is also an active investor in AI start-ups. Launched in December 2021, the investing department has recently announced an AI fund dedicated to enterprise applications. Publicly announced investments in AI companies include:
A March 2024 investment in Mistral AI, who provides foundation models and an AI deployment platform
A March 2024 investment in Unstructured, a company making tools for the pre-processing of unstructured data for use in developing large language models
A February 2024 investment in Glean, a provider of an enterprise search and document Q&A solution.
A January 2024 investment in Anomalo’s Series B, a tool for the detection, root cause analysis, and resolution of data quality issues.
A January 2024 investment in Perplexity, a consumer-focused, generative AI search and answer engine
An October 2023 investment in Cleanlab’s Series A, a data and AI pipeline quality management tool.
Databricks’ strategy to expand its platform to include ML developer tools and enable its customers to use their proprietary, exclusive data to build competitive advantage through AI is a great way of bringing the company to the forefront of innovation in AI and to secure an even brighter future for the business and its customers.
I find myself beyond fascinated by the companies that that Databricks Snowflake, Anthropic and Open AI will acquire and invest in.
Companies like these are fascinating because it is like witnessing future big tech companies when they were young. And whether they are generational or not, they must pretend that they embody the best aspects of what the fashion of AI can bring in the enchantments and little conveniences for their customers.
May all startups that go public in America capitalize their dreams for new capabilities of the future. These are the duopolies of the next era.