Market Map & Analysis: AI Synthetic Data Companies
3 Segments Emerge: Computer Vision, Specialized and General Purpose Synthetic Data Generation Platforms
Synthetic data tools are important to the development of ML because they augment or replace real-world data to train and improve models, protect sensitive information, mitigate bias and validate the quality of AI applications. Synthetic data is particularly useful in sectors and use cases with regulations or limitations for data handling and in instances where real-world information is difficult or costly to obtain.
The functions that synthetic data performs include:
Identify gaps and biases in real world datasets
Construct datasets with specific, customizable properties
Teach a model for a specific use case and environment
Test and improve the accuracy of a model
Lower the cost and time to market
Explore the effects of certain data characteristics on a model
Synthetic data got its start for computer vision in the world of autonomous vehicles. They are particularly useful given the challenges and importance of edge cases in the world of AVs and the need to generate substantial amounts of data to simulate the long-tail of driving scenarios. Computer simulated worlds make it possible to automatically produce thousands or millions of permutations of any imaginable driving scenario—e.g., changing the locations of other cars, adding or removing pedestrians, increasing or decreasing vehicle speeds, adjusting the weather, and so on.
With the advent of Large Language Models, synthetic data is also being increasingly generated for other data types, such as text, tabular, time-series. There are both companies who specialize in synthetic data for a given sector, such as healthcare and financial services, and general purpose tools. Types of data that can be synthetically generated include physical, email and HTTP addresses, individual and business names, dates and times, file names, phone numbers and Social Security Numbers, SKUs and product names, Internet usage and browser history information.
Computer Vision Companies Span Human-Centric Use Cases and Outdoor and Environmental Perception
Computer vision is expanding beyond automotive to use cases such as retail shelf design and management, fitness applications, virtual try-on for apparel and cosmetics, and building security and intelligence applications.
Neurolabs provides new generation image recognition technology for retail and consumer packaged goods. It has image recognition datasets for over 100,000SKUs and is developing ChatCPG, a retail shelf auditing tool for insights into in-store visits and customer behaviour. Headquartered in Edinburgh, UK, it raised a US$3.5M Seed round in May 2022.
Synthesis AI generates fully labeled humans in near-infinite combinations of environments and lighting and scenarios for multi-human ML models engaging in a variety of activities. Use cases for its technology include avatar creation, pose understanding, ID verification, AR/VR/XR, driver monitoring, pedestrian detection, among others. Headquartered in San Francisco, CA, it raised a US$17M Series A in April 2022.
Datagen generates data on full body humans in context, interacting with objects and environments and is developing a synthetic data management platform with generation, versioning and sharing capabilities. With headquarters in Tel Aviv, Israel, it raised a US$50M Series B in March 2022.
Parallel Domain specializes in automotive environments with synthetic data generation for a variety of driving scenarios, regions, agents and environmental conditions. Its use cases include Advanced Driver Assistance, L4/L5 autonomy and drones / aerial applications. Headquartered in Vancouver, Canada, it raised a US$30M Series B in November 2022.
Rendered.ai provides a PaaS for generating physics-based synthetic data, such as procedurally modeled landscapes, vegetation, buildings, water bodies, and cities. The use cases for its technology include automotive, medical imagery, robotics, satellite imagery, security. With headquarters in Bellevue, WA, Rendered raised a US$6M Seed round in October 2021.
Specialized Synthetic Data Platforms Compete with General Purpose Tools
MDClone’s ADAMS platform is a self-service data analytics environment for healthcare collaboration, research, and innovation. Its use cases are access and exploration of medical data, cross-organizational sharing and collaboration. With headquarters in Tel Aviv, Israel, it last raised a US$50M Series B in March 2022.
FinCrime Dynamics provides synthetic data for financial crime prevention. Its technology enables financial institutions to create tailored data with the latest financial crime simulations to build high performance machine learning crime detection. Its headquarters are in Cambridge, UK.
With the tagline “Set your enterprise data free”, Hazy re-engineers enterprise data to make it faster, easier and safer to use. Its customers include financial services institutions, telecommunications companies, government and research agencies. Hacy has its HQ in London, UK and it has last raised a US$9M Series A in March 2023.
Gretel’s motto is “We help developers build with data, together.” and it develops technology to generate synthetic datasets with the same characteristics as real world data, to train and test AI models without compromising quality and privacy. It supports relational, tabular, unstructured text, time-series, image data and it can validate models and use cases with quality and privacy scores. Headquartered in San Diego, CA, it has raised a US$50M Series B in October 2021.
Tonic’s motto is “We fake things seriously” and it can create realistic, targeted, representative test data based on your real world data, preserving critical relationships and maintaining input-to-output consistency across tables and databases. It can also model, shape, and size the data to specific requirements. With headquarters in San Francisco, CA, it last raised a US$35M Series B in September 2021.
Datomize’s tagline is “Limited data. Unlimited insights” and it enables generating synthetic data with the lower bias. Its technology offers the Data Health Report, which validates source data, provides quality and balance scores and insights into data gaps. With headquarters in Tel Aviv, Israel, Datomize raised a US$6M Seed round in February 2021.
Synthetized is “the fastest way to create trusted data” and its use cases for synthetic data generation include customer data analytics, customer churn projection, data monetization in financial services, bias mitigation in recruitment processes, data sharing with 3rd-parties, fraud prevention, clinical trials, insurance quote conversion, ecommerce trend projection. With headquarters in London, UK, Synthetized last raised a US$2.8M Seed round in March 2020.
Diveplane can create a safe, accurate synthetic twin for real-world data with use cases in financial services, healthcare and defense applications. Its technology can also be used for buying habits and trends, population and census information, credit worthiness and financial transactions, travel and vacation preferences, medical records and diagnosis trends. Diveplane has its headquarters in Raleigh, NC and it last raised a US$25M Series A round in September 2022.
Mostly.ai provides a synthetic data platform for smart data imputation, data augmentation, rebalancing, automated data privacy and diversity, data exploration. Based in Vienna, Austria, the company raised a US$25M Series B round in January 2022.
The more generative AI develops into a varied and rich universe of consumer and enterprise applications, the more demand there will be for synthetic data to augment and improve ML models. As the AI field itself, this sector is only emerging, and it is exciting to see it unfold across its wide variety of use cases.