Growth Vectors in ML Foundation Models - Cost & Performance, Embeddings
Cost and Performance Qualities and Embeddings Models Have Had Tremendous Growth in the Past Six Months
In the past few months, there have been a multitude of announcements of new foundation models from both established AI companies and start-ups. This growth is driven by the need to build AI applications that are optimized for a large variety of use cases, cost requirements and ranges of usage across Consumer and Enterprise verticals.
Here is a breakdown of the growth vectors that have been developing at the ML foundation model level and representative examples for each of them, focused on releases from notable AI companies in the past few months. This collection includes models who are production ready, having undergone fine-tuning for safety and alignment, and models that are meant for research and have not been secured.
The main categories are:
Cost & Performance: Optimized for low compute cost and reduced latency
Embeddings: Represent training data in vector form
Visual: Images and videos
Audio: Sounds, music and speech
3D: Three-dimensional assets
General Purpose: Versatile LLMs that handle a large variety of use cases
Domain Focus: Optimized for a vertical, such as finance, legal, health
Language Focus: Optimized for multiple or specific languages
This is an update to my November 2023 post on the types of differentiation for foundation models that were starting to emerge at the time.
Part 1 features models developed for Cost and Performance and Embeddings, with details on the other categories coming in the next posts.
Cost & Performance
Launched in February 2024, the Gemma series features 2B and 7B parameter text-to-text models, with an 8k context length, created by Google and derived from the Gemini series. Optimized to be able to run on consumer hardware, such as developer laptops or desktop computers, Gemma is available for testing on Kaggle and training and deployment through Google Cloud Vertex AI, as well as through Hugging Face (featuring a 🤗 Transformers integration) and GitHub. Under a custom license, Gemma models have open weights and offer pre-trained variants and instruction-tuned variants.
Stable LM 2 1.6B is a 1.6 billion parameter small language model with a context length of 4096 trained on multilingual data in English, Spanish, German, Italian, French, Portuguese, and Dutch. Launched in January 2024, the model is trained on a mix of publicly available datasets and synthetic datasets, utilizing Direct Preference Optimization (DPO). It can be used both commercially and non-commercially with a Stability AI Membership and can be tested on Hugging Face. Its technical report is available here.
Released in December 2023, Mixtral 8x7B is a sparse mixture of experts model (SMoE) with open weights. Licensed under Apache 2.0, it outperforms Llama 2 70B on most benchmarks with 6x faster inference. The Mixtral model handles code, English, French, Italian, German and Spanish with a context length of 8k tokens. It is deployed on the Mistral platform behind the mistral-small endpoint, currently in beta, and available through the Hugging Face repository.
Released in December 2023, phi-2 is a 2.7B parameter model with a 2048 context length optimized for reasoning and language understanding from Microsoft Research. Trained on a mixture of Synthetic and Web datasets created to teach the model common sense reasoning and general knowledge, including science, daily activities, and theory of mind, it is claimed to “achieve better performance compared to 25x larger Llama-2-70B model on multi-step reasoning tasks, i.e., coding and math.” Phi-2 is available through the Azure AI Machine Learning Studio, where you can find the source code and model weights, released under the Microsoft Research License.
Launched in November 2023, Orca2 is a 7B and 13B parameter model series optimized for reasoning abilities by imitating the step-by-step reasoning traces of more capable LLMs. Created by fine-tuning of the LLAMA 2 models on synthetic data meant to teach various reasoning techniques, such as step-by-step processing, recall then generate, recall-reason-generate, extract-generate, and direct answer methods, Orca2 “attains performance levels similar or better to those of models 5-10x larger, as assessed on complex tasks that test advanced reasoning abilities in zero-shot settings”, according to the technical report. The model series is available through Microsoft’s Hugging Face repository and the Azure AI Machine Learning Studio.
Launched in November 2023, the Nemotron-3 8B from NVIDIA has been developed with compute performance improvements in mind. Built to integrate with the NVIDIA TensorRT-LLM open-source library and the NeMo deployment framework, the models are meant to enable cutting-edge accuracy, low latency, and high throughput. Available through the Azure AI Studio and Hugging Face, the Nemotron 3 catalog includes a base model, three versions for chatbots and one version for Q&A applications. With a context length of 4,096 tokens, Nemotron allows for dynamic steering of responses by specifying desired attributes like quality, helpfulness, and toxicity at inference time and is available under the NVIDIA AI Foundation Models Community License.
Made available in November 2023, Titan Text Lite from Amazon is a model with a 4k context length optimized for summarization and copywriting tasks. Together with Titan Text Express, an 8k context window model, it is exclusively available on Amazon Bedrock.
Launched in September 2023, Mistral 7B has been developed to be “compute efficient, helpful and trustworthy” and claims to outperform larger models, such as LLama 2 13 B and LLama 2 34B. With support for English and code and an 8k token context length, it is available under an Apache 2.0 open source license on Mistral AI’s website, as well as through Amazon Bedrock, Microsoft’s Azure AI Studio, Google’s Vertex AI and Hugging Face.
Released in July 2023, MPT-7B-8k from MosaicML (acquired by Databricks), is a 7B parameter open source LLM with a 8k context length that specializes in document summarization and question answering. Under the Creative Commons BY-SA 3.0 for the base and long-form instruction versions and the Creative Commons BY-NC-SA 4.0 for the chat version, MPT-7B-8k is optimized for faster training and inference, and finetuning on domain-specific data on the MosaicML platform.
Embeddings
Embeddings are numerical representations of words that can be used to measure the relatedness between all of the concepts in a given text. They are lists of floating point numbers that can be stored in a vector database and that measure the relationships between the various words on a large variety of dimensions. The vector length, or the number of dimensions, is determined by the chosen embedding model.
Released in January 2024, Embedding-3 Small and Large are two models from OpenAI that generate vectors with a length of 1536 and 3072 dimensions, respectively. Both models are available through OpenAI’s embeddings API endpoint.
Embed v3 was launched in November 2023 by Cohere and optimized for its ability to evaluate how well a query matches a document's topic and assesses the overall quality of the content. This means that it can rank the highest-quality documents at the top, which is especially helpful when dealing with noisy datasets. The Embed v3 series features two English-only and two multilingual versions, with a dimension length of 1024 for the base models and of 384 for the light versions. The multilingual models support 100+ languages.
Launched in September 2023, Titan Text Embeddings from Amazon supports more than 25 languages, handles up to 8k of input text and outputs vectors of 1,536 dimensions. Another model from Amazon is the Titan Multimodal Embeddings, which was released in November 2023 and handles inputs of English text and images of up to 25MB or combinations thereof to generate embeddings with context lengths of 256, 384 and 1,024 dimensions. Both of Amazon’s models are available through Bedrock.
SONAR is a multilingual and multimodal fixed-size sentence embedding space introduced by Meta in August 2023. Covering 200 languages, it allows for text-to-text and speech-to-text machine translation, including zero-shot language and modality combinations. Under a combination of commercial and non-commercial MIT licenses for the code and speech encoders, SONAR is available on GitHub and is described in the research paper here.
A technology with as many applications as GenAI will continue to grow in a large variety of directions to support the various use cases and implementations it is meant for. Stay tuned for Part 2, with more details on the other growth vectors at the foundation model level.
If you know of more representative models for these categories that I haven’t mentioned here yet, let me know, so I can add them.
Bonus: if you haven’t already, check out my latest guest post on the AI Supremacy newsletter about the portfolio of start-up investments that OpenAI and its Start-up Fund have made in 2023 and 2024 here.
The "cost and performance focused ML models " show quite a large range of differently sized LLMs. From 1.6B up to 8x7B, maybe a more narrow overview would be of value. What do you think?