20 Popular Open Source AI Developer Tools
Lots of Opportunity to Contribute and Benefit in Open Source AI
This article is a guest post on the AI Supremacy newsletter from earlier this week.
This is a collection of some of the most popular open source AI and ML developer tools, ranked by the number of stars they have on GitHub, for projects active in 2023 and 2024. It focuses on developer applications used to train and deploy ML models and AI agents and its purpose is to highlight the breadth and diversity of tools and frameworks that are being built by the open source AI community and the vast potential in the space.
The collection contains:
Model training and inference frameworks:
General Purpose ones - TensorFlow, Hugging Face Transformers, Diffusers,
Optimized to run on local clients - GPT4ALL, Ollama, MLC LLM and
AI agent development kits - Autogen, Semantic Kernel, Unity ML Agents Toolkit
Vector databases - Milvus, Faiss
Orchestration tools - LangChain, LlamaIndex, Flowise, Mindsdb
Compute Optimization libraries - Colossal AI, DeepSpeed, Ray, Vllm
TensorFlow is a widely used library for training and inference of ML models that offers significant versatility and scalability across platforms, from consumer hardware to clusters of servers. Open sourced in November 2015 by Google, it has since collected over 182,000 stars on GitHub.
Hugging Face Transformers is a library that offers access and tools to fine-tune and deploy a vast collection of the open source models that are hosted on Hugging Face. It is compatible with TensorFlor, Pytorch and JAX and includes support for text, image and audio models. Launched in 2018, Transformers has gathered 124,000 stars on GitHub.
LangChain is a framework for developing and deploying applications powered by large language models developed by the company with the same name. The tool provides modular building blocks and components for building custom chains, it allows inspection, monitoring and evaluation of applications and can turn any chain into a REST API. Launched in October 2022, LangChain has achieved over 81,000 stars on GitHub.
PyTorch is a very popular General Purpose framework and library developed by Facebook’s AI Research lab and released in October 2016, with over 77,000 stars on GitHub.
GPT4ALL is a client that can install and run AI models on consumer-grade and edge hardware provided by Nomic AI. Optimized for CPU-only, no-internet environments, GPT4All runs on Windows, OSX and Ubuntu and can run a series of models, including Alpaca, Llama, Pythia, Mosaic, Falcon, StablLM and custom GPT4All ones. Launched in August 2023, it gained over 63,000 stars on GitHub.
Ollama is a tool which enables users to use open source LLMs locally on their Windows, macOS or Linux machine. With support for a variety of models, including Gemma, Llama, Mistral, Mixtral, Command-R and Llava, the tool gathered over 52,000 stars on GitHub since its launch in February 2023. Besides the GitHub repository, resources for Ollama are available on the website and Discord channel.
ColossalAI provides a collection of parallelism components for distributed training and inference of models with a few lines of code. Their tools support data, pipeline, tensor and sequence parallelism, as well as a zero redundancy optimizer and a method for automatic management of parallelization. Since its creation in 2021, the Colossal AI repository has gained over 37,000 stars on GitHub.
DeepSpeed, provided by Microsoft, is a deep learning optimization library for distributed training and inference. It is developed to support large to very large models that need to be trained at scale on hundreds and thousands of GPUs in resource constrained systems, and yet still deliver low latency and high throughput for inference. Since its launch in May 2020, it has gathered over 32,000 stars on GitHub.
LlamaIndex is a data framework for LLM-based applications with Retrieval Augmented Generation. The tool provides the abstractions necessary to more easily ingest, structure, and access private or domain-specific data in order to inject them into LLMs. Its components include Data connectors to get existing data from their native source and format, data indexes to structure it in intermediate representations and engines to provide natural language access to the data. Since it launched in Nov 2022, Llamaindex collected over 30,000 stars on GitHub.
Ray is an unified compute framework for scaling AI and Python workloads — from reinforcement learning to deep learning to tuning, and model serving. It consists of a core distributed runtime and a set of AI libraries for simplifying ML compute, including ones that deal with datasets, distributed training, hyperparameter tuning and inference. Developed by Anyscale, Ray has over 30,000 stars on GitHub.
Milvus is a vector database built to power embedding similarity search and AI applications. It is used to make unstructured data search more accessible regardless of the deployment environment. Developed by The Linux Foundation AI & Data organization, Milvus has gathered over 36,000 stars on GitHub.
Faiss (Facebook AI Similarity Search) is a library for efficient similarity search and clustering of dense vectors developed by Meta AI. It contains algorithms that search in sets of vectors of any size, up to ones that possibly do not fit in RAM. It also contains supporting code for evaluation and parameter tuning. Since its launch in March 2017, it collected over 27,000 stars on GitHub.
Autogen is a multi-agent conversation framework that works as a high-level abstraction for building workflows using multiple LLMs. Developed by Microsoft, it is meant to support the development of modular and complex agents to perform sophisticated tasks that use AI models, as well as a variety of other tools and components. Since its launch in September 2023, Autogen collected over 24,000 stars on GitHub.
Hugging Face Diffusers is a library for fine-tuning and deployment of pretrained diffusion models for generating images, audio, and 3D objects. It includes diffusion pipelines for inference, interchangeable noise schedulers for different diffusion speeds and output quality and access to pretrained models from the Hugging Face platform. Diffusers has over 22,000 stars on GitHub.
Flowise is a low-code, drag and drop tool to develop customized LLM orchestration flows and AI agents. Built around customization and modularity, the framework supports integrations with other frameworks, the creation of autonomous agents for various tasks and integration or open source, locally run LLMs. Since its launch in February 2023, Flowise has collected over 21,000 stars on GitHub.
Mindsdb is a platform which automates pipelines that connect real-time enterprise data to AI systems. It is used to train and customize models, automate tasks, define and execute trigger events and provide observability. Since it launched in 2017, it has gathered over 21,000 stars on GitHub.
Semantic Kernel from Microsoft is an SDK that lets developers build AI agents that can call on existing code. It lets them mix conventional programming languages, like C# and Python, with LLMs using prompt templating, chaining, and planning capabilities in order to build AI experiences into existing applications. Released in March 2023, Semantic Kernel has gathered over 17,000 stars on GitHub.
Vllm is a high-throughput and memory-efficient inference engine for LLMs that uses PagedAttention, an algorithm for the management of attention keys and values. Developed at UC Berkeley, it was launched in September 2023 with the paper “Efficient Memory Management for Large Language Model Serving with PagedAttention” and it has since collected over 17,000 stars on GitHub.
Machine Learning Compilation for Large Language Models (MLC LLM) is a universal deployment solution that allows native deployment of any large language models with native APIs with compiler acceleration. The mission of the project is to enable everyone to develop, optimize and deploy AI models natively on everyone’s devices with ML compilation techniques. Developed by researchers at Carnegie Mellon University, MLC LM has over 16,000 stars on GitHub.
The Unity Machine Learning Agents Toolkit enables games and simulations to serve as environments for training intelligent agents. It provides PyTorch-based algorithms to enable game developers to train intelligent agents for 2D, 3D and VR/AR games. The agents can be used for multiple purposes, including controlling NPC behavior (in a variety of settings such as multi-agent and adversarial), automated testing of game builds and evaluating different game design decisions pre-release. The toolkit gathered over 16,000 stars on GitHub.