llama.cpp
🚧 Cortex.cpp is currently under development. Our documentation outlines the intended behavior of Cortex, which may not yet be fully implemented in the codebase.
llamacpp
is formerly called "Nitro".
Introduction​
llamacpp
is a C++ inference library that any server can load at runtime. It submodules (and occasionally upstreams) llama.cpp for GGUF inference. llama.cpp runs on CPU and GPU, and is optimized for inference.
In addition to llama.cpp
, llamacpp
adds:
- Model orchestration, like model warm-up and concurrent models.
Usage​
cortex engines llama.cpp init
The command will check, download, and install these dependencies:
- Windows
- Linux
- MacOs
- engine.dll- Cuda 11.7: - cublas64_11.dll - cublasLt64_11.dll - cudart64_110.dll- Cuda 12.2 - cublas64_12.dll - cublasLt64_12.dll - cudart64_12.dll - cudnn_ops_infer64_8.dll - cudnn64_8.dll- Cuda 12.4 - cublas64_12.dll - cublasLt64_12.dll - cudart64_12.dll - nvrtc64_120_0.dll- MSBuild libraries: - msvcp140.dll - vcruntime140.dll - vcruntime140_1.dll
- libengine.so- Cuda 11.7: - libcublas.so.11 - libcublasLt.so.11 - libcudart.so.11.0- Cuda 12.2: - libcublas.so.12 - libcublasLt.so.12 - libcudart.so.12 - Cuda 12.4: - libcublasLt.so.12 - libcublas.so.12
- libengine.dylib
To include llamacpp
in your own server implementation, follow the steps here.
Get GGUF Models​
You can download precompiled models from the Cortex Hub on Hugging Face. These models include configurations, tokenizers, and dependencies tailored for optimal performance with this engine.
Interface​
llamacpp
has the following Interfaces:
- HandleChatCompletion: Processes chat completion tasks.
void HandleChatCompletion(std::shared_ptr<Json::Value> jsonBody,std::function<void(Json::Value&&, Json::Value&&)>&& callback);
- HandleEmbedding: Generates embeddings for the input data provided.
void HandleEmbedding(std::shared_ptr<Json::Value> jsonBody,std::function<void(Json::Value&&, Json::Value&&)>&& callback);
- LoadModel: Loads a model based on the specifications.
void LoadModel(std::shared_ptr<Json::Value> jsonBody,std::function<void(Json::Value&&, Json::Value&&)>&& callback);
- UnloadModel: Unloads a model as specified.
void UnloadModel(std::shared_ptr<Json::Value> jsonBody,std::function<void(Json::Value&&, Json::Value&&)>&& callback);
- GetModelStatus: Retrieves the status of a model.
All the interfaces above contain the following parameters:void GetModelStatus(std::shared_ptr<Json::Value> jsonBody,std::function<void(Json::Value&&, Json::Value&&)>&& callback);
Parameter | Description |
---|---|
jsonBody | The requested content is in JSON format. |
callback | A function that handles the response. |
Architecture​
Main Components​
llamacpp
is architected with several key components:
- enginei: An engine interface definition that extends to all engines, handling endpoint logic and facilitating communication between
cortex.cpp
andllama engine
. - llama engine: Exposes APIs for embedding and inference. It loads and unloads models and simplifies API calls to
llama.cpp
. - llama.cpp: Submodule from the
llama.cpp
repository that provides the core functionality for embeddings and inferences. - llama server context: A wrapper offers a more straightforward and user-friendly interface for
llama.cpp
APIs
Communication Protocols​
The diagram above illustrates how llamacpp
communication protocol works:
Streaming
: Responses are processed and returned one token at a time.RESTful
: The response is processed as a whole. After the llama server context completes the entire process, a single result returns tocortex.cpp
.
Code Structure​
.├── base # Engine interface definition| └── cortex-common # Common interfaces used for all engines| └── enginei.h # Define abstract classes and interface methods for engines├── examples # Server example to integrate engine│ └── server.cc # Example server demonstrating engine integration├── llama.cpp # Upstream llama.cpp repository│ └── (files from upstream llama.cpp)├── src # Source implementation for llama.cpp│ ├── chat_completion_request.h # OpenAI compatible request handling│ ├── llama_client_slot # Manage vector of slots for parallel processing│ ├── llama_engine # Implementation of llamacpp engine for model loading and inference│ ├── llama_server_context # Context management for chat completion requests│ │ ├── slot # Struct for slot management│ │ └── llama_context # Struct for llama context management| | └── chat_completion # Struct for chat completion management| | └── embedding # Struct for embedding management├── third-party # Dependencies of the llamacpp project│ └── (list of third-party dependencies)
Roadmap​
The future plans for llamacpp
are focused on enhancing performance and expanding capabilities. Key areas of improvement include:
- Performance Enhancements: Optimizing speed and reducing memory usage to ensure efficient processing of tasks.
- Multimodal Model Compatibility: Expanding support to include a variety of multimodal models, enabling a broader range of applications and use cases.
To follow the latest developments of llamacpp
, please see the GitHub Repository.