Transformers documentation

llama.cpp

You are viewing main version, which requires installation from source. If you'd like regular pip install, checkout the latest stable version (v5.0.0rc2).
Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

llama.cpp

llama.cpp is a C/C++ inference engine for deploying large language models locally. It’s lightweight and doesn’t require Python, CUDA, or other heavy server infrastructure. llama.cpp uses the GGUF file format. GGUF supports quantized model weights and memory-mapping to reduce memory bandwidth on your device.

Browse the Hub for models already available in GGUF format.

llama.cpp can convert and run Transformers models as standalone C++ executables with the convert_hf_to_gguf.py script.

python3 convert_hf_to_gguf.py ./models/openai/gpt-oss-20b

The conversion process works as follows.

  1. The script loads the model configuration with AutoConfig.from_pretrained() and extracts the vocabulary with AutoTokenizer.from_pretrained().
  2. Based on the config’s architecture field, the script selects a converter class from its internal registry. The registry maps Transformers architecture names (like LlamaForCausalLM) to corresponding converter classes.
  3. The converter class extracts config parameters, maps Transformers tensor names to GGUF tensor names, transforms tensors, and packages the vocabulary.
  4. The output is a single GGUF file containing the model weights, tokenizer, and metadata.

Deploy the model locally from the command line with llama-cli or start a web UI with llama-server. Add the -hf flag to indicate the model is from the Hub.

llama-cli
llama-server
llama-cli -hf openai/gpt-oss-20b

Resources

Update on GitHub