| | --- |
| | license: gemma |
| | language: |
| | - en |
| | pipeline_tag: text-generation |
| | tags: |
| | - litert |
| | - litert-lm |
| | - gemma |
| | - agent |
| | - tool-calling |
| | - function-calling |
| | - multimodal |
| | - on-device |
| | library_name: litert-lm |
| | --- |
| | |
| | # Agent Gemma 3n E2B - Tool Calling Edition |
| |
|
| | A specialized version of **Gemma 3n E2B** optimized for **on-device tool/function calling** with LiteRT-LM. While Google's standard LiteRT-LM models focus on general text generation, this model is specifically designed for agentic workflows with advanced tool calling capabilities. |
| |
|
| | ## Why This Model? |
| |
|
| | Google's official LiteRT-LM releases provide excellent on-device inference but don't include built-in tool calling support. This model bridges that gap by: |
| |
|
| | - β
**Native tool/function calling** via Jinja templates |
| | - β
**Multimodal support** (text, vision, audio) |
| | - β
**On-device optimized** - No cloud API required |
| | - β
**INT4 quantized** - Efficient memory usage |
| | - β
**Production ready** - Tested and validated |
| |
|
| | Perfect for building AI agents that need to interact with external tools, APIs, or functions while running completely on-device. |
| |
|
| | ## Model Details |
| |
|
| | - **Base Model**: Gemma 3n E2B |
| | - **Format**: LiteRT-LM v1.4.0 |
| | - **Quantization**: INT4 |
| | - **Size**: ~3.2GB |
| | - **Tokenizer**: SentencePiece |
| | - **Capabilities**: |
| | - Advanced tool/function calling |
| | - Multi-turn conversations with tool interactions |
| | - Vision processing (images) |
| | - Audio processing |
| | - Streaming responses |
| |
|
| | ## Tool Calling Example |
| |
|
| | The model uses a sophisticated Jinja template that supports OpenAI-style function calling: |
| |
|
| | ```python |
| | from litert_lm import Engine, Conversation |
| | |
| | # Load the model |
| | engine = Engine.create("gemma-3n-E2B-it-agent-fixed.litertlm", backend="cpu") |
| | conversation = Conversation.create(engine) |
| | |
| | # Define tools the model can use |
| | tools = [ |
| | { |
| | "name": "get_weather", |
| | "description": "Get current weather for a location", |
| | "parameters": { |
| | "type": "object", |
| | "properties": { |
| | "location": {"type": "string", "description": "City name"}, |
| | "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]} |
| | }, |
| | "required": ["location"] |
| | } |
| | }, |
| | { |
| | "name": "search_web", |
| | "description": "Search the internet for information", |
| | "parameters": { |
| | "type": "object", |
| | "properties": { |
| | "query": {"type": "string", "description": "Search query"} |
| | }, |
| | "required": ["query"] |
| | } |
| | } |
| | ] |
| | |
| | # Have a conversation with tool calling |
| | message = { |
| | "role": "user", |
| | "content": "What's the weather in San Francisco and latest news about AI?" |
| | } |
| | |
| | response = conversation.send_message(message, tools=tools) |
| | print(response) |
| | ``` |
| |
|
| | ### Example Output |
| |
|
| | The model will generate structured tool calls: |
| |
|
| | ``` |
| | <start_function_call>call:get_weather{location:San Francisco,unit:celsius}<end_function_call> |
| | <start_function_call>call:search_web{query:latest AI news}<end_function_call> |
| | <start_function_response> |
| | ``` |
| |
|
| | You then execute the functions and send back results: |
| |
|
| | ```python |
| | # Execute tools (your implementation) |
| | weather = get_weather("San Francisco", "celsius") |
| | news = search_web("latest AI news") |
| | |
| | # Send tool responses back |
| | tool_response = { |
| | "role": "tool", |
| | "content": [ |
| | { |
| | "name": "get_weather", |
| | "response": {"temperature": 18, "condition": "partly cloudy"} |
| | }, |
| | { |
| | "name": "search_web", |
| | "response": {"results": ["OpenAI releases GPT-5...", "..."]} |
| | } |
| | ] |
| | } |
| | |
| | final_response = conversation.send_message(tool_response) |
| | print(final_response) |
| | # "The weather in San Francisco is 18Β°C and partly cloudy. |
| | # In AI news, OpenAI has released GPT-5..." |
| | ``` |
| |
|
| | ## Advanced Features |
| |
|
| | ### Multi-Modal Tool Calling |
| |
|
| | Combine vision, audio, and tool calling: |
| |
|
| | ```python |
| | message = { |
| | "role": "user", |
| | "content": [ |
| | {"type": "image", "data": image_bytes}, |
| | {"type": "text", "text": "What's in this image? Search for more info about it."} |
| | ] |
| | } |
| | |
| | response = conversation.send_message(message, tools=[search_tool]) |
| | # Model can see the image AND call search functions |
| | ``` |
| |
|
| | ### Streaming Tool Calls |
| |
|
| | Get tool calls as they're generated: |
| |
|
| | ```python |
| | def on_token(token): |
| | if "<start_function_call>" in token: |
| | print("Tool being called...") |
| | print(token, end="", flush=True) |
| | |
| | conversation.send_message_async(message, tools=tools, callback=on_token) |
| | ``` |
| |
|
| | ### Nested Tool Execution |
| |
|
| | The model can chain tool calls: |
| |
|
| | ```python |
| | # User: "Book me a flight to Tokyo and reserve a hotel" |
| | # Model: calls check_flights() β calls book_hotel() β confirms both |
| | ``` |
| |
|
| | ## Performance |
| |
|
| | Benchmarked on CPU (no GPU acceleration): |
| |
|
| | - **Prefill Speed**: 21.20 tokens/sec |
| | - **Decode Speed**: 11.44 tokens/sec |
| | - **Time to First Token**: ~1.6s |
| | - **Cold Start**: ~4.7s |
| | - **Tool Call Latency**: ~100-200ms additional |
| |
|
| | GPU acceleration provides 3-5x speedup on supported hardware. |
| |
|
| | ## Installation & Usage |
| |
|
| | ### Requirements |
| |
|
| | 1. **LiteRT-LM Runtime** - Build from source: |
| | ```bash |
| | git clone https://github.com/google-ai-edge/LiteRT.git |
| | cd LiteRT/LiteRT-LM |
| | bazel build -c opt //runtime/engine:litert_lm_main |
| | ``` |
| |
|
| | 2. **Supported Platforms**: Linux (clang), macOS, Android |
| |
|
| | ### Quick Start |
| |
|
| | ```bash |
| | # Download model |
| | wget https://huggingface.co/kontextdev/agent-gemma/resolve/main/gemma-3n-E2B-it-agent-fixed.litertlm |
| | |
| | # Run with simple prompt |
| | ./bazel-bin/runtime/engine/litert_lm_main \ |
| | --model_path=gemma-3n-E2B-it-agent-fixed.litertlm \ |
| | --backend=cpu \ |
| | --input_prompt="Hello, I need help with some tasks" |
| | |
| | # Run with GPU (if available) |
| | ./bazel-bin/runtime/engine/litert_lm_main \ |
| | --model_path=gemma-3n-E2B-it-agent-fixed.litertlm \ |
| | --backend=gpu \ |
| | --input_prompt="What can you help me with?" |
| | ``` |
| |
|
| | ### Python API (Recommended) |
| |
|
| | ```python |
| | from litert_lm import Engine, Conversation, SessionConfig |
| | |
| | # Initialize |
| | engine = Engine.create("gemma-3n-E2B-it-agent-fixed.litertlm", backend="gpu") |
| | |
| | # Configure session |
| | config = SessionConfig( |
| | max_tokens=2048, |
| | temperature=0.7, |
| | top_p=0.9 |
| | ) |
| | |
| | # Start conversation |
| | conversation = Conversation.create(engine, config) |
| | |
| | # Define your tools |
| | tools = [...] # Your function definitions |
| | |
| | # Chat with tool calling |
| | while True: |
| | user_input = input("You: ") |
| | response = conversation.send_message( |
| | {"role": "user", "content": user_input}, |
| | tools=tools |
| | ) |
| | |
| | # Handle tool calls if present |
| | if has_tool_calls(response): |
| | results = execute_tools(extract_calls(response)) |
| | response = conversation.send_message({ |
| | "role": "tool", |
| | "content": results |
| | }) |
| | |
| | print(f"Agent: {response['content']}") |
| | ``` |
| |
|
| | ## Tool Call Format |
| |
|
| | The model uses this format for tool interactions: |
| |
|
| | **Function Declaration** (system/developer role): |
| | ``` |
| | <start_of_turn>developer |
| | <start_function_declaration> |
| | { |
| | "name": "function_name", |
| | "description": "What it does", |
| | "parameters": {...} |
| | } |
| | <end_function_declaration> |
| | <end_of_turn> |
| | ``` |
| |
|
| | **Function Call** (assistant): |
| | ``` |
| | <start_function_call>call:function_name{arg1:value1,arg2:value2}<end_function_call> |
| | ``` |
| |
|
| | **Function Response** (tool role): |
| | ``` |
| | <start_function_response>response:function_name{result:value}<end_function_response> |
| | ``` |
| |
|
| | ## Use Cases |
| |
|
| | ### Personal AI Assistant |
| | - Calendar management |
| | - Email sending |
| | - Web searching |
| | - File operations |
| |
|
| | ### IoT & Smart Home |
| | - Device control |
| | - Sensor monitoring |
| | - Automation workflows |
| | - Voice commands |
| |
|
| | ### Development Tools |
| | - Code generation with API calls |
| | - Database queries |
| | - Deployment automation |
| | - Testing & debugging |
| |
|
| | ### Business Applications |
| | - CRM integration |
| | - Data analysis |
| | - Report generation |
| | - Customer support |
| |
|
| | ## Model Architecture |
| |
|
| | Built on Gemma 3n E2B with 9 optimized components: |
| |
|
| | ``` |
| | Section 0: LlmMetadata (Agent Jinja template) |
| | Section 1: SentencePiece Tokenizer |
| | Section 2: TFLite Embedder |
| | Section 3: TFLite Per-Layer Embedder |
| | Section 4: TFLite Audio Encoder (HW accelerated) |
| | Section 5: TFLite End-of-Audio Detector |
| | Section 6: TFLite Vision Adapter |
| | Section 7: TFLite Vision Encoder |
| | Section 8: TFLite Prefill/Decode (INT4) |
| | ``` |
| |
|
| | All components are optimized for on-device inference with hardware acceleration support. |
| |
|
| | ## Comparison |
| |
|
| | | Feature | Standard Gemma LiteRT-LM | This Model | |
| | |---------|-------------------------|------------| |
| | | Text Generation | β
| β
| |
| | | Tool Calling | β | β
| |
| | | Multimodal | β
| β
| |
| | | Streaming | β
| β
| |
| | | On-Device | β
| β
| |
| | | Jinja Templates | Basic | Advanced Agent Template | |
| | | INT4 Quantization | β
| β
| |
| |
|
| | ## Limitations |
| |
|
| | - **Tool Execution**: The model generates tool calls but doesn't execute them - you need to implement the actual functions |
| | - **Context Window**: Limited to 4096 tokens (configurable) |
| | - **Streaming Tool Calls**: Partial tool calls may need buffering |
| | - **Hardware Requirements**: Minimum 4GB RAM recommended |
| | - **No Native GPU on CPU-only systems**: Falls back to CPU inference |
| |
|
| | ## Tips for Best Results |
| |
|
| | 1. **Clear Tool Descriptions**: Provide detailed function descriptions |
| | 2. **Schema Validation**: Validate tool call arguments before execution |
| | 3. **Error Handling**: Handle malformed tool calls gracefully |
| | 4. **Context Management**: Keep conversation history concise |
| | 5. **Temperature**: Use 0.7-0.9 for creative tasks, 0.3-0.5 for precise tool calls |
| | 6. **Batching**: Process multiple tool calls in parallel when possible |
| |
|
| | ## License |
| |
|
| | This model inherits the [Gemma license](https://ai.google.dev/gemma/terms) from the base model. |
| |
|
| | ## Citation |
| |
|
| | ```bibtex |
| | @misc{agent-gemma-litertlm, |
| | title={Agent Gemma 3n E2B - Tool Calling Edition}, |
| | author={kontextdev}, |
| | year={2025}, |
| | publisher={HuggingFace}, |
| | howpublished={\url{https://huggingface.co/kontextdev/agent-gemma}} |
| | } |
| | ``` |
| |
|
| | ## Links |
| |
|
| | - [LiteRT-LM GitHub](https://github.com/google-ai-edge/LiteRT/tree/main/LiteRT-LM) |
| | - [Gemma Model Family](https://ai.google.dev/gemma) |
| | - [LiteRT Documentation](https://ai.google.dev/edge/litert) |
| | - [Tool Calling Guide](https://ai.google.dev/gemma/docs/function-calling) |
| |
|
| | ## Support |
| |
|
| | For issues or questions: |
| | - Open an issue on [GitHub](https://github.com/google-ai-edge/LiteRT/issues) |
| | - Check the [LiteRT-LM docs](https://ai.google.dev/edge/litert/inference) |
| | - Community forum: [Google AI Edge](https://discuss.ai.google.dev/) |
| |
|
| | --- |
| |
|
| | Built with β€οΈ for the on-device AI community |
| |
|