local setup with a coding agent

#15
by lightenup - opened

Hi - thanks for creating and sharing this exceptional model!

Anyone using Qwen3-Coder-Next locally with a coding agent? I am assuming that it works well with Qwen3-Coder, but what about other ones? Which inference engine (vllm, sglang, llama.cpp, ...) do you use on which hardware?

vllm + opencode

@kyr0 vllm with fp8 or bfloat16 weights? which version, can you share your launch command? What kind of experience are you getting? Does everything work or are there some issues you needed to work around?

I am currently using Codex with llama.cpp (autoparser branch, https://github.com/ggml-org/llama.cpp/pull/18675 ) and the responses API, using this launch command:

-m /app/models/Qwen3-Coder-Next-UD-Q8_K_XL-00001-of-00003.gguf --jinja --threads -1 -ngl 99 --ctx-size 150000 --temp 1.0 --min-p 0.01 --top-p 0.95 --top-k 40 --port 8060 --host 0.0.0.0 -a qwen3-coder-next

I have no tool call issues, but agent turns often end prematurely, i.e., the model generates something like: "Let me read source_file.c:" and then doesn't generate the tool call (hence Codex assumes the model is already finished). I can tell it to continue; but ofc it is boring to do so for tasks that require 100+ tool calls. (I am considering to patch Codex to auto-continue until the model really declares to be finished).

Other than that, I'd say it's about the same level as gpt-oss-120b high for "explain me this and that in the code-base" tasks -- although much less tokens are needed.

@lightenup No worries, here is my Makefile which contains everything you need, I guess. I'm running it using NVidia Container Toolkit + Docker.

# Makefile for managing the Qwen3-Coder-Next-FP8-512k model container

# -- Model serving settings

# Example cURL call: 
# url -H "Authorization: Bearer local-dev-key" -s http://baradcuda:8908/v1/models | jq
SERVED_NAME        ?= qwen3-coder-next-fp8-512k
API_KEY            ?= local-dev-key
PORT               ?= 8908

# -- Container settings
IMAGE              ?= vllm/vllm-openai:v0.15.1-cu130
CONTAINER          ?= qwen3-coder-next-fp8-512k

# -- GPU settings
GPU_MEM_UTIL       ?= 0.92
GPU_NO             ?= 0
GPU_UUID           := $(shell nvidia-smi --query-gpu=gpu_uuid --format=csv,noheader,nounits -i $(GPU_NO))
GPU_DEVICE         ?= "device=$(GPU_UUID)"

HF_HOME            ?= /var/lib/docker/container_volumes/hf_models
VLLM_CACHE_ROOT    ?= $(HF_HOME)/vllm_cache

# -- Inference settings
MODEL              ?= Qwen/Qwen3-Coder-Next-FP8
# 512k = 512 * 1024 = 524.288
# 256k = 256 * 1024 = 262.144
MAX_LEN            ?= 524288
MAX_SEQS           ?= 1
BATCH_TOKS         ?= 2048
DTYPE              ?= bfloat16
# use flashinfer for DGX spark
ATTN_BACKEND       ?= "FLASH_ATTN"

## -- Commands
.PHONY: start stop logs status

# Start the container with the specified settings, removing any existing container with the same name
start:
    docker rm -f $(CONTAINER) >/dev/null 2>&1 || true
    docker run -d --name $(CONTAINER) \
      --gpus $(GPU_DEVICE) --ipc=host \
      -p $(PORT):8000 \
      -e HF_HOME="$(HF_HOME)" \
      -e HF_TOKEN \
      -e VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 \
      -e TORCH_ALLOW_TF32_CUBLAS_OVERRIDE=1 \
      -e VLLM_CACHE_ROOT="$(VLLM_CACHE_ROOT)" \
      -v $(HF_HOME):$(HF_HOME) \
      -v $(VLLM_CACHE_ROOT):$(VLLM_CACHE_ROOT) \
      $(IMAGE) \
        --model "$(MODEL)" \
        --served-model-name "$(SERVED_NAME)" \
        --host 0.0.0.0 --port 8000 \
        --api-key "$(API_KEY)" \
        --dtype "$(DTYPE)" \
        --gpu-memory-utilization "$(GPU_MEM_UTIL)" \
        --max-model-len "$(MAX_LEN)" \
        --kv-cache-dtype fp8 \
        --calculate-kv-scales \
        --enable-chunked-prefill \
        --max-num-seqs "$(MAX_SEQS)" \
        --max-num-batched-tokens "$(BATCH_TOKS)" \
        --attention-backend "$(ATTN_BACKEND)" \
        --enable-auto-tool-choice \
        --tool-call-parser qwen3_coder

# Force stop and remove the container
stop:
    docker rm -f $(CONTAINER)

# Interactively follows the container logs, useful for debugging and monitoring 
logs:
    docker logs -f $(CONTAINER)

# For CI: print logs once without following, to avoid hanging if container is not running
logs-once:
    docker logs $(CONTAINER)

# Check the status of the container, including whether it's running and a sample of the model's API response
status:
    

@echo
	 "== $(CONTAINER) =="
    

@docker
	 ps --filter "name=$(CONTAINER)"
    

@echo
	
    

@curl
	 -H "Authorization: Bearer $(API_KEY)" -s http://localhost:$(PORT)/v1/models | head -c 1200 | jq || true 
    

@echo
	

Sign up or log in to comment