admarcosai 's Collections Efficient Inference
updated
Prompt Cache: Modular Attention Reuse for Low-Latency Inference
Paper
• 2311.04934
• Published
• 32
Routing to the Expert: Efficient Reward-guided Ensemble of Large
Language Models
Paper
• 2311.08692
• Published
• 13
Exponentially Faster Language Modelling
Paper
• 2311.10770
• Published
• 119
Memory Augmented Language Models through Mixture of Word Experts
Paper
• 2311.10768
• Published
• 19
Unlocking Anticipatory Text Generation: A Constrained Approach for
Faithful Decoding with Large Language Models
Paper
• 2312.06149
• Published
• 3
SparQ Attention: Bandwidth-Efficient LLM Inference
Paper
• 2312.04985
• Published
• 40
Distributed Inference and Fine-tuning of Large Language Models Over The
Internet
Paper
• 2312.08361
• Published
• 27
Steering Llama 2 via Contrastive Activation Addition
Paper
• 2312.06681
• Published
• 14
Context Tuning for Retrieval Augmented Generation
Paper
• 2312.05708
• Published
• 16
PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU
Paper
• 2312.12456
• Published
• 45
Soaring from 4K to 400K: Extending LLM's Context with Activation Beacon
Paper
• 2401.03462
• Published
• 28
Efficient LLM inference solution on Intel GPU
Paper
• 2401.05391
• Published
• 11
Supervised Knowledge Makes Large Language Models Better In-context
Learners
Paper
• 2312.15918
• Published
• 9
BlackMamba: Mixture of Experts for State-Space Models
Paper
• 2402.01771
• Published
• 25
BitDelta: Your Fine-Tune May Only Be Worth One Bit
Paper
• 2402.10193
• Published
• 21
Speculative Streaming: Fast LLM Inference without Auxiliary Models
Paper
• 2402.11131
• Published
• 42
Smaller Language Models Are Better Instruction Evolvers
Paper
• 2412.11231
• Published
• 28