Deployment Techniques for LLMs/SLMs on Edge Devices
- KV Caching
- Quantization
- Flash Attention
- LoRA/QLoRA
- Speculative Decoding
- Tensor Parallelism
- Prefix Caching
- Mixture of Experts (MoE)
- Continuous Batching
- Paged Attention
These optimizations reduce:
- GPU memory consumption
- Latency
- Power usage
- Training cost
- Inference cost
KV Cache (Key-Value Cache)
Problem
Transformer attention recomputes attention matrices for all previous tokens during autoregressive generation.
Without caching:
- Token 1 computed once
- Token 2 recomputes token 1
- Token 3 recomputes token 1 and 2
- Complexity becomes quadratic.
Solution
KV cache stores:
- Key matrices
- Value matrices
for previously generated tokens.
When new tokens arrive:
- Only new token attention is computed
- Old attention states are reused
Benefits
| Benefit | Impact |
| Lower latency | Huge |
| Faster generation | 2x–20x |
| Lower compute | Significant |
| Better streaming performance | Excellent |
KV Cache Architecture
Input Tokens
↓
Embedding Layer
↓
Transformer Layer
↓
Store K/V tensors in GPU memory
↓
Reuse during next token generation
KV Cache Challenges
1. GPU Memory Explosion
For long contexts:
- 32k
- 64k
- 128k
KV cache becomes massive.
Formula:
Memory ≈ Layers × Heads × Sequence Length × Head Dimension
For a 70B model:
- KV cache can consume >40GB VRAM alone.
Advanced KV Cache Optimizations
Paged Attention
Used by:
- vLLM
Instead of contiguous memory allocation:
- Cache divided into pages
- Reduces fragmentation
- Enables efficient batching
Prefix Caching
Reusable system prompts are cached.
Useful for:
- Chatbots
- Enterprise copilots
- Repeated templates
Sliding Window Attention
Only recent tokens retained.
Good for:
- Streaming applications
- Edge inference
Quantized KV Cache
KV tensors themselves are quantized:
- FP16 → INT8 → INT4
Reduces memory footprint significantly.
Quantization
What is Quantization?
Reducing precision of weights:
| Precision | Memory |
| FP32 | Highest |
| FP16 | 50% |
| INT8 | 25% |
| INT4 | 12.5% |
Types of Quantization
Post Training Quantization (PTQ)
Model trained normally then compressed.
Advantages:
- Fast
- Cheap
- Easy
Disadvantages:
- Accuracy drop possible
Quantization Aware Training (QAT)
Quantization simulated during training.
Advantages:
- Better accuracy
Disadvantages:
- Expensive training
Popular Quantization Algorithms
GPTQ
- One-shot quantization
- Widely used
- Efficient for inference
AWQ (Activation-aware Weight Quantization)
Better activation preservation.
Improves:
- Accuracy
- Stability
GGUF
Used heavily in:
- Ollama
- llama.cpp ecosystem
Ideal for:
- CPU inference
- Edge devices
BitsAndBytes
Popular Hugging Face library.
Supports:
- 8-bit loading
- 4-bit loading
- QLoRA
QLoRA
QLoRA combines:
- 4-bit quantization
- LoRA adapters
Enables fine-tuning:
- 7B models on single consumer GPU
- 70B models with limited hardware
Core idea:
- Freeze quantized base model
- Train small low-rank adapters
LoRA (Low Rank Adaptation)
Instead of updating full weights:
W = W + ΔW
Where:
- W frozen
- ΔW low-rank trainable matrices
Benefits:
- Tiny trainable params
- Faster fine-tuning
- Low VRAM usage
Flash Attention
Attention optimized using:
- Kernel fusion
- Tiling
- GPU SRAM optimization
Benefits:
- Faster attention
- Less memory
- Longer contexts
Speculative Decoding
Two models:
- Small draft model
- Large verifier model
Draft predicts tokens quickly.
Large model verifies batches.
Benefits:
- 2x–5x inference speedup
Continuous Batching
Requests dynamically merged.
Used by:
- NVIDIA TensorRT-LLM
- Anyscale Ray Serve
- vLLM
Improves:
- Throughput
- GPU utilization
Fine-Tuning Optimizations
Gradient Checkpointing
Stores fewer activations.
Recomputes during backward pass.
Tradeoff:
- Lower memory
- More compute
ZeRO Optimization
Developed by:
- Microsoft DeepSpeed
Stages:
- ZeRO-1
- ZeRO-2
- ZeRO-3
Splits:
- Gradients
- Optimizer states
- Parameters
across GPUs.
Best Production Stack
| Component | Recommended |
| Inference Engine | vLLM |
| Quantization | AWQ/GPTQ |
| Fine-Tuning | QLoRA |
| Attention | Flash Attention 2 |
| Serving | TensorRT-LLM |
| CPU Edge | llama.cpp |
Modern Large Language Models (LLMs) are computationally expensive because they process billions of parameters and long token sequences. To make inference and fine-tuning practical, the ecosystem relies heavily on optimization techniques such as:
- KV Caching
- Quantization
- Flash Attention
- LoRA/QLoRA
- Speculative Decoding
- Tensor Parallelism
- Prefix Caching
- Mixture of Experts (MoE)
- Continuous Batching
- Paged Attention
These optimizations reduce:
- GPU memory consumption
- Latency
- Power usage
- Training cost
- Inference cost
KV Cache (Key-Value Cache)
Problem
Transformer attention recomputes attention matrices for all previous tokens during autoregressive generation.
Without caching:
- Token 1 computed once
- Token 2 recomputes token 1
- Token 3 recomputes token 1 and 2
- Complexity becomes quadratic.
Solution
KV cache stores:
- Key matrices
- Value matrices
for previously generated tokens.
When new tokens arrive:
- Only new token attention is computed
- Old attention states are reused
Benefits
| Benefit | Impact |
| Lower latency | Huge |
| Faster generation | 2x–20x |
| Lower compute | Significant |
| Better streaming performance | Excellent |
KV Cache Architecture
Input Tokens
↓
Embedding Layer
↓
Transformer Layer
↓
Store K/V tensors in GPU memory
↓
Reuse during next token generation
KV Cache Challenges
1. GPU Memory Explosion
For long contexts:
- 32k
- 64k
- 128k
KV cache becomes massive.
Formula:
Memory ≈ Layers × Heads × Sequence Length × Head Dimension
For a 70B model:
- KV cache can consume >40GB VRAM alone.
Advanced KV Cache Optimizations
Paged Attention
Used by:
- vLLM
Instead of contiguous memory allocation:
- Cache divided into pages
- Reduces fragmentation
- Enables efficient batching
Prefix Caching
Reusable system prompts are cached.
Useful for:
- Chatbots
- Enterprise copilots
- Repeated templates
Sliding Window Attention
Only recent tokens retained.
Good for:
- Streaming applications
- Edge inference
Quantized KV Cache
KV tensors themselves are quantized:
- FP16 → INT8 → INT4
Reduces memory footprint significantly.
Quantization
What is Quantization?
Reducing precision of weights:
| Precision | Memory |
| FP32 | Highest |
| FP16 | 50% |
| INT8 | 25% |
| INT4 | 12.5% |
Types of Quantization
Post Training Quantization (PTQ)
Model trained normally then compressed.
Advantages:
- Fast
- Cheap
- Easy
Disadvantages:
- Accuracy drop possible
Quantization Aware Training (QAT)
Quantization simulated during training.
Advantages:
- Better accuracy
Disadvantages:
- Expensive training
Popular Quantization Algorithms
GPTQ
- One-shot quantization
- Widely used
- Efficient for inference
AWQ (Activation-aware Weight Quantization)
Better activation preservation.
Improves:
- Accuracy
- Stability
GGUF
Used heavily in:
- Ollama
- llama.cpp ecosystem
Ideal for:
- CPU inference
- Edge devices
BitsAndBytes
Popular Hugging Face library.
Supports:
- 8-bit loading
- 4-bit loading
- QLoRA
QLoRA
QLoRA combines:
- 4-bit quantization
- LoRA adapters
Enables fine-tuning:
- 7B models on single consumer GPU
- 70B models with limited hardware
Core idea:
- Freeze quantized base model
- Train small low-rank adapters
LoRA (Low Rank Adaptation)
Instead of updating full weights:
W = W + ΔW
Where:
- W frozen
- ΔW low-rank trainable matrices
Benefits:
- Tiny trainable params
- Faster fine-tuning
- Low VRAM usage
Flash Attention
Attention optimized using:
- Kernel fusion
- Tiling
- GPU SRAM optimization
Benefits:
- Faster attention
- Less memory
- Longer contexts
Speculative Decoding
Two models:
- Small draft model
- Large verifier model
Draft predicts tokens quickly.
Large model verifies batches.
Benefits:
- 2x–5x inference speedup
Continuous Batching
Requests dynamically merged.
Used by:
- NVIDIA TensorRT-LLM
- Anyscale Ray Serve
- vLLM
Improves:
- Throughput
- GPU utilization
Fine-Tuning Optimizations
Gradient Checkpointing
Stores fewer activations.
Recomputes during backward pass.
Tradeoff:
- Lower memory
- More compute
ZeRO Optimization
Developed by:
- Microsoft DeepSpeed
Stages:
- ZeRO-1
- ZeRO-2
- ZeRO-3
Splits:
- Gradients
- Optimizer states
- Parameters
across GPUs.
Best Production Stack
| Component | Recommended |
| Inference Engine | vLLM |
| Quantization | AWQ/GPTQ |
| Fine-Tuning | QLoRA |
| Attention | Flash Attention 2 |
| Serving | TensorRT-LLM |
| CPU Edge | llama.cpp |
MEdge AI means running models locally on:
- Phones
- IoT devices
- Embedded systems
- Raspberry Pi
- Automotive systems
- Industrial gateways
Goals:
- Low latency
- Offline capability
- Data privacy
- Reduced cloud cost
Challenges
| Challenge | Description |
| Low RAM | Mobile devices limited |
| Limited compute | No datacenter GPUs |
| Thermal constraints | Heat throttling |
| Battery usage | Power critical |
| Storage | Models huge |
Small Language Models (SLMs)
Examples:
- Microsoft Phi
- Google Gemma
- Meta Llama 3 8B
- TinyLlama
- Mistral 7B
SLMs dominate edge deployment.
Deployment Architectures
1. Fully Local
Everything on device.
Pros:
- Privacy
- Offline
Cons:
- Limited performance
2. Hybrid Edge-Cloud
Edge handles:
- Embeddings
- Intent classification
Cloud handles:
- Heavy generation
Best for enterprise systems.
3. Split Computing
Layers split between:
- Device
- Edge gateway
- Cloud
Emerging architecture.
Optimization Techniques
Quantization
Most important optimization.
Typical:
- 4-bit GGUF
- INT8
- Mixed precision
Model Distillation
Teacher → Student model.
Large model knowledge compressed into smaller model.
Benefits:
- Faster inference
- Lower RAM
- Better edge suitability
ONNX Runtime
Converts models to optimized graph format.
Supports:
- CPU acceleration
- Mobile acceleration
- Hardware-specific kernels
TensorRT
Optimized for:
- NVIDIA Jetson
- Edge GPUs
Performs:
- Kernel fusion
- Memory optimization
- Precision tuning
CoreML
Used for:
- iOS deployment
Optimized for:
- Apple Neural Engine
Android Deployment
Typical stack:
- TensorFlow Lite
- ONNX Runtime Mobile
- llama.cpp Android
llama.cpp
One of the most important edge inference frameworks.
Features:
- CPU optimized
- Metal acceleration
- Vulkan support
- Quantized GGUF models
Ideal for:
- Raspberry Pi
- Macs
- Offline assistants
Edge RAG
Common architecture:
Documents
↓
Local Embedding Model
↓
Lightweight Vector DB
↓
Local LLM
Vector DBs:
- Chroma
- FAISS
- SQLite-VSS
Memory Management
Critical on edge systems.
Strategies:
- Streaming inference
- Sliding context windows
- KV cache compression
- CPU-GPU offloading
Emerging Trends
On-device multimodal AI
Text + Vision + Audio locally.
Federated Fine-Tuning
Edge devices collaboratively fine-tune.
TinyMoE
Mixture-of-Experts for mobile.




