Deployment Techniques for LLMs/SLMs on Edge Devices

Precisionrecalls@gmail.com April 8, 2025 Comments (0)

Deployment Techniques for LLMs/SLMs on Edge Devices

KV Caching
Quantization
Flash Attention
LoRA/QLoRA
Speculative Decoding
Tensor Parallelism
Prefix Caching
Mixture of Experts (MoE)
Continuous Batching
Paged Attention

These optimizations reduce:

GPU memory consumption
Latency
Power usage
Training cost
Inference cost

KV Cache (Key-Value Cache)

Problem

Transformer attention recomputes attention matrices for all previous tokens during autoregressive generation.

Without caching:

Token 1 computed once
Token 2 recomputes token 1
Token 3 recomputes token 1 and 2
Complexity becomes quadratic.

Solution

KV cache stores:

Key matrices
Value matrices

for previously generated tokens.

When new tokens arrive:

Only new token attention is computed
Old attention states are reused

Benefits

Benefit	Impact
Lower latency	Huge
Faster generation	2x–20x
Lower compute	Significant
Better streaming performance	Excellent

KV Cache Architecture

Input Tokens

↓

Embedding Layer

↓

Transformer Layer

↓

Store K/V tensors in GPU memory

↓

Reuse during next token generation

KV Cache Challenges

1. GPU Memory Explosion

For long contexts:

32k
64k
128k

KV cache becomes massive.

Formula:

Memory ≈ Layers × Heads × Sequence Length × Head Dimension

For a 70B model:

KV cache can consume >40GB VRAM alone.

Advanced KV Cache Optimizations

Paged Attention

Used by:

vLLM

Instead of contiguous memory allocation:

Cache divided into pages
Reduces fragmentation
Enables efficient batching

Prefix Caching

Reusable system prompts are cached.

Useful for:

Chatbots
Enterprise copilots
Repeated templates

Sliding Window Attention

Only recent tokens retained.

Good for:

Streaming applications
Edge inference

Quantized KV Cache

KV tensors themselves are quantized:

FP16 → INT8 → INT4

Reduces memory footprint significantly.

Quantization

What is Quantization?

Reducing precision of weights:

Precision	Memory
FP32	Highest
FP16	50%
INT8	25%
INT4	12.5%

Types of Quantization

Post Training Quantization (PTQ)

Model trained normally then compressed.

Advantages:

Fast
Cheap
Easy

Disadvantages:

Accuracy drop possible

Quantization Aware Training (QAT)

Quantization simulated during training.

Advantages:

Better accuracy

Disadvantages:

Expensive training

Popular Quantization Algorithms

GPTQ

One-shot quantization
Widely used
Efficient for inference

AWQ (Activation-aware Weight Quantization)

Better activation preservation.

Improves:

Accuracy
Stability

GGUF

Used heavily in:

Ollama
llama.cpp ecosystem

Ideal for:

CPU inference
Edge devices

BitsAndBytes

Popular Hugging Face library.

Supports:

8-bit loading
4-bit loading
QLoRA

QLoRA

QLoRA combines:

4-bit quantization
LoRA adapters

Enables fine-tuning:

7B models on single consumer GPU
70B models with limited hardware

Core idea:

Freeze quantized base model
Train small low-rank adapters

LoRA (Low Rank Adaptation)

Instead of updating full weights:

W = W + ΔW

Where:

W frozen
ΔW low-rank trainable matrices

Benefits:

Tiny trainable params
Faster fine-tuning
Low VRAM usage

Flash Attention

Attention optimized using:

Kernel fusion
Tiling
GPU SRAM optimization

Benefits:

Faster attention
Less memory
Longer contexts

Speculative Decoding

Two models:

Small draft model
Large verifier model

Draft predicts tokens quickly.
Large model verifies batches.

Benefits:

2x–5x inference speedup

Continuous Batching

Requests dynamically merged.

Used by:

NVIDIA TensorRT-LLM
Anyscale Ray Serve
vLLM

Improves:

Throughput
GPU utilization

Fine-Tuning Optimizations

Gradient Checkpointing

Stores fewer activations.
Recomputes during backward pass.

Tradeoff:

Lower memory
More compute

ZeRO Optimization

Developed by:

Microsoft DeepSpeed

Stages:

ZeRO-1
ZeRO-2
ZeRO-3

Splits:

Gradients
Optimizer states
Parameters

across GPUs.

Best Production Stack

Component	Recommended
Inference Engine	vLLM
Quantization	AWQ/GPTQ
Fine-Tuning	QLoRA
Attention	Flash Attention 2
Serving	TensorRT-LLM
CPU Edge	llama.cpp

Modern Large Language Models (LLMs) are computationally expensive because they process billions of parameters and long token sequences. To make inference and fine-tuning practical, the ecosystem relies heavily on optimization techniques such as:

KV Caching
Quantization
Flash Attention
LoRA/QLoRA
Speculative Decoding
Tensor Parallelism
Prefix Caching
Mixture of Experts (MoE)
Continuous Batching
Paged Attention

These optimizations reduce:

GPU memory consumption
Latency
Power usage
Training cost
Inference cost

KV Cache (Key-Value Cache)

Problem

Transformer attention recomputes attention matrices for all previous tokens during autoregressive generation.

Without caching:

Token 1 computed once
Token 2 recomputes token 1
Token 3 recomputes token 1 and 2
Complexity becomes quadratic.

Solution

KV cache stores:

Key matrices
Value matrices

for previously generated tokens.

When new tokens arrive:

Only new token attention is computed
Old attention states are reused

Benefits

Benefit	Impact
Lower latency	Huge
Faster generation	2x–20x
Lower compute	Significant
Better streaming performance	Excellent

KV Cache Architecture

Input Tokens

↓

Embedding Layer

↓

Transformer Layer

↓

Store K/V tensors in GPU memory

↓

Reuse during next token generation

KV Cache Challenges

1. GPU Memory Explosion

For long contexts:

32k
64k
128k

KV cache becomes massive.

Formula:

Memory ≈ Layers × Heads × Sequence Length × Head Dimension

For a 70B model:

KV cache can consume >40GB VRAM alone.

Advanced KV Cache Optimizations

Paged Attention

Used by:

vLLM

Instead of contiguous memory allocation:

Cache divided into pages
Reduces fragmentation
Enables efficient batching

Prefix Caching

Reusable system prompts are cached.

Useful for:

Chatbots
Enterprise copilots
Repeated templates

Sliding Window Attention

Only recent tokens retained.

Good for:

Streaming applications
Edge inference

Quantized KV Cache

KV tensors themselves are quantized:

FP16 → INT8 → INT4

Reduces memory footprint significantly.

Quantization

What is Quantization?

Reducing precision of weights:

Precision	Memory
FP32	Highest
FP16	50%
INT8	25%
INT4	12.5%

Types of Quantization

Post Training Quantization (PTQ)

Model trained normally then compressed.

Advantages:

Fast
Cheap
Easy

Disadvantages:

Accuracy drop possible

Quantization Aware Training (QAT)

Quantization simulated during training.

Advantages:

Better accuracy

Disadvantages:

Expensive training

Popular Quantization Algorithms

GPTQ

One-shot quantization
Widely used
Efficient for inference

AWQ (Activation-aware Weight Quantization)

Better activation preservation.

Improves:

Accuracy
Stability

GGUF

Used heavily in:

Ollama
llama.cpp ecosystem

Ideal for:

CPU inference
Edge devices

BitsAndBytes

Popular Hugging Face library.

Supports:

8-bit loading
4-bit loading
QLoRA

QLoRA

QLoRA combines:

4-bit quantization
LoRA adapters

Enables fine-tuning:

7B models on single consumer GPU
70B models with limited hardware

Core idea:

Freeze quantized base model
Train small low-rank adapters

LoRA (Low Rank Adaptation)

Instead of updating full weights:

W = W + ΔW

Where:

W frozen
ΔW low-rank trainable matrices

Benefits:

Tiny trainable params
Faster fine-tuning
Low VRAM usage

Flash Attention

Attention optimized using:

Kernel fusion
Tiling
GPU SRAM optimization

Benefits:

Faster attention
Less memory
Longer contexts

Speculative Decoding

Two models:

Small draft model
Large verifier model

Draft predicts tokens quickly.
Large model verifies batches.

Benefits:

2x–5x inference speedup

Continuous Batching

Requests dynamically merged.

Used by:

NVIDIA TensorRT-LLM
Anyscale Ray Serve
vLLM

Improves:

Throughput
GPU utilization

Fine-Tuning Optimizations

Gradient Checkpointing

Stores fewer activations.
Recomputes during backward pass.

Tradeoff:

Lower memory
More compute

ZeRO Optimization

Developed by:

Microsoft DeepSpeed

Stages:

ZeRO-1
ZeRO-2
ZeRO-3

Splits:

Gradients
Optimizer states
Parameters

across GPUs.

Best Production Stack

Component	Recommended
Inference Engine	vLLM
Quantization	AWQ/GPTQ
Fine-Tuning	QLoRA
Attention	Flash Attention 2
Serving	TensorRT-LLM
CPU Edge	llama.cpp

MEdge AI means running models locally on:

Phones
IoT devices
Embedded systems
Raspberry Pi
Automotive systems
Industrial gateways

Goals:

Low latency
Offline capability
Data privacy
Reduced cloud cost

Challenges

Challenge	Description
Low RAM	Mobile devices limited
Limited compute	No datacenter GPUs
Thermal constraints	Heat throttling
Battery usage	Power critical
Storage	Models huge

Small Language Models (SLMs)

Examples:

Microsoft Phi
Google Gemma
Meta Llama 3 8B
TinyLlama
Mistral 7B

SLMs dominate edge deployment.

Deployment Architectures

1. Fully Local

Everything on device.

Pros:

Privacy
Offline

Cons:

Limited performance

2. Hybrid Edge-Cloud

Edge handles:

Embeddings
Intent classification

Cloud handles:

Heavy generation

Best for enterprise systems.

3. Split Computing

Layers split between:

Device
Edge gateway
Cloud

Emerging architecture.

Optimization Techniques

Quantization

Most important optimization.

Typical:

4-bit GGUF
INT8
Mixed precision

Model Distillation

Teacher → Student model.

Large model knowledge compressed into smaller model.

Benefits:

Faster inference
Lower RAM
Better edge suitability

ONNX Runtime

Converts models to optimized graph format.

Supports:

CPU acceleration
Mobile acceleration
Hardware-specific kernels

TensorRT

Optimized for:

NVIDIA Jetson
Edge GPUs

Performs:

Kernel fusion
Memory optimization
Precision tuning

CoreML

Used for:

iOS deployment

Optimized for:

Apple Neural Engine

Android Deployment

Typical stack:

TensorFlow Lite
ONNX Runtime Mobile
llama.cpp Android

llama.cpp

One of the most important edge inference frameworks.

Features:

CPU optimized
Metal acceleration
Vulkan support
Quantized GGUF models

Ideal for:

Raspberry Pi
Macs
Offline assistants

Edge RAG

Common architecture:

Documents

↓

Local Embedding Model

↓

Lightweight Vector DB

↓

Local LLM

Vector DBs:

Chroma
FAISS
SQLite-VSS

Memory Management

Critical on edge systems.

Strategies:

Streaming inference
Sliding context windows
KV cache compression
CPU-GPU offloading

Emerging Trends

On-device multimodal AI

Text + Vision + Audio locally.

Federated Fine-Tuning

Edge devices collaboratively fine-tune.

TinyMoE

Mixture-of-Experts for mobile.

Deployment Techniques for LLMs/SLMs on Edge Devices

Leave a Comment Cancel reply

Quick Link

Help

Get In Touch