Drag
  • Home
  • Blogs
  • Deployment Techniques for LLMs/SLMs on Edge Devices

Deployment Techniques for LLMs/SLMs on Edge Devices

  • KV Caching
  • Quantization
  • Flash Attention
  • LoRA/QLoRA
  • Speculative Decoding
  • Tensor Parallelism
  • Prefix Caching
  • Mixture of Experts (MoE)
  • Continuous Batching
  • Paged Attention

These optimizations reduce:

  • GPU memory consumption
  • Latency
  • Power usage
  • Training cost
  • Inference cost

KV Cache (Key-Value Cache)

Problem

Transformer attention recomputes attention matrices for all previous tokens during autoregressive generation.

Without caching:

  • Token 1 computed once
  • Token 2 recomputes token 1
  • Token 3 recomputes token 1 and 2
  • Complexity becomes quadratic.

Solution

KV cache stores:

  • Key matrices
  • Value matrices

for previously generated tokens.

When new tokens arrive:

  • Only new token attention is computed
  • Old attention states are reused

Benefits

BenefitImpact
Lower latencyHuge
Faster generation2x–20x
Lower computeSignificant
Better streaming performanceExcellent

KV Cache Architecture

Input Tokens

    ↓

Embedding Layer

    ↓

Transformer Layer

    ↓

Store K/V tensors in GPU memory

    ↓

Reuse during next token generation


KV Cache Challenges

1. GPU Memory Explosion

For long contexts:

  • 32k
  • 64k
  • 128k

KV cache becomes massive.

Formula:

Memory ≈ Layers × Heads × Sequence Length × Head Dimension

For a 70B model:

  • KV cache can consume >40GB VRAM alone.

Advanced KV Cache Optimizations

Paged Attention

Used by:

  • vLLM

Instead of contiguous memory allocation:

  • Cache divided into pages
  • Reduces fragmentation
  • Enables efficient batching

Prefix Caching

Reusable system prompts are cached.

Useful for:

  • Chatbots
  • Enterprise copilots
  • Repeated templates

Sliding Window Attention

Only recent tokens retained.

Good for:

  • Streaming applications
  • Edge inference

Quantized KV Cache

KV tensors themselves are quantized:

  • FP16 → INT8 → INT4

Reduces memory footprint significantly.


Quantization

What is Quantization?

Reducing precision of weights:

PrecisionMemory
FP32Highest
FP1650%
INT825%
INT412.5%

Types of Quantization

Post Training Quantization (PTQ)

Model trained normally then compressed.

Advantages:

  • Fast
  • Cheap
  • Easy

Disadvantages:

  • Accuracy drop possible

Quantization Aware Training (QAT)

Quantization simulated during training.

Advantages:

  • Better accuracy

Disadvantages:

  • Expensive training

Popular Quantization Algorithms

GPTQ

  • One-shot quantization
  • Widely used
  • Efficient for inference

AWQ (Activation-aware Weight Quantization)

Better activation preservation.

Improves:

  • Accuracy
  • Stability

GGUF

Used heavily in:

  • Ollama
  • llama.cpp ecosystem

Ideal for:

  • CPU inference
  • Edge devices

BitsAndBytes

Popular Hugging Face library.

Supports:

  • 8-bit loading
  • 4-bit loading
  • QLoRA

QLoRA

QLoRA combines:

  • 4-bit quantization
  • LoRA adapters

Enables fine-tuning:

  • 7B models on single consumer GPU
  • 70B models with limited hardware

Core idea:

  • Freeze quantized base model
  • Train small low-rank adapters

LoRA (Low Rank Adaptation)

Instead of updating full weights:

W = W + ΔW

Where:

  • W frozen
  • ΔW low-rank trainable matrices

Benefits:

  • Tiny trainable params
  • Faster fine-tuning
  • Low VRAM usage

Flash Attention

Attention optimized using:

  • Kernel fusion
  • Tiling
  • GPU SRAM optimization

Benefits:

  • Faster attention
  • Less memory
  • Longer contexts

Speculative Decoding

Two models:

  • Small draft model
  • Large verifier model

Draft predicts tokens quickly.
Large model verifies batches.

Benefits:

  • 2x–5x inference speedup

Continuous Batching

Requests dynamically merged.

Used by:

  • NVIDIA TensorRT-LLM
  • Anyscale Ray Serve
  • vLLM

Improves:

  • Throughput
  • GPU utilization

Fine-Tuning Optimizations

Gradient Checkpointing

Stores fewer activations.
Recomputes during backward pass.

Tradeoff:

  • Lower memory
  • More compute

ZeRO Optimization

Developed by:

  • Microsoft DeepSpeed

Stages:

  • ZeRO-1
  • ZeRO-2
  • ZeRO-3

Splits:

  • Gradients
  • Optimizer states
  • Parameters

across GPUs.


Best Production Stack

ComponentRecommended
Inference EnginevLLM
QuantizationAWQ/GPTQ
Fine-TuningQLoRA
AttentionFlash Attention 2
ServingTensorRT-LLM
CPU Edgellama.cpp

Modern Large Language Models (LLMs) are computationally expensive because they process billions of parameters and long token sequences. To make inference and fine-tuning practical, the ecosystem relies heavily on optimization techniques such as:

  • KV Caching
  • Quantization
  • Flash Attention
  • LoRA/QLoRA
  • Speculative Decoding
  • Tensor Parallelism
  • Prefix Caching
  • Mixture of Experts (MoE)
  • Continuous Batching
  • Paged Attention

These optimizations reduce:

  • GPU memory consumption
  • Latency
  • Power usage
  • Training cost
  • Inference cost

KV Cache (Key-Value Cache)

Problem

Transformer attention recomputes attention matrices for all previous tokens during autoregressive generation.

Without caching:

  • Token 1 computed once
  • Token 2 recomputes token 1
  • Token 3 recomputes token 1 and 2
  • Complexity becomes quadratic.

Solution

KV cache stores:

  • Key matrices
  • Value matrices

for previously generated tokens.

When new tokens arrive:

  • Only new token attention is computed
  • Old attention states are reused

Benefits

BenefitImpact
Lower latencyHuge
Faster generation2x–20x
Lower computeSignificant
Better streaming performanceExcellent

KV Cache Architecture

Input Tokens

    ↓

Embedding Layer

    ↓

Transformer Layer

    ↓

Store K/V tensors in GPU memory

    ↓

Reuse during next token generation


KV Cache Challenges

1. GPU Memory Explosion

For long contexts:

  • 32k
  • 64k
  • 128k

KV cache becomes massive.

Formula:

Memory ≈ Layers × Heads × Sequence Length × Head Dimension

For a 70B model:

  • KV cache can consume >40GB VRAM alone.

Advanced KV Cache Optimizations

Paged Attention

Used by:

  • vLLM

Instead of contiguous memory allocation:

  • Cache divided into pages
  • Reduces fragmentation
  • Enables efficient batching

Prefix Caching

Reusable system prompts are cached.

Useful for:

  • Chatbots
  • Enterprise copilots
  • Repeated templates

Sliding Window Attention

Only recent tokens retained.

Good for:

  • Streaming applications
  • Edge inference

Quantized KV Cache

KV tensors themselves are quantized:

  • FP16 → INT8 → INT4

Reduces memory footprint significantly.


Quantization

What is Quantization?

Reducing precision of weights:

PrecisionMemory
FP32Highest
FP1650%
INT825%
INT412.5%

Types of Quantization

Post Training Quantization (PTQ)

Model trained normally then compressed.

Advantages:

  • Fast
  • Cheap
  • Easy

Disadvantages:

  • Accuracy drop possible

Quantization Aware Training (QAT)

Quantization simulated during training.

Advantages:

  • Better accuracy

Disadvantages:

  • Expensive training

Popular Quantization Algorithms

GPTQ

  • One-shot quantization
  • Widely used
  • Efficient for inference

AWQ (Activation-aware Weight Quantization)

Better activation preservation.

Improves:

  • Accuracy
  • Stability

GGUF

Used heavily in:

  • Ollama
  • llama.cpp ecosystem

Ideal for:

  • CPU inference
  • Edge devices

BitsAndBytes

Popular Hugging Face library.

Supports:

  • 8-bit loading
  • 4-bit loading
  • QLoRA

QLoRA

QLoRA combines:

  • 4-bit quantization
  • LoRA adapters

Enables fine-tuning:

  • 7B models on single consumer GPU
  • 70B models with limited hardware

Core idea:

  • Freeze quantized base model
  • Train small low-rank adapters

LoRA (Low Rank Adaptation)

Instead of updating full weights:

W = W + ΔW

Where:

  • W frozen
  • ΔW low-rank trainable matrices

Benefits:

  • Tiny trainable params
  • Faster fine-tuning
  • Low VRAM usage

Flash Attention

Attention optimized using:

  • Kernel fusion
  • Tiling
  • GPU SRAM optimization

Benefits:

  • Faster attention
  • Less memory
  • Longer contexts

Speculative Decoding

Two models:

  • Small draft model
  • Large verifier model

Draft predicts tokens quickly.
Large model verifies batches.

Benefits:

  • 2x–5x inference speedup

Continuous Batching

Requests dynamically merged.

Used by:

  • NVIDIA TensorRT-LLM
  • Anyscale Ray Serve
  • vLLM

Improves:

  • Throughput
  • GPU utilization

Fine-Tuning Optimizations

Gradient Checkpointing

Stores fewer activations.
Recomputes during backward pass.

Tradeoff:

  • Lower memory
  • More compute

ZeRO Optimization

Developed by:

  • Microsoft DeepSpeed

Stages:

  • ZeRO-1
  • ZeRO-2
  • ZeRO-3

Splits:

  • Gradients
  • Optimizer states
  • Parameters

across GPUs.


Best Production Stack

ComponentRecommended
Inference EnginevLLM
QuantizationAWQ/GPTQ
Fine-TuningQLoRA
AttentionFlash Attention 2
ServingTensorRT-LLM
CPU Edgellama.cpp

MEdge AI means running models locally on:

  • Phones
  • IoT devices
  • Embedded systems
  • Raspberry Pi
  • Automotive systems
  • Industrial gateways

Goals:

  • Low latency
  • Offline capability
  • Data privacy
  • Reduced cloud cost

Challenges

ChallengeDescription
Low RAMMobile devices limited
Limited computeNo datacenter GPUs
Thermal constraintsHeat throttling
Battery usagePower critical
StorageModels huge

Small Language Models (SLMs)

Examples:

  • Microsoft Phi
  • Google Gemma
  • Meta Llama 3 8B
  • TinyLlama
  • Mistral 7B

SLMs dominate edge deployment.


Deployment Architectures

1. Fully Local

Everything on device.

Pros:

  • Privacy
  • Offline

Cons:

  • Limited performance

2. Hybrid Edge-Cloud

Edge handles:

  • Embeddings
  • Intent classification

Cloud handles:

  • Heavy generation

Best for enterprise systems.


3. Split Computing

Layers split between:

  • Device
  • Edge gateway
  • Cloud

Emerging architecture.


Optimization Techniques

Quantization

Most important optimization.

Typical:

  • 4-bit GGUF
  • INT8
  • Mixed precision

Model Distillation

Teacher → Student model.

Large model knowledge compressed into smaller model.

Benefits:

  • Faster inference
  • Lower RAM
  • Better edge suitability

ONNX Runtime

Converts models to optimized graph format.

Supports:

  • CPU acceleration
  • Mobile acceleration
  • Hardware-specific kernels

TensorRT

Optimized for:

  • NVIDIA Jetson
  • Edge GPUs

Performs:

  • Kernel fusion
  • Memory optimization
  • Precision tuning

CoreML

Used for:

  • iOS deployment

Optimized for:

  • Apple Neural Engine

Android Deployment

Typical stack:

  • TensorFlow Lite
  • ONNX Runtime Mobile
  • llama.cpp Android

llama.cpp

One of the most important edge inference frameworks.

Features:

  • CPU optimized
  • Metal acceleration
  • Vulkan support
  • Quantized GGUF models

Ideal for:

  • Raspberry Pi
  • Macs
  • Offline assistants

Edge RAG

Common architecture:

Documents

   ↓

Local Embedding Model

   ↓

Lightweight Vector DB

   ↓

Local LLM

Vector DBs:

  • Chroma
  • FAISS
  • SQLite-VSS

Memory Management

Critical on edge systems.

Strategies:

  • Streaming inference
  • Sliding context windows
  • KV cache compression
  • CPU-GPU offloading

Emerging Trends

On-device multimodal AI

Text + Vision + Audio locally.

Federated Fine-Tuning

Edge devices collaboratively fine-tune.

TinyMoE

Mixture-of-Experts for mobile.


Blog Image
Blog Image

Leave a Comment

Your email address will not be published. Required fields are marked *