Zero-Egress RAG: Running Llama on Local Silicon

For certain organizations, cloud is not an option. Not "private cloud." Not "single-tenant." Zero cloud. Zero internet. Zero packets leaving the building.

This is not paranoia. It is a rational response to specific threat models:

Defense contractors bound by ITAR and classified information handling requirements
Engineering firms protecting proprietary designs and trade secrets
Legal practices maintaining attorney-client privilege for high-stakes litigation
Family offices protecting multi-generational wealth intelligence
M&A advisory handling material non-public information

For these organizations, this guide provides the technical blueprint for deploying production-grade RAG systems entirely on local hardware.

The Zero-Egress Architecture

What "Zero-Egress" Actually Means

A true zero-egress deployment has no:

Internet connectivity: No WAN interface on the AI system
Cloud API calls: All inference runs on local compute
External telemetry: No usage data transmitted anywhere
Update dependencies: Patches applied via secure physical media

The machine running your AI is as isolated as your most sensitive air-gapped systems.

Why This Matters

Consider the attack surface of a cloud-connected AI system:

Query interception: Network traffic between your system and cloud provider
Provider compromise: Data breach at OpenAI, Anthropic, or Azure
State-level access: Legal or covert access in provider's jurisdiction
Supply chain: Dependencies on external model serving infrastructure

An air-gapped system eliminates categories 1-4 entirely. The remaining attack vectors are physical access and insider threat—both of which you already manage for classified or highly sensitive systems.

Hardware Specifications

Option 1: Apple Silicon (Recommended for Most Use Cases)

Apple's unified memory architecture makes M-series chips exceptionally suited for local LLM inference:

M2 Ultra (Entry Configuration)

CPU: 24-core (16 performance + 8 efficiency)
GPU: 76-core
Neural Engine: 32-core
Memory: 128GB unified (minimum recommended)
Storage: 2TB SSD (minimum)
Power: 300W under load

M3 Max (Mid-Range Configuration)

CPU: 16-core (12 performance + 4 efficiency)
GPU: 40-core
Neural Engine: 16-core
Memory: 128GB unified
Storage: 2TB SSD
Power: 150W under load

M4 Ultra (Premium Configuration) (Available Q2 2026)

CPU: 32-core
GPU: 80-core
Memory: 256GB unified
Optimal for: 70B+ parameter models with full context

Why Apple Silicon?

Unified Memory: No GPU VRAM bottleneck—the full 128GB/256GB is available to the model
Power Efficiency: 300W vs. 700W+ for equivalent NVIDIA setup
Thermal Management: Runs without datacenter cooling infrastructure
Form Factor: Mac Studio fits on a desk; no rack infrastructure required
Security: Secure Enclave for key management, FileVault encryption

Option 2: NVIDIA GPU Cluster (Maximum Performance)

For organizations requiring the highest throughput or running multiple concurrent inference streams:

Single-GPU Configuration

GPU: NVIDIA RTX 4090 (24GB VRAM)
CPU: AMD Ryzen 9 7950X or Intel i9-14900K
RAM: 128GB DDR5
Storage: 4TB NVMe RAID
Power: 600W under load

Suitable for: 7B-13B models at full precision, 30B+ with quantization

Multi-GPU Configuration

GPUs: 2x NVIDIA A100 (80GB) or 4x RTX 4090
CPU: AMD EPYC or Intel Xeon
RAM: 256GB+ DDR5
Storage: 8TB NVMe RAID
Power: 1.5kW under load

Suitable for: 70B models at full precision, batch inference

Enterprise Configuration

GPUs: 8x NVIDIA H100 (80GB)
Interconnect: NVLink
RAM: 512GB+
Power: 5kW under load

Suitable for: Real-time multimodal, 100B+ models, concurrent users

Hardware Comparison Matrix

Configuration	Model Size	Tokens/sec	Document Capacity	Power	Cost (Est.)
M2 Ultra 128GB	70B Q4	15-20	50K documents	300W	$8,000
M3 Max 128GB	30B Q8	25-35	30K documents	150W	$6,500
M4 Ultra 256GB	70B Q8	30-40	100K documents	350W	$12,000
RTX 4090	13B FP16	40-60	20K documents	600W	$4,000
2x A100 80GB	70B FP16	50-80	80K documents	1.2kW	$30,000
8x H100	405B	100+	500K documents	5kW	$300,000

Model Selection and Quantization

Recommended Models for Air-Gapped Deployment

Primary Recommendation: Llama 3.1 Family

Model	Parameters	VRAM Required (FP16)	VRAM Required (Q4)
Llama 3.1 8B	8B	16GB	5GB
Llama 3.1 70B	70B	140GB	40GB
Llama 3.1 405B	405B	810GB	230GB

Alternative Options:

Mistral Large: Strong reasoning, efficient architecture
Qwen 2.5: Excellent multilingual support (APAC deployments)
DeepSeek-V2: Strong coding and technical capabilities

Quantization Strategies

Quantization reduces model size and memory requirements at the cost of some accuracy:

Q8 (8-bit quantization)

Memory reduction: ~50%
Quality loss: Minimal (<1% on most benchmarks)
Recommended for: Production deployments with sufficient hardware

Q4 (4-bit quantization)

Memory reduction: ~75%
Quality loss: Noticeable (2-5% on complex reasoning)
Recommended for: Memory-constrained deployments

GGUF Format

Optimized for CPU+GPU inference
Native support in llama.cpp
Best compatibility with Apple Silicon

AWQ (Activation-aware Weight Quantization)

Better quality than naive quantization
Optimized for GPU inference
Recommended for NVIDIA deployments

Model Deployment Stack

┌─────────────────────────────────────────┐
│           Application Layer              │
│    (RAG Orchestration, Query API)        │
├─────────────────────────────────────────┤
│          Inference Engine                │
│   llama.cpp / vLLM / Ollama              │
├─────────────────────────────────────────┤
│          Vector Database                 │
│    Qdrant / Chroma / LanceDB             │
├─────────────────────────────────────────┤
│          Model Runtime                   │
│   GGUF (Apple) / AWQ (NVIDIA)            │
├─────────────────────────────────────────┤
│          Hardware Layer                  │
│   Apple Silicon / NVIDIA GPU             │
└─────────────────────────────────────────┘

Software Architecture

Inference Engines

For Apple Silicon: llama.cpp with Metal

Native Metal GPU acceleration
Optimal unified memory utilization
GGUF model format support

For NVIDIA: vLLM or TensorRT-LLM

PagedAttention for efficient memory
Tensor parallelism for multi-GPU
High throughput batch inference

Universal: Ollama

Simplified deployment
Cross-platform compatibility
Built-in model management

Vector Database Selection

For air-gapped deployments, the vector database must run entirely local:

Qdrant (Recommended)

Rust-based, high performance
HNSW indexing
Payload filtering for access control

Chroma

Python-native
Simple deployment
Good for smaller document sets

LanceDB

Embedded, no server required
Excellent for edge deployments
Lance columnar format

Milvus (Enterprise)

Distributed architecture
Billion-vector scale
Complex access control

Document Processing Pipeline

Air-gapped document processing requires local tooling:

OCR: Tesseract + pdf2image Table Extraction: Camelot, Tabula Embedding: Local sentence-transformers (all-MiniLM, E5, BGE) Chunking: LangChain or custom recursive splitter

Network Isolation Architecture

Physical Isolation

┌─────────────────────────────────────────────────────────┐
│                    SECURE ZONE                           │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐  │
│  │  AI Server  │◄──►│   Switch    │◄──►│ Admin       │  │
│  │  (Air-Gap)  │    │ (Isolated)  │    │ Terminal    │  │
│  └─────────────┘    └─────────────┘    └─────────────┘  │
│         ▲                                                │
│         │ USB (Updates Only)                            │
│         ▼                                                │
│  ┌─────────────┐                                        │
│  │  Update     │                                        │
│  │  Station    │                                        │
│  └─────────────┘                                        │
│                                                          │
│  ══════════════════ AIR GAP ════════════════════════    │
└─────────────────────────────────────────────────────────┘
                          │
                          │ (No Connection)
                          │
┌─────────────────────────────────────────────────────────┐
│                 CORPORATE NETWORK                        │
└─────────────────────────────────────────────────────────┘

Access Control

Physical Access:

Dedicated secure room or cabinet
Access logging and biometric authentication
Tamper-evident seals on hardware

Logical Access:

Local user accounts only (no domain join)
Role-based access to query and admin functions
Session timeouts and automatic logout

Update Protocol:

Prepare update package on internet-connected system
Scan for malware on isolated scanning station
Transfer via write-once media (USB-C with write protection)
Apply update with cryptographic verification
Log update with checksum and timestamp

Performance Optimization

Memory Management

For Apple Silicon:

Allocate 90% of unified memory to model
Use memory-mapped GGUF files
Enable Metal memory optimization

For NVIDIA:

Maximize VRAM allocation
Use PagedAttention (vLLM)
Enable Flash Attention 2

Inference Optimization

Batch Processing:

Batch similar queries for throughput
Use continuous batching where available

Context Caching:

Cache system prompts
Reuse KV cache for follow-up queries

Quantization Tuning:

Test Q8 vs Q4 for your specific use case
Consider mixed precision for critical layers

Benchmarking Your Deployment

Expected performance ranges:

Hardware	Model	Batch Size	Tokens/sec	First Token Latency
M2 Ultra	70B Q4	1	15-20	500ms
M4 Ultra	70B Q8	1	30-40	300ms
RTX 4090	13B FP16	1	40-60	100ms
2x A100	70B FP16	8	200+	150ms

Deployment Checklist

Pre-Deployment

Security clearance for AI system installation
Physical space with appropriate environmental controls
Network isolation verification
Hardware procurement and inspection
Software package preparation and verification

Installation

Hardware setup in secure location
OS installation (minimal, hardened)
Inference stack deployment
Vector database configuration
Model loading and verification
Document ingestion pipeline testing

Validation

Query accuracy testing with known documents
Performance benchmarking
Access control verification
Audit logging confirmation
Update procedure testing

Operational

Case Study: Engineering Firm Deployment

Context

200-person structural engineering firm
Proprietary design methodologies (trade secrets)
Client NDAs prohibiting cloud AI usage
30 years of project documentation

Hardware Selection

2x Mac Studio M2 Ultra (128GB each)
Primary + hot standby configuration
Dedicated VLAN with air-gap to corporate network

Software Stack

Ollama for inference (Llama 3.1 70B Q4)
Qdrant for vector storage
Custom Python RAG application
Local embedding model (BGE-large)

Document Ingestion

50,000 PDF drawings and specifications
10,000 calculation sheets
5,000 project reports
Total: ~2M pages processed

Results (12 Months)

Average query latency: 3-5 seconds
Document retrieval accuracy: 94%
Engineer time saved on precedent research: 40%
Zero security incidents
Full compliance with client NDAs

Lessons Learned

OCR quality is critical—invest in document scanning quality
Domain-specific embedding models outperform generic
Query logging enables continuous improvement
Regular model updates improve quality significantly

Getting Started

For organizations evaluating zero-egress AI infrastructure:

Phase 1: Hardware Selection (Week 1)

Assess document volume and query patterns
Select hardware configuration
Plan physical installation

Phase 2: Software Stack (Week 2)

Package software for air-gapped installation
Configure inference and vector database
Set up document processing pipeline

Phase 3: Pilot Deployment (Week 3-4)

Install in secure location
Ingest sample document set
Validate query accuracy and performance

Phase 4: Production Rollout (Month 2)

Complete document ingestion
User training and adoption
Establish update and monitoring procedures

Next Steps

For organizations requiring zero-egress AI infrastructure:

Hardware Consultation: Selection guidance for your specific requirements
Security Architecture Review: Integration with existing security controls
Pilot Planning: Scope definition and success criteria

Schedule Architecture Review | Explore Sovereign Deployment

Related reading: