Oxaide
Back to blog
Infrastructure

Zero-Egress RAG: Running Llama on Local Silicon for Regulated Enterprises

A technical blueprint for deploying air-gapped AI knowledge systems on Apple Silicon and NVIDIA GPU clusters. Hardware specifications, quantization strategies, and network isolation architectures for enterprises requiring complete data sovereignty.

January 1, 2026
15 min read
Oxaide Team

Zero-Egress RAG: Running Llama on Local Silicon

For certain organizations, cloud is not an option. Not "private cloud." Not "single-tenant." Zero cloud. Zero internet. Zero packets leaving the building.

This is not paranoia. It is a rational response to specific threat models:

  • Defense contractors bound by ITAR and classified information handling requirements
  • Engineering firms protecting proprietary designs and trade secrets
  • Legal practices maintaining attorney-client privilege for high-stakes litigation
  • Family offices protecting multi-generational wealth intelligence
  • M&A advisory handling material non-public information

For these organizations, this guide provides the technical blueprint for deploying production-grade RAG systems entirely on local hardware.

The Zero-Egress Architecture

What "Zero-Egress" Actually Means

A true zero-egress deployment has no:

  • Internet connectivity: No WAN interface on the AI system
  • Cloud API calls: All inference runs on local compute
  • External telemetry: No usage data transmitted anywhere
  • Update dependencies: Patches applied via secure physical media

The machine running your AI is as isolated as your most sensitive air-gapped systems.

Why This Matters

Consider the attack surface of a cloud-connected AI system:

  1. Query interception: Network traffic between your system and cloud provider
  2. Provider compromise: Data breach at OpenAI, Anthropic, or Azure
  3. State-level access: Legal or covert access in provider's jurisdiction
  4. Supply chain: Dependencies on external model serving infrastructure

An air-gapped system eliminates categories 1-4 entirely. The remaining attack vectors are physical access and insider threat—both of which you already manage for classified or highly sensitive systems.

Hardware Specifications

Option 1: Apple Silicon (Recommended for Most Use Cases)

Apple's unified memory architecture makes M-series chips exceptionally suited for local LLM inference:

M2 Ultra (Entry Configuration)

  • CPU: 24-core (16 performance + 8 efficiency)
  • GPU: 76-core
  • Neural Engine: 32-core
  • Memory: 128GB unified (minimum recommended)
  • Storage: 2TB SSD (minimum)
  • Power: 300W under load

M3 Max (Mid-Range Configuration)

  • CPU: 16-core (12 performance + 4 efficiency)
  • GPU: 40-core
  • Neural Engine: 16-core
  • Memory: 128GB unified
  • Storage: 2TB SSD
  • Power: 150W under load

M4 Ultra (Premium Configuration) (Available Q2 2026)

  • CPU: 32-core
  • GPU: 80-core
  • Memory: 256GB unified
  • Optimal for: 70B+ parameter models with full context

Why Apple Silicon?

  1. Unified Memory: No GPU VRAM bottleneck—the full 128GB/256GB is available to the model
  2. Power Efficiency: 300W vs. 700W+ for equivalent NVIDIA setup
  3. Thermal Management: Runs without datacenter cooling infrastructure
  4. Form Factor: Mac Studio fits on a desk; no rack infrastructure required
  5. Security: Secure Enclave for key management, FileVault encryption

Option 2: NVIDIA GPU Cluster (Maximum Performance)

For organizations requiring the highest throughput or running multiple concurrent inference streams:

Single-GPU Configuration

  • GPU: NVIDIA RTX 4090 (24GB VRAM)
  • CPU: AMD Ryzen 9 7950X or Intel i9-14900K
  • RAM: 128GB DDR5
  • Storage: 4TB NVMe RAID
  • Power: 600W under load

Suitable for: 7B-13B models at full precision, 30B+ with quantization

Multi-GPU Configuration

  • GPUs: 2x NVIDIA A100 (80GB) or 4x RTX 4090
  • CPU: AMD EPYC or Intel Xeon
  • RAM: 256GB+ DDR5
  • Storage: 8TB NVMe RAID
  • Power: 1.5kW under load

Suitable for: 70B models at full precision, batch inference

Enterprise Configuration

  • GPUs: 8x NVIDIA H100 (80GB)
  • Interconnect: NVLink
  • RAM: 512GB+
  • Power: 5kW under load

Suitable for: Real-time multimodal, 100B+ models, concurrent users

Hardware Comparison Matrix

Configuration Model Size Tokens/sec Document Capacity Power Cost (Est.)
M2 Ultra 128GB 70B Q4 15-20 50K documents 300W $8,000
M3 Max 128GB 30B Q8 25-35 30K documents 150W $6,500
M4 Ultra 256GB 70B Q8 30-40 100K documents 350W $12,000
RTX 4090 13B FP16 40-60 20K documents 600W $4,000
2x A100 80GB 70B FP16 50-80 80K documents 1.2kW $30,000
8x H100 405B 100+ 500K documents 5kW $300,000

Model Selection and Quantization

Recommended Models for Air-Gapped Deployment

Primary Recommendation: Llama 3.1 Family

Model Parameters VRAM Required (FP16) VRAM Required (Q4)
Llama 3.1 8B 8B 16GB 5GB
Llama 3.1 70B 70B 140GB 40GB
Llama 3.1 405B 405B 810GB 230GB

Alternative Options:

  • Mistral Large: Strong reasoning, efficient architecture
  • Qwen 2.5: Excellent multilingual support (APAC deployments)
  • DeepSeek-V2: Strong coding and technical capabilities

Quantization Strategies

Quantization reduces model size and memory requirements at the cost of some accuracy:

Q8 (8-bit quantization)

  • Memory reduction: ~50%
  • Quality loss: Minimal (<1% on most benchmarks)
  • Recommended for: Production deployments with sufficient hardware

Q4 (4-bit quantization)

  • Memory reduction: ~75%
  • Quality loss: Noticeable (2-5% on complex reasoning)
  • Recommended for: Memory-constrained deployments

GGUF Format

  • Optimized for CPU+GPU inference
  • Native support in llama.cpp
  • Best compatibility with Apple Silicon

AWQ (Activation-aware Weight Quantization)

  • Better quality than naive quantization
  • Optimized for GPU inference
  • Recommended for NVIDIA deployments

Model Deployment Stack

┌─────────────────────────────────────────┐
│           Application Layer              │
│    (RAG Orchestration, Query API)        │
├─────────────────────────────────────────┤
│          Inference Engine                │
│   llama.cpp / vLLM / Ollama              │
├─────────────────────────────────────────┤
│          Vector Database                 │
│    Qdrant / Chroma / LanceDB             │
├─────────────────────────────────────────┤
│          Model Runtime                   │
│   GGUF (Apple) / AWQ (NVIDIA)            │
├─────────────────────────────────────────┤
│          Hardware Layer                  │
│   Apple Silicon / NVIDIA GPU             │
└─────────────────────────────────────────┘

Software Architecture

Inference Engines

For Apple Silicon: llama.cpp with Metal

  • Native Metal GPU acceleration
  • Optimal unified memory utilization
  • GGUF model format support

For NVIDIA: vLLM or TensorRT-LLM

  • PagedAttention for efficient memory
  • Tensor parallelism for multi-GPU
  • High throughput batch inference

Universal: Ollama

  • Simplified deployment
  • Cross-platform compatibility
  • Built-in model management

Vector Database Selection

For air-gapped deployments, the vector database must run entirely local:

Qdrant (Recommended)

  • Rust-based, high performance
  • HNSW indexing
  • Payload filtering for access control

Chroma

  • Python-native
  • Simple deployment
  • Good for smaller document sets

LanceDB

  • Embedded, no server required
  • Excellent for edge deployments
  • Lance columnar format

Milvus (Enterprise)

  • Distributed architecture
  • Billion-vector scale
  • Complex access control

Document Processing Pipeline

Air-gapped document processing requires local tooling:

OCR: Tesseract + pdf2image Table Extraction: Camelot, Tabula Embedding: Local sentence-transformers (all-MiniLM, E5, BGE) Chunking: LangChain or custom recursive splitter

Network Isolation Architecture

Physical Isolation

┌─────────────────────────────────────────────────────────┐
│                    SECURE ZONE                           │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐  │
│  │  AI Server  │◄──►│   Switch    │◄──►│ Admin       │  │
│  │  (Air-Gap)  │    │ (Isolated)  │    │ Terminal    │  │
│  └─────────────┘    └─────────────┘    └─────────────┘  │
│         ▲                                                │
│         │ USB (Updates Only)                            │
│         ▼                                                │
│  ┌─────────────┐                                        │
│  │  Update     │                                        │
│  │  Station    │                                        │
│  └─────────────┘                                        │
│                                                          │
│  ══════════════════ AIR GAP ════════════════════════    │
└─────────────────────────────────────────────────────────┘
                          │
                          │ (No Connection)
                          │
┌─────────────────────────────────────────────────────────┐
│                 CORPORATE NETWORK                        │
└─────────────────────────────────────────────────────────┘

Access Control

Physical Access:

  • Dedicated secure room or cabinet
  • Access logging and biometric authentication
  • Tamper-evident seals on hardware

Logical Access:

  • Local user accounts only (no domain join)
  • Role-based access to query and admin functions
  • Session timeouts and automatic logout

Update Protocol:

  1. Prepare update package on internet-connected system
  2. Scan for malware on isolated scanning station
  3. Transfer via write-once media (USB-C with write protection)
  4. Apply update with cryptographic verification
  5. Log update with checksum and timestamp

Performance Optimization

Memory Management

For Apple Silicon:

  • Allocate 90% of unified memory to model
  • Use memory-mapped GGUF files
  • Enable Metal memory optimization

For NVIDIA:

  • Maximize VRAM allocation
  • Use PagedAttention (vLLM)
  • Enable Flash Attention 2

Inference Optimization

Batch Processing:

  • Batch similar queries for throughput
  • Use continuous batching where available

Context Caching:

  • Cache system prompts
  • Reuse KV cache for follow-up queries

Quantization Tuning:

  • Test Q8 vs Q4 for your specific use case
  • Consider mixed precision for critical layers

Benchmarking Your Deployment

Expected performance ranges:

Hardware Model Batch Size Tokens/sec First Token Latency
M2 Ultra 70B Q4 1 15-20 500ms
M4 Ultra 70B Q8 1 30-40 300ms
RTX 4090 13B FP16 1 40-60 100ms
2x A100 70B FP16 8 200+ 150ms

Deployment Checklist

Pre-Deployment

  • Security clearance for AI system installation
  • Physical space with appropriate environmental controls
  • Network isolation verification
  • Hardware procurement and inspection
  • Software package preparation and verification

Installation

  • Hardware setup in secure location
  • OS installation (minimal, hardened)
  • Inference stack deployment
  • Vector database configuration
  • Model loading and verification
  • Document ingestion pipeline testing

Validation

  • Query accuracy testing with known documents
  • Performance benchmarking
  • Access control verification
  • Audit logging confirmation
  • Update procedure testing

Operational

  • User training
  • Incident response procedures
  • Backup and recovery testing
  • Scheduled security reviews
  • Performance monitoring

Case Study: Engineering Firm Deployment

Context

  • 200-person structural engineering firm
  • Proprietary design methodologies (trade secrets)
  • Client NDAs prohibiting cloud AI usage
  • 30 years of project documentation

Hardware Selection

  • 2x Mac Studio M2 Ultra (128GB each)
  • Primary + hot standby configuration
  • Dedicated VLAN with air-gap to corporate network

Software Stack

  • Ollama for inference (Llama 3.1 70B Q4)
  • Qdrant for vector storage
  • Custom Python RAG application
  • Local embedding model (BGE-large)

Document Ingestion

  • 50,000 PDF drawings and specifications
  • 10,000 calculation sheets
  • 5,000 project reports
  • Total: ~2M pages processed

Results (12 Months)

  • Average query latency: 3-5 seconds
  • Document retrieval accuracy: 94%
  • Engineer time saved on precedent research: 40%
  • Zero security incidents
  • Full compliance with client NDAs

Lessons Learned

  1. OCR quality is critical—invest in document scanning quality
  2. Domain-specific embedding models outperform generic
  3. Query logging enables continuous improvement
  4. Regular model updates improve quality significantly

Getting Started

For organizations evaluating zero-egress AI infrastructure:

Phase 1: Hardware Selection (Week 1)

  • Assess document volume and query patterns
  • Select hardware configuration
  • Plan physical installation

Phase 2: Software Stack (Week 2)

  • Package software for air-gapped installation
  • Configure inference and vector database
  • Set up document processing pipeline

Phase 3: Pilot Deployment (Week 3-4)

  • Install in secure location
  • Ingest sample document set
  • Validate query accuracy and performance

Phase 4: Production Rollout (Month 2)

  • Complete document ingestion
  • User training and adoption
  • Establish update and monitoring procedures

Next Steps

For organizations requiring zero-egress AI infrastructure:

  1. Hardware Consultation: Selection guidance for your specific requirements
  2. Security Architecture Review: Integration with existing security controls
  3. Pilot Planning: Scope definition and success criteria

Schedule Architecture Review | Explore Sovereign Deployment


Related reading:

Oxaide

Done-For-You AI Setup

Enterprise Knowledge Engine

Secure, private RAG infrastructure for your organization.

Role-Based Access Control
Enterprise-Grade Encryption
Custom API Integration

Enterprise-Grade Security · PDPA/GDPR Compliant

GDPR/PDPA Compliant
AES-256 encryption
High availability
Business-grade security