Zero-Egress RAG: Running Llama on Local Silicon
For certain organizations, cloud is not an option. Not "private cloud." Not "single-tenant." Zero cloud. Zero internet. Zero packets leaving the building.
This is not paranoia. It is a rational response to specific threat models:
- Defense contractors bound by ITAR and classified information handling requirements
- Engineering firms protecting proprietary designs and trade secrets
- Legal practices maintaining attorney-client privilege for high-stakes litigation
- Family offices protecting multi-generational wealth intelligence
- M&A advisory handling material non-public information
For these organizations, this guide provides the technical blueprint for deploying production-grade RAG systems entirely on local hardware.
The Zero-Egress Architecture
What "Zero-Egress" Actually Means
A true zero-egress deployment has no:
- Internet connectivity: No WAN interface on the AI system
- Cloud API calls: All inference runs on local compute
- External telemetry: No usage data transmitted anywhere
- Update dependencies: Patches applied via secure physical media
The machine running your AI is as isolated as your most sensitive air-gapped systems.
Why This Matters
Consider the attack surface of a cloud-connected AI system:
- Query interception: Network traffic between your system and cloud provider
- Provider compromise: Data breach at OpenAI, Anthropic, or Azure
- State-level access: Legal or covert access in provider's jurisdiction
- Supply chain: Dependencies on external model serving infrastructure
An air-gapped system eliminates categories 1-4 entirely. The remaining attack vectors are physical access and insider threat—both of which you already manage for classified or highly sensitive systems.
Hardware Specifications
Option 1: Apple Silicon (Recommended for Most Use Cases)
Apple's unified memory architecture makes M-series chips exceptionally suited for local LLM inference:
M2 Ultra (Entry Configuration)
- CPU: 24-core (16 performance + 8 efficiency)
- GPU: 76-core
- Neural Engine: 32-core
- Memory: 128GB unified (minimum recommended)
- Storage: 2TB SSD (minimum)
- Power: 300W under load
M3 Max (Mid-Range Configuration)
- CPU: 16-core (12 performance + 4 efficiency)
- GPU: 40-core
- Neural Engine: 16-core
- Memory: 128GB unified
- Storage: 2TB SSD
- Power: 150W under load
M4 Ultra (Premium Configuration) (Available Q2 2026)
- CPU: 32-core
- GPU: 80-core
- Memory: 256GB unified
- Optimal for: 70B+ parameter models with full context
Why Apple Silicon?
- Unified Memory: No GPU VRAM bottleneck—the full 128GB/256GB is available to the model
- Power Efficiency: 300W vs. 700W+ for equivalent NVIDIA setup
- Thermal Management: Runs without datacenter cooling infrastructure
- Form Factor: Mac Studio fits on a desk; no rack infrastructure required
- Security: Secure Enclave for key management, FileVault encryption
Option 2: NVIDIA GPU Cluster (Maximum Performance)
For organizations requiring the highest throughput or running multiple concurrent inference streams:
Single-GPU Configuration
- GPU: NVIDIA RTX 4090 (24GB VRAM)
- CPU: AMD Ryzen 9 7950X or Intel i9-14900K
- RAM: 128GB DDR5
- Storage: 4TB NVMe RAID
- Power: 600W under load
Suitable for: 7B-13B models at full precision, 30B+ with quantization
Multi-GPU Configuration
- GPUs: 2x NVIDIA A100 (80GB) or 4x RTX 4090
- CPU: AMD EPYC or Intel Xeon
- RAM: 256GB+ DDR5
- Storage: 8TB NVMe RAID
- Power: 1.5kW under load
Suitable for: 70B models at full precision, batch inference
Enterprise Configuration
- GPUs: 8x NVIDIA H100 (80GB)
- Interconnect: NVLink
- RAM: 512GB+
- Power: 5kW under load
Suitable for: Real-time multimodal, 100B+ models, concurrent users
Hardware Comparison Matrix
| Configuration | Model Size | Tokens/sec | Document Capacity | Power | Cost (Est.) |
|---|---|---|---|---|---|
| M2 Ultra 128GB | 70B Q4 | 15-20 | 50K documents | 300W | $8,000 |
| M3 Max 128GB | 30B Q8 | 25-35 | 30K documents | 150W | $6,500 |
| M4 Ultra 256GB | 70B Q8 | 30-40 | 100K documents | 350W | $12,000 |
| RTX 4090 | 13B FP16 | 40-60 | 20K documents | 600W | $4,000 |
| 2x A100 80GB | 70B FP16 | 50-80 | 80K documents | 1.2kW | $30,000 |
| 8x H100 | 405B | 100+ | 500K documents | 5kW | $300,000 |
Model Selection and Quantization
Recommended Models for Air-Gapped Deployment
Primary Recommendation: Llama 3.1 Family
| Model | Parameters | VRAM Required (FP16) | VRAM Required (Q4) |
|---|---|---|---|
| Llama 3.1 8B | 8B | 16GB | 5GB |
| Llama 3.1 70B | 70B | 140GB | 40GB |
| Llama 3.1 405B | 405B | 810GB | 230GB |
Alternative Options:
- Mistral Large: Strong reasoning, efficient architecture
- Qwen 2.5: Excellent multilingual support (APAC deployments)
- DeepSeek-V2: Strong coding and technical capabilities
Quantization Strategies
Quantization reduces model size and memory requirements at the cost of some accuracy:
Q8 (8-bit quantization)
- Memory reduction: ~50%
- Quality loss: Minimal (<1% on most benchmarks)
- Recommended for: Production deployments with sufficient hardware
Q4 (4-bit quantization)
- Memory reduction: ~75%
- Quality loss: Noticeable (2-5% on complex reasoning)
- Recommended for: Memory-constrained deployments
GGUF Format
- Optimized for CPU+GPU inference
- Native support in llama.cpp
- Best compatibility with Apple Silicon
AWQ (Activation-aware Weight Quantization)
- Better quality than naive quantization
- Optimized for GPU inference
- Recommended for NVIDIA deployments
Model Deployment Stack
┌─────────────────────────────────────────┐
│ Application Layer │
│ (RAG Orchestration, Query API) │
├─────────────────────────────────────────┤
│ Inference Engine │
│ llama.cpp / vLLM / Ollama │
├─────────────────────────────────────────┤
│ Vector Database │
│ Qdrant / Chroma / LanceDB │
├─────────────────────────────────────────┤
│ Model Runtime │
│ GGUF (Apple) / AWQ (NVIDIA) │
├─────────────────────────────────────────┤
│ Hardware Layer │
│ Apple Silicon / NVIDIA GPU │
└─────────────────────────────────────────┘
Software Architecture
Inference Engines
For Apple Silicon: llama.cpp with Metal
- Native Metal GPU acceleration
- Optimal unified memory utilization
- GGUF model format support
For NVIDIA: vLLM or TensorRT-LLM
- PagedAttention for efficient memory
- Tensor parallelism for multi-GPU
- High throughput batch inference
Universal: Ollama
- Simplified deployment
- Cross-platform compatibility
- Built-in model management
Vector Database Selection
For air-gapped deployments, the vector database must run entirely local:
Qdrant (Recommended)
- Rust-based, high performance
- HNSW indexing
- Payload filtering for access control
Chroma
- Python-native
- Simple deployment
- Good for smaller document sets
LanceDB
- Embedded, no server required
- Excellent for edge deployments
- Lance columnar format
Milvus (Enterprise)
- Distributed architecture
- Billion-vector scale
- Complex access control
Document Processing Pipeline
Air-gapped document processing requires local tooling:
OCR: Tesseract + pdf2image Table Extraction: Camelot, Tabula Embedding: Local sentence-transformers (all-MiniLM, E5, BGE) Chunking: LangChain or custom recursive splitter
Network Isolation Architecture
Physical Isolation
┌─────────────────────────────────────────────────────────┐
│ SECURE ZONE │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ AI Server │◄──►│ Switch │◄──►│ Admin │ │
│ │ (Air-Gap) │ │ (Isolated) │ │ Terminal │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ ▲ │
│ │ USB (Updates Only) │
│ ▼ │
│ ┌─────────────┐ │
│ │ Update │ │
│ │ Station │ │
│ └─────────────┘ │
│ │
│ ══════════════════ AIR GAP ════════════════════════ │
└─────────────────────────────────────────────────────────┘
│
│ (No Connection)
│
┌─────────────────────────────────────────────────────────┐
│ CORPORATE NETWORK │
└─────────────────────────────────────────────────────────┘
Access Control
Physical Access:
- Dedicated secure room or cabinet
- Access logging and biometric authentication
- Tamper-evident seals on hardware
Logical Access:
- Local user accounts only (no domain join)
- Role-based access to query and admin functions
- Session timeouts and automatic logout
Update Protocol:
- Prepare update package on internet-connected system
- Scan for malware on isolated scanning station
- Transfer via write-once media (USB-C with write protection)
- Apply update with cryptographic verification
- Log update with checksum and timestamp
Performance Optimization
Memory Management
For Apple Silicon:
- Allocate 90% of unified memory to model
- Use memory-mapped GGUF files
- Enable Metal memory optimization
For NVIDIA:
- Maximize VRAM allocation
- Use PagedAttention (vLLM)
- Enable Flash Attention 2
Inference Optimization
Batch Processing:
- Batch similar queries for throughput
- Use continuous batching where available
Context Caching:
- Cache system prompts
- Reuse KV cache for follow-up queries
Quantization Tuning:
- Test Q8 vs Q4 for your specific use case
- Consider mixed precision for critical layers
Benchmarking Your Deployment
Expected performance ranges:
| Hardware | Model | Batch Size | Tokens/sec | First Token Latency |
|---|---|---|---|---|
| M2 Ultra | 70B Q4 | 1 | 15-20 | 500ms |
| M4 Ultra | 70B Q8 | 1 | 30-40 | 300ms |
| RTX 4090 | 13B FP16 | 1 | 40-60 | 100ms |
| 2x A100 | 70B FP16 | 8 | 200+ | 150ms |
Deployment Checklist
Pre-Deployment
- Security clearance for AI system installation
- Physical space with appropriate environmental controls
- Network isolation verification
- Hardware procurement and inspection
- Software package preparation and verification
Installation
- Hardware setup in secure location
- OS installation (minimal, hardened)
- Inference stack deployment
- Vector database configuration
- Model loading and verification
- Document ingestion pipeline testing
Validation
- Query accuracy testing with known documents
- Performance benchmarking
- Access control verification
- Audit logging confirmation
- Update procedure testing
Operational
- User training
- Incident response procedures
- Backup and recovery testing
- Scheduled security reviews
- Performance monitoring
Case Study: Engineering Firm Deployment
Context
- 200-person structural engineering firm
- Proprietary design methodologies (trade secrets)
- Client NDAs prohibiting cloud AI usage
- 30 years of project documentation
Hardware Selection
- 2x Mac Studio M2 Ultra (128GB each)
- Primary + hot standby configuration
- Dedicated VLAN with air-gap to corporate network
Software Stack
- Ollama for inference (Llama 3.1 70B Q4)
- Qdrant for vector storage
- Custom Python RAG application
- Local embedding model (BGE-large)
Document Ingestion
- 50,000 PDF drawings and specifications
- 10,000 calculation sheets
- 5,000 project reports
- Total: ~2M pages processed
Results (12 Months)
- Average query latency: 3-5 seconds
- Document retrieval accuracy: 94%
- Engineer time saved on precedent research: 40%
- Zero security incidents
- Full compliance with client NDAs
Lessons Learned
- OCR quality is critical—invest in document scanning quality
- Domain-specific embedding models outperform generic
- Query logging enables continuous improvement
- Regular model updates improve quality significantly
Getting Started
For organizations evaluating zero-egress AI infrastructure:
Phase 1: Hardware Selection (Week 1)
- Assess document volume and query patterns
- Select hardware configuration
- Plan physical installation
Phase 2: Software Stack (Week 2)
- Package software for air-gapped installation
- Configure inference and vector database
- Set up document processing pipeline
Phase 3: Pilot Deployment (Week 3-4)
- Install in secure location
- Ingest sample document set
- Validate query accuracy and performance
Phase 4: Production Rollout (Month 2)
- Complete document ingestion
- User training and adoption
- Establish update and monitoring procedures
Next Steps
For organizations requiring zero-egress AI infrastructure:
- Hardware Consultation: Selection guidance for your specific requirements
- Security Architecture Review: Integration with existing security controls
- Pilot Planning: Scope definition and success criteria
Schedule Architecture Review | Explore Sovereign Deployment
Related reading: