The most frequently asked question in technology committees in 2025 is no longer “which model to use,” but “where to run AI”. Amid latency concerns, token costs, regulatory compliance, and data leaks, an increasing number of organizations are exploring local deployment of language models with elastic cloud support for peak loads and new use cases. Positioned in this intermediate space is SoaxNG—the orchestration layer of OASIX Cloud (Grupo Aire), built on OpenStack—enabling Ollama deployment with Open WebUI, blending on-site privacy and scalability.
Its goal: hybrid ecosystems where sensitive data remains under direct control, while cloud infrastructure provides capacity and resilience when business demands it.
What Ollama adds in SoaxNG environments
Ollama has become the go-to “local runtime” for GGUF models (a quantized format that reduces memory and inference costs), noted for its operational simplicity: download, run, chat. In an enterprise context, this simplicity is amplified by its integration with SoaxNG, which provides orchestration, isolation, and container lifecycle management.
Why add Open WebUI?
- Adoption curve. Open WebUI offers a visual interface that eliminates reliance on command line. It’s key for extending AI beyond technical teams: legal, marketing, customer support, or operations can test, iterate, and share without opening a terminal.
- Collaboration. Conversation history, prompt templates, document upload (PDF/images) with OCR, and model-specific adjustments (temperature, top-p, context size) help standardize workflows.
- Extensibility. From the UI itself, users can download/manage models, configure server ports/IPs, and enable embedding or vision modules if available.
Deployment architecture: containers, profiles, and persistence
The recommended deployment pattern is containerized:
- Resource isolation. Each Ollama instance runs in its own container with granular CPU/GPU allocation. SoaxNG manages this via its orchestration engine on OpenStack, supporting multi-tenancy and separation between development and production.
- Scalability. SoaxNG auto-adjusts resources and replicas during inference peaks. For heavier models, profiles with GPU and adequate memory are assigned per case.
- Persistence. Volumes connect to OASIX’s Flash Scale Premium systems, ensuring GGUF models > 100 GB are stored and served without bottlenecks.
Typical stack (via Docker Compose, with CPU/GPU profiles):
- Ollama Core – runtime for GGUF models.
- Open WebUI – unified management and chat interface.
- Nginx – reverse proxy with TLS and load balancing.
Supported models and resource profiles
SoaxNG offers predefined profiles to speed up deployment across main model families:
Model | min vCPU | min RAM | Storage | Main use case |
---|---|---|---|---|
DeepSeek-R1 | 8 | 32 GB | 150 GB | Reasoning and analysis |
Llama 3.2 | 4 | 16 GB | 45 GB | General text generation |
CodeLlama-70B | 12 | 64 GB | 85 GB | Development support |
LLaVA-1.6 | 6 | 24 GB | 35 GB | Vision and documentation |
Note: requirements depend on size/quantization, context, and desired throughput. The GGUF model catalog grows weekly; in enterprise, standardizing profiles per service level (latency, concurrency) and data sensitivity is critical.
Use cases already delivering ROI
1) Automated cybersecurity (SOC)
- Automated playbooks: generating response procedures (IR) for new CVE, mapped to MITRE ATT&CK.
- Accelerated forensics: ingesting 1 TB/day of logs to look for APT patterns and correlations.
- Simulation: realistic attack scenarios to train red teams and evaluate controls.
2) Process automation
- Document processing: extracting clause elements in contracts (vision + text).
- Regulatory monitoring: tracking ENISA/GDPR changes, with alerts and summaries.
- Technical documentation: creating manuals and procedures with validation from cross-functional teams.
3) Intelligent DevOps
- Secure code: static/dynamic analysis with correction suggestions.
- Optimization: scalability recommendations based on telemetry and costs.
- Incident management: classifying tickets and initial RCA to reduce MTTR.
Security and governance: Zero-Trust and compliance
Adopting local AI requires a minimum trust architecture with controls aligned to European standards.
Zero-Trust “out of the box”
- Homomorphic encryption during inference flows with sensitive data (health/finance).
- NVIDIA Confidential Computing: TEE for GPUs, isolating models and attack vectors.
- Granular RBAC: permissions per model/prompt/output and traceability.
Compliance
- ENS Alto for public administrations in Spain.
- GDPR – Art. 35: predefined DPIA for processing personal data.
- ISO 27001/27017: secure management and controls in cloud.
- Periodic audits with compliance models to detect deviations.
Open WebUI: detailed menu for non-technical teams
- Login and authentication. First admin setup with database and configuration control. Supports corporate SSO.
- Model selection. Catalog of downloaded models with options to add/remove and test without leaving the UI.
- Main chat. Central area with multiple conversations and history; ideal for playbooks, internal Q&A, and guided testing.
- Connection and parameters. Server IP/port, context size, temperature, top-p, etc.
- Audio and images. Input via microphone, analysis, and/or image generation (if supported by the model).
- OCR and documents. Upload PDF/images to extract text and include in the context.
- Prompt templates. Reusable library to standardize tasks.
- Internet search. Available as per configuration; useful when current info matters.
Why “local + cloud” is a strategic decision
- Sovereignty and privacy. Local AI prevents sensitive data from being sent to third parties. With SoaxNG, control remains on-premises or in private cloud, extending to OASIX Cloud when more capacity is needed.
- Latency and costs. Reducing network jumps cuts latency and per-call costs. For recurrent loads (internal RAG, classification, extraction), resident models typically outperform.
- Compliance. Keeping data and logs within EU jurisdiction simplifies GDPR, ENS, and audits (ISO).
- Scalability. Cloud handles peaks and facilitates experiments without risking data. The key is effective perimeter control and observability.
Best practices for production deployment
- Start with a scoped use case (e.g., internal assistant for FAQs and policies, or document processing of a specific type).
- Define profiles for GPU/CPU and SLOs (latency, throughput, context window) per model.
- Traceability: enable prompt/output logs with data protection and well-defined retention policies.
- Human-in-the-loop: set up review for critical tasks (legal, compliance, customer).
- Regular evaluation (quality, biases, drift), using validation datasets and metrics for accuracy and usefulness.
- Secrets management and rotation: credentials, keys, and internal store access.
- Continuity plan: rollbacks, snapshots of models, volume backups, and incident recovery procedures.
Adoption in Spain: a pathway to digital sovereignty
For Spanish organizations, the Ollama + SoaxNG duo offers a pragmatic route towards generative AI without sacrificing sovereignty: simplified installation, visual management, and security controls with ENS/ISO certifications, easing public procurement and audits. The hybrid approach—local where data must be protected, cloud where growth is desired—is, today, the most realistic strategy for delivering immediate value.
Conclusion
The convergence of local AI and cloud is no longer a philosophical debate: it is an operational architecture. Ollama minimizes friction in model deployment close to the data; Open WebUI makes AI accessible to everyone in the organization; SoaxNG provides the structure—orchestration, profiles, persistence, security—that an enterprise environment demands. If your goal is to accelerate speed without losing control, this is a solid starting point.
S basta próximo paso? Escoger un caso piloto, definir métricas de éxito, y medir. La ventaja no será del modelo más grande, sino de la capacidad de convertirlo en procesos repetibles que mejoren negocio y cumplimiento al mismo tiempo.
Preguntas frecuentes
¿Qué ventajas tiene ejecutar LLMs con Ollama respecto a consumir un servicio externo?
Menos latencia, costes más predecibles y control del dato. Es clave cuando hay información sensible, requisitos regulatorios o se busca personalizar modelos sin exponer prompts y outputs a terceros.
¿Puedo empezar sin GPU?
Sí. Muchos modelos GGUF funcionan en CPU para prototipos y casos ligeros. Para concurrencia y contextos amplios, una GPU aporta una mejora notable. SoaxNG permite perfiles adaptados por caso.
¿Cómo se gestiona la seguridad y el cumplimiento?
Con Zero-Trust (RBAC granular, aislamiento, TEE en GPU), cifrado, y artefactos de cumplimiento (ENS, RGPD con DPIA, ISO 27001/27017). La trazabilidad en prompts y outputs facilita auditorías.
¿Qué modelos son adecuados para empezar?
Depende del uso: Llama 3.x para asistentes y texto general; DeepSeek-R1 para razonamiento; CodeLlama para desarrollo; y LLaVA para documentos/visión. La clave es ajustar cuantización y contexto al SLO esperado.