The cloud-accelerated AI company strengthens its presence in Europe with operational infrastructure in Sweden and a technical training tour kicking off in Amsterdam. Promises lower latency for northern and central Europe, data residency within the EU, and a practical menu of techniques to adapt open models.
Together AI announced a new step in its European strategy: it now has GPU infrastructure operational in Sweden and, concurrently, will launch a series of free workshops to train engineers and technical teams in fine-tuning and deploying open models. The first workshop—focused on updating and customizing LLMs—will take place in Amsterdam on September 10.
This move combines technological muscle with educational support. On one hand, a Nordic region bringing compute closer to users in the north and center of the continent; on the other, events with highly specific content: post-training (SFT, preference optimization, verifiable rewards), personalized speculative decoding (referencing accelerations of 1.85× or more on models like DeepSeek R1), and quantization to compress LLMs and reduce inference costs. The company’s ambition is that clients not only consume capabilities but also learn to leverage open models with lower costs and latencies.
Infrastructure in Sweden: data residency and milliseconds… worth gold
The new Together AI region in Sweden powers its serverless inference API for a range of popular open models—such as gpt-oss, DeepSeek, Meta Llama, and Qwen—and also allows clients to request GPU clusters and dedicated endpoints directly within Swedish territory.
The company highlights two immediate operational benefits:
- Data compliance and residency within the EU. By locating GPU servers in Sweden, European legal and security teams have a jurisdictional anchor to meet governance and transparency requirements. In regulated sectors or with strict audits, this is not just a bonus: it prevents unnecessary data transfers and simplifies discussions with regulators and risk committees.
- Perceptible reduction in latency. By bringing inference closer to the end-user, round-trip network time can be reduced by 50 to 70 ms, which in reactive applications results in response time improvements of up to 25–30%. In chat, assisted editing, tools calling agents, or incremental completion flows, these milliseconds are noticeable.
For those needing dedicated capacity lanes, dedicated endpoints and custom clusters remain the answer. An example from Caesar (caesar.xyz)—a platform focused on deep investigation and knowledge professionals—illustrates this hybrid approach:
“Currently, we use Together’s dedicated endpoints (an implementation of 8×H200 Llama 4 Maverick) to power our transformation stage with high concurrency and wide context windows. As we approach public launch, we are excited to deploy our workloads in Together AI’s new Sweden region to offer lower latency and meet regional data requirements. The combination of dedicated capacity and serverless elasticity allows us to quickly scale as demand grows.” — Mark McKenzie, Caesar Founder.
The clear message for the market is: dedicated capacity when load is stable or SLAs demand it, and serverless to absorb spikes and contain costs in unpredictable scenarios—two modes that coexist and can be orchestrated from the same platform.
The other “leg”: practical workshops to elevate team skills
The company isn’t just powering servers; it also aims to raise the learning curve for developers and data teams. Along with launching in Sweden, it introduces a training tour focused on AI skill enhancement. The first session, in Amsterdam on September 10, focuses on how to update and customize open models using production-ready methods.
The topics target three critical areas that now distinguish prototypes from robust systems:
- Post-training with SFT, preferences, and verifiable rewards.
- SFT with domain-specific data: curate and blend niche datasets (e.g., legal, financial, industrial) so the model speaks the “language” of the industry.
- Preference optimization: fine-tune responses based on quality criteria set by the team (style, accuracy, tone, safety).
- Verifiable rewards: introduce measurable signals—tests, checkers, rules—that reduce subjectivity and help scale alignment without high labeling costs.
- “On-demand” speculative decoding.
- Use a “draft” model adjusted to the domain to pre-generate tokens and accelerate inference of the larger model.
- Along with well-calibrated acceptance/rejection strategies, Together AI aims for accelerations over 1.85× in pipelines like DeepSeek R1, a significant improvement for high-traffic serving or consistent low latency.
- Quantization to bring “fat” LLMs into modest environments.
- Compression techniques that reduce memory and FLOPs, lower GPU requirements, and cut costs per query, enabling inferences on smaller devices or in more economical infrastructure.
- For many organizations, combining quantization + nearby endpoints is enough to go from “No high-end GPU” to “Business can serve with reasonable SLAs and sustainable margins.”
Leading the plan, CEO Vipul Ved Prakash emphasizes the ecosystem: “Europe is at the forefront of AI innovation, and we are committed to equipping its developers and researchers with the infrastructure and experience needed for success. Investments in Sweden and the European engineering community demonstrate our dedication to fostering high-performance, reliable, and scalable AI in the region.”
What does Together AI really address?
Beyond rhetoric, Together AI’s positioning is concrete: train, fine-tune, and run generative AI models with a specialized cloud that prioritizes performance, control, and cost. The platform supports open and custom models across multiple modalities, and allows clients to choose deployment methods with adjustable levels of privacy and security. In essence: it doesn’t impose a single model or a closed consumption model; it enables composable architectures.
Operationally, the Swedish region extends the global network powering its serverless API. In capacity, dedicated endpoints and on-demand GPU clusters provide performance stability and isolation, vital when loads are critical, prompts involve very large context windows, or the business demands fine traceability of throughput.
Practically, for the platform team this means they can:
- Anchor data and traffic within the EU to streamline compliance.
- Reduce latency for users in northern and central Europe without re-architecting the entire stack.
- Hyrbridate elastic (serverless) consumption with fixed capacity (dedicated), depending on load patterns.
- Adopt modern post-training, decoding, and quantization techniques without starting from scratch.
Why now? Latency, costs, and talent
The European context helps explain the timing. Alongside the rise of agents, copilots, and conversational experiences, technical teams face three forces:
- Latency as UX: every 50–70 ms saved changes users’ perception in chat, generative search, or in-app assistance. In mature markets, this difference can tip the scales.
- Token costs: the debate is no longer just “which model,” but how much it costs to serve at scale. Techniques like speculative decoding and quantization are direct efficiency levers.
- Talent scarcity: talent exists but not everywhere or with the same “stack.” Workshops aim to shorten the gap between academic papers and what truly works in real-world stacks with metrics, observability, and SLAs.
From a business perspective, having a Nordic region shortens network hops toward markets like Sweden, Denmark, Norway, Finland, the Netherlands, or Germany. Legal anchoring within the EU reduces friction in purchasing, auditing, security, and risk management, especially in banking, healthcare, or public sectors.
What Amsterdam brings: recipes, not just concepts
The promise of Together AI’s Model Shaping workshop isn’t just slide summaries. The curriculum aims to internalize practices that aid daily operations:
- How to select and clean domain data for effective SFT without incurring excessive labeling costs.
- How to define verifiable reward functions to improve the model where it matters (format compliance, hallucination avoidance, terminological consistency).
- How to calibrate a “draft” model for speculative decoding, and where to set thresholds to avoid compromising quality for speed.
- Choosing the right quantization strategy based on hardware, precision requirements, and the specific use case’s sensitivity.
In essence: take home reproducible procedures that enable moving from prototypes to scalable, cost-controlled systems.
Forged identity through “pragmatic openness”
As a leading AI acceleration cloud, Together AI describes itself as committed to open collaboration, innovation, and transparency. This isn’t ideology—performance is essential, but so is providing clients with control options (models, endpoints, data residency, privacy) and support with applicable knowledge. In that framework, Sweden is both a presence point and a statement of intent for Europe.
The plan aims for a virtuous circle:
- Infrastructure close to users and data,
- Tools to efficiently customize open models,
- Training so teams adopt the latest without steep learning curves.
If the ecosystem responds—with projects transforming latency and costs into better experiences and margins—the effort will have been worthwhile.
The essentials in four key points
- New Swedish region now operational: serverless API, dedicated endpoints, and on-demand GPU clusters with EU data residency.
- Latency: typical improvements of 50–70 ms, with response time reductions of 25–30% in interactive apps.
- Free workshops: starting in Amsterdam (September 10) on post-training (SFT, preferences, verifiable rewards), speculative decoding (> 1.85× on routes like DeepSeek R1), and quantization.
- CEO message: “Europe is at the forefront”; Together AI will invest in infrastructure and engineering community to foster reliable scalable AI in the region.
FAQs
1) What benefits does a European company gain deploying inference in Together AI’s Swedish region?
Primarily, two things: EU data residency—key for compliance and auditing—and lower latency for users in northern and central Europe (typical cuts of 50–70 ms, with performance improvements of 25–30%). This translates into better UX and less legal friction.
2) What’s the practical difference between using the serverless API and a dedicated endpoint?
The serverless API provides elasticity and pay-as-you-go pricing; ideal for spikes, testing, and variable demand services. A dedicated endpoint guarantees reserved capacity, stable performance, and isolation, useful for critical loads, large context windows, or strict SLAs. Many organizations combine both: fixed capacity + elasticity for spikes.
3) Which specific techniques are addressed in the Amsterdam workshop and why do they matter?
Topics include SFT, preference optimization, and verifiable rewards (aligns model with domain at reasonable costs), speculative decoding (accelerates inference with a “draft” model, referencing > 1.85×), and quantization (reduces hardware needs and lowers query costs). These are direct levers to boost quality and lower cost/latency.
4) What models does the API support, and how does this fit with control and security requirements?
Together AI’s API supports open and custom models—including gpt-oss, DeepSeek, Meta Llama, Qwen—and offers deployment options with various levels of isolation, traceability, and privacy. Coupled with EU data residency (Sweden region), it enables designing compliant architectures without sacrificing performance.
Note: The information is sourced from the official Together AI announcement regarding infrastructure opening in Sweden and the launch of their European workshop series, including the first Model Shaping workshop in Amsterdam. The latency figures (50–70 ms and 25–30%) and techniques (SFT, preferences, rewards, speculative decoding > 1.85×, quantization) were provided by the company.