The French startup makes a significant leap in AI model inference by announcing that its inference engine has achieved token generation speeds up to 3.5 times faster than leading solutions like vLLM and TensorRT-LLM, using AMD Instinct™ MI300X GPUs. This development positions Kog at the forefront of next-generation inference platforms and underscores Europe’s commitment to a sovereign and independent technological infrastructure.
In the era of generative AI, the bottleneck is no longer training but inference. The ability to generate quick sequential responses—measured in tokens per second per request—has become critical for autonomous agents, virtual assistants, real-time voice applications, and advanced reasoning models. However, the most used inference engines remain optimized for mass chat scenarios, often neglecting complex individual flow performance.
Kog AI has shared preliminary results showing its engine outperforms competitors across all key metrics. Highlights include:
– Up to 3.5× faster token generation compared to current engines on MI300X.
– Consistent performance across models ranging from 1 billion to 32 billion parameters (Llama, Mistral, Qwen).
– A record inter-GPU latency of just 4 microseconds, reducing typical communication library times by up to fourfold.
Kog’s inference engine demonstrates exceptional performance on compact models (1B to 7B), which can, when properly fine-tuned, match or surpass the accuracy of much larger models for specific tasks—dramatically lowering infrastructure costs and increasing speed tenfold.
Unlike other solutions, Kog did not merely optimize existing frameworks. They built their engine from scratch using C++ and highly optimized assembly code to eliminate bottlenecks at hardware and software levels. A key innovation is KCCL (Kog Collective Communications Library), an internally developed communication library that has achieved the lowest latency ever recorded for distributed inference across GPUs.
The result is a flexible system deployable in local, cloud, or hybrid environments, packaged as APIs or Docker containers, ready for critical use cases like real-time voice transcription, autonomous agents, or contextual assistants with advanced reasoning.
Kog AI’s announcement reflects not only a technical breakthrough but also a strategic statement. The French startup epitomizes a new wave of European technological innovation aimed at reducing dependence on American and Asian infrastructures and developing a sovereign, agile, and highly specialized AI.
“Modern AI applications cannot afford high latencies or inefficient infrastructure,” says Kog. “Our goal is to make real-time inference a standard, not an exception.”
Amid rising inference costs and latency issues threatening user experience, Kog’s scalable, efficient, and sovereign approach offers an attractive alternative. Its partnership with AMD and the use of the powerful MI300X accelerator demonstrate that Europe can not only compete but lead in designing cutting-edge solutions for the next AI generation.