P99 CONF 2025 | LLM Inference Optimization by Chip Huyen
Go to https://www.p99conf.io/ for P99 CONF talks on demand and to learn more.
. . . . .
This talk will discuss why LLM inference is slow and key latency metrics. It also covers techniques that make LLM inference fast, including different batching, parallelism, and prompt caching. Not all latency problems are engineering problems though. This talk will also cover interesting tricks to hide latency at an application level.
. . . . .
P99 CONF is the premier technical conference for engineers obsessed with high-performance, low-latency applications. Join developers and architects from leading companies like Shopify, Uber, Disney, Netflix, Google, LinkedIn, ShareChat, Meta, Square, Lyft, and American Express — plus many more — as they share deep technical insights on topics such as Rust, Go, Zig, distributed data systems, Kubernetes, and AI/ML infrastructure. Hear from world-class experts like Michael Stonebraker (Postgres), Bryan Cantrill (DTrace, Oxide), Avi Kivity (KVM, ScyllaDB), Carl Lerche (Tokio, AWS), and other pioneers shaping the future of performance engineering. Experience it all — for free, from anywhere in the world. Explore talks and learn more at p99conf.io/
source
