ALTO: An Efficient Network Orchestrator for Compound AI Systems

التفاصيل البيبلوغرافية
العنوان: ALTO: An Efficient Network Orchestrator for Compound AI Systems
المؤلفون: Santhanam, Keshav, Raghavan, Deepti, Rahman, Muhammad Shahir, Venkatesh, Thejas, Kunjal, Neha, Thaker, Pratiksha, Levis, Philip, Zaharia, Matei
سنة النشر: 2024
المجموعة: Computer Science
مصطلحات موضوعية: Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Information Retrieval
الوصف: We present ALTO, a network orchestrator for efficiently serving compound AI systems such as pipelines of language models. ALTO achieves high throughput and low latency by taking advantage of an optimization opportunity specific to generative language models: streaming intermediate outputs. As language models produce outputs token by token, ALTO exposes opportunities to stream intermediate outputs between stages when possible. We highlight two new challenges of correctness and load balancing which emerge when streaming intermediate data across distributed pipeline stage instances. We also motivate the need for an aggregation-aware routing interface and distributed prompt-aware scheduling to address these challenges. We demonstrate the impact of ALTO's partial output streaming on a complex chatbot verification pipeline, increasing throughput by up to 3x for a fixed latency target of 4 seconds / request while also reducing tail latency by 1.8x compared to a baseline serving approach.
نوع الوثيقة: Working Paper
الوصول الحر: http://arxiv.org/abs/2403.04311Test
رقم الانضمام: edsarx.2403.04311
قاعدة البيانات: arXiv