Kako smanjiti potrošnju energije i troškove u LLM klasterima, a zadržati vrhunske performanse?
Vreme | 20. avgust 2025. 16:00 |
---|---|
Predavač | dr Jovan Stojković |
Mesto | Palata nauke, sala Horizont, 4. sprat |
Odgovor ćete čuti od dr Jovana Stojkovića, koji je nedavno doktorirao na prestižnom University of Illinois Urbana-Champaign. Jovan se bavi aktuelnim temama na preseku arhitekture računara i mašinskog učenja, a na predavanju će predstaviti dva inovativna sistema:
DynamoLLM– okvir za upravljanje energijom u LLM okruženjima koji optimizuje potrošnju, smanjuje emisije i troškove, a čuva performanse.
TAPAS– inteligentni sistem za raspoređivanje zadataka u GPU klasterima, koji smanjuje opterećenje napajanja i hlađenja uz minimalan uticaj na rad.
Pridružite se i saznajte kako ove tehnologije menjaju način na koji razmišljamo o energetskoj efikasnosti i performansama data-centara!
Toward Energy-Efficient LLM Inference Serving Systems
Today, LLM inference clusters receive a large number of queries with strict Service Level Objectives (SLOs). To achieve the desired performance, these models execute on power-hungry GPUs causing the inference clusters to 1) consume large amounts of energy and carbon emissions, and 2) provision high power and cooling capacities, resulting in high datacenter Total Cost of Ownership (TCO) . In this talk, I will present two systems that address these challenges: DynamoLLM and TAPAS.
DynamoLLM is the first energy-management framework for LLM inference environments. It automatically and dynamically reconfigures the inference cluster to optimize for energy and cost of LLM serving under the service's performance SLOs. DynamoLLM saves energy and operational carbon emissions, and reduces cost to the customer, while meeting the latency SLOs. TAPAS is a thermal- and power-aware scheduling scheme designed for GPU clusters in the cloud. TAPAS optimizes power and cooling oversubscription while maintaining minimal impact on performance. By using smart workload placement, request routing, and configuration tuning, TAPAS reduces the thermal and power throttling events, boosting system efficiency without affecting the latency and quality of results.
Biography:
Jovan Stojkovic is an incoming Assistant Professor at the Computer Science department at the University of Texas at Austin. Jovan has recently completed his PhD from the University of Illinois at Urbana-Champaign under the guidance of Professor Josep Torrellas. Jovan’s interests are in the architecture and systems for cloud and datacenter computing. His research was awarded with multiple awards such as HPCA Best Paper award, IEEE Micro Top Pick Honorable Mention, W. J. Poppelbaum Memorial Award, Kenichi Miura Award, and an invitation to speak at the Heidelberg Laurate Forum. Jovan completed his undergraduate studies at the University of Belgrade, School of Electrical Engineering.