Cloud Blog: PyTorch/XLA 2.5: vLLM support and an improved developer experience

Source URL: https://cloud.google.com/blog/products/ai-machine-learning/whats-new-with-pytorchxla-2-5/
Source: Cloud Blog
Title: PyTorch/XLA 2.5: vLLM support and an improved developer experience

Feedly Summary: Machine learning engineers are bullish on PyTorch/XLA, a Python package that uses the XLA deep learning compiler to connect the PyTorch deep learning framework and Cloud TPUs. And now, PyTorch/XLA 2.5 is here, along with a set of improvements to add support for vLLM and enhance the overall developer experience. Featured in this release are:

A clarified proposal for deprecation of the older torch_xla API in favor of moving towards the existing PyTorch API, providing for a simplified developer experience. An example of this is the migration of existing Distributed API.

A series of improvements to the torch_xla.compile function which improve the debugging experience for developers during the development process.

Experimental support in vLLM for TPUs, allowing you to extend your existing deployments and while leveraging the same vLLM interface across your TPUs.

Let’s take a look at each of these enhancements.
Streamlining the torch_xla API
With PyTorch/XLA 2.5, we’re taking a significant step towards making the API more consistent with upstream PyTorch. Our north star is to minimize the learning curve for developers already familiar with PyTorch, making it easier to use XLA devices. This means gradually phasing out and deprecating custom API calls for PyTorch/XLA for more mature functionality when possible, and then, migrating the API calls over to their PyTorch counterparts. Other features still remain within the existing Python module before migration.
In the spirit of a simpler developer experience for PyTorch/XLA, in this release we have migrated over to leveraging some existing PyTorch distributed API functions when running models on top of PyTorch/XLA. Historically, the calls for the distributed API were located under the torch_xla module; in this update we migrated most of them to torch.distributed.

code_block
)])]>

aside_block
<ListValue: [StructValue([(‘title’, ‘$300 in free credit to try Google Cloud AI and ML’), (‘body’, <wagtail.rich_text.RichText object at 0x3e8d6bc61250>), (‘btn_text’, ‘Start building for free’), (‘href’, ‘http://console.cloud.google.com/freetrial?redirectPath=/vertex-ai/’), (‘image’, None)])]>

Improvement to ‘torch_xla.compile’
We’ve also added a few new compilation features to help you debug or notice potential issues within your model code. For example, a ‘full_graph’ mode emits an error message when there’s more than one compilation graph. This helps you discover potential issues caused by multiple compilation graphs early on (during compilation).
Additionally, you can now specify an expected number of recompilations for compiled functions. This can help you debug performance issues in which a function might be getting recompiled more times than expected, for example, when it has unexpected dynamism.
You can now also give compiled functions an understandable name instead of an automatically created one. By naming compiled targets, you gain more context when debugging messages, making it easier to figure out where the problem may be. Here’s an example of what that looks like in reality:

code_block
<ListValue: [StructValue([(‘code’, ‘# named code\r\n@torch_xla.compile\r\ndef dummy_cos_sin_decored(self, tensor):\r\n return torch.cos(torch.sin(tensor))\r\n\r\n# target dumped HLO renamed with named code function name \r\n…\r\nmodule_0021.SyncTensorsGraph.4.hlo_module_config.txt\r\nmodule_0021.SyncTensorsGraph.4.target_arguments.txt\r\nmodule_0021.SyncTensorsGraph.4.tpu_comp_env.txt\r\nmodule_0024.dummy_cos_sin_decored.5.before_optimizations.txt\r\nmodule_0024.dummy_cos_sin_decored.5.execution_options.txt\r\nmodule_0024.dummy_cos_sin_decored.5.flagfile\r\nmodule_0024.dummy_cos_sin_decored.5.hlo_module_config.txt\r\nmodule_0024.dummy_cos_sin_decored.5.target_arguments.txt\r\nmodule_0024.dummy_cos_sin_decored.5.tpu_comp_env.txt\r\n…’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e8d6bc616a0>)])]>

Looking at the above output you can see the original versus the named output generated from the same file; ‘SyncTensorsGraph’ is the automatically generated name. Below, you can see the renamed file related to the small code example above.
vLLM on TPU (experimental)
If you use vLLM to serve models on GPUs, you can now switch to TPU as a backend. vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs. vLLM on TPU retains the same vLLM interface that developers love, including direct integration into Hugging Face Model Hub to simplify model experimentation on TPU. 
Switching your vLLM endpoint to TPU is a matter of a few config changes. Aside from the TPU image, everything else remains the same: request payload, metrics used for autoscaling, load balancing, model source code, etc. For details, see the installation guide. 
Other vLLM features we’ve extended to TPU include Pallas kernels such as paged attention, flash attention and performance optimizations in dynamo bridge, all which are now part of the PyTorch/XLA repository (code). While vLLM is available to PyTorch TPU users, this work is still ongoing, and we look forward to rolling out additional features and optimizations in future releases.
Start using PyTorch/XLA 2.5
You can start taking advantage of these latest features by downloading the latest release through your Python package manager. Or, if this is your first time hearing about PyTorch/XLA, check out the project’s Github page for installation instructions and more detailed information.
For a full list of changes, check out the release notes!

AI Summary and Description: Yes

**Summary:** The text discusses the release of PyTorch/XLA 2.5, highlighting significant improvements to the interaction between the PyTorch deep learning framework and Cloud TPUs. The enhancements include a streamlined API, debugging improvements in the compilation process, and experimental support for vLLM on TPUs, which are particularly relevant for developers and data scientists working with machine learning in cloud environments.

**Detailed Description:**
The release of PyTorch/XLA 2.5 introduces various enhancements that contribute to a better developer experience while working with the PyTorch deep learning framework, especially in the context of using Cloud TPUs. The implications of these updates are crucial for AI and infrastructure professionals focused on developing and deploying machine learning models efficiently.

– **API Streamlining:**
– The introduction of a clarified proposal for phasing out the old `torch_xla` API in favor of the existing PyTorch API. This change is aimed at reducing complexity for developers already accustomed to the PyTorch environment.
– Distribution APIs were migrated to standard PyTorch APIs, consolidating functionality and easing the learning curve for new users.

– **Improvements to `torch_xla.compile`:**
– Enhanced debugging capabilities with a new ‘full_graph’ mode, which emits error messages for multiple compilation graphs. This caters to developers who may run into complications caused by excessive duplications in graphs.
– Function recompilation tracking helps identify unexpected behaviors during model execution, a critical feature for model optimization.
– A new function naming feature improves clarity in debugging by allowing developers to specify meaningful names for compiled functions, which aids in tracking down issues more effectively.

– **Experimental vLLM Support on TPU:**
– vLLM, known for high-throughput and efficient model serving, now supports TPU, enabling users to transition seamlessly from GPU without altering their existing configuration aside from necessary TPU-specific changes.
– Integration with the Hugging Face Model Hub remains unchanged, further simplifying the process for developers looking to experiment with various models.

– **Future Developments:**
– Ongoing work on additional improvements and support for vLLM on TPUs is planned, indicating a commitment to evolving the tool to meet user needs and performance expectations.

This release is particularly significant for professionals in AI and cloud infrastructure, facilitating not only easier integration and deployment of machine learning models but also providing tools that enhance performance optimization and debugging capabilities.