The Imperative for Machine Learning Compilation
From Development to Deployment: The Core Challenge
Machine Learning Compilation (MLC) represents the critical technological bridge that transforms a machine learning model from its abstract, development-centric form into a concrete, high-performance deployment artifact.1 This process is not a mere translation but a sophisticated optimization pipeline designed to navigate the complexities of modern hardware and software ecosystems. At its core, MLC addresses the fundamental disconnect between how models are created and how they must be executed in production environments. The development form of a model encompasses the high-level abstractions used by data scientists and researchers. This typically involves model architectures defined in popular frameworks such as PyTorch, TensorFlow, or JAX, along with the associated learned parameters, or weights.1 These frameworks prioritize productivity, flexibility, and ease of experimentation, allowing for rapid prototyping and iteration. However, the very features that make them powerful for development—dynamic graph execution, Python-level control flow, and a vast library of operators—introduce significant overhead that is unacceptable in performance-critical deployment scenarios.

bundle-combo—sap-core-hcm-hcm-and-successfactors-ec By Uplatz
In contrast, the deployment form is a lean, optimized package containing only the essential components required for inference. This includes low-level, hardware-specific executable code for each model operation, efficient routines for managing resources like memory, and stable Application Programming Interfaces (APIs) for integration into larger applications, such as a Java API for an Android mobile application.1 A crucial aspect of this transformation is integration and dependency minimization. For instance, deploying a flower classification model to a camera app should not require packaging code related to natural language processing, such as embedding table lookups. MLC enables the selective assembly of necessary components, drastically reducing the final application’s size and broadening the range of devices it can be deployed on, a paramount concern for resource-constrained environments like mobile phones and edge devices.1
The Combinatorial Explosion Problem: Hardware and Model Diversity
The primary impetus for the rise of sophisticated ML compilers is a systemic challenge known as the “combinatorial explosion” of models and hardware.4 The field of machine learning is characterized by a dual-pronged, rapid evolution. On one axis, model architectures are constantly advancing, from established Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) to the now-dominant Transformer architectures and their ever-growing variants.5 Each new architecture introduces unique computational patterns and operator requirements.
On the other axis, the hardware landscape has fragmented into a vast and heterogeneous ecosystem. Beyond general-purpose CPUs, workloads are now deployed on Graphics Processing Units (GPUs) with specialized cores like NVIDIA’s TensorCores, Google’s Tensor Processing Units (TPUs), custom Neural Processing Units (NPUs) in mobile devices, and reconfigurable hardware like Field-Programmable Gate Arrays (FPGAs).6 Each of these hardware targets possesses a unique architecture, memory hierarchy, and instruction set, demanding tailored code to unlock its peak performance.6
The result is a matrix of M models and N hardware targets, where the engineering effort required to manually optimize each model for each target scales multiplicatively. Relying on human engineers to hand-craft optimized kernels (e.g., in CUDA for NVIDIA GPUs) for every permutation is prohibitively expensive, time-consuming, and unsustainable.6 Early deep learning frameworks mitigated this by relying on a small set of vendor-provided, hand-tuned libraries like Intel’s MKL for CPUs and NVIDIA’s cuDNN for GPUs.5 However, this approach creates a dependency bottleneck; these libraries often lag behind the latest research in model architecture and are non-existent for novel or specialized hardware accelerators.5
This is the chasm that ML compilers are designed to bridge. They serve as a vital abstraction layer, automating the complex task of generating performant, hardware-specific code from a high-level model description.6 By decoupling the model definition from the hardware implementation, compilers tackle the combinatorial explosion problem head-on, enabling researchers to focus on model innovation while ensuring their work can be efficiently deployed across the full spectrum of computing platforms.
A Primer on Key Compiler Optimization Techniques
To achieve this hardware-specific optimization, ML compilers employ a suite of powerful transformation techniques. While these techniques are diverse, they share a common underlying goal: to restructure the computation and data layout of a model to better align with the architectural realities of the target hardware, with a particular focus on managing the memory hierarchy. The performance of modern processors is often limited not by their computational speed but by the latency and bandwidth of memory access—the so-called “memory wall.” Consequently, the most impactful compiler optimizations are those that minimize data movement and maximize data reuse within the fastest tiers of the memory system.
Operator Fusion (Kernel Fusion)
Operator fusion, also known as kernel fusion, is one of the most critical optimizations in ML compilers. It addresses the significant overhead associated with memory access and kernel launches by combining multiple, sequential operators from the computation graph into a single, monolithic kernel.9
The primary motivation for fusion is the reduction of data movement to and from global memory (e.g., DRAM or HBM on a GPU), which is orders of magnitude slower than on-chip memory like registers or L1/L2 caches.12 Without fusion, the result of each individual operator must be written out to global memory, only to be immediately read back by the next operator in the sequence. This constant traffic becomes a major performance bottleneck. By fusing operators, intermediate results can be kept in fast, local memory—such as GPU registers or shared memory—and consumed directly by the next stage of the fused computation.12 This elimination of redundant memory write/read cycles drastically reduces pressure on memory bandwidth.11 A secondary benefit is the reduction in kernel launch overhead; invoking a single large kernel is far more efficient than launching many small, independent ones.14
A canonical example of this technique is the fusion of a convolution layer, a bias addition, and a Rectified Linear Unit (ReLU) activation function.11 In a non-fused execution, the workflow would be:
- Launch a convolution kernel, read inputs from global memory, write outputs to global memory.
- Launch an element-wise addition kernel, read the convolution outputs and bias from global memory, write results to global memory.
- Launch a ReLU kernel, read the addition results from global memory, write final outputs to global memory.
A fused kernel performs all three operations in a single pass. The convolution output is computed and stored in a register, the bias is added to it, the ReLU function is applied, and only the final result is written to global memory. This transformation can yield profound performance improvements, sometimes doubling the throughput.12
However, fusion is not always straightforward. Compilers must navigate complex data dependencies that can prevent fusion. For example, reduction operations, which are central to attention mechanisms in Transformers, introduce loop-carried dependencies that are challenging for traditional fusion heuristics.13 Advanced compilers are beginning to explore more aggressive strategies, such as intentionally breaking certain dependencies and then constructing algebraic “repair” terms to compensate, thereby enabling fusion in cases where it was previously deemed impossible.13
Constant Folding
Constant folding is a fundamental optimization that reduces runtime computation by pre-calculating parts of the model graph that are static and do not depend on user input.9 It is the process of identifying and evaluating constant expressions at compile time and replacing them with their final computed values.17
In the context of machine learning, a “constant expression” can be an entire subgraph of operations where all inputs are known at compile time. This most commonly applies to operations involving model parameters (weights), which are fixed after training, or other static configuration values. For example, if a model includes a step to normalize weights by a constant factor, the compiler can perform this normalization once during compilation and embed the final normalized weights directly into the executable.19
This technique is often used in tandem with constant propagation, where the value of a constant variable is substituted into subsequent expressions that use it.17 The combined effect is a cascade of simplifications that can eliminate significant portions of the original computation graph. The benefits are twofold: it reduces the number of instructions that need to be executed at runtime, leading to lower latency, and it can reduce the size of the final model by eliminating the need to store intermediate constant tensors.16
Memory Layout Optimization
The logical representation of a tensor in a high-level framework (e.g., a 4D tensor for an image batch with dimensions for batch, channels, height, and width) is distinct from its physical layout in memory. The order of these dimensions in memory can have a dramatic impact on performance, and memory layout optimization is the process of reordering them to best suit the target hardware.21
The most common example is the transformation between NCHW (Batch, Channels, Height, Width) and NHWC (Batch, Height, Width, Channels) layouts. While many frameworks like PyTorch traditionally default to NCHW, hardware accelerators like NVIDIA GPUs with TensorCores and Google TPUs often achieve significantly higher performance with the NHWC layout. This is because NHWC places the channel dimension last, which often leads to more coalesced memory accesses for convolution operations. Coalesced access occurs when parallel threads or processing elements access contiguous blocks of memory simultaneously, maximizing the effective use of memory bandwidth.22
An ML compiler with knowledge of the target hardware’s preferences can automatically insert layout transformation nodes into the graph. It analyzes the model and determines the optimal layout for each operator, inserting transposes where necessary to convert between layouts. This is a critical hardware-specific optimization that bridges the gap between the framework’s logical tensor view and the physical memory architecture of the device, ensuring that data is presented to the compute units in the most efficient format possible.24 This optimization extends beyond simple dimension reordering to include more complex memory management strategies like efficient register allocation to minimize spills to main memory and careful planning of memory buffers.26
Quantization
Quantization is a powerful compression and optimization technique that reduces the numerical precision of a model’s parameters (weights) and, in many cases, its activations (intermediate results).9 Typically, models are trained using 32-bit floating-point numbers (FP32) for their wide dynamic range and high precision. Quantization converts these values to lower-precision formats, most commonly 16-bit floating-point (FP16 or BFloat16) or 8-bit integers (INT8).28 This technique is transformative for several reasons:
- Reduced Model Size and Memory Footprint: By reducing the number of bits per parameter, the overall model size is drastically reduced. An INT8-quantized model, for example, is approximately four times smaller than its FP32 equivalent. This is critical for deployment on devices with limited storage and RAM, such as microcontrollers and mobile phones.30
- Faster Inference Speed: Many modern processors, from high-end GPUs with TensorCores to mobile NPUs, have specialized hardware units designed to perform integer arithmetic much faster than floating-point arithmetic. Executing a model using INT8 operations can lead to a significant increase in throughput and a reduction in latency.29
- Lower Power Consumption: The combination of reduced data movement (due to a smaller memory footprint) and simpler integer computations results in a substantial decrease in energy consumption. This is a crucial benefit for battery-powered edge devices.31
There are two primary methodologies for applying quantization, each with its own trade-offs:
- Post-Training Quantization (PTQ): This approach is applied to an already trained FP32 model. The process involves analyzing the distribution of weights and activations to determine an optimal mapping from the FP32 range to the lower-precision INT8 range. This often requires a small, representative “calibration dataset” to capture the typical dynamic range of the activations.34 PTQ is relatively simple and fast to apply but can sometimes lead to a noticeable drop in model accuracy, as the model was not trained with quantization in mind.
- Quantization-Aware Training (QAT): To mitigate the accuracy loss associated with PTQ, QAT simulates the effect of quantization during the training or fine-tuning process. It inserts “fake quantization” nodes into the model graph, which mimic the rounding and clipping errors that will occur during integer-based inference. By making these errors part of the training loss function, the model learns to become robust to them, adjusting its weights to minimize the impact of the precision reduction.28 QAT typically yields higher accuracy than PTQ but requires access to the training pipeline and is more computationally expensive.
Auto-Tuning and Search-Based Optimization
While the optimizations described above are powerful, their effectiveness often depends on choosing the right parameters—for example, what tile size to use for a loop, which specific group of operators to fuse, or what loop unrolling factor to apply. The optimal choice for these parameters is highly dependent on the specific operator, its input tensor shapes, and the intricate details of the target hardware architecture. Manually defined heuristics struggle to cover this vast and complex search space.
Auto-tuning automates this process by transforming it into a search problem.39 Instead of relying on a fixed set of rules, an auto-tuning system explores a vast space of possible optimization configurations, empirically measuring their performance on the actual target hardware to find the best one.21
The process generally follows these steps:
- Define a Search Space: The compiler defines a parameterized space of possible implementations (or “schedules”) for a given computational kernel. This space includes choices for loop tiling, reordering, vectorization, parallelization, and other low-level transformations.
- Propose Candidates: A search algorithm (e.g., simulated annealing, genetic algorithms, or random search) proposes candidate configurations from this space.
- Generate and Benchmark: For each candidate, the compiler generates the corresponding code and benchmarks its execution time on the target hardware.
- Update and Iterate: The performance feedback from the benchmark is used to guide the search algorithm toward more promising regions of the search space.
Because exhaustively searching this space is often intractable, advanced auto-tuners like those in Apache TVM employ machine learning-based cost models.6 These models are trained on-the-fly using the benchmark data collected during the search. They learn to predict the performance of a given configuration without needing to run a full benchmark, allowing the system to prune unpromising paths and explore the search space much more efficiently.22 This learning-based, empirical approach is a cornerstone of modern ML compilers, enabling them to generate highly specialized, high-performance code for a wide array of hardware, even for novel architectures where no hand-tuned libraries exist. The primary drawback of this powerful technique is that the search process can be extremely time-consuming, sometimes taking hours to optimize a single model.12
Apache TVM: A Multi-Level, Low-Control Compiler Stack
Apache TVM (Tensor Virtual Machine) is an open-source machine learning compiler framework designed to close the gap between productivity-focused deep learning frameworks and the diverse, performance-focused landscape of hardware backends.42 It has emerged as a powerful solution for achieving high performance and portability, particularly for deploying models on a wide range of devices, from server-class GPUs to resource-constrained embedded systems.
Architectural Philosophy: Separating “What” from “How”
The foundational design principle of TVM is the separation of concerns, a concept heavily inspired by the Halide language for image processing.8 TVM’s architecture rigorously separates the algorithmic description of a computation (the “what”) from its schedule, which dictates the low-level implementation strategy (the “how”). For example, the algorithm for a matrix multiplication is mathematically fixed, but its implementation can vary dramatically—it could be tiled for cache efficiency, parallelized across multiple cores, or vectorized using SIMD instructions.
TVM extends this paradigm by introducing a third layer of separation: the hardware interface. This allows the schedule to be further decoupled from the specific hardware primitives or intrinsics of a target device.43 This three-way separation—algorithm, schedule, and hardware target—is the key to TVM’s flexibility and extensibility. It enables the compiler to explore a vast space of possible implementations for a given deep learning operator and map the optimal one to arbitrary hardware, including novel and specialized accelerators like FPGAs and custom ASICs for which traditional, hand-tuned libraries do not exist.41 This “Python-first” development philosophy empowers developers to easily customize the compilation pipeline, define new optimizations, and add support for new hardware backends, making TVM a highly adaptable and research-friendly framework.44
The TVM Compilation Workflow: A Layered Approach
The end-to-end compilation process in TVM is a structured, multi-stage pipeline that progressively lowers a high-level model representation into optimized, hardware-specific machine code.45
- Model Creation and Import: The process begins with obtaining a model representation. This can be done either by constructing a model directly using TVM’s nn.Module API, which is syntactically similar to PyTorch, or by importing a pre-trained model from an external framework such as PyTorch, TensorFlow, or ONNX. This initial step produces an IRModule, which serves as the central data structure and container for the model throughout the entire compilation process. The IRModule holds all the necessary information, including both high-level and low-level function representations.45
- High-Level Transformation (in Relax): The IRModule initially represents the model using high-level relax.Functions. At this stage, the compiler applies a series of target-independent, graph-level optimizations. These are analogous to traditional compiler passes and include techniques like constant folding, dead code elimination, and, most importantly, operator fusion, which combines multiple operators into a single, more efficient kernel.21
- Lowering and Tensor Program Optimization (in TIR): Following high-level optimization, the abstract operators within the Relax functions are “lowered” into low-level implementations defined as tir.PrimFuncs (TensorIR Primitive Functions). This is the most critical stage for hardware-specific performance tuning. Here, the compiler applies a “schedule” to each tir.PrimFunc, which explicitly defines the low-level execution strategy. This includes specifying loop structures, memory access patterns, data layouts, and mapping computations to parallel execution units like GPU thread blocks and threads.45
- Target Translation (Codegen): Once the TIR functions are fully optimized and scheduled, the final code generation phase translates them into an executable format for the specified hardware target. This could be LLVM Intermediate Representation (IR) for compilation to x86 or ARM CPUs, CUDA C source code for NVIDIA GPUs, or standard C code for deployment on microcontrollers.45
- Runtime Execution: The final compiled artifact is packaged into a runtime.Module. This is a self-contained, deployable library with a minimal runtime API that allows the compiled functions to be loaded and executed in a variety of programming languages, including Python, C++, Java, and JavaScript. This universal runtime system ensures that the optimized model can be easily integrated into production applications across different platforms.45
The Dual-Layer Intermediate Representation: Relax and TIR
TVM’s ability to perform both high-level and low-level optimizations in a coordinated manner stems from its unique multi-level Intermediate Representation (IR) system. This dual-layer architecture, consisting of Relax and TensorIR (TIR), is a direct architectural solution to the inherent tension between maintaining a global, graph-level view for general optimizations and having fine-grained, low-level control for hardware-specific tuning.45
A single-level IR presents a trade-off. A pure graph representation, for instance, is excellent for reasoning about operator fusion but completely abstracts away the implementation details of those operators, making it difficult to optimize for a specific hardware’s memory hierarchy or parallel execution model.49 Conversely, a purely low-level IR exposes all implementation details, enabling precise hardware tuning, but loses the global model structure, which complicates graph-level transformations. TVM’s dual-IR system explicitly resolves this dichotomy by providing distinct representations for each level of abstraction, with a well-defined process for lowering between them. This structure enables “joint high- and low-level optimizations,” a key differentiator of the TVM stack.41
Relax (High-Level Graph IR)
Relax, the successor to TVM’s original high-level IR, Relay, serves as the primary representation for end-to-end models.45 It is a functional, high-level IR that captures the model’s overall structure as a computational graph. Unlike simpler static dataflow graphs, Relax is highly expressive, with native support for complex control flow (e.g., conditionals and loops), recursion, and advanced data structures.48 This expressiveness is crucial for representing modern, dynamic neural network architectures.
The primary roles of Relax in the compilation pipeline are to serve as the ingestion point for models imported from external frameworks and to be the substrate upon which graph-level optimizations are performed. Passes operating on Relax functions can analyze the entire model to identify opportunities for operator fusion, perform data layout transformations, and apply other high-level rewrites before any hardware-specific details are considered.21
TensorIR (TIR) (Low-Level Tensor Program IR)
TensorIR (TIR) is TVM’s low-level IR, designed to represent the concrete implementation of an individual operator or a fused group of operators.45 A TIR function, or tir.PrimFunc, describes a computation not as a graph of abstract operations, but as a program with explicit, nested loop structures, multi-dimensional memory buffer accesses, and constructs for managing parallelism (e.g., thread bindings) and vectorization.48
TIR is the level at which the hardware-specific “schedule” is applied. Using TIR’s schedule primitives, a developer or an automated system can programmatically transform the implementation of an operator. For example, one can reorder loops to improve data locality, “tile” loops to fit data into caches, bind outer loops to GPU thread blocks and inner loops to threads within a block, and unroll loops to enable instruction-level parallelism. This explicit, low-level control is what allows TVM to generate highly optimized code that is precisely tailored to the architectural nuances of a specific hardware target.42 The lowering process from Relax to TIR is the critical bridge between the two levels, and cross-level transformations can even use information from TIR function patterns to make more intelligent fusion decisions at the Relax level.45
Automated Optimization: AutoTVM and MetaSchedule
Perhaps the most defining feature of Apache TVM is its sophisticated infrastructure for automating the otherwise laborious process of performance tuning. The traditional method for supporting a new hardware target involves an expert manually writing and optimizing a library of computational kernels (e.g., convolutions, matrix multiplications).5 This requires deep, specialized knowledge of the hardware and is a massive engineering undertaking. TVM’s philosophy is to automate this task through search-based optimization.21
This automation is realized through TVM’s auto-tuning frameworks, primarily AutoTVM and its more recent successor, MetaSchedule.21 These systems work by first defining a vast, parameterized search space of possible schedule configurations for a given TIR program. This space can be enormous, encompassing billions of potential combinations of loop tiling sizes, unrolling factors, fusion choices, and other transformation parameters.22
Exhaustively generating and benchmarking every single configuration in this space would be computationally infeasible. To navigate this complexity efficiently, TVM employs a machine learning-based cost model.6 This model is trained on-the-fly during the tuning process. It observes the performance of a small number of configurations that are benchmarked on the actual target hardware. Using this data, it learns a function that predicts the likely performance of other, untested configurations. A search algorithm, such as simulated annealing or a genetic algorithm, then uses the predictions from this cost model to intelligently guide the exploration of the search space, prioritizing configurations that are likely to yield high performance and pruning unpromising regions.6
This empirical, learning-driven approach has profound consequences. It democratizes hardware support by dramatically lowering the barrier to entry for enabling new and exotic hardware like FPGAs or custom ASICs.43 Instead of requiring a full team of kernel-writing experts, a hardware vendor can instead focus on describing the hardware’s capabilities to TVM (e.g., its memory hierarchy and available primitives) and then leverage the auto-tuner to automatically generate a library of high-performance kernels. This methodology allows TVM to produce code that is often competitive with, and in some cases even surpasses, the performance of industry-standard, hand-tuned libraries like NVIDIA’s cuDNN, especially for less common operator configurations or on hardware where such libraries are not available.41 The engineering burden is thus transformed: the challenge shifts from low-level, manual kernel programming to the higher-level task of defining effective search spaces and ensuring the ML cost model is accurate for the target architecture.
OpenXLA: A High-Level, Framework-Integrated Compiler
OpenXLA (Accelerated Linear Algebra) is a domain-specific compiler for linear algebra, designed to optimize machine learning computations with a focus on improving execution speed, reducing memory usage, and enhancing portability across mainstream hardware accelerators.62 Originally developed within Google to accelerate TensorFlow workloads on its custom TPU hardware, XLA has since evolved into the cornerstone of the OpenXLA project—an industry-wide collaboration involving major players like Google, NVIDIA, Intel, and AMD—to create an open and interoperable compiler ecosystem for machine learning.12
Architectural Philosophy: Performance and Portability through High-Level Abstraction
The core philosophy of XLA is to provide performance and portability through a high-level, graph-based abstraction. Unlike TVM, which exposes fine-grained control over low-level implementation details, XLA operates at a higher level of abstraction, focusing on optimizing the computational graph as a whole.49 Its primary goal is to take a subgraph of operations from a frontend framework and compile it into a small number of highly optimized, fused kernels.62 This approach aims to eliminate the overhead of the framework’s runtime, reduce memory bandwidth by keeping intermediate values in registers, and enable aggressive, model-specific optimizations.62
A key tenet of XLA’s design is portability. By defining a set of high-level, hardware-agnostic operations, XLA aims to allow a large fraction of ML models to run on new hardware backends with minimal to no modification.62 This contrasts with the laborious process of rewriting models to use new, hardware-specific monolithic operators. XLA’s architecture is therefore strategically designed for production environments, prioritizing stability, seamless integration with major frameworks, and high performance on large-scale, mainstream hardware like GPUs and TPUs.12
The XLA Compilation Workflow: From Framework to Machine Code
The XLA compilation process is a well-defined, multi-stage pipeline that transforms a high-level graph from a frontend framework into optimized native machine code.62
- Graph Ingestion (StableHLO): The compilation pipeline begins when a frontend framework, such as JAX, TensorFlow, or PyTorch, provides a model or function to be compiled. This computation is represented in StableHLO, an MLIR-based dialect that serves as a versioned and stable “portability layer” between the rapidly evolving frameworks and the compiler itself. This stable interface is crucial for maintaining a decoupled and interoperable ecosystem.62
- Target-Independent Optimizations: The StableHLO graph is immediately lowered into XLA’s internal High-Level Operations (HLO) intermediate representation. At this stage, XLA applies a series of powerful, hardware-agnostic optimization passes to the HLO graph. These include standard compiler optimizations like Common Subexpression Elimination (CSE) and algebraic simplification, as well as ML-specific transformations like target-independent operator fusion and buffer analysis to plan runtime memory allocation.62
- Target-Specific HLO Optimizations: The optimized HLO graph is then handed off to a specific hardware backend (e.g., the GPU backend or CPU backend). This backend performs a second round of optimizations that are tailored to the target hardware’s architecture. For example, the GPU backend may apply fusion patterns that are beneficial for the GPU programming model, decide how to partition the computation into parallel CUDA streams, or pattern-match certain HLO subgraphs to highly optimized, hand-written library calls like those from cuDNN or cuBLAS.62
- Code Generation (LLVM): In the final stage, the backend performs target-specific code generation. For the CPU and GPU backends, XLA heavily leverages the LLVM compiler infrastructure. The backend emits Low-Level Virtual Machine (LLVM) IR, which is a low-level, hardware-agnostic assembly language. LLVM then takes over, performing its own suite of sophisticated low-level optimizations (such as instruction scheduling and register allocation) before generating the final native machine code for the target architecture (e.g., x86 assembly for CPUs or PTX for NVIDIA GPUs).62
This reliance on a high-level graph IR (HLO) and a mature code generation framework (LLVM) is a direct cause of both XLA’s primary strength and its main weakness. It allows XLA to achieve excellent performance on mainstream hardware like CPUs and GPUs by focusing on ML-specific graph optimizations while delegating the complex task of final machine code generation to the highly optimized LLVM.66 However, this same architecture makes it challenging to support truly novel hardware, such as optical or neuromorphic processors, for which an LLVM backend does not exist. Adding support for such a device would require building a complete HLO-to-machine-code backend from scratch—a monumental engineering effort.70
The High-Level Operations (HLO) Intermediate Representation
XLA’s architecture is centered on a single primary level of abstraction, the High-Level Operations (HLO) IR, which exists in two distinct but related forms: the external, stable interface (StableHLO) and the internal, mutable compiler IR (HLO).
StableHLO: The Portability Layer
StableHLO is an MLIR dialect that serves as the official, public-facing ingestion interface for the XLA compiler.44 Its creation addresses a critical challenge in the fragmented ML ecosystem: the need for a stable contract between ML frameworks (the “producers” of computation graphs) and ML compilers (the “consumers”). Frameworks and compilers evolve at different, often rapid, paces. Without a stable interface, any internal change in the compiler could break compatibility with all upstream frameworks, leading to a maintenance nightmare.
StableHLO solves this by providing versioning and strong compatibility guarantees (e.g., backward and forward compatibility), ensuring that a model exported to a specific version of StableHLO will be consumable by a compiler that supports that version.67 This strategic decision prioritizes ecosystem stability and interoperability, allowing the XLA compiler team to freely evolve the internal HLO representation and backend optimizations without disrupting the frameworks that depend on it. It is a production-oriented design that fosters a decoupled, multi-vendor ecosystem.
HLO: The Internal Compiler IR
Once a StableHLO program is ingested, it is converted into XLA’s internal HLO representation, which is the substrate for all subsequent compiler optimizations.48 HLO represents the entire computation as a Directed Acyclic Graph (DAG) where the nodes are high-level, hardware-agnostic tensor operations.9 The HLO instruction set is intentionally small and stable, comprising fewer than 100 operations such as convolution, dot, reduce, and various element-wise operations.72
All optimizations within XLA are implemented as graph-to-graph transformations on this HLO representation. This graph-centric approach is particularly well-suited for powerful, global optimizations like operator fusion, where the compiler can analyze dependencies across a large subgraph and combine many operations into one. However, this high-level view comes at the cost of control. HLO abstracts away the low-level implementation details of each operation, offering less fine-grained control over aspects like cache tiling or thread mapping compared to a lower-level IR like TVM’s TIR.49 These decisions are instead delegated to the hardware-specific backend during the final code generation stage.
Seamless Integration and Just-In-Time (JIT) Compilation
A key factor in XLA’s widespread adoption is its tight and user-friendly integration with major ML frameworks, which enables Just-In-Time (JIT) compilation with minimal code changes from the developer.73 JIT compilation defers the compilation of a function or model until it is first executed at runtime.
- In JAX: JIT compilation is a central and idiomatic feature of the JAX library, typically invoked with the @jax.jit decorator. When a JIT-compiled function is first called, JAX “traces” its execution with abstract placeholder values (tracers) to capture the sequence of operations as a JAX-native intermediate representation called jaxpr. This jaxpr is then converted to StableHLO and sent to the XLA compiler. XLA compiles this graph into highly optimized machine code, which is then cached. All subsequent calls to the function with inputs of the same shape and type will bypass the Python interpreter entirely and directly execute the fast, cached binary, leading to dramatic performance improvements.74
- In TensorFlow: XLA can be enabled in two primary ways. For fine-grained control, developers can use the @tf.function(jit_compile=True) decorator. This provides “must-compile” semantics, meaning the entire decorated function will be compiled by XLA; if any part is incompatible, an error is raised. The second method is “auto-clustering,” which can be enabled via an environment variable. In this mode, the TensorFlow runtime attempts to automatically identify subgraphs within the model that are compatible with XLA, compile them, and replace them with a single “cluster” operator, while the rest of the model executes in the standard TensorFlow runtime.62
- In PyTorch: The integration of XLA with PyTorch is primarily to enable PyTorch models to run on XLA-compatible hardware, most notably Google TPUs. With the advent of PyTorch 2.0, this integration is streamlined through the torch.compile API. By specifying backend=’openxla’, developers can direct PyTorch’s new compiler stack to use XLA for code generation. In this flow, the TorchDynamo component captures a graph of the PyTorch model, which is then lowered and passed to the OpenXLA backend for compilation and execution.79
This seamless JIT integration is a hallmark of XLA’s design, making powerful compiler optimizations accessible to a broad audience of ML practitioners without requiring them to become compiler experts.
Comparative Analysis: TVM vs. XLA
Apache TVM and OpenXLA represent two of the most influential and technologically sophisticated approaches to machine learning compilation. While both aim to solve the same fundamental problem—bridging the gap between high-level ML models and diverse hardware targets—they do so with distinct philosophies, architectures, and trade-offs. Their differences reflect a classic tension in compiler design: the balance between fine-grained control and high-level abstraction, and between research-oriented flexibility and production-focused stability. They are not merely competing tools but represent two divergent and successful evolutionary paths in the ML systems landscape.
Core Philosophical and Architectural Differences
The most fundamental distinction between TVM and XLA lies in their level of abstraction. XLA operates as a high-level, graph-centric compiler. It abstracts away the intricate details of how an operation is implemented, focusing instead on optimizing the dataflow between a well-defined set of high-level operations (HLO).49 This approach simplifies the compiler’s frontend and allows it to perform powerful global graph optimizations.
In stark contrast, TVM is a multi-level compiler stack that explicitly exposes both high-level and low-level representations. Its architecture provides a high-level graph IR (Relax) for model-wide optimizations, but crucially, it also offers a low-level tensor program IR (TIR) that gives developers or automated systems explicit control over the implementation details of each operator, such as loop structures and memory access patterns.45 This architectural choice reflects a “lowering” philosophy, where a program is progressively transformed through layers of abstraction, with more hardware-specific detail introduced at each step. XLA’s approach is more akin to “translation,” where a high-level specification (HLO) is translated into another representation (like LLVM IR) for a backend to handle.
These architectural differences are rooted in their origins and primary objectives. XLA was developed within Google to serve its large-scale production needs, particularly for its own TPU hardware. This genesis fostered a focus on stability, seamless integration with major frameworks like TensorFlow and JAX, and scalability for massive datacenter workloads.12 TVM, conversely, originated as a university research project. Its goal was to tackle the problem of performance portability across a highly fragmented and diverse hardware landscape, including edge devices and novel academic accelerators for which no mature software stack existed.12 This led to a design that prioritizes flexibility, extensibility, and the ability to generate high-performance code from scratch without relying on pre-existing vendor libraries.
Intermediate Representations: A Tale of Two Stacks
The differing philosophies of TVM and XLA are most clearly manifested in their choice of Intermediate Representations (IRs).
- TVM’s Dual IR (Relax + TIR): TVM’s two-level IR system enables a powerful separation of concerns. Graph-level optimizations, such as operator fusion and data layout transformation, are performed on the high-level Relax IR, which maintains a global view of the entire model. Subsequently, the model is lowered to TIR, where hardware-specific, low-level optimizations are applied. This includes intricate loop transformations, memory scoping, and mapping computations to parallel hardware threads. This dual-stack approach allows for co-optimization, where insights from the low-level TIR representation can inform optimization decisions at the high-level Relax stage, enabling a more holistic optimization strategy.45
- XLA’s Single-Level IR (HLO): XLA’s architecture is centered on a single primary level of abstraction: the HLO graph. All compiler optimizations are expressed as graph-to-graph transformations on this representation. This design simplifies the compiler and makes it easier to reason about global dataflow optimizations. However, it provides no mechanism within the core compiler to control fine-grained implementation details. The responsibility for generating efficient machine code from an HLO operator is delegated entirely to the hardware-specific backend, which might use a general-purpose compiler like LLVM or rely on proprietary, vendor-specific code generation techniques.62 This makes XLA less adaptable to hardware that lacks a mature backend compiler infrastructure.
Hardware Support and Extensibility
The two compilers offer different models for extensibility and support for new hardware targets.
- TVM: Was designed from the ground up for hardware extensibility. To add support for a new accelerator, a developer needs to describe its specific instructions or primitives to TIR and define a search space of valid schedules. TVM’s auto-tuner can then be used to automatically discover high-performance kernels for that target.43 This significantly lowers the barrier to entry for novel hardware and is a key reason for TVM’s adoption in the hardware research community. The downside of this flexibility is a tendency toward ecosystem fragmentation, as hardware vendors often create and maintain their own forks of TVM with custom modifications.6
- XLA: Extensibility is managed through a more formal, pluggable backend architecture, the Pluggable JAX/Python Runtime (PJRT).81 This allows third parties to develop backends for their hardware that can be integrated with frameworks like JAX and TensorFlow. However, writing a complete HLO-to-machine-code backend is a substantial engineering effort, far more involved than defining schedules in TVM. Historically, official support within the main OpenXLA repository has been focused on CPUs, NVIDIA/AMD GPUs, and Google TPUs, with support for other hardware often residing in downstream, vendor-maintained forks.12
Performance, Usability, and Ideal Use Cases
Choosing between TVM and XLA involves a trade-off between performance characteristics, developer experience, and the specific deployment target.
- Performance: Direct performance comparisons are highly dependent on the specific model, hardware, and level of tuning. TVM’s strength lies in its auto-tuner’s ability to generate highly specialized kernels that can outperform even hand-tuned vendor libraries, particularly for unconventional operator shapes or on hardware where such libraries are immature or unavailable.61 XLA, on the other hand, is exceptionally well-optimized for large-scale training and inference on mainstream datacenter hardware. Its capabilities for automatic model parallelism (SPMD) are particularly strong, making it a go-to choice for training massive language models across thousands of accelerator chips.12 Benchmarks generally show both compilers to be highly competitive, with the performance leader varying by workload.82
- Usability: XLA offers a remarkably seamless developer experience, especially within the JAX ecosystem, where JIT compilation is often as simple as adding a @jit decorator.74 This “just works” experience allows for rapid prototyping and iteration. TVM’s compilation process is more explicit, requiring the user to manage distinct steps of model import, optimization (including an optional, lengthy tuning phase), and building.46 While this provides more granular control, it also presents a steeper learning curve and can significantly increase compilation times, sometimes to several hours, which can hinder developer productivity.12
- Ideal Use Cases:
- TVM is the preferred choice for deploying models across a diverse and heterogeneous range of hardware targets. It excels in scenarios involving edge devices (e.g., mobile phones, IoT devices), embedded systems (via its microTVM extension for microcontrollers 47), and novel or custom accelerators (FPGAs, ASICs) where its ability to generate optimized code from scratch is a key advantage.84
- XLA is the dominant compiler for large-scale training and inference on mainstream datacenter hardware. Its robust support for GPUs and TPUs, tight integration with TensorFlow and JAX, and advanced features for distributed training make it the standard for both cutting-edge research and production deployment of large models at scale.12
The following table provides a consolidated summary of the architectural and philosophical distinctions between Apache TVM and OpenXLA.
| Feature Dimension | Apache TVM | OpenXLA |
| Core Philosophy | Performance portability via separation of concerns (algorithm, schedule, hardware). | High-level abstraction for seamless framework integration and large-scale optimization. |
| Primary Abstraction | Multi-level: High-level graph (Relax) and low-level tensor programs (TIR). | High-level: Graph of hardware-agnostic operations (HLO). |
| Intermediate Reps. | Relax: Functional graph IR for end-to-end models. TIR: Low-level loop-nest IR for operator implementation. | StableHLO: Versioned, portable MLIR dialect for framework interoperability. HLO: Internal graph IR for optimization. |
| Optimization Engine | Primarily search-based auto-tuning with ML cost models (MetaSchedule). | Primarily rule-based graph passes, with backend-specific optimizations and some auto-tuning capabilities. |
| Extensibility Model | Highly flexible; add new hardware by defining TIR schedules and primitives. | Pluggable backend architecture (PJRT); requires implementing a full HLO backend. |
| Hardware Sweet Spot | Diverse and heterogeneous hardware: edge, mobile, FPGAs, novel ASICs. | Mainstream datacenter hardware: GPUs, TPUs, and CPUs at scale. |
| Developer Experience | More explicit and controllable compilation pipeline. Can have long auto-tuning times. | “Just works” via JIT decorators (@jit). Less control but faster iteration. |
| Ecosystem & Governance | Apache Software Foundation project with a broad open-source community. | Industry consortium (Google, NVIDIA, Intel, etc.) focused on production stability. |
The Broader Compiler Ecosystem and Practical Challenges
While TVM and XLA are foundational pillars of machine learning compilation, they do not operate in a vacuum. The production deployment landscape is a heterogeneous ecosystem of specialized tools, each with its own strengths and target use cases. An effective MLOps strategy often involves orchestrating a pipeline of these tools rather than relying on a single, monolithic solution. This reality underscores that the “best” compiler is context-dependent, and practitioners must navigate a complex set of trade-offs involving performance, portability, and ease of use. Furthermore, the very act of compilation, which abstracts away hardware complexity, introduces its own set of practical challenges related to unsupported features and the difficulty of debugging optimized, “black box” code.
A Heterogeneous Toolkit: TensorRT, ONNX Runtime, and TorchInductor
The broader ecosystem includes several key players that complement or offer alternatives to TVM and XLA.
- NVIDIA TensorRT: TensorRT is a high-performance deep learning inference optimizer and runtime library developed by NVIDIA, exclusively for its own GPUs.85 Its primary function is to take a trained model and apply a suite of aggressive, GPU-specific optimizations. These include precision calibration (for FP16, BF16, and INT8), extensive layer and tensor fusion based on a vast library of hand-tuned rules, and kernel auto-tuning to select the fastest implementation for the specific target GPU architecture.86 TensorRT is not a general-purpose compiler in the same vein as TVM or XLA but rather a final-stage optimization tool. It often serves as a high-performance backend that can be targeted by other systems; for example, a model can be exported from a framework to the ONNX format and then ingested by TensorRT for final optimization and deployment.88
- ONNX Runtime: The ONNX Runtime is a cross-platform, high-performance inference engine designed to execute models saved in the Open Neural Network Exchange (ONNX) format.90 The core value proposition of ONNX and its runtime is interoperability. A developer can train a model in PyTorch, export it to the standardized ONNX format, and then deploy it in a completely different environment, such as a C++ or Java application, using the ONNX Runtime without any dependency on PyTorch.91 The ONNX Runtime itself features a modular architecture based on “Execution Providers” (EPs). These EPs act as backends that delegate computation to hardware-specific libraries. For instance, when running on an NVIDIA GPU, the ONNX Runtime can use the TensorRT EP to pass subgraphs to TensorRT for maximum performance. On an Intel CPU, it can use the OpenVINO EP. This pluggable architecture allows the ONNX Runtime to act as a unified inference frontend that can leverage the best available acceleration technology on any given platform.92
- PyTorch 2.0’s TorchInductor: With the release of PyTorch 2.0, the framework introduced its own native, deeply integrated compiler stack, accessible via the simple torch.compile() API.94 The default backend for this stack is TorchInductor. This modern compiler leverages several components: TorchDynamo safely captures graphs from Python bytecode, AOTAutograd traces the backward pass, and TorchInductor itself lowers the PyTorch operations into a low-level, loop-based IR.95 For code generation, TorchInductor primarily targets OpenAI’s Triton language to generate high-performance GPU kernels, and it generates C++/OpenMP for CPUs.94 TorchInductor represents the culmination of lessons learned from earlier compilers, aiming to provide significant performance gains with minimal code changes while preserving PyTorch’s dynamic and Pythonic feel.
The existence of this diverse toolkit suggests that a sophisticated deployment pipeline might involve multiple stages. For example, a common workflow is to train in PyTorch, export to ONNX for a framework-agnostic artifact, and then use TensorRT to generate the final, highly optimized engine for deployment on NVIDIA GPUs.89 This multi-tool approach allows practitioners to leverage the best features of each component for different stages of the lifecycle.
Common Implementation Hurdles and Mitigation Strategies
Despite their power, ML compilers introduce an abstraction layer that can be both “leaky” (failing to support all features) and opaque (making it difficult to debug). These two issues—unsupported operators and the “black box” nature of compiled code—are the most common hurdles developers face.
The “Unsupported Operator” Problem
A frequent and frustrating challenge arises when a model employs an operator that is not natively supported by the target compiler or hardware backend.9 The rapid pace of ML research means new, custom layers and operations are constantly being invented, and compilers inevitably lag behind.9 When an unsupported operator is encountered, the compilation process may fail outright, or, more insidiously, the compiler may decide to “fall back” to a different execution engine.98 This typically involves partitioning the graph, running the unsupported operator on the CPU using the original framework’s slow, eager-mode runtime, and then transferring the data back to the accelerator to continue the compiled execution. This context switching and data movement can completely negate any performance gains from compilation.
Several strategies exist to mitigate this problem:
- Framework Fallback: While potentially slow, this is often the default and easiest solution. The compiler handles the partitioning automatically, ensuring correctness at the cost of performance.
- Operator Decomposition: Many unsupported operators can be expressed as a combination of simpler, supported primitive operations. Developers can manually define this decomposition, creating a “composite operator” that the compiler can then understand and optimize.99
- Custom Operator Plugins: For performance-critical operators, the most effective solution is to write a custom, low-level implementation (e.g., in CUDA) and register it as a plugin with the compiler. Tools like TensorRT and TVM provide well-defined interfaces for adding these custom operators, allowing the compiler to integrate them seamlessly into the optimized graph.88
The “Black Box” Challenge: Debugging Compiled Models
The very process of optimization—fusing operators, changing data layouts, altering numerical precision—transforms the model into a state that is far removed from the original, human-readable source code. This creates a significant debugging challenge.101 When a compiled model produces incorrect results, exhibits numerical instability (e.g., outputs NaNs), or performs worse than expected, tracing the root cause can be exceptionally difficult.103 Traditional debugging tools like step-through debuggers are often ineffective because the execution flow no longer maps directly to the source Python code.
Debugging compiled models requires a different set of techniques:
- Incremental Compilation and Bisection: A pragmatic first step is to start with a known-good, simple model and incrementally add layers or complexity from the problematic model until the issue reappears. This helps to isolate the specific operator or subgraph causing the failure.105
- Inspection of Intermediate Representations: Most compilers provide flags or APIs to dump their internal IR at various stages of the compilation process (e.g., XLA’s HLO graph, TVM’s Relax and TIR functions).77 By examining these intermediate forms, an expert can verify whether the compiler’s transformations are correct and identify where an incorrect optimization might have been applied.
- Numerical Comparison: A common technique is to run both the original, eager-mode model and the compiled model with the same input and compare their outputs layer by layer. This allows the developer to pinpoint the exact location in the model where the numerical results begin to diverge, narrowing down the source of the error.
- Compiler-Specific Tooling: As compilers mature, they are increasingly equipped with specialized debugging and logging features. For example, PyTorch’s torch.compile can be configured via the TORCH_LOGS environment variable to provide detailed information on “graph breaks” (where the compiler had to fall back to eager mode) and “recompilations” (where dynamic inputs forced the compiler to generate new code), which are common sources of performance issues.104 Specialized tools like Amazon SageMaker Debugger can capture intermediate tensors during execution for offline analysis.102
Ultimately, both the unsupported operator problem and the debugging challenge stem from the same root cause: the abstraction gap created by compilation. In exchange for automated performance optimization, we sacrifice transparency and direct control. Therefore, the future usability of ML compilers depends not only on more powerful optimization algorithms but also on the development of better tools for managing this abstraction gap, including improved diagnostics, visualizers, and more ergonomic mechanisms for extending the compiler with custom logic.
Future Directions and Concluding Remarks
The field of machine learning compilation is at a pivotal juncture, driven by the unprecedented demands of next-generation models and the relentless pursuit of deploying powerful AI on an ever-expanding spectrum of hardware. As models grow in scale and complexity, and as hardware becomes more specialized, the role of the compiler is evolving from a mere optimization tool to a critical enabler of innovation. The future trajectory of MLC is being shaped by two dominant forces: the unique challenges posed by Large Language Models (LLMs) and the transformative potential of using machine learning to design the compilers themselves.
The New Frontier: Compiling Large Language Models (LLMs)
The rise of Large Language Models and other Transformer-based architectures has introduced a new class of compilation challenges that push existing systems to their limits. These models are characterized by:
- Massive Scale: With hundreds of billions or even trillions of parameters, LLMs require sophisticated memory management and parallelism strategies that go beyond simple operator fusion.
- Dynamicism: The inference process for LLMs, particularly during text generation, is inherently dynamic. The shapes of tensors in the key-value (KV) cache change with every generated token, creating challenges for static compilers that prefer fixed tensor shapes.12
- Complex Operators: The attention mechanism, the core of the Transformer architecture, involves a sequence of operations with complex data dependencies that are difficult for traditional compilers to fuse and optimize effectively.
This new frontier has spurred a wave of specialized research and development. Projects like MLC LLM (Machine Learning Compilation for Large Language Models) are focused on creating high-performance, universal deployment solutions that allow any LLM to be compiled and run natively on a wide array of devices, from high-end GPUs to mobile phones and even web browsers via WebAssembly and WebGPU (through the companion WebLLM project).106 This effort to “democratize” LLM deployment relies heavily on advanced compiler acceleration techniques to manage the immense computational and memory requirements.
Furthermore, a fascinating recursive trend is emerging, exemplified by projects like Meta’s LLM Compiler. This research explores using an LLM itself as a component of the compiler. By pre-training a model on a massive corpus of compiler-specific data, such as LLVM-IR and assembly code, the LLM learns the patterns and semantics of code optimization. It can then be fine-tuned to perform complex compiler tasks, such as predicting the optimal sequence of optimization passes to reduce code size or even disassembling machine code back into a high-level IR.109 This points toward a future where “foundation models of compiler optimization” could automate and potentially surpass human-engineered heuristics.
The Rise of ML-Driven Compiler Design and Auto-Tuning
The success of using machine learning for auto-tuning specific kernels within TVM is a precursor to a much broader trend: applying ML to optimize the entire compilation process. The design of a traditional compiler is replete with complex, heuristic-based decisions, such as which optimization passes to run and in what order (the classic “phase-ordering problem”), which is known to be NP-hard.40
Modern research is increasingly focused on replacing these hand-tuned heuristics with learned models. Techniques from reinforcement learning, imitation learning, and Bayesian optimization are being used to automatically discover superior optimization strategies for a wide range of compiler tasks, including instruction selection, register allocation, and loop vectorization.111 This represents a paradigm shift toward a “meta-level” of optimization: we are moving from using compilers to optimize ML models, to using ML to optimize the compilers themselves. This recursive application of AI holds the potential to create compilers that can adapt and learn to generate better code for new hardware architectures automatically, significantly reducing the manual effort required to build and maintain high-performance compiler backends.39
Synthesis and Concluding Remarks
Machine learning compilation has evolved from a niche research area into an indispensable component of the modern AI stack. Compilers like Apache TVM and OpenXLA provide the critical link that enables the innovations of ML research to be deployed efficiently and scalably in the real world. They are the automated solution to the otherwise intractable problem of optimizing a rapidly expanding universe of models for an equally diverse landscape of hardware.
This analysis has revealed that TVM and XLA, while sharing a common goal, represent two distinct and valid philosophical approaches to compiler design. XLA, with its high-level graph abstraction and focus on stable integration, is an exemplar of a production-driven compiler optimized for the mainstream, large-scale computing ecosystem. TVM, with its multi-level IR and learning-based optimization engine, embodies a flexible, research-oriented approach designed for performance portability and extensibility to the furthest reaches of the hardware frontier. They are not simply competitors but rather two successful points in a vast design space, each tailored to a different set of priorities.
Looking forward, the single greatest catalyst for innovation in ML compilation will likely be the push to run large, powerful models on resource-constrained edge devices. The immense gap between the requirements of models like LLMs and the capabilities of edge hardware creates an extreme optimization challenge that will demand radical advances in every technique discussed in this report—from more aggressive quantization and novel fusion strategies to more sophisticated, learned memory management. Projects like MLC LLM are at the vanguard of this movement. As AI becomes more pervasive, the ability to compile and run complex models efficiently, privately, and with low latency on-device will be paramount. The continued development of intelligent, automated, and adaptable compiler technology is therefore not just an academic pursuit but a foundational requirement for the future of artificial intelligence.
