A Systematic Analysis of Parallel Execution in Deterministic Systems: From Optimistic Concurrency to Block-STM and Optimal Scheduling

I. The Challenge of Deterministic Parallelism in High-Throughput Systems

Introduction to Deterministic State Machine Replication (SMR)

At the core of many modern distributed systems, particularly blockchains, lies the principle of deterministic State Machine Replication (SMR). This paradigm allows a distributed network of entities to agree upon and apply a sequence of operations—typically packaged as blocks of transactions—to a shared state. The foundational requirement of SMR is absolute determinism: every entity that Parallel Execution a given block of transactions must arrive at the exact same final state. This guarantee of identical outcomes across a decentralized network is the bedrock of trust and consistency, but it also presents a formidable performance challenge.

bundle-multi-3-in-1—sap-sd By Uplatz

The Sequential Bottleneck

The most straightforward method to ensure determinism is sequential execution. By processing transactions one after another, in a predefined order, the system guarantees that every node will compute the same state transitions and reach the same final state. However, this approach is fundamentally unscalable. In an era of multi-core processors, sequential execution leaves vast computational resources idle, creating a severe throughput bottleneck that is incapable of meeting the demands of high-performance applications.2 As transaction volumes grow, the latency and throughput of a sequentially executed blockchain drop precipitously, highlighting the urgent need for parallel execution models that can harness modern hardware without sacrificing deterministic consistency.

A Taxonomy of Parallel Execution Models

The quest to overcome the sequential bottleneck has given rise to several distinct architectural models for parallel transaction execution in blockchains. These models differ primarily in where and how they determine transaction dependencies and schedule execution.

The Ethereum Model (Pre-Execution): In this model, the responsibility for execution is centralized. A single node, the block proposer, executes the transactions, computes the resulting state differences (state-diffs), and packages this information along with the transaction list into a block. This block is then propagated across the network, where other nodes (validators) need only to validate the pre-computed state changes against the transactions.4 While this simplifies the task for validators, it concentrates the entire execution workload on the proposer, creating a centralized performance ceiling defined by the capacity of a single machine.
The Solana Model (Static Dependencies): This model takes a static, upfront approach to dependency analysis. To be included in a block, transactions must explicitly declare which accounts (state) they will read from and write to. This a priori knowledge allows the runtime to construct a dependency graph and schedule non-conflicting transactions for parallel execution.4 The primary drawback of this approach is that it shifts a significant burden onto developers, who must accurately predict and declare all state accesses. This can be complex, error-prone, and may force transactions that do not actually conflict to be executed sequentially out of an abundance of caution.2
The Sui Model (Object-Centric): The Sui model refines the dependency concept by focusing on data ownership. State is organized into “objects,” and transactions are classified based on whether they access owned objects or shared objects. Transactions that only touch objects owned by a single address can be processed in parallel with minimal coordination. Transactions involving shared objects require more complex consensus and ordering. This model excels at parallelizing transactions with clear, non-overlapping data access patterns but requires a more structured approach to state management.
The Aptos Model (Post-Ordering Acceleration): This model represents a paradigm shift by decoupling transaction ordering from execution. The block proposer’s primary role is to establish a definitive, sequential order of transactions within a block. This ordered block is then agreed upon via consensus. Following consensus, each validator node independently executes the entire block in parallel, using a sophisticated engine to achieve a final state that is identical to the one that would have been produced by sequential execution.4 This approach democratizes the execution workload across all validators and is the model that Block-STM is designed to power.

The evolution from the Ethereum model to the Aptos model signifies a fundamental redistribution of computational responsibility in a distributed system. In the former, a single proposer node bears the heavy load of execution, while validators perform the lighter task of verification. This creates an inherent asymmetry and a single point of performance limitation. The Aptos model, in contrast, makes every validator responsible for the full execution workload. This architectural choice is only viable if a highly efficient parallel execution engine is available to every node. Without such an engine, each validator would simply be encumbered by the same sequential bottleneck that the proposer faced, nullifying any systemic benefit. Therefore, the feasibility and success of the Aptos model are inextricably linked to the performance of an advanced execution engine like Block-STM. It strategically trades a centralized execution bottleneck for a distributed parallel execution challenge, a challenge that Block-STM is purpose-built to solve.

II. Paradigms of Concurrency Control: Pessimism vs. Optimism

Concurrency control is the mechanism by which a system coordinates simultaneous accesses to shared data to ensure consistency and integrity.6 In the context of transaction processing, this means preserving the illusion that each transaction executes in isolation, a property known as serializability. The two dominant philosophical approaches to achieving this are pessimistic and optimistic concurrency control.

Pessimistic Concurrency Control (PCC): The “Ask for Permission” Approach

Pessimistic Concurrency Control (PCC) operates under the assumption that conflicts between transactions are a common occurrence and must be actively prevented.

Core Mechanism: The primary tool of PCC is locking. Before a transaction can read or write a piece of data, it must first acquire a lock on that data item. If another transaction already holds a conflicting lock, the requesting transaction must wait until the lock is released.8 The most well-known PCC protocol is two-phase locking (2PL), which governs the acquisition and release of these locks to ensure serializability.6
Strengths: PCC’s main advantage is its guarantee of correctness. By preventing conflicts from ever occurring, it ensures that the database state is never corrupted.
Weaknesses: This guarantee comes at a significant cost. The overhead of acquiring and releasing locks can be substantial, especially in systems with high transaction volumes.10 More importantly, PCC can severely limit concurrency, as transactions are forced to wait even if a conflict would not have ultimately materialized. This can lead to poor resource utilization and reduced throughput. Furthermore, PCC systems are vulnerable to deadlocks, a situation where two or more transactions are stuck waiting for each other to release locks, requiring a separate mechanism for detection and resolution.9 Under high-contention workloads, the performance of PCC protocols degrades significantly as lock contention and waiting times increase.11

Optimistic Concurrency Control (OCC): The “Ask for Forgiveness” Approach

Optimistic Concurrency Control (OCC) is founded on the opposite assumption: that conflicts between transactions are rare.10 Instead of preventing conflicts with locks, it allows transactions to execute freely and only checks for conflicts at the very end, resolving them after the fact.

The Three Phases of OCC: The lifecycle of a transaction under OCC is typically divided into three distinct phases 13:

Read Phase: The transaction executes its logic. It reads values from the shared database but buffers all its writes in a private workspace or transactional log.8 These tentative writes are not visible to any other transaction. During this phase, the system records the set of data items the transaction has read (its read-set) and the set of data items it intends to write (its write-set).
Validation Phase: This is the critical conflict-detection step. Before the transaction is allowed to commit, the system validates its execution. It checks whether any other transaction has committed changes to the data items in the current transaction’s read-set since the read phase began.10 If a conflict is detected (i.e., the data read is now stale), the transaction’s validation fails.
Write Phase: If validation is successful, the transaction is committed. Its buffered writes are applied to the database, making them permanent and visible to all subsequent transactions.13 If validation fails, the transaction is aborted, its private writes are discarded, and it is typically restarted from the beginning.15

Strengths: In environments with low data contention, OCC can achieve significantly higher throughput than PCC. By avoiding the overhead of locking, it minimizes coordination and allows for a high degree of parallelism.7 Because it is a non-locking protocol, OCC is also inherently free from deadlocks.14
Weaknesses: The primary drawback of OCC is its performance degradation under high contention. When conflicts are frequent, the rate of transaction aborts and re-executions skyrockets. This leads to a large amount of wasted computational work, as the effort spent on aborted transactions is thrown away.13 In extreme cases, the system can enter a state of “contention collapse,” where throughput plummets as transactions are repeatedly aborted.16 The cost of concurrency is thus shifted from the upfront coordination of locking to the backend penalty of rollback and re-execution.10

The decision between PCC and OCC can be viewed through an economic lens, as a trade-off based on the anticipated “cost of conflict.” PCC is akin to purchasing a comprehensive insurance policy. It requires paying a fixed, upfront premium—the overhead of locking—on every single transaction, regardless of whether a conflict was ever likely. This premium buys the certainty that no conflicts will occur. OCC, on the other hand, operates like a “pay-as-you-go” model with a deductible. It forgoes the upfront premium, but must pay a potentially high penalty—the cost of an abort and re-execution—whenever a conflict actually materializes.

When the conflict rate is low, the cumulative cost of these occasional penalties in OCC is far less than the cumulative cost of the mandatory premiums in PCC, making OCC the more efficient strategy. However, as the conflict rate rises, the frequency and total cost of OCC’s penalties can increase dramatically, eventually surpassing the cost of PCC’s steady premiums. This inflection point marks the onset of contention collapse. The central goal of an advanced optimistic system like Block-STM is to fundamentally alter this economic balance. It does so not by trying to prevent conflicts, but by dramatically reducing the cost of the penalty associated with an abort, thereby extending the viability of the optimistic approach into much higher-contention regimes.

The following table provides a comparative summary of these two fundamental approaches to concurrency control.

Feature	Pessimistic Concurrency Control (PCC)	Optimistic Concurrency Control (OCC)
Core Assumption	Conflicts are likely and must be prevented.	Conflicts are rare and can be resolved after detection.
Primary Mechanism	Locking (e.g., Two-Phase Locking).	Read/Validate/Write phases with private workspaces.
Conflict Handling	Prevention: Transactions wait for locks to be released.	Detection & Resolution: Conflicting transactions are aborted and restarted.
Performance (Low Contention)	Sub-optimal due to constant locking overhead.	High throughput due to minimal coordination.
Performance (High Contention)	Degrades due to lock contention and long wait times.	Performance collapses due to high abort rates and wasted work.
Key Bottleneck	Lock manager, wait queues.	Validation phase, cost of re-execution.
Deadlock Risk	High; requires deadlock detection and resolution mechanisms.	None; deadlock-free by design.

III. Block-STM: Turning the Ordering Curse into a Performance Blessing

Block-STM is a state-of-the-art, in-memory parallel execution engine designed specifically for deterministic SMR systems like the Aptos blockchain.17 It is built upon the principles of Software Transactional Memory (STM), a concurrency control paradigm that provides high-level abstractions to make parallel programming appear more like simple sequential execution.19 However, Block-STM is not a general-purpose STM library; it is a highly specialized system that leverages unique properties of the blockchain use-case to achieve unprecedented performance.1

The Central Innovation: Leveraging the Preset Serialization Order

The most profound innovation of Block-STM is its ability to transform a traditional constraint of blockchain systems into its greatest performance asset. Unlike general-purpose OCC or STM systems, which must dynamically discover a valid serializable order among concurrent transactions, Block-STM is given a block where the final serialization order is already determined by the proposer: $tx_1 < tx_2 <… < tx_n$.1 In conventional thinking, this fixed order is a “curse” that constrains parallelism. Block-STM ingeniously inverts this, using the preset order as a global, unambiguous source of truth for dependency resolution, thereby turning the ordering curse into a performance blessing.3

This preset order acts as an implicit, decentralized coordination mechanism, drastically reducing the need for explicit and costly synchronization primitives between threads. In a typical parallel system, if two threads contend for a resource, they must engage in an expensive runtime negotiation (e.g., atomic compare-and-swap, lock acquisition) to resolve the conflict. In Block-STM, the outcome of any conflict between $tx_i$ and $tx_j$ (where $i < j$) is pre-ordained: the effects of $tx_i$ must prevail. This deterministic rule is embedded directly into the engine’s core logic, allowing it to replace expensive, dynamic synchronization with cheap, predictable operations. It transforms a potentially chaotic race for state access into an orderly, pre-negotiated process, which is the foundational reason for its low overhead and remarkable scalability.

Core Components Deep Dive

Block-STM’s architecture is a synthesis of established techniques and novel optimizations, all centered around the preset order.

Multi-Version Data Structure

To manage concurrent state access, Block-STM employs a multi-version data structure. This structure avoids crude write-write conflicts by allowing multiple versions of the same data location to coexist in memory.3 When a transaction writes to a location, it creates a new version tagged with its unique identifier, which consists of its index in the block and its current “incarnation” number (since a transaction may be re-executed multiple times).20

The preset order dictates a simple yet powerful rule for reads: when a transaction $tx_j$ needs to read a memory location, the data structure provides it with the value written by the transaction $tx_i$ with the highest index such that $i < j$.3 This ensures that every read is deterministic and consistent with the final serial order, even while transactions are executing speculatively and out of sequence.

Speculative Execution and Validation

Worker threads in the system speculatively execute transactions in parallel, assuming no conflicts will occur.3 Each execution of a transaction is termed an “incarnation”.1 During an incarnation, the engine meticulously tracks all memory accesses:

Read-set: A log of each memory location read, along with the specific version (transaction ID and incarnation number) that was observed.3
Write-set: A private buffer of all memory locations and the new values the transaction intends to write.3

After an incarnation completes, its write-set is applied to the multi-version data structure, and a validation task is scheduled. The validation process is the heart of conflict detection. It involves re-reading every location in the transaction’s read-set and comparing the currently visible version against the version originally recorded.3 If all versions match, the validation succeeds, indicating that the transaction’s view of the world remains consistent. If a mismatch is found, it signifies that a transaction with a lower index has since been re-executed and produced a new version of the data. This constitutes a Read-Write conflict, causing the validation to fail and triggering an abort and subsequent re-execution of the current transaction.

The Collaborative Scheduler

Coordinating the execution and validation tasks across dozens of threads is a critical challenge. Naive approaches using standard concurrent data structures like priority queues are notoriously difficult to scale.1 Block-STM introduces a novel, low-overhead collaborative scheduler that elegantly sidesteps this problem.17 The scheduler prioritizes tasks associated with lower-indexed transactions, as their outcomes affect all subsequent transactions. Crucially, because the transactions are identified by a compact and bounded set of indices (from 1 to $n$), the scheduler can implement this prioritization using a few shared atomic counters instead of complex, lock-heavy data structures. This counting-based approach is a key optimization made possible only by the preset serialization order.1

Dynamic Dependency Estimation

The single greatest threat to performance in an OCC system is a cascade of aborts, where the re-execution of one transaction triggers a chain reaction of validation failures in dependent transactions.3 Block-STM’s most innovative feature for mitigating this is its dynamic dependency estimation mechanism.

When a transaction $tx_i$ fails validation and is aborted, its write-set is not merely discarded. Instead, the engine uses this information as a strong hint about where $tx_i$ is likely to write in its next incarnation. It marks every memory location in the aborted write-set with a special ESTIMATION flag in the multi-version data structure.3 Later, if another transaction $tx_j$ attempts to read a location marked with this flag, it immediately knows that it has a potential dependency on the outcome of $tx_i$. Rather than proceeding with stale data and guaranteeing a future abort, $tx_j$ can pause its execution and wait for the dependency to be resolved.3 This “just-in-time” dependency detection has two major advantages over pre-execution approaches that try to compute a full dependency graph upfront: (1) estimates are generated lazily and only when a conflict actually occurs, avoiding unnecessary overhead, and (2) the estimates are based on a more recent and therefore more accurate execution state.17

Lazy, Block-Level Commits

Finally, Block-STM leverages the operational context of blockchains, where state updates are applied atomically at the block level. This allows the engine to forgo the expensive synchronization required for committing each transaction individually. Instead, it accumulates all state changes from successful transaction executions in the multi-version data structure. Only after all transactions in the block have been successfully executed and validated does the system perform a single, lazy commit of the entire block’s state changes. This is achieved efficiently using a “double-collect” technique managed by two atomic counters, minimizing synchronization to the very end of the process.1

IV. Conflict Detection, Abort Rates, and Mitigation Strategies

The performance of any optimistic concurrency control system is ultimately determined by its ability to manage conflicts and minimize the resulting aborts. While Block-STM’s architecture is designed to be robust, understanding the nature of conflicts and the strategies to mitigate them is crucial.

Anatomy of a Transaction Conflict

A conflict arises when two or more transactions access the same memory location, and at least one of these accesses is a write operation.8 There are three canonical types of data conflicts 22:

Write-Write (W→W): Two transactions attempt to write to the same location. Block-STM’s multi-version data structure inherently resolves these conflicts by storing both writes as distinct versions.
Write-Read (W→R): One transaction reads a location after another transaction has written to it. The preset order and multi-version read rule in Block-STM manage this by ensuring the reading transaction always sees the correct, preceding version.
Read-Write (R→W): One transaction reads a location, and a concurrent transaction later writes to it. This is the most problematic type of conflict in an optimistic system and the primary cause of aborts.

Primary Cause of Aborts: Validation Failure

In Block-STM, a transaction is aborted and scheduled for re-execution when its validation fails. This occurs when the assumptions made during its speculative execution are proven false. The canonical scenario is an R→W conflict that violates the preset serial order. For example, consider transaction $tx_j$. During its execution, it reads a value written by $tx_i$ (where $i < j$). Subsequently, $tx_i$ itself is aborted (due to a conflict with an even earlier transaction, $tx_k$) and is re-executed. In its new incarnation, $tx_i$ writes a different value to that same memory location. When $tx_j$’s validation task runs, it re-reads the location and discovers that the version it originally read is no longer the latest version produced by a transaction with an index less than $j$. This version mismatch signals a broken dependency, invalidates the entire execution of $tx_j$, and forces an abort.3

The Critical Impact of Abort Cascades

The true performance risk is not a single abort, but a cascade of aborts. The re-execution of a single, low-indexed transaction can have far-reaching consequences. If $tx_1$ aborts, its new write-set might invalidate the read-sets of $tx_2$, $tx_5$, and $tx_{100}$, and any other transaction that read the values it originally wrote. When these dependent transactions are subsequently validated, they too will fail and be aborted. This chain reaction can lead to an enormous amount of wasted computation, potentially bringing the entire parallel execution process to a standstill. As noted in the Block-STM research, minimizing these cascades is “the name of the performance game” for any STM system.3

Advanced Abort Reduction Techniques

Beyond Block-STM’s foundational dynamic dependency estimation, research has explored more granular and semantically aware strategies to reduce the high cost of aborts.

Operation-Level vs. Transaction-Level Conflict Resolution: A fundamental limitation of traditional OCC, including Block-STM, is its coarse-grained, transaction-level resolution. If a conflict is detected in just one of a transaction’s many operations, the entire transaction must be aborted and restarted from scratch.23 A more advanced approach, exemplified by ParallelEVM, introduces operation-level concurrency control. This system uses a detailed operation log to trace dependencies within a single transaction. Upon a validation failure, it does not abort the whole transaction. Instead, it enters a redo phase where it identifies and re-executes only the specific operations that were directly affected by the conflict, while preserving the results of all other, conflict-free operations.23 This fine-grained approach significantly reduces wasted work, with evaluations showing it can achieve a 4.28x speedup on real-world Ethereum workloads, compared to just 2.49x for a traditional OCC implementation.23
Data-Driven Timestamp Management (TicToc): Conventional timestamp-ordering (T/O) protocols often cause unnecessary aborts by assigning a rigid, static timestamp to a transaction before it executes. The TicToc algorithm offers a more flexible, optimistic approach. It avoids a centralized timestamp allocator and instead embeds read and write timestamps directly into the data items themselves. A transaction’s final commit timestamp is computed lazily at validation time, based on the versions of the data it actually accessed.25 This data-driven approach allows TicToc to find a valid serializable ordering for transaction interleavings that would have been needlessly aborted by a stricter, pre-ordained T/O protocol. This flexibility has been shown to reduce abort rates by a factor of 3.3x in certain workloads while improving throughput.25
Transaction Healing: Taking semantic awareness a step further, the concept of “transaction healing” seeks to repair a conflicted transaction rather than aborting it. This mechanism captures the dependency graph of operations within a transaction prior to execution. If validation fails, instead of discarding all work, the system uses the dependency graph to judiciously restore only the specific non-serializable operations and heal the inconsistent states that resulted from them.26 This approach attempts to maximize the re-utilization of partial computational work, offering a promising alternative to the costly abort-and-restart cycle, especially in highly contended environments.

V. The Influence of State Access Patterns on Parallelism

The degree of parallelism that can be extracted from a block of transactions is not an intrinsic property of the execution engine alone. It is fundamentally constrained by the inherent data dependencies within the workload, which are dictated by the patterns of state access. Understanding and classifying these patterns is essential for reasoning about system performance and designing scalable applications.

Formal Classification of State Access

The interactions between transactions and the shared state can be classified along two primary, orthogonal dimensions: state partitionability and access ordering.27

State Partitionability: This dimension describes how the state is structured and accessed.

Non-Partitionable State: The state is monolithic, and any transaction can potentially read or modify any part of it. This creates a high probability of dependencies between any two transactions, severely limiting parallelization. A canonical example is a global counter that is incremented by every transaction in a block.
Partitionable State: The state can be logically divided into disjoint subsets or partitions. Each transaction is known to access only a specific partition, often determined by a hash function on transaction inputs (e.g., an account address). Transactions operating on different partitions are independent and can be executed in parallel without any risk of conflict.27 Updating individual user balances, where each transaction touches only the sender and receiver accounts, is a classic example of a workload with a partitionable state.

State Access Ordering: This dimension describes the temporal dependencies between state updates.

Ordered Access: The execution of a transaction $x_t$ is strictly dependent on the state produced by the execution of the immediately preceding transaction, $x_{t-1}$. This creates a rigid, sequential chain of dependencies that inherently resists parallelization.27 The Block-STM engine is specifically designed to enforce this property while still finding opportunities for speculative parallelism.
Relaxed Access: The execution of transaction $x_t$ depends on the state produced by some previous transaction, but not necessarily $x_{t-1}$. This allows for more flexibility and out-of-order execution, increasing the potential for parallelism.27

The interplay between the execution engine and the workload’s intrinsic access patterns is what ultimately determines real-world performance. The engine provides the potential for parallel speedup, but the workload’s access patterns define the opportunity for parallelism. A highly sophisticated engine like Block-STM running a workload with a non-partitionable, ordered state access pattern (e.g., all transactions modifying a single, central account) will yield minimal speedup, as the workload is inherently sequential. Conversely, a simpler parallel execution model could achieve high performance on a workload that is “embarrassingly parallel”—one characterized by a highly partitionable state and relaxed access dependencies.28

This dynamic reveals that maximizing system throughput is a shared responsibility. While protocol engineers must design powerful execution engines, application and smart contract developers must also design their systems to be “parallelism-friendly.” By favoring architectures that partition state (e.g., sharding data across multiple contracts or accounts instead of concentrating it in one) and minimizing strict, ordered dependencies, developers can create workloads that provide greater opportunity for the underlying engine to extract parallelism. This co-design of workload and engine is essential for achieving true scalability.

The following table summarizes the implications of these state access patterns for concurrency.

Access Pattern	Description	Example	Parallelism Potential	Suitable Concurrency Model
Non-Partitionable & Ordered	Transactions access a shared global state and must be applied in strict sequence.	A global configuration contract updated by every transaction.	Very Low (Inherently Sequential)	Sequential Execution or low-overhead parallel models like Block-STM.
Partitionable & Ordered	State is partitioned, but the final state must reflect a strict transaction order.	A series of transfers between distinct accounts, where the final balances must match the sequential order.	High (with Speculation)	Optimistic models with preset order (e.g., Block-STM).
Non-Partitionable & Relaxed	Transactions access a shared global state but can be applied in any order.	A decentralized voting system where votes are tallied at the end, and the order of casting is irrelevant.	Moderate	General-purpose OCC or STM systems.
Partitionable & Relaxed	State is partitioned, and transactions on different partitions are independent and can be reordered freely.	Minting NFTs where each mint is an independent transaction to a new token ID.	Very High (Embarrassingly Parallel)	Static dependency analysis (e.g., Solana model) or simple parallel execution.

VI. The Frontier of Optimal Transaction Scheduling

While concurrency control mechanisms like OCC react to conflicts as they happen, transaction scheduling is a proactive discipline that aims to arrange the execution of transactions to prevent conflicts from occurring in the first place. The ultimate goal of a scheduler is to find an execution plan that maximizes throughput and minimizes the total time required to process a batch of transactions, a metric known as the makespan.30 However, finding a truly optimal schedule is an NP-Complete problem, meaning it is computationally intractable for all but the smallest inputs.30 Consequently, practical systems rely on a variety of efficient heuristics and, increasingly, machine learning techniques.

Heuristic-Based Scheduling Algorithms

Shortest Makespan First (SMF): This is a greedy, heuristic-based strategy that iteratively builds a schedule. At each step, from the pool of unscheduled transactions, the scheduler selects the transaction that is predicted to cause the smallest increase to the schedule’s total makespan.30 To make this computationally feasible in real-time, implementations often use a sampling-based variant: instead of evaluating all pending transactions, the scheduler randomly samples a small number $k$ (e.g., 5) and chooses the best one from that small sample.30 This provides a good balance between schedule quality and decision-making overhead.
Queue-Based Scheduling (Hashing/Clustering): This approach simplifies the scheduling problem by partitioning it. Transactions are first analyzed to predict which state or resources they are likely to access. They are then grouped into different queues or buckets based on these predictions, often by hashing parts of the transaction data (like the accounts involved or SQL WHERE clauses).30 The core strategy is to place transactions that are likely to conflict with each other into the same queue. The scheduler then enforces a simple rule: transactions within the same queue are executed serially to avoid conflicts, but all the queues are processed in parallel relative to each other. This effectively transforms a complex, arbitrary dependency graph into a much simpler and more manageable structure of inter-queue parallelism and intra-queue serialism.30

Learning-Based and Adaptive Scheduling Policies

As workloads become more dynamic and complex, static heuristics can prove suboptimal. A new frontier in scheduling involves using machine learning and adaptive policies that can learn and respond to workload characteristics in real-time.

Reinforcement Learning (RL): The scheduling problem can be framed for a reinforcement learning agent. The agent observes the current state of the system (e.g., the set of pending transactions, the resources currently in use by in-flight transactions) and chooses an action (which transaction to schedule next). After executing the action, it receives a reward based on the outcome (e.g., a positive reward for a committed transaction, a negative reward for an abort). Over time, the agent learns a policy that maps states to actions in a way that maximizes its cumulative reward, effectively learning to optimize for throughput.30
Adaptive Mechanisms (DCoS): Systems can employ adaptive strategies that treat different types of transactions differently. The Dynamic Contention Scheduling (DCoS) protocol, for instance, uses a dual-granularity architecture. It first identifies transactions that access highly contended “hot” data. These high-risk transactions are then scheduled by a sophisticated, DRL-based executor that carefully manages their execution to minimize conflicts. The remaining “cold” transactions, which are unlikely to conflict, can be processed concurrently with a simpler, lower-overhead mechanism. This approach focuses the expensive optimization effort precisely where it is needed most.33
Adaptive Just-In-Time Transaction Scheduling (AJITTS): This technique is specifically designed for OCC systems and focuses on minimizing a transaction’s “vulnerability window”—the period between when it finishes execution and when it is validated and certified. A longer window means a higher probability that a conflicting transaction will commit, causing an abort. AJITTS adaptively throttles the start of transaction execution, aiming for each transaction to complete “just in time” for its turn at the certification queue. This minimizes the vulnerability window, thereby reducing the overall abort rate without sacrificing resource utilization.34

Transaction scheduling and concurrency control are not independent problems; they are two sides of the same coin, representing proactive and reactive strategies for managing data contention. A pure OCC system like Block-STM is largely reactive: it permits concurrent execution and then reacts to the resulting conflicts via validation and aborts. A scheduler, in contrast, is proactive: it analyzes the workload and attempts to impose an execution order that will prevent conflicts from arising. The most advanced systems recognize this symbiotic relationship. Block-STM, while reactive, implicitly benefits from the proactive scheduling performed by the block proposer who sets the initial transaction order. More sophisticated systems explicitly integrate the two. A smart scheduler can pre-process a batch of transactions, identifying likely conflicts and arranging the workload to be more “optimism-friendly”—for example, by ensuring conflicting transactions are not scheduled to run concurrently. This proactive ordering reduces the number of conflicts the backend OCC mechanism has to resolve, lowering the abort rate and allowing the system as a whole to sustain high throughput even under significant contention. The future of high-performance transaction processing likely resides in these hybrid systems that intelligently combine proactive scheduling with reactive concurrency control.

VII. Empirical Evaluation and Real-World Performance

The theoretical advantages of parallel execution models must be validated through rigorous empirical evaluation. The performance of Block-STM, in particular, has been extensively benchmarked, demonstrating its effectiveness in a production environment.

Performance of Block-STM in the Aptos Blockchain

Block-STM is the production execution engine for the Aptos blockchain, where its performance has set new standards for Layer-1 throughput.

Throughput (Transactions Per Second – TPS): Benchmark results show that Block-STM can achieve exceptionally high throughput. In controlled tests, it has reached up to 170,000 TPS on the Aptos benchmarks and 110,000 TPS on the earlier Diem benchmarks.1 The figure of over 160,000 TPS is widely cited as its peak capability under optimal conditions.3
Scalability and Speedup: A key feature of Block-STM is its ability to adapt to the level of inherent parallelism in a workload and scale with available hardware. When running on a 32-thread machine, the engine demonstrates a speedup of up to 20x over a single-threaded, sequential execution for low-contention workloads.1 Even under high-contention workloads, where conflicts are frequent, it still provides a substantial speedup of up to 9x.1 This demonstrates the effectiveness of its dynamic dependency estimation and collaborative scheduling in managing conflicts efficiently.
Low Overhead: A critical design goal for any parallel system is to avoid penalizing workloads that are not parallelizable. Block-STM excels in this regard. For a workload that is completely sequential (i.e., every transaction depends on the previous one), Block-STM incurs a maximum performance overhead of only 30% compared to a purely sequential execution engine.1 This ensures that the system is performant across the entire spectrum of workloads.
Real-World Adoption: The proven success and open-source nature of Block-STM have led to its recognition and adoption by other major blockchain projects seeking to enhance their own execution capabilities. Prominent platforms such as Polygon, Sei, and Starknet have integrated or are in the process of integrating Block-STM into their technology stacks, cementing its position as a state-of-the-art industry solution.2

The following table consolidates the key performance metrics reported for Block-STM.

Metric	Low-Contention Workload	High-Contention Workload	Sequential Workload	Source(s)
Peak Throughput (TPS)	Up to 170,000 (Aptos)	Up to 80,000 (Aptos)	N/A	1
Speedup (vs. Sequential, 32 Threads)	Up to 20x	Up to 9x	~0.7x	17
Overhead	N/A	N/A	At most 30%	1

Performance of General OCC in High-Contention Workloads

Academic research and industry experience have consistently shown that the performance of generic OCC protocols is highly sensitive to data contention.

Studies using standard database benchmarks confirm that as contention increases, the performance of OCC degrades rapidly due to the overwhelming cost of repeated transaction aborts.16 This can lead to a “contention collapse” where throughput for certain transaction types approaches zero.16
However, performance is not solely a function of the abstract protocol but also of its implementation. Research has shown that careful engineering choices—such as avoiding the use of language runtime locks during the abort process—can prevent the most catastrophic performance collapses, allowing OCC to remain viable even on high-contention benchmarks like TPC-C.16
Furthermore, hybrid strategies demonstrate significant promise. By partitioning a workload into conflict-free clusters (which can be executed with no concurrency control) and a small set of “residual” conflicting transactions (handled by a traditional protocol), systems can achieve up to a 2x throughput improvement over standard two-phase locking in high-contention scenarios.39 This reinforces the idea that proactively scheduling and structuring workloads is key to unlocking performance.

VIII. Synthesis and Future Directions

The pursuit of high-throughput, deterministic transaction processing has led to a remarkable evolution in parallel execution models. This analysis reveals a clear and interconnected causal chain that governs system performance. At the most fundamental level, the state access patterns of a workload define the theoretical upper bound on achievable parallelism. Application designs that foster partitionable state create opportunities for concurrency, while those reliant on global, ordered state impose inherent sequentiality.

To capitalize on these opportunities, transaction scheduling algorithms act as a proactive layer, analyzing incoming transactions and attempting to arrange them in an execution plan that maximizes parallel execution and minimizes potential conflicts. Finally, the concurrency control model serves as the reactive enforcement layer. An advanced optimistic engine like Block-STM executes this plan speculatively, using sophisticated mechanisms to efficiently detect and resolve the remaining, unavoidable conflicts in a way that preserves the required deterministic outcome. Block-STM’s key contribution is its masterful use of a preset transaction order to transform this reactive resolution process from a costly, synchronization-heavy bottleneck into a low-overhead, highly scalable operation.

Looking ahead, several promising research avenues emerge from this synthesis, aiming to further erase the boundaries between these layers and push the limits of performance.

Hardware-Software Co-design: The core operations of a parallel execution engine—such as version lookups in a multi-version data structure, read-set validation, and dependency tracking—are performance-critical hot spots. Future research could explore the use of specialized hardware, such as FPGAs or custom ASICs, to accelerate these specific tasks, offloading them from the CPU and further reducing execution latency.
Tighter Integration of Scheduling and Execution: Current systems largely treat scheduling and execution as distinct phases. A more advanced architecture could feature a tight feedback loop between the two. The execution engine could provide real-time telemetry on observed conflict rates, abort cascades, and data hot-spots directly to the scheduler. The scheduler, in turn, could use this live data to dynamically adjust its scheduling policy, for example, by shifting from a purely optimistic schedule to a more conservative one for transactions known to be targeting a suddenly contended resource.
Cross-Layer Optimizations and Semantic Awareness: The ultimate frontier lies in breaking down the abstraction barriers between the application layer and the execution engine. If smart contracts could provide semantic hints to the underlying runtime—for example, declaring that certain operations are commutative or that a specific state update can tolerate a bounded degree of staleness—the execution engine could leverage this information to make more intelligent concurrency decisions. This would move beyond generic, syntax-based conflict detection (i.e., “did two transactions touch the same memory location?”) toward a more powerful, application-aware concurrency model that could permit forms of parallelism that are currently considered unsafe. This holistic, cross-layer approach represents the next logical step in the quest to build truly scalable and efficient deterministic systems.

Cutting-edge Technology Courses by Uplatz