key: cord-0047297-eacnmw1p authors: Piatka, Christian; Amslinger, Rico; Haas, Florian; Weis, Sebastian; Altmeyer, Sebastian; Ungerer, Theo title: Investigating Transactional Memory for High Performance Embedded Systems date: 2020-06-12 journal: Architecture of Computing Systems - ARCS 2020 DOI: 10.1007/978-3-030-52794-5_8 sha: 60b2e96419abcb943a58e64cac33961543ce03b7 doc_id: 47297 cord_uid: eacnmw1p We present a Transaction Management Unit (TMU) for Hardware Transactional Memories (HTMs). Our TMU enables three different contention management strategies, which can be applied according to the workload. Additionally, the TMU enables unbounded transactions in terms of size. Our approach tackles two challenges of traditional HTMs: (1) potentially high abort rates, (2) missing support for unbounded transactions. By enhancing a simulator with a transactional memory and our TMU, we demonstrate that our TMU achieves speedups of up to 4.2 and reduces abort rates by a factor of up to 11.6 for some of the STAMP benchmarks. To fully utilize multicores, the ability to generate efficient parallel code is essential. Because in-depth parallelization has proven to be very error prone, alternative synchronization mechanisms, such as transactional memories (TM) [12] , evolved to be a subject of research. Implementations of hardware transactional memories were integrated into commercial high performance chips from Intel and IBM. Despite their benefits, current available commercial HTMs (e.g. Intel's TSX) do not meet the high requirements of embedded systems. Commercial HTMs statically implement contention management strategies, which can lead to high abort rates. To meet the high requirements concerning power consumption, embedded systems depend on low abort rates. Another disadvantage of COTS HTMs is that they bound transactions in several ways. The size of a transaction is limited by the capacity and associativity of the L1 cache. In addition, transactions are aborted by events like interrupts, which limits their duration. This can negatively impact performance and complicates usability, leading to more programming errors, which is unacceptable for embedded systems due to the increasing demands of computational power and the scarce resources provided. For our work, we developed two challenges: (1) Lowering abort rates by providing effective contention management. (2) Enabling unbounded transactions in terms of size. We want to achieve these goals by implementing a Transaction Management Unit (TMU). The main contributions of this paper are: (1) The design of a flexible TMU, which enables the user to apply different contention management strategies. (2) A solution to enable unbounded transactions, considering their size. The rest of this paper is structured as follows: After giving an overview on the state of the art of transactional memories, we will describe our TMU. In the following section, our proposal is evaluated. At the end of this paper, we discuss related work and conclude by summing up our results and describing future work. A transaction is a sequence of instructions that is monitored by the transactional memory system. The beginning and the end of a transaction are usually marked by special instructions. To ensure a correct execution of the program, the transactional memory system has to ensure that every transaction fulfills the following three criteria: (1) Transactions have to be executed atomically, which means that they commit or abort as a whole. (2) Transactions have to run isolated, which means that they do not impact each other. (3) The transactional executions have to be serializable, which means that a sequential execution with a matching output exists. To fulfill these criteria, the transactional memory system has to ensure that values, which are consumed in a transaction, are not modified by another transaction running in parallel. For this purpose, read and write accesses in a transaction are logged at cache line granularity in a read and write set. A conflict occurs whenever a write access of a transaction tries to manipulate a cache line, which is already added to the read or write set of a competing transaction. A conflict also occurs, when a read access tries to read a cache line already contained in the write set of another transaction. To keep track of read and write accesses inside of transactions, HTMs usually utilize the cache coherence protocol. If a conflict is detected, it has to be resolved by the HTM, by aborting all but one of the conflicting transactions. This involves setting back all the memory modifications performed during the transactions (rollback). Additionally, the read and write sets have to be cleared and the register files have to be restored. Afterwards, the aborted transactions have to be restarted. Another source of transactional aborts are physical limits of the hardware, or interrupts. Physical restrictions are usually based on the size or associativity of the L1 caches, which are used to store the transactional read and write sets. Most HTM systems do not implement any mechanisms to allow the read or write set to overflow the size or associativity of the cache. This limits the transactions in terms of consumed and modified cache lines. Interrupts limit a transaction concerning its duration. Frequent transaction aborts because of physical limits or interrupts can be critical for the performance, since a significant amount of work has to be discarded. Due to these physical restrictions, a programmer usually has to provide an alternative path of execution utilizing different synchronization mechanisms, which are not affected by physical hardware boundaries. The fallback path is a weak spot for transactional memory usage. It uses alternative synchronization, which can be error prone and reduce performance, depending on the depth of parallelization. Additionally, it takes away one of the main advantages, which is the easy usability, because the fallback path increases the complexity of the parallel code. Providing an efficient alternative would render transactional memory superfluous. In this section, we first describe the implementation and the hardware setup of our system. Afterwards, we give a short overview of the selection of contention management strategies we implemented. At the end of this section, we explain how our solution is able to support unbounded transactions concerning their size. The multicore system we consider consists of up to 16 cores. Each core has a private L1 instruction cache (L1I$) as well as a private L1 data cache (L1D$). The TMU is integrated into the L2 cache and is able to monitor the messages relevant for the transactional execution. As depicted in Fig. 1 , our TMU is integrated into the shared L2 cache. We consider a multicore system with N cores (we assume N ≤ 16). The TMU monitors the execution and collects as well as provides data concerning the transactions. To be able to favor a transaction over others, our TMU is able to store priorities. Four (log 2 16) bits are needed at most to be able to save a specific priority for every core. The priority of a transaction can be specified by the programmer at transaction start. The default priorities are the core IDs. In order to store priorities, timestamps, or performance counters, our TMU provides a 64 bit value. The timestamps can be set at different times (e.g. transaction start, transaction commit, etc.) depending on the conditions of the contention management strategy. Performance counters provide information concerning the transactional execution, e.g. the number of committed transactions. To mark whether a transaction runs unbounded, the TMU provides an additional bit per core. The TMU consists of a memory, which stores information, relevant for the transactional execution of each core. The information stored depends on the applied strategy. To resolve a conflict, the TMU takes as inputs (1) the core ID of the core running the transaction, which detected the conflict, and (2) the core IDs of the cores running the conflicting transactions. After comparing the corresponding data, the TMU signals to abort the transaction that detected the conflict, or the conflicting transactions. We optimistically estimate an upper bound for the hardware costs, when considering a 16 core multicore, by: Because we rely on a dual ported memory, we doubled the assumed memory capacity in our estimation (Eq. 1). A comparator works similar to an adder, which is the reason why the hardware costs are approximately double the amount of the bits compared (Eq. (2)). Even if we consider the double amount for additional hardware cost, our approach consumes less than 0.05% of the space provided by a 2 MB L2 cache. Whenever a conflict between two running transactions occurs, the responsibility for resolving the conflict is handed over to the TMU. Depending on the strategy, the priority, a timestamp, or the number of commits are stored in the TMU. After comparing the relevant data, the TMU determines which of the conflicting transactions are aborted. We implemented three strategies: priority: The transaction with the higher priority is allowed to continue. This strategy allows to enforce an ordered commit of the transaction and a prioritization of a transaction over others. We would like to utilize this strategy in the future to enable various real-time strategies (e.g. mixed criticality). timestamp [16] : The transaction, that started first can carry on. Taking the timestamp of a transaction into account reduces the indeterminism concerning the aborts and guarantees progress. commit: The transaction on the core, which committed fewer transactions is able to continue. This strategy leads to a more balanced execution, because cores, which were not able to commit transactions, are favored when conflicts occur. Unbounded transactions always overrule the contention management strategy. Transactions have to abort whenever a transaction's read or write set exceeds the size or associativity of the L1 cache. Therefore, most HTMs have to provide a fallback mechanism consisting of an alternative execution path. This does not only make it harder for the programmer to write efficient and correct code, it can also be crucial for performance because of the loss of already computed work. The TMU monitors the transactional execution and sets a bit whenever a transaction is forced to run in unbounded mode. Whenever the bit is set, conflicts are resolved favoring the unbounded transaction. Therefore, the transaction will never be aborted, which means it does not have to be rolled back. Since it is guaranteed that the transaction will succeed, the backup version of the cache line is not needed, which allows the transaction to use the entire cache hierarchy. The TMU can only support one unbounded transaction at the time. If another transaction or thread tries to perform a conflicting access, the TMU takes actions to suppress them, e.g. by stalling the core. For the implementation of our approach, we utilized the gem5 simulator [4] . We selected the STAMP benchmark suite [6] to evaluate our approach. In this section we will describe in detail our evaluation methodology followed by the presentation of our results. The gem5 [4] is a cycle accurate processor simulator. It offers the possibility to choose an instruction set architecture out of a selection such as ARMv7, x86, etc. Furthermore, the periphery can be configured freely. The configuration of our system is described in Table 1 . We chose this configuration, as it models a contemporary embedded multicore. High-performance embedded systems as smartphones exhibit similar specifications. Due to the long run times entailed by the large input set of the STAMP benchmark suite [6] and the authors' recommendation to use the smaller input configuration for simulators, we chose to do our evaluation with the small input configuration depicted in Table 2 . yada -a20 -i inputs/633.2 For our baseline, we implemented a transactional memory system into the ARMbased gem5 simulator [4] . The implementation of the interface is similar to those offered by Intel TSX [13] and the newly announced ARM TME [3] . Our interface allows the programmer to explicitly start and end transactions using the corresponding commands, which are provided by our transactional memory system. Our baseline HTM detects and resolves conflicts eagerly: Conflicts are detected instantly when the conflicting memory access occurs (in contrast to detecting them at commit time). When a conflict occurs, the transaction, that detects the conflict, aborts to resolve it. We provide a fallback path with regular POSIX Thread synchronization in our baseline implementation. In our baseline as well as in the runs supported by our TMU, a transaction is executed in the fallback path, if the attempt to execute the transaction failed 100 times. Trying to re-execute a transaction for 100 times makes sense, because the execution of a transaction in the fallback path prohibits the other cores to execute work. Whenever the read or write of a transaction in our baseline exceeds the L1 cache, it is directly executed in the fallback path. We evaluated the STAMP benchmark suite [6] . Figure 2 depicts the evaluation of the eight STAMP benchmarks bayes, genome, intruder, kmeans, labyrinth, vacation, ssca2 and yada. Each graph depicts three lines. All lines show the absolute speedup compared to the reference execution (one core, no synchronization). We focused on the region of interest, which are the parts executed by transactions, because they can be quite small compared to the entire benchmark, making it difficult to show the effects of our work. We calculated the speedup as shown by Eq. 3. The labeling of the lines in Fig. 2 indicates whether the line represents the baseline execution or a contention management strategy combined with unbounded transactions. In the following, we describe in detail the behavior of the evaluated benchmarks: bayes: For the benchmark bayes, we were able to beat the baseline execution for most executions. Up to 62 unbounded transactions are executed, which shows that it is beneficial to implement unbounded transactions concerning their size. We are able to achieve the best speedup for the execution with four cores and the timestamp strategy. Considering the entire run time of the benchmark, the part in which transactions were executed, is extremely short. genome: The benchmark genome scales quite well. Our system produces the same results as the baseline. The features of the TMU become relevant for the executions with eight and sixteen cores. For these runs, we are able to significantly lower the number of aborted transactions. In these executions, transactions are started, which do not fit in the L1 cache and therefore have to be executed in the fallback path. Because of our TMU, we are prepared for these cases and are able to continue without having to abort them. intruder : For the benchmark intruder, we made similar observations as with the benchmark genome. The main difference is that the positive effects of our extensions only take effect for the execution with sixteen cores. For this execution, the baseline contention management strategy performs so poorly, that we are able to lower the abort rate by a factor of 11.6. Within the execution of the benchmark, no transaction faces a capacity problem, which means we achieve the speedup only through better contention management. kmeans: For the benchmark kmeans, we were not able to outperform the baseline. The reason for this is that no transaction, for any execution, faces a capacity problem, which means no unbounded transaction has to be executed. Additionally, the benchmark has hardly any conflicts, which eliminates the grounds of what we can improve. labyrinth: The baseline execution for the benchmark labyrinth is below one, because the benchmark launches fairly big transactions, which do not fit in the L1 cache and cause the transaction to abort and execute in the fallback path. The already achieved computational progress is discarded, which is Results of the execution of the STAMP benchmark suite [6] . For the benchmarks genome, intruder and yada we improved performance compared to the baseline implementation. why the baseline execution falls below one. For this benchmark, our work is beneficial. We manage to raise the speedup above the baseline execution and one. yada: The baseline execution for the benchmark yada suffers from a lot of conflicts, due to poor contention management. To commit around 4900 transactions, up to 33590 aborts occur. Additionally, some transactions face a problem with the capacity of the L1 cache. Therefore, the TMU handles up to 170 unbounded transactions, which is beneficial to performance and allows us to improve the baseline execution with both strategies. Additionally, we are able to achieve speedups bigger than one. In Fig. 3 , we depicted the number of aborts for every benchmark of the STAMP benchmark suite [6] . The line labeled as baseline depicts the baseline execution. The other lines represent an execution with a contention management strategy. Because we want to show the benefits of the implemented contention management strategies, we disabled the unbounded transactions for this evaluation. For most of the benchmarks, we are able to lower the number of the aborts. Especially for executions with 8 and 16 cores. For these executions, the contention management strategy of the baseline performs poorly and we are able to reduce the number of aborts significantly for some benchmarks. For the benchmarks, which already had a low abort rate, our strategies were not beneficial and sometimes even caused more aborts (e.g. ssca2 ). Our strategies perform best, when the contention between the transactions is high. Our evaluation showed that our work is beneficial to a transactional memory system in terms of performance and abort rates. There are some proposals describing how to handle unbounded transactions e.g. [2, 7, 8, 14] . In this section we describe and discuss solutions for similar problems. The authors of [5] focused on unbounded transactions. They proposed a permissions-only cache, which allows large transactions by only tracking read and write bits without the corresponding data. Once the permissions-only cache overflows, one of two proposed implementations, to handle unbounded transactions, can be utilized. ONE-TM-Serialized only allows one overflowed transaction at a time and stalls the other cores. ONE-TM-Concurrent allows several concurrent transactions to run in parallel, although only one transaction can run in overflowed-mode. Our approach possesses the same runtime characteristics as ONE-TM-Concurrent, the difference between the approaches lies in the management of the unbounded transaction. In contrast to [5] , where the authors propose to use per block meta data to ensure the correct execution of the unbounded transaction, we use our TMU to manage and protect the unbounded transaction. Due to a LogTM-style [14] baseline transactional memory system, the authors' proposal is also able to survive interrupting actions performed by the operating system. Contention management strategies were no subject to the authors. The authors of [9] adapted an HTM for embedded systems, focusing on energy consumption and complexity. In this proposal, the authors evaluate different cache structures and three different contention management schemes (eager, lazy, We were able to reduce the number of aborts for most of the benchmarks. Especially the benchmarks, for which a lot of aborts were saved (genome, intruder, etc.), achieved significant speedups, which can be observed in Fig. 2. forced-serial), concerning their complexity and energy consumption. The authors also provide a mechanism to support overflowing transactions (exceeding cache limits) by running them in serial mode. Because the authors are very sensitive for complexity, they only allow a simple execution mode for unbounded transactions. Therefore, running a transaction in serial mode means that all other CPUs get suspended. The overflowed transaction can now run isolated and is able to utilize the complete memory hierarchy. Our approach differs from [9] , because we focus on performance and abort rates. Therefore, we provide a more complex execution mode for unbounded transactions. Furthermore, we offer several different and more complex contention-management strategies. The work describing the most relevant contention management policies focus on software transactional memories (STM) [10, 11, 16, 17] . Later work has applied some of these contention management strategies to HTMs e.g. [15] . In contrast to our work, the authors focused on using HTM as an synchronization primitive for an operating system as well as managing it in the scheduler. This made it necessary to implement a new more complex HTM. The authors focused, concerning the contention management strategies, on finding a well performing policy in most of the cases. This makes sense, since the best working policy is workload dependent, as also mentioned by the authors. In our work, we present a TMU, which is located in the shared L2 cache and costs approximately less than 0.05% of the space of a 2 MB L2 cache. We provide three different contention resolution policies and enable unbounded transactions. In our evaluation, we did not consider the contention management strategy, which enables priorities, because in our perspective it would not produce any interesting results concerning its execution time. The priority strategy will be of more focus in our future work. By our evaluation with the gem5 simulator [4] and the STAMP benchmark suite [6] , we show that the TMU is beneficial for performance and is able to reduce the number of aborted transactions. The work we present in this paper is the foundation to employ several other features. To further increase performance, we also would like to enable threadlevel speculation, which will utilize the TMU to ensure correct execution. Our research also concerns fault tolerance utilizing transactional memory [1] , where we will also consider investigating the use of the TMU. Because safety is a major issue for embedded systems, we want to try to utilize our TMU to enable mixed criticality and real time for hardware transactional memories. Redundant execution on heterogeneous multi-cores utilizing transactional memory Unbounded transactional memory Transactional memory extension (TME) intrinsics The gem5 simulator Making the fast case common and the uncommon case simple in unbounded transactional memory STAMP: stanford transactional applications for multi-processing Unbounded page-based transactional memory Hybrid transactional memory Embedded-TM: energy and complexity-effective hardware transactional memory for embedded multicore systems Polymorphic contention management Toward a theory of transactional contention managers Transactional memory: architectural support for lockfree data structures Intel Corporation: Intel Transactional Synchronization Extensions (Intel TSX) Overview LogTM: logbased transactional memory TxLinux: using and managing hardware transactional memory in an operating system Contention management in dynamic software transactional memory Advanced contention management for dynamic software transactional memory