GPU事务性内存技术研究

doi:10.11959/j.issn.2096-0271.2020029

摘要/Abstract

摘要：

GPU是并行计算领域重要的体系结构之一，然而在面对高数据竞争的场景时，程序员往往需要设计复杂的并行方案。为了简化这一过程，GPU事务性内存实现了复杂的数据同步和并行，对外则仅提供简单的API。首先介绍了GPU事务性内存的研究背景。其次，讨论了近年的GPU事务性内存的设计方案与策略，分析了不同设计方案遇到的问题和解决方案，包括硬件和软件上的实现。最后对GPU事务性内存的现状和未来的发展做出了总结和展望。

关键词: GPU, 事务性内存, 并行计算

Abstract:

GPU is one of the important architectures in parallel computing,however,when dealing with high data racing scenarios,programmers often need to design complex parallel schemes.In order to simplify this process,GPU transactional memory implements complex data synchronization and parallelism,and only provides simple API.The research background of GPU transactional memory was introduced.Then,the designs and strategies of GPU transactional memory in recent years were discussed,and the problems and solutions of different designs were analyzed,including the implementation of hardware and software.Finally,the current situation and future development of GPU transactional memory were summarized and prospected.

Key words: GPU, transactional memory, parallel computing

中图分类号:

TP302

林玉哲, 张为华. GPU事务性内存技术研究[J]. 大数据, 2020, 6(4): 3-17.

Yuzhe LIN, Weihua ZHANG. A research on GPU transactional memory[J]. Big Data Research, 2020, 6(4): 3-17.

图/表 9

图1

图2

图3

图4

图5

图6

图7

图8

表1

参考文献 22

[1]	NVIDIA. CUDA C++ programming guide[Z]. 2020.
[2]	YAN Z F , LIN Y Z , PENG L ,et al. Harmonia:a high throughput B+tree for GPUS[C]// The 24th Symposium on Principles and Practice of Parallel Programming. New York:ACM Press, 2019: 133-144.
[3]	SHAHVARANI A , JACOBSEN H A . A hybrid B+tree as solution for in-memory indexing on CPU-GPU heterogeneous computing platforms[C]// The 2016 International Conference on Management of Data.[S.l.:s.n]. 2016: 1523-1538
[4]	KRZYSZTOF K , . B+-tree optimized for GPGPU[C]// OTM Confederated International Conferences.[S.l]: Springer, 2012: 843-854.
[5]	JORDAN F , ANDREW W , KEVIN S . Accelerating braided B+tree searches on a GPU with CUDA[C]// The 2nd Workshop on Applications for Multi and Many Core Processors:Analysis,Implementation,and Performance.[S.l.:s.n]. 2011: 1-11.
[6]	ZHANG W H , YAN Z F , LIN Y Z ,et al. A high throughput B+tree for SIMD architectures[J]. IEEE Transaction on Parallel and Distributed Systems, 2020,31(3): 707-720.
[7]	HERLIHY M , ELIOT J , MOSS B . Transactional memory:architectural support for lock-free data structures[C]// The 20th Annual International Symposium on Computer Architecture.[S.l.:s.n]. 1993: 289-300.
[8]	SHAVIT N , TOUITOU D . Software transactional memory[C]// The 14th ACM Symposium on Principles of Distributed Computing. New York:ACM Press, 1995: 204-213.
[9]	LOMET D B , . Process structuring,synchronization,and recovery using atomic actions[C]// The ACM Conference on Language Design for Reliable Software. New York:ACM Press, 1977: 128-137.
[10]	HARRIS T , ADRIáN C ,et al. Transactional memory:an overview[J]. IEEE Micro, 2007,27(3): 8-29.
[11]	WANG X , ZHANG W H , WANG Z G ,et al. Eunomia:scaling concurrent searchtrees under contention using HTM[C]// The 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. New York:ACM Press, 2017: 385-399.
[12]	CEDERMAN D , TSIGAS P , CHAUDHRY M T . Towards a software transactional memory for graphics processors[C]// Eurographics Conference on Parallel Graphics ＆Visualization.[S.l.:s.n]. 2010: 121-129.
[13]	XU Y L , WANG R , GOSWAMI N ,et al. Software transactional memory for GPU architectures[J]. Computer Architecture Letters, 2014,13(1): 49-52.
[14]	SHEN Q , SHARP C , BLEWITT W ,et al. Priority rule based software transactions for the GPU[C]// The European Conference on Parallel Processing.[S.l.:s.n], 2015: 361-372.
[15]	HOLEY A , ZHAI A . Lightweight software transactions on GPUs[C]// The 43rd International Conference on Parallel Processing.[S.l.:s.n], 2014: 461-470.
[16]	AWAD M A , ASHKIANI S , JOHNSON R ,et al. Engineering a high-performance GPU B-tree[C]// The 24th ACM SIGPLAN Symposium on Principles and Practice of Parallel.[S.l.:s.n], 2019: 145-157.
[17]	FUNG W W L , SINGH I , BROWNSWORD A ,et al. KILO TM:hardware transactional memory for GPU architectures[J]. IEEE Micro, 2012,32(3): 7-16.
[18]	FUNG W W L , AAMODTT M . Energy efficient GPU transactional memory via space-time optimizations[C]// The 46th Annual International Symposium on Microarchitecture. New York:ACM Press, 2013: 408-420.
[19]	SUI C , LU P , SAMUEL I . Accelerating GPU hardware transactional memory with snapshot isolation[C]// The 44th Annual International Symposium on Computer Architecture. New York:ACM Press, 2017: 282-294.
[20]	FELBER P , FETZER C , RIEGEL T . Dynamic performance tuning of word based software transactional memory[C]// The 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. New York:ACM Press, 2008: 237-246.
[21]	FUNG W W L , SHAM I , YUAN G L ,et al. Dynamic warp formation:efficient MIMD control flow on SIMD graphics hardware[J]. ACM Transactions on Architecture and Code Optimization, 2009,6(2).
[22]	BAKHODA A , YUAN G L , FUNG W W L ,et al. Analyzing CUDA workloads using a detailed GPU simulator[C]// 2009 IEEE International Symposium on Performance Analysis of Systems and Software. Piscataway:IEEE Press, 2009: 163-174.

对比项	GPU STM	GPU HTM
实现方法	软件	硬件
目标	提供减少并行编程难度的API	提供减少并行编程难度的API
需要考虑和关注的问题	解决SIMT导致的死锁、活锁问题；版本管理、冲突检测	使SIMT栈支持回退；版本管理、冲突检测
性能	受制于软件设计	潜力更大，但现在多为模拟器实现
实现难易程度	相对容易	需要对硬件进行设计修改