通信学报 ›› 2023, Vol. 44 ›› Issue (2): 160-171.doi: 10.11959/j.issn.1000-436x.2023009

• 学术论文 • 上一篇    下一篇

高错误率长序列基因组数据敏感序列识别并行算法

钟诚1,2, 孙辉1,2   

  1. 1 广西大学计算机与电子信息学院,广西 南宁 530004
    2 广西高校并行分布与智能计算重点实验室,广西 南宁 530004
  • 修回日期:2022-11-11 出版日期:2023-02-25 发布日期:2023-02-01
  • 作者简介:钟诚(1964- ),男,广西桂平人,博士,广西大学教授、博士生导师,主要研究方向为并行分布计算、生物信息计算、网络信息安全
    孙辉(1997- ),男,宁夏银川人,广西大学硕士生,主要研究方向为并行计算、计算生物学、生物信息安全
  • 基金资助:
    国家自然科学基金资助项目(61962004);广西研究生教育创新计划基金项目资助(YCSW2021020)

Parallel algorithm for sensitive sequence recognition from long-read genome data with high error rate

Cheng ZHONG1,2, Hui SUN1,2   

  1. 1 School of Computer, Electronics and Information, Guangxi University, Nanning 530004, China
    2 Key Laboratory of Parallel, Distributed and Intelligent Computing of Guangxi Universities and Colleges, Nanning 530004, China
  • Revised:2022-11-11 Online:2023-02-25 Published:2023-02-01
  • Supported by:
    The National Natural Science Foundation of China(61962004);Innovation Project of Guangxi Graduate Education(YCSW2021020)

摘要:

为解决现有算法难以有效识别高错误率长序列基因组数据中敏感序列的问题,提出一种CPU和GPU协同计算识别的并行算法 CGPU-F3SR。该算法通过将基因组数据中的长序列分割为多条短序列,引入布隆过滤机制,以免对分割短序列重复计算;采用k-mer编码策略并行地提取所有短序列中的错误信息,并提出改进的序列相似度计算模型,以提高识别准确率;采取CPU和GPU协同并行加速短序列相似度计算,以提升识别效率;进而高效、准确地识别出高错误率长序列基因组数据中的2类敏感序列——短串联重复序列和疾病相关序列。在长度为 100~400 kbp 的长序列基因组数据中敏感序列识别的实验结果表明,与其他同类并行算法相比,所提CPU/GPU并行算法CGPU-F3SR识别准确率和查准率分别平均提升7.77%和43.07%,假阳性率平均降低7.41%,识别吞吐量平均提高2.44倍。

关键词: 敏感序列识别, 过滤, 相似度计算, 序列比对, 并行计算

Abstract:

To solve the problem that existing algorithms were difficult to effectively identify sensitive sequences in genomic data for long-read with high error rate, a recognition algorithm using hybrid CPU and GPU parallel computing, called CGPU-F3SR, was proposed.Firstly, the long-read in genomic data were partitioned into multiple short-read, and the Bloom filtering mechanism was used to avoid repeated calculation of the short-read.Secondly, the k-mer coding strategy was used to extract in parallel the error information of all short-read, the recognition accuracy was promoted by improving the sequence similarity calculation model.Finally, CPU and GPU were used to coordinate and parallel to accelerate the calculation of short-read similarity to improve recognition efficiency.As a result, both two types of sensitive sequences including short tandem repeats and disease related sequences could be identified efficiently and accurately from genome data for long-read with high error rate.The experimental results of recognizing sensitive sequences from genomic data for long-read with length 100~400 kbp each show that, compared with existing parallel algorithm, the average recognition accuracy and precision rate of proposed CPU/GPU parallel algorithm CGPU-F3SR are increased by 7.77% and 43.07% respectively, its average false positive rate is reduced by 7.41%, and its average recognition throughput is increased by 2.44 times.

Key words: sensitive sequence recognition, filtering, similarity calculation, sequence alignment, parallel computing

中图分类号: 

No Suggested Reading articles found!