网络与信息安全学报 ›› 2018, Vol. 4 ›› Issue (6): 1-10.doi: 10.11959/j.issn.2096-109x.2018048
• 综述 • 下一篇
明拓思宇,陈鸿昶
修回日期:
2018-06-01
出版日期:
2018-06-15
发布日期:
2018-08-08
作者简介:
明拓思宇(1994-),男,湖南长沙人,国家数字交换系统工程技术研究中心硕士生,主要研究方向为文本摘要。|陈鸿昶(1964-),男,河南郑州人,国家数字交换系统工程技术研究中心教授、博士生导师,主要研究方向为电信网信息安全。
基金资助:
Tuosiyu MING,Hongchang CHEN
Revised:
2018-06-01
Online:
2018-06-15
Published:
2018-08-08
Supported by:
摘要:
随着互联网上的信息呈爆炸式增长,如何从海量信息中提取有用信息成了一个关键的技术问题。文本摘要技术能够从大数据中压缩提炼出精炼简洁的文档信息,有效降低用户的信息过载问题,成为研究热点。分类整理分析了近些年来国内外的文本摘要方法及其具体实现,将传统方法和深度学习摘要方法的优缺点进行了对比分析,并对今后的研究方向进行了合理展望。
中图分类号:
明拓思宇, 陈鸿昶. 文本摘要研究进展与趋势[J]. 网络与信息安全学报, 2018, 4(6): 1-10.
Tuosiyu MING, Hongchang CHEN. Research progress and trend of text summarization[J]. Chinese Journal of Network and Information Security, 2018, 4(6): 1-10.
表1
各文本摘要方法的优缺点"
方法 | 优点 | 缺点 |
基于统计学方法 | 依据文本形式上的规律,简单直观,避免考虑复杂的句法、语法结构,易于实现且应用广泛,无需训练数据,执行速度快 | 只是单纯利用了单词表层特征,没有充分挖掘词义关系和语义特征,存在较大局限性 |
基于外部语义资源方法 | 在统计学方法的基础上利用词间关系、词义关系进行了改进,使文本摘要的语义性能得到了一定的提高 | 受收录词汇的限制比较大;对于文章题目依赖程度较高;分词对关键词的影响较大;相似度阈值的选取对构建词汇链有影响。语法语义结构不连贯 |
基于图排序方法 | 适用于结构较为松散且涉及主题较多的结构;计算句子权重的同时可以充分考虑词汇之间、词组之间或句子之间的全局关系;无监督,语言独立,不需要对大量语料进行处理 | 通常只考虑了句子节点间的相似性关系,而忽略了文档篇章结构以及句子上下文的信息;相似度计算的好坏决定了关键词和句子重要性排序的正确与否;对数据的利用不够充分;没有考虑信息冗余 |
基于统计机器学习方法 | 特征选择和训练分类器的选择上有较大的可供选择范围,还可以综合一些开放性特征提高分类的精度 | 需要人工标注的数据集;效果严格依赖于训练数据质量的好坏;监督或半监督,执行速度较无监督的方法慢 |
基于深度学习方法 | 降低了对人工的依赖,可以高效地进行训练;可以与多种神经网络结构和Sequence-to-Sequence模型结合,生成文本摘要的可读性和准确度高 | 可解释性差;需要大量人工标注的数据集;由于有复杂的神经网络结构的引入,执行速度慢,需要花费相对较长的时间;对计算机性能有一定的要求 |
[1] | CHENG J , LAPATA M . Neural summarization by extracting sentences and words[J]. ar Xiv preprint ar Xiv:1603.07252, 2016. |
[2] | NEMA P , KHAPRA M , LAHA A ,et al. Diversity driven attention model for query-based abstractive summarization[J]. ar Xiv preprint ar Xiv:1704.08300, 2017. |
[3] | LI P , LAM W , BING L ,et al. Deep recurrent generative decoder for abstractive text summarization[J]. ar Xiv preprint ar Xiv:1708.00625, 2017. |
[4] | BING L , LI P , LIAO Y ,et al. Abstractive multi-document summarization via phrase selection and merging[J]. ar Xiv preprint ar Xiv:1506.01597, 2015. |
[5] | LI C , QIAN X , LIU Y . Using supervised bigram-based ilp for extractive summarization[C]// The 51st Annual Meeting of the Association for Computational Linguistics. 2013: 1004-1013. |
[6] | VEENA G , GUPTA D , JAGANADH J ,et al. A graph based conceptual mining model for abstractive text summarization[J]. Indian Journal of Science and Technology, 2016,9(S1). |
[7] | DANESH S , SUMNER T , MARTIN J H . Sgrank:combining statistical and graphical methods to improve the state of the art in unsupervised keyphrase extraction[C]// The Fourth Joint Conference on Lexical and Computational Semantics. 2015: 117-126. |
[8] | FLORESCU C , CARAGEA C . Position Rank:an unsupervised approach to keyphrase extraction from scholarly documents[C]// The 55th Annual Meeting of the Association for Computational Linguistics. 2017: 1105-1115. |
[9] | LUHN H P . The automatic creation of literature abstracts[J]. IBM Journal of Research and Development, 1958,2(2): 159-165. |
[10] | BAXENDALE P B . Machine-made index for technical literature—an experiment[J]. IBM Journal of Research and Development, 1958,2(4): 354-361. |
[11] | EDMUNDSON H P . New methods in automatic extracting[J]. Journal of the ACM, 1969,16(2): 264-285. |
[12] | SALTON G , YU C T . On the construction of effective vocabularies for information retrieval[C]// ACM SIGIR Forum. 1973: 48-60. |
[13] | 施聪莺, 徐朝军, 杨晓江 . TFIDF 算法研究综述[J]. 计算机应用, 2009,29(B06): 167-170. |
SHI C Y , XU C J , YANG X J . Study of TFIDF algorithm[J]. Journal of Computer Applications, 2009,29(B06): 167-170. | |
[14] | 徐文海, 温有奎 . 一种基于 TFIDF 方法的中文关键词抽取算法[J]. 情报理论与实践, 2008,31(2): 298-302. |
XU W H , WEN Y K . A Chinese keyword extraction algorithm based on TFIDF method[J]. Information Studies:Theory & Application, 2008,31(2): 298-302. | |
[15] | SUQIN Z B S H M . An improved text feature weighting algorithm based on TFIDF[J]. Computer Applications and Software, 2011,2: 7. |
[16] | 李静月, 李培峰, 朱巧明 . 一种改进的 TFIDF 网页关键词提取方法[J]. 计算机应用与软件, 2011,28(5): 25-27. |
LI J Y , LI P F , ZHU Q M . An improved TFIDF-based approach to extract key words from Wed pages[J]. Computer Applications and Software, 2011,28(5): 25-27. | |
[17] | EL-BELTAGY S R , RAFEA A . Kp-miner:participation in semeval-2[C]// The 5th International Workshop on Semantic Evaluation, 2010: 190-193. |
[18] | DANESH S , SUMNER T , MARTIN J H . Sgrank:combining statistical and graphical methods to improve the state of the art in unsupervised keyphrase extraction[C]// The Fourth Joint Conference on Lexical and Computational Semantics. 2015: 117-126. |
[19] | FLORESCU C , CARAGEA C . Position Rank:an unsupervised approach to keyphrase extraction from scholarly documents[C]// The 55th Annual Meeting of the Association for Computational Linguistics, 2017: 1105-1115. |
[20] | PADMALAHARI E , KUMAR D V N S , PRASAD S . Automatic text summarization with statistical and linguistic features using successive thresholds[C]// 2014 International Conference on Advanced Communication Control and Computing Technologies (ICACCCT). 2014: 1519-1524. |
[21] | MILLER G A . Word Net:a lexical database for English[J]. Communications of the ACM, 1995,38(11): 39-41. |
[22] | BARZILAY R , ELHADAD M . Using lexical chains for text summarization[J]. Advances in Automatic Text Summarization, 1999: 111-121. |
[23] | JAIN A , GAUR A . Summarizing long historical documents using significance and utility calculation using Word Net[J]. Imperial Journal of Interdisciplinary Research, 2017,3(3). |
[24] | SILBER H G , MCCOY K F . Efficient text summarization using lexical chains[C]// The 5th International Conference on Intelligent user interfaces. ACM, 2000: 252-255. |
[25] | KOLLA M . Automatic text summarization using lexical chains:algorithms and experiments[D]. University of Lethbridge, 2004. |
[26] | POURVALI M , ABADEH M S . Automated text summarization base on lexicales chain and graph using of wordnet and wikipedia knowledge base[J]. ar Xiv preprint ar Xiv:1203.3586, 2012. |
[27] | HOU S , HUANG Y , FEI C ,et al. Holographic lexical chain and its application in chinese text summarization[C]// Asia-Pacific Web (APWeb) and Web-Age Information Management (WAIM) Joint Conference on Web and Big Data. 2017: 266-281. |
[28] | LYNN H M , CHOI C , KIM P . An improved method of automatic text summarization for Web contents using lexical chain with semantic-related terms[J]. Soft Computing, 2018,22(12): 4013-4023. |
[29] | PAL A R , SAHA D . An approach to automatic text summarization using Word Net[C]// IEEE International Conference on Advance Computing Conference (IACC). 2014: 1169-1173. |
[30] | PAGE L , BRIN S , MOTWANI R ,et al. The Page Rank citation ranking:bringing order to the Web[R]. Stanford Info Lab, 1999. |
[31] | KLEINBERG J M , KUMAR R , RAGHAVAN P ,et al. The Web as a graph:measurements,models,and methods[C]// International Computing and Combinatorics Conference, 1999: 1-17. |
[32] | MIHALCEA R , . Graph-based ranking algorithms for sentence extraction,applied to text summarization[C]// Proceedings of the ACL 2004 on Interactive Poster And Demonstration Sessions. Association for Computational Linguistics, 2004:20. |
[33] | WAN X , XIAO J . Single document keyphrase extraction using neighborhood knowledge[C]// AAAI. 2008,8: 855-860. |
[34] | GOLLAPALLI S D , CARAGEA C . Extracting keyphrases from research papers using citation networks[C]// AAAI. 2014: 1629-1635. |
[35] | KHAN A , SALIM N , FARMAN H ,et al. Abstractive text summarization based on improved semantic graph approach[J]. International Journal of Parallel Programming, 2018: 1-25. |
[36] | AL-KHASSAWNEH Y A , SALIM N , JARRAH M . Improving triangle-graph based text summarization using hybrid similarity function[J]. Indian Journal of Science and Technology, 2017,10(8). |
[37] | WEI F , LI W , LU Q ,et al. A document-sensitive graph model for multi-document summarization[J]. Knowledge and Information Systems, 2010,22(2): 245-259. |
[38] | GE S S , ZHANG Z , HE H . Weighted graph model based sentence clustering and ranking for document summarization[C]// 2011 4th International Conference on Interaction Sciences (ICIS). 2011: 90-95. |
[39] | NGUYEN-HOANG T A , NGUYEN K , TRAN Q V . TSGVi:a graph-based summarization system for Vietnamese documents[J]. Journal of Ambient Intelligence and Humanized Computing, 2012,3(4): 305-313. |
[40] | 耿焕同, 蔡庆生, 赵鹏 ,等. 一种基于词共现图的文档自动摘要研究[J]. 情报学报, 2005,24(6): 652-1. |
GENG H T , CAI Q S , ZHAO P ,et al. Research on document automatic summarization based on word co-occurrence[J]. Journal of the China Society for Scientific and Technical Information, 2005,24(6): 652-1. | |
[41] | SEHGAL S , KUMAR B , RAMPAL L ,et al. A modification to graph based approach for extraction based automatic text summarization[M]// Progress in Advanced Computing and Intelligent Engineering. Singapore Springer Press, 2018: 373-378. |
[42] | YOUSEFI-AZAR M , HAMEY L . Text summarization using unsupervised deep learning[J]. Expert Systems with Applications, 2017,68: 93-105. |
[43] | ARRAS L , HORN F , MONTAVON G ,et al. What is relevant in a text document? an interpretable machine learning approach[J]. PloSone, 2017,12(8):e0181142 |
[44] | THU H N T . An optimization text summarization method based on naive bayes and topic word for single syllable language[J]. Applied Mathematical Sciences, 2014,8(3): 99-115. |
[45] | SILVA G , FERREIRA R , LINS R D ,et al. Automatic text document summarization based on machine learning[C]// 2015 ACM Symposium on Document Engineering. ACM, 2015: 191-194. |
[46] | NISHIKAWA H , ARITA K , TANAKA K ,et al. Learning to generate coherent summary with discriminative hidden semi-markov model[C]// The 25th International Conference on Computational Linguistics:Technical Papers. 2014: 1648-1659. |
[47] | ALLAHYARI M , POURIYEH S , ASSEFI M ,et al. A brief survey of text mining:classification,clustering and extraction techniques[J]. ar Xiv preprint ar Xiv:1707.02919, 2017. |
[48] | KUPIEC J , PEDERSEN J , CHEN F . A trainable document summarizer[C]// The 18th annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 1995: 68-73. |
[49] | CONROY J M,O'LEARY D P , . Text summarization via hidden markov models[C]// The 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 2001: 406-407. |
[50] | LIN C Y , . Training a selection function for extraction[C]// The Eighth International Conference on Information and Knowledge Management. ACM, 1999: 55-62. |
[51] | HINTON G E , OSINDERO S , TEH Y W . A fast learning algorithm for deep belief nets[J]. Neural Computation, 2006,18(7): 1527-1554. |
[52] | MRK?I? N , VULI? I , SéAGHDHA D ó ,et al. Semantic specialisation of distributional word vector spaces using monolingual and cross-lingual constraints[J]. ar Xiv preprint ar Xiv:1706.00374, 2017. |
[53] | XIONG Z , SHEN Q , WANG Y ,et al. Paragraph vector representation based on word to vector and CNN learning[J]. CMC:Computers,Materials & Continua, 2018,55(2): 213-227. |
[54] | WANG X , ZHANG H , LIU Y . Sentence vector model based on implicit word vector expression[J]. IEEE Access, 2018,6: 17455-17463. |
[55] | SUTSKEVER I , VINYALS O , Le Q V . Sequence to sequence learning with neural networks[C]// Advances in neural information processing systems. 2014: 3104-3112. |
[56] | NALLAPATI R , XIANG B , ZHOU B . Sequence-to-sequence rnns for text summarization[J]. ar Xiv preprint ar Xiv:1602.06023v1, 2016. |
[57] | RUSH A M , CHOPRA S , WESTON J . A neural attention model for abstractive sentence summarization[J]. ar Xiv preprint ar Xiv:1509.00685, 2015. |
[58] | CHOPRA S , AULI M , RUSH A M . Abstractive sentence summarization with attentive recurrent neural networks[C]// The 2016 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies. 2016: 93-98. |
[59] | NALLAPATI R , ZHOU B , GULCEHRE C ,et al. Abstractive text summarization using sequence-to-sequence RNNS and beyond[J]. ar Xiv preprint ar Xiv:1602.06023v5, 2016. |
[60] | CAO Z , LI W , LI S ,et al. Attsum:Joint learning of focusing and summarization with neural attention[J]. ar Xiv preprint ar Xiv:1604.00125, 2016. |
[61] | SEE A , LIU P J , Manning C D . Get to the point:summarization with pointer-generator networks[J]. ar Xiv preprint ar Xiv:1704.04368, 2017. |
[62] | ABADI M , BARHAM P , CHEN J ,et al. Tensor Flow:a system for large-scale machine learning[C]// OSDI. 2016: 265-283. |
[63] | SUTSKEVER I , VINYALS O , LE Q V . Sequence to sequence learning with neural networks[C]// Advances in Neural Information Processing Systems. 2014: 3104-3112. |
[64] | GEHRING J , AULI M , GRANGIER D ,et al. Convolutional sequence to sequence learning[J]. ar Xiv preprint ar Xiv:1705.03122, 2017. |
[65] | LIU L , LU Y , YANG M ,et al. Generative adversarial network for abstractive text summarization[J]. ar Xiv preprint ar Xiv:1711.09357, 2017. |
[66] | GOODFELLOW I , POUGET-ABADIE J , MIRZA M ,et al. Generative adversarial nets[C]// Advances in Neural Information Processing Systems. 2014: 2672-2680. |
[67] | TAN J , WAN X , XIAO J . Abstractive document summarization with a graph-based attentional neural model[C]// The 55th Annual Meeting of the Association for Computational Linguistics. 2017: 1171-1181. |
[68] | MIHALCEA R , TARAU P . Textrank:bringing order into text[C]// Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing. 2004. |
[69] | ERKAN G , RADEV D R . Lexrank:graph-based lexical centrality as salience in text summarization[J]. Journal of Artificial Intelligence Research, 2004,22: 457-479. |
[1] | 夏锐琪, 李曼曼, 陈少真. 基于机器学习的分组密码结构识别[J]. 网络与信息安全学报, 2023, 9(3): 79-89. |
[2] | 李晓萌, 郭玳豆, 卓训方, 姚恒, 秦川. 载体独立的抗屏摄信息膜叠加水印算法[J]. 网络与信息安全学报, 2023, 9(3): 135-149. |
[3] | 谢绒娜, 马铸鸿, 李宗俞, 田野. 基于卷积神经网络的加密流量分类方法[J]. 网络与信息安全学报, 2022, 8(6): 84-91. |
[4] | 李东, 郝艳妮, 彭升辉, 訾瑞杰, 刘西蒙. 国家自然科学基金委员会网络安全现状与展望[J]. 网络与信息安全学报, 2022, 8(6): 92-101. |
[5] | 章登勇, 文凰, 李峰, 曹鹏, 向凌云, 杨高波, 丁湘陵. 基于双分支网络的图像修复取证方法[J]. 网络与信息安全学报, 2022, 8(6): 110-122. |
[6] | 林佳滢, 周文柏, 张卫明, 俞能海. 空域频域相结合的唇型篡改检测方法[J]. 网络与信息安全学报, 2022, 8(6): 146-155. |
[7] | 单棣斌, 杜学绘, 王文娟, 刘敖迪, 王娜. 基于GNN双源学习的访问控制关系预测方法[J]. 网络与信息安全学报, 2022, 8(5): 40-55. |
[8] | 穆超, 王鑫, 杨明, 张恒, 陈振娅, 吴晓明. 面向物联网设备固件的硬编码漏洞检测方法[J]. 网络与信息安全学报, 2022, 8(5): 98-110. |
[9] | 韦南, 殷丽华, 宁洪, 方滨兴. 本科“机器学习”课程教学改革初探[J]. 网络与信息安全学报, 2022, 8(4): 182-189. |
[10] | 陈晋音, 吴长安, 郑海斌. 基于softmax激活变换的对抗防御方法[J]. 网络与信息安全学报, 2022, 8(2): 48-63. |
[11] | 邱宝琳, 易平. 基于多维特征图知识蒸馏的对抗样本防御方法[J]. 网络与信息安全学报, 2022, 8(2): 88-99. |
[12] | 黄诚, 孙明旭, 段仁语, 吴苏晟, 陈斌. 面向项目版本差异性的漏洞识别技术研究[J]. 网络与信息安全学报, 2022, 8(1): 52-62. |
[13] | 李丽娟, 李曼, 毕红军, 周华春. 基于混合深度学习的多类型低速率DDoS攻击检测方法[J]. 网络与信息安全学报, 2022, 8(1): 73-85. |
[14] | 秦中元, 贺兆祥, 李涛, 陈立全. 基于图像重构的MNIST对抗样本防御算法[J]. 网络与信息安全学报, 2022, 8(1): 86-94. |
[15] | 高振升, 曹利峰, 杜学绘. 基于区块链的访问控制技术研究进展[J]. 网络与信息安全学报, 2021, 7(6): 68-87. |
阅读次数 | ||||||
全文 |
|
|||||
摘要 |
|
|||||
|