大数据 ›› 2021, Vol. 7 ›› Issue (1): 76-93.doi: 10.11959/j.issn.2096-0271.2021006
• 专题:数据驱动的软件智能化开发 • 上一篇
张建1,2,3, 孟祥鑫1,2,3, 孙海龙1,2,3, 王旭1,2,3, 刘旭东1,2,3
出版日期:
2021-01-15
发布日期:
2021-01-01
作者简介:
张建(1994- ),男,北京航空航天大学计算机学院博士生,主要研究方向为软件工程、源代码分析、自然语言理基金资助:
Jian ZHANG1,2,3, Xiangxin MENG1,2,3, Hailong SUN1,2,3, Xu WANG1,2,3, Xudong LIU1,2,3
Online:
2021-01-15
Published:
2021-01-01
Supported by:
摘要:
通过挖掘并利用软件大数据中蕴含的知识来提高软件开发的智能化水平已成为软件工程领域的热点研究问题。然而,对软件开发者及其群体协作方法的研究尚未形成系统化的研究成果。针对此问题,以开发者群体为研究对象,通过深入分析开发者的行为历史数据,研究面向智能协作的关键技术,并以此为基础研制相应的支撑环境。首先,收集并分析了海量的开发者相关数据;第二,给出了软件开发者能力特征模型及其协作关系模型,并构建了开发者知识图谱;第三,以开发者知识图谱为支撑,阐述了基于智能推荐的协作开发方法。基于以上关键技术,研发了相应的支撑工具,并构建了智能协作开发环境系统;最后,对未来的工作进行了展望。
中图分类号:
张建, 孟祥鑫, 孙海龙, 王旭, 刘旭东. 数据驱动的软件开发者智能协作技术[J]. 大数据, 2021, 7(1): 76-93.
Jian ZHANG, Xiangxin MENG, Hailong SUN, Xu WANG, Xudong LIU. Data driven intelligent collaboration of software developers[J]. Big Data Research, 2021, 7(1): 76-93.
表3
系统中汇聚的主要软件大数据情况"
项目 | GitHub | Stack Overflow | Topcoder | CSDN | 企业数据 | ||
东软集团股份有限公司 | 万达信息股份有限公司 | ||||||
数据类型 | 开发过程数据 | 问答数据 | 众包开发数据 | 博客、论坛、问答 | 开发过程数据GitLab | ||
数据采集 | API+直接下载 | 直接下载 | API | 爬虫 | 实时采集API | 实时采集API | |
数据存储方式 | MySQL | MySQL | MySQL | MySQL | MySQL | MySQL | |
Neo4j | Neo4j | Neo4j | Neo4j | Neo4j | Neo4j | ||
MongoDB | MongoDB | MongoDB | MongoDB | MongoDB | MongoDB | ||
数据量 | 原始数据 | 12.9 TB | 323 GB | 8.1 GB | 7.6 GB | N/A | N/A |
资源数量 | 开发者:3 241万 | 用户数:1 053万 | 开发者:4.05万 | 用户:98.7万 | 代码提交:3万 | 开发者:296 | |
项目数:1.25亿 | 问题数:1 774万 | 项目数:0.39万 | 博客:104.5万 | 行为数据:2亿+ | 项目:978 | ||
commit:13.7亿 | 回答数:2 711万 | 提交数:8.69万 | 问答:15.7万 | 项目:280+ | |||
标签数:5.50万 | 挑战数:3.73万 | 论坛:57.9万 | 开发者:130+问题:1.5万 | ||||
知识图谱 | 结点数:22.4亿 | 结点数:1.6亿 | 结点数:7.8万 | 结点数:706万 | N/A | 结点数:1.2万 | |
关系数:74.5亿 | 关系数:5.9亿 | 关系数:1 829万 | 关系数:1 294万 | 关系:3.7万 | |||
加工数据种类 | 开发者能力属性、开发者协作关系、开发资源 |
[1] | THONGTANUNAM P , TANTITHAMTHAVORN C , KULA R G ,et al. Who should review my code? A file location-based codereviewer recommendation approach for modern code review[C]// 2015 IEEE 22nd International Conference on Software Analysis, Evolution, and Reengineering. Piscataway: IEEE Press, 2015: 141-150. |
[2] | BROOKSF P ,et al. The mythical man-month (anniversary ed.)[C]// Boston: AddisonWesley Longman Publishing Co., Inc. 1995. |
[3] | RAHMAN M M , ROY C K , REDL J ,et al. CORRECT: code reviewer recommendation at GitHub for Vendasta technologies[C]// The 31st IEEE/ACM International Conference on Automated Software Engineering. Piscataway: IEEE Press, 2016: 792-797. |
[4] | ASTHANA S , KUMAR R , BHAGWAN R ,et al. WhoDo: automating reviewer suggestions at scale[C]// The 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. New York: ACM Press, 2019: 937-945. |
[5] | LIU H B , QIAO M , GREENIA D ,et al. A machine learning approach to combining individual strength and team features for team recommendation[C]// 2014 13th International Conference on Machine Learning and Applications. Piscataway:IEEE Press, 2014: 213-218. |
[6] | SAPIENZA A , GOYAL P , FERRARA E . Deep neural networks for optimal team composition[J]. Frontiers in Big Data, 2019,2:14. |
[7] | GAO D W , TONG Y X , SHE J Y ,et al. Top-k team recommendation and its variants in spatial crowdsourcing[J]. Data Science and Engineering, 2017,2(2): 136-150. |
[8] | NGUYEN A T , HILTON M , CODOBAN M ,et al. API code recommendation using statistical learning from finegrained changes[C]// The 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering. New York: ACM Press, 2016: 511-522. |
[9] | LUAN S F , YANG D , BARNABY C ,et al. Aroma:code recommendation via structural code search[J]. Proceedings of the ACM on Programming Languages, 2019,3(OOPSLA): 1-28. |
[10] | SVYATKOVSKIY A , ZHAO Y , FU S Y ,et al. Pythia: ai-assisted code completion system[C]// The 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. New York: ACM Press, 2019: 2727-2735. |
[11] | ZHANG X D , ZHU C G , LI Y ,et al. Precfix:large-scale patch recommendation by mining defect-patch pairs[C]// The ACM/IEEE 42nd International Conference on Software Engineering: Software Engineering in Practice. New York: ACM Press, 2020: 41-50. |
[12] | DEMARCO T , LISTER T . Peopleware:productive projects and teams[M]. New Jersey: Addison-Wesley, 2013. |
[13] | JONES C . Programming productivity[M]. New York: McGraw-Hill, Inc., 1985. |
[14] | BOYD D M , ELLISON N B . Social network sites: definition, history, and scholarship[J]. Journal of Computer‐Mediated Communication, 2007,13(1): 210-230. |
[15] | MENEELY A , WILLIAMS L , SNIPES W ,et al. Predicting failures with developer networks and social network analysis[C]// The 16th ACM SIGSOFT International Symposium on Foundations of Software Engineering. New York: ACM Press, 2008: 13-23. |
[16] | WOLF T , SCHROTER A , DAMIAN D ,et al. Predicting build failures using social network analysis on developer communication[C]// The 31st International Conference on Software Engineering. Piscataway: IEEE Press, 2009: 1-11. |
[17] | JERMAKOVICS A , SILLITTI A , SUCCI G . Mining and visualizing developer networks from version control systems[C]// The 4th International Workshop on Cooperative and Human Aspects of Software Engineering. New York: ACM Press, 2011: 24-31. |
[18] | CAGLAYAN B , BENER A B , MIRANSKYY A ,et al. Emergence of developer teams in the collaboration network[C]// 2013 6th International Workshop on Cooperative and Human Aspects of Software Engineering. Piscataway: IEEE Press, 2013: 33-40. |
[19] | JOBLIN M , APEL S , HUNSEN C ,et al. Classifying developers into core and peripheral: an empirical study on count and network metrics[C]// The 39th International Conference on Software Engineering. Piscataway: IEEE Press, 2017: 164-174. |
[20] | SINDHGATTA R . Identifying domain expertise of developers from source code[C]// The 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York:ACM Press, 2008: 981-989. |
[21] | MATTER D , KUHN A , NIERSTRASZ O ,et al. Assigning bug reports using a vocabulary-based expertise model of developers[C]// The 6th IEEE International Working Conference on Mining Software Repositories. Piscataway: IEEE Press, 2009: 131-140. |
[22] | TEYTON C , PALYART M , FALLERI J R ,et al. Automatic extraction of developer expertise[C]// The 18th International Conference on Evaluation and Assessment in Software Engineering. New York: ACM Press, 2014: 8. |
[23] | WANG Z Z , SUN H L , FU Y ,et al. Recommending crowdsourced software developers in consideration of skill improvement[C]// 2017 32nd IEEE/ACM International Conference on Automated Software Engineering. Piscataway: IEEE Press, 2017: 717-722. |
[24] | WANG Z Z , SUN H L , HAN T . Predicting crowdsourcing worker performance with knowledge tracing[C]// International Conference on Knowledge Science, Engineering and Management. Cham:Springer, 2020: 352-359. |
[25] | WANG J , MENG X X , WANG H M ,et al. An online developer profiling tool based on analysis of GitLab repositories[C]// CCF Conference on Computer Supported Cooperative Work and Social Computing. Singapore: Springer, 2019: 408-417. |
[26] | DING J , SUN H L , WANG X ,et al. Entity-level sentiment analysis of issue comments[C]// The 3rd International Workshop on Emotion Awareness in Software Engineering. New York: ACM Press, 2018: 7-13. |
[27] | YAN J F , SUN H L , WANG X ,et al. Profiling developer expertise across software communities with heterogeneous information network analysis[C]// The 10th Asia-Pacific Symposium on Internetware. New York: ACM Press, 2018: 1-9. |
[28] | SHAO B , YAN J F . Recommending answerers for stack overflow with LDA model[C]// The 12th Chinese Conference on Computer Supported Cooperative Work and Social Computing. New York: ACM Press, 2017: 80-86. |
[29] | XIA Z L , SUN H L , JIANG J ,et al. A hybrid approach to code reviewer recommendation with collaborative filtering[C]// 2017 6th International Workshop on Software Mining. Piscataway: IEEE Press, 2017: 24-31. |
[30] | FU Y , SUN H L , YE L T ,et al. Competitionaware task routing for contest based crowdsourced software development[C]// 2017 6th International Workshop on Software Mining. Piscataway: IEEE Press, 2017: 32-39. |
[31] | ZHANG Z Y , SUN H L , ZHANG H Y . Developer recommendation for Topcoder through a meta-learning based policy model[J]. Empirical Software Engineering, 2019,25(1): 1-31. |
[32] | YE L T , SUN H L , WANG X ,et al. Personalized teammate recommendation for crowdsourced software developers[C]// The 33rd ACM/IEEE International Conference on Automated Software Engineering. New York: ACM Press, 2018: 808-813. |
[33] | SUNF M , WANGX , SUNH L ,et al. Recommendflow: use topic model to automatically recommend stack overflow Q&A in IDE[C]// International Conference on Collaborative Computing: Networking, Applications and Worksharing. Cham:Springer, 2016: 521-526. |
[34] | TIAN Y F , WANG X , SUN H L ,et al. Automatically generating API usage patterns from natural language queries[C]// 2018 25th Asia-Pacific Software Engineering Conference. Piscataway: IEEE Press, 2018: 59-68. |
[35] | ZHANG J , SUN H L , TIAN Y F ,et al. Poster:semantically enhanced tag recommendation for software CQAs via deep learning[C]// 2018 IEEE/ACM 40th International Conference on Software Engineering: Companion (ICSECompanion). Piscataway: IEEE Press, 2018: 294-295. |
[1] | 李德仁, 张过, 蒋永华, 沈欣, 刘伟玲. 论大数据视角下的地球空间信息学的机遇与挑战[J]. 大数据, 2022, 8(2): 3-14. |
[2] | 仇晓兰, 胡玉新, 上官松涛, 付琨. 遥感卫星大数据高精度一体化处理技术[J]. 大数据, 2022, 8(2): 15-27. |
[3] | 刘伟权, 王程, 臧彧, 胡倩, 于尚书, 赖柏锜. 基于遥感大数据的信息提取技术综述[J]. 大数据, 2022, 8(2): 28-57. |
[4] | 刘建强, 叶小敏, 兰友国. 我国海洋卫星遥感大数据及其应用服务[J]. 大数据, 2022, 8(2): 75-88. |
[5] | 杨何群, 王晓峰, 高彦青, 陆一闻, 麻炳欣, 王昕瑶. 数值天气预报对卫星大数据的需求分析[J]. 大数据, 2022, 8(2): 89-102. |
[6] | 袁胜古, 罗伦, 郭榕刚, 毛恒彬, 王芳, 蔡红玥, 肖和平. 遥感大数据在公路交通领域中的应用与实践[J]. 大数据, 2022, 8(2): 103-119. |
[7] | 史经业, 李攀. 空天大数据在新型智慧城市建设中的关键技术与应用探索[J]. 大数据, 2022, 8(2): 120-133. |
[8] | 黄新波. 十四运会信息化系统大数据统一平台的设计与应用[J]. 大数据, 2022, 8(2): 158-167. |
[9] | 郭明军, 陈沁, 安小米, 王建冬, 易成岐. 我国大数据发展指数构建及实践应用——从政务数据与社会数据融合的视角[J]. 大数据, 2022, 8(2): 182-192. |
[10] | 曹乔卓然, 陈祖刚, 李国庆, 李静. 科学数据中心资源和用户访问控制体系[J]. 大数据, 2022, 8(1): 98-112. |
[11] | 潘小多, 李新, 冉有华, 郭学军. 开放科学背景下的科学数据开放共享:国家青藏高原科学数据中心的实践[J]. 大数据, 2022, 8(1): 113-120. |
[12] | 任帅, 陈丹丹, 储根深, 白鹤, 李慧昭, 何远杰, 胡长军. 基于材料数值计算大数据的材料辐照机理发现[J]. 大数据, 2021, 7(6): 3-18. |
[13] | 杜雪涛. 大数据认知计算在内容安全管控中的应用[J]. 大数据, 2021, 7(6): 53-66. |
[14] | 刘枬, 郝雪镜, 陈俞宏. 大数据定价方法的国内外研究综述及对比分析[J]. 大数据, 2021, 7(6): 89-102. |
[15] | 马金锋, 饶凯锋, 李若男, 张京, 郑华. 水环境模型与大数据技术融合研究[J]. 大数据, 2021, 7(6): 103-119. |
阅读次数 | ||||||
全文 |
|
|||||
摘要 |
|
|||||
|