大数据 ›› 2021, Vol. 7 ›› Issue (1): 76-93.doi: 10.11959/j.issn.2096-0271.2021006
张建1,2,3, 孟祥鑫1,2,3, 孙海龙1,2,3, 王旭1,2,3, 刘旭东1,2,3
出版日期:
2021-01-15
发布日期:
2021-01-01
作者简介:
张建(1994- ),男,北京航空航天大学计算机学院博士生,主要研究方向为软件工程、源代码分析、自然语言理基金资助:
Jian ZHANG1,2,3, Xiangxin MENG1,2,3, Hailong SUN1,2,3, Xu WANG1,2,3, Xudong LIU1,2,3
Online:
2021-01-15
Published:
2021-01-01
Supported by:
摘要:
通过挖掘并利用软件大数据中蕴含的知识来提高软件开发的智能化水平已成为软件工程领域的热点研究问题。然而,对软件开发者及其群体协作方法的研究尚未形成系统化的研究成果。针对此问题,以开发者群体为研究对象,通过深入分析开发者的行为历史数据,研究面向智能协作的关键技术,并以此为基础研制相应的支撑环境。首先,收集并分析了海量的开发者相关数据;第二,给出了软件开发者能力特征模型及其协作关系模型,并构建了开发者知识图谱;第三,以开发者知识图谱为支撑,阐述了基于智能推荐的协作开发方法。基于以上关键技术,研发了相应的支撑工具,并构建了智能协作开发环境系统;最后,对未来的工作进行了展望。
中图分类号:
张建, 孟祥鑫, 孙海龙, 王旭, 刘旭东. 数据驱动的软件开发者智能协作技术[J]. 大数据, 2021, 7(1): 76-93.
Jian ZHANG, Xiangxin MENG, Hailong SUN, Xu WANG, Xudong LIU. Data driven intelligent collaboration of software developers[J]. Big Data Research, 2021, 7(1): 76-93.
表3
系统中汇聚的主要软件大数据情况"
项目 | GitHub | Stack Overflow | Topcoder | CSDN | 企业数据 | ||
东软集团股份有限公司 | 万达信息股份有限公司 | ||||||
数据类型 | 开发过程数据 | 问答数据 | 众包开发数据 | 博客、论坛、问答 | 开发过程数据GitLab | ||
数据采集 | API+直接下载 | 直接下载 | API | 爬虫 | 实时采集API | 实时采集API | |
数据存储方式 | MySQL | MySQL | MySQL | MySQL | MySQL | MySQL | |
Neo4j | Neo4j | Neo4j | Neo4j | Neo4j | Neo4j | ||
MongoDB | MongoDB | MongoDB | MongoDB | MongoDB | MongoDB | ||
数据量 | 原始数据 | 12.9 TB | 323 GB | 8.1 GB | 7.6 GB | N/A | N/A |
资源数量 | 开发者:3 241万 | 用户数:1 053万 | 开发者:4.05万 | 用户:98.7万 | 代码提交:3万 | 开发者:296 | |
项目数:1.25亿 | 问题数:1 774万 | 项目数:0.39万 | 博客:104.5万 | 行为数据:2亿+ | 项目:978 | ||
commit:13.7亿 | 回答数:2 711万 | 提交数:8.69万 | 问答:15.7万 | 项目:280+ | |||
标签数:5.50万 | 挑战数:3.73万 | 论坛:57.9万 | 开发者:130+问题:1.5万 | ||||
知识图谱 | 结点数:22.4亿 | 结点数:1.6亿 | 结点数:7.8万 | 结点数:706万 | N/A | 结点数:1.2万 | |
关系数:74.5亿 | 关系数:5.9亿 | 关系数:1 829万 | 关系数:1 294万 | 关系:3.7万 | |||
加工数据种类 | 开发者能力属性、开发者协作关系、开发资源 |
[1] | THONGTANUNAM P , TANTITHAMTHAVORN C , KULA R G ,et al. Who should review my code? A file location-based codereviewer recommendation approach for modern code review[C]// 2015 IEEE 22nd International Conference on Software Analysis, Evolution, and Reengineering. Piscataway: IEEE Press, 2015: 141-150. |
[2] | BROOKSF P ,et al. The mythical man-month (anniversary ed.)[C]// Boston: AddisonWesley Longman Publishing Co., Inc. 1995. |
[3] | RAHMAN M M , ROY C K , REDL J ,et al. CORRECT: code reviewer recommendation at GitHub for Vendasta technologies[C]// The 31st IEEE/ACM International Conference on Automated Software Engineering. Piscataway: IEEE Press, 2016: 792-797. |
[4] | ASTHANA S , KUMAR R , BHAGWAN R ,et al. WhoDo: automating reviewer suggestions at scale[C]// The 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. New York: ACM Press, 2019: 937-945. |
[5] | LIU H B , QIAO M , GREENIA D ,et al. A machine learning approach to combining individual strength and team features for team recommendation[C]// 2014 13th International Conference on Machine Learning and Applications. Piscataway:IEEE Press, 2014: 213-218. |
[6] | SAPIENZA A , GOYAL P , FERRARA E . Deep neural networks for optimal team composition[J]. Frontiers in Big Data, 2019,2:14. |
[7] | GAO D W , TONG Y X , SHE J Y ,et al. Top-k team recommendation and its variants in spatial crowdsourcing[J]. Data Science and Engineering, 2017,2(2): 136-150. |
[8] | NGUYEN A T , HILTON M , CODOBAN M ,et al. API code recommendation using statistical learning from finegrained changes[C]// The 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering. New York: ACM Press, 2016: 511-522. |
[9] | LUAN S F , YANG D , BARNABY C ,et al. Aroma:code recommendation via structural code search[J]. Proceedings of the ACM on Programming Languages, 2019,3(OOPSLA): 1-28. |
[10] | SVYATKOVSKIY A , ZHAO Y , FU S Y ,et al. Pythia: ai-assisted code completion system[C]// The 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. New York: ACM Press, 2019: 2727-2735. |
[11] | ZHANG X D , ZHU C G , LI Y ,et al. Precfix:large-scale patch recommendation by mining defect-patch pairs[C]// The ACM/IEEE 42nd International Conference on Software Engineering: Software Engineering in Practice. New York: ACM Press, 2020: 41-50. |
[12] | DEMARCO T , LISTER T . Peopleware:productive projects and teams[M]. New Jersey: Addison-Wesley, 2013. |
[13] | JONES C . Programming productivity[M]. New York: McGraw-Hill, Inc., 1985. |
[14] | BOYD D M , ELLISON N B . Social network sites: definition, history, and scholarship[J]. Journal of Computer‐Mediated Communication, 2007,13(1): 210-230. |
[15] | MENEELY A , WILLIAMS L , SNIPES W ,et al. Predicting failures with developer networks and social network analysis[C]// The 16th ACM SIGSOFT International Symposium on Foundations of Software Engineering. New York: ACM Press, 2008: 13-23. |
[16] | WOLF T , SCHROTER A , DAMIAN D ,et al. Predicting build failures using social network analysis on developer communication[C]// The 31st International Conference on Software Engineering. Piscataway: IEEE Press, 2009: 1-11. |
[17] | JERMAKOVICS A , SILLITTI A , SUCCI G . Mining and visualizing developer networks from version control systems[C]// The 4th International Workshop on Cooperative and Human Aspects of Software Engineering. New York: ACM Press, 2011: 24-31. |
[18] | CAGLAYAN B , BENER A B , MIRANSKYY A ,et al. Emergence of developer teams in the collaboration network[C]// 2013 6th International Workshop on Cooperative and Human Aspects of Software Engineering. Piscataway: IEEE Press, 2013: 33-40. |
[19] | JOBLIN M , APEL S , HUNSEN C ,et al. Classifying developers into core and peripheral: an empirical study on count and network metrics[C]// The 39th International Conference on Software Engineering. Piscataway: IEEE Press, 2017: 164-174. |
[20] | SINDHGATTA R . Identifying domain expertise of developers from source code[C]// The 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York:ACM Press, 2008: 981-989. |
[21] | MATTER D , KUHN A , NIERSTRASZ O ,et al. Assigning bug reports using a vocabulary-based expertise model of developers[C]// The 6th IEEE International Working Conference on Mining Software Repositories. Piscataway: IEEE Press, 2009: 131-140. |
[22] | TEYTON C , PALYART M , FALLERI J R ,et al. Automatic extraction of developer expertise[C]// The 18th International Conference on Evaluation and Assessment in Software Engineering. New York: ACM Press, 2014: 8. |
[23] | WANG Z Z , SUN H L , FU Y ,et al. Recommending crowdsourced software developers in consideration of skill improvement[C]// 2017 32nd IEEE/ACM International Conference on Automated Software Engineering. Piscataway: IEEE Press, 2017: 717-722. |
[24] | WANG Z Z , SUN H L , HAN T . Predicting crowdsourcing worker performance with knowledge tracing[C]// International Conference on Knowledge Science, Engineering and Management. Cham:Springer, 2020: 352-359. |
[25] | WANG J , MENG X X , WANG H M ,et al. An online developer profiling tool based on analysis of GitLab repositories[C]// CCF Conference on Computer Supported Cooperative Work and Social Computing. Singapore: Springer, 2019: 408-417. |
[26] | DING J , SUN H L , WANG X ,et al. Entity-level sentiment analysis of issue comments[C]// The 3rd International Workshop on Emotion Awareness in Software Engineering. New York: ACM Press, 2018: 7-13. |
[27] | YAN J F , SUN H L , WANG X ,et al. Profiling developer expertise across software communities with heterogeneous information network analysis[C]// The 10th Asia-Pacific Symposium on Internetware. New York: ACM Press, 2018: 1-9. |
[28] | SHAO B , YAN J F . Recommending answerers for stack overflow with LDA model[C]// The 12th Chinese Conference on Computer Supported Cooperative Work and Social Computing. New York: ACM Press, 2017: 80-86. |
[29] | XIA Z L , SUN H L , JIANG J ,et al. A hybrid approach to code reviewer recommendation with collaborative filtering[C]// 2017 6th International Workshop on Software Mining. Piscataway: IEEE Press, 2017: 24-31. |
[30] | FU Y , SUN H L , YE L T ,et al. Competitionaware task routing for contest based crowdsourced software development[C]// 2017 6th International Workshop on Software Mining. Piscataway: IEEE Press, 2017: 32-39. |
[31] | ZHANG Z Y , SUN H L , ZHANG H Y . Developer recommendation for Topcoder through a meta-learning based policy model[J]. Empirical Software Engineering, 2019,25(1): 1-31. |
[32] | YE L T , SUN H L , WANG X ,et al. Personalized teammate recommendation for crowdsourced software developers[C]// The 33rd ACM/IEEE International Conference on Automated Software Engineering. New York: ACM Press, 2018: 808-813. |
[33] | SUNF M , WANGX , SUNH L ,et al. Recommendflow: use topic model to automatically recommend stack overflow Q&A in IDE[C]// International Conference on Collaborative Computing: Networking, Applications and Worksharing. Cham:Springer, 2016: 521-526. |
[34] | TIAN Y F , WANG X , SUN H L ,et al. Automatically generating API usage patterns from natural language queries[C]// 2018 25th Asia-Pacific Software Engineering Conference. Piscataway: IEEE Press, 2018: 59-68. |
[35] | ZHANG J , SUN H L , TIAN Y F ,et al. Poster:semantically enhanced tag recommendation for software CQAs via deep learning[C]// 2018 IEEE/ACM 40th International Conference on Software Engineering: Companion (ICSECompanion). Piscataway: IEEE Press, 2018: 294-295. |
[1] | 钱海红, 王茂异, 熊贇. 高等教育数字化转型的现状与发展研究[J]. 大数据, 2023, 9(3): 56-70. |
[2] | 梅宏, 杜小勇, 金海, 程学旗, 柴云鹏, 石宣化, 靳小龙, 王亚沙, 刘驰. 大数据技术前瞻[J]. 大数据, 2023, 9(1): 1-20. |
[3] | 沈阳, 余梦珑. 元宇宙与大数据:时空智能中的数据洞察与价值连接[J]. 大数据, 2023, 9(1): 103-110. |
[4] | 陈静. 人文大数据及其在数字人文领域中的应用[J]. 大数据, 2022, 8(6): 3-14. |
[5] | 罗煜楚, 吴昊, 郭宇涵, 谭绍聪, 刘灿, 蒋瑞珂, 袁晓如. 数字人文中的可视化[J]. 大数据, 2022, 8(6): 74-93. |
[6] | 郑童哲恒, 李斌, 冯敏萱, 常博林, 王东波. 历史典籍的结构化探索——《史记·列传》数字人文知识库的构建与可视化研究[J]. 大数据, 2022, 8(6): 40-55. |
[7] | 张宇奇, 黄晓雯, 桑基韬. 知识增强策略引导的交互式强化推荐系统[J]. 大数据, 2022, 8(5): 88-105. |
[8] | 李汶龙, 袁媛, 安筱鹏. 刍议大数据治理的三大基础思维[J]. 大数据, 2022, 8(4): 34-45. |
[9] | 朱智韬, 司世景, 王健宗, 肖京. 联邦推荐系统综述[J]. 大数据, 2022, 8(4): 105-132. |
[10] | 汤奇峰, 邵志清, 叶雅珍. 数据交易中的权利确认和授予体系[J]. 大数据, 2022, 8(3): 40-53. |
[11] | 许小颖, 陈熙, 陈源, 谢永靖. 区块链在个性化推荐系统中的应用研究综述[J]. 大数据, 2022, 8(3): 87-102. |
[12] | 王陈慧子, 蔡玮. 元宇宙数字经济:现状、特征与发展建议[J]. 大数据, 2022, 8(3): 140-150. |
[13] | 杨玫, 李玮, 乔思渊, 刘巍. 中国大数据产业产值测算方法研究[J]. 大数据, 2022, 8(3): 151-160. |
[14] | 李德仁, 张过, 蒋永华, 沈欣, 刘伟玲. 论大数据视角下的地球空间信息学的机遇与挑战[J]. 大数据, 2022, 8(2): 3-14. |
[15] | 仇晓兰, 胡玉新, 上官松涛, 付琨. 遥感卫星大数据高精度一体化处理技术[J]. 大数据, 2022, 8(2): 15-27. |
阅读次数 | ||||||
全文 |
|
|||||
摘要 |
|
|||||
|