基于优化反馈的组合在线学习

doi:10.11959/j.issn.2096-0271.2021052

摘要/Abstract

摘要：

组合在线学习问题研究如何在与环境的交互过程中学习未知参数，逐步找到最优的目标组合。该问题有丰富的应用场景，如广告投放、搜索和推荐等。首先阐述了组合在线学习问题的定义及其框架——组合多臂老虎机问题，归纳了此框架下的经典算法和研究进展；然后具体介绍了该问题的两个实际应用——在线影响力最大化和在线排序学习问题，以及其研究进展；最后展望了组合在线学习问题的未来研究方向。

关键词: 组合多臂老虎机, 在线学习, 在线影响力最大化, 在线排序学习

Abstract:

Combinatorial online learning studies how to learn the unknown parameters and gradually find the optimal combination of targets during the interactions with the environment.This problem has a wide range of applications including advertisement placement, searching and recommendation.Firstly, the definition of combinatorial online learning and its general framework – the problem of combinatorial multi-armed bandits were introduced, and its traditional algorithms and research progress were summarized.Then, the related works of two specific applications, online influence maximization and online learning to rank, were introduced.Finally, the prospective directions of further researches on combinatorial online learning were discussed.

Key words: combinatorial multi-armed bandits, online learning, online influence maximization, online learning to rank

中图分类号:

TP181

孔芳, 杨悦然, 陈卫, 李帅. 基于优化反馈的组合在线学习[J]. 大数据, 2021, 7(5): 111-130.

Fang KONG, Yueran YANG, Wei CHEN, Shuai LI. Combinatorial online learning based on optimizing feedbacks[J]. Big Data Research, 2021, 7(5): 111-130.

图/表 1

参考文献 72

[1]	ALON N , CESA-BIANCHI N ,, DEKEL O ,et al. Online learning with feedback graphs:beyond bandits[J]. arXiv preprint,2015,arXiv:1502.07617.
[2]	LYKOURIS T , TARDOS é ,, WALI D . Feedback graph regret bounds for Thompson sampling and UCB[C]// Proceedings of the 31st International Conference on Algorithmic Learning Theory.[S.l.:s.n.], 2020.
[3]	BARTóK G , FOSTER D P , PáL D , ,et al. Partial monitoring-classification,regret bounds,and algorithms[J]. Mathematics of Operations Research, 2014,39(4): 967-997.
[4]	AUER P , CESA-BIANCHI N , FISCHER P . Finite-time analysis of the multiarmed bandit problem[J]. Machine Learning, 2002,47(2): 235-256.
[5]	THOMPSON W R . On the likelihood that one unknown probability exceeds another in view of the evidence of two samples[J]. Biometrika, 1933,25(3/4): 285-294.
[6]	BUBECK S , MUNOS R , STOLTZ G . Pure exploration in multi-armed bandits problems[M]// Lecture Notes in Computer Science. Heidelberg: Springer, 2009: 23-37.
[7]	AUDIBERT J Y , BUBECK S , MUNOS R . Best arm identification in multi-armed bandits[C]// Proceedings of the 23rd Annual Conference on Learning Theory.[S.l.:s.n.], 2010: 41-53.
[8]	KARNIN Z , KOREN T , SOMEKH O . Almost optimal exploration in multiarmed bandits[C]// Proceedings of the International Conference on Machine Learning. New York:ACM Press, 2013: 1238-1246.
[9]	GARIVIER A , KAUFMANN E . Optimal best arm identification with fixed confidence[C]// Proceedings of the 29th Annual Conference on Learning Theory.[S.l.:s.n.], 2016.
[10]	WU Y F , SHARIFF R , LATTIMORE T ,et al. Conservative bandits[C]// Proceedings of the 33rd International Conference on Machine Learning. New York:ACM Press, 2016.
[11]	AUER P , CESA-BIANCHI N ,, FREUND Y ,et al. The nonstochastic multiarmed bandit problem[J]. SIAM Journal on Computing, 2002,32(1): 48-77.
[12]	LI L H , CHU W , LANGFORD J ,et al. A contextual-bandit approach to personalized news article recommendation[C]// Proceedings of the 19th International Conference on World Wide Web. New York:ACM Press, 2010.
[13]	GAI Y , KRISHNAMACHARI B , JAIN R . Learning multiuser channel allocations in cognitive radio networks:a combinatorial multi-armed bandit formulation[C]// Proceedings of 2010 IEEE Symposium on New Frontiers in Dynamic Spectrum. Piscataway:IEEE Press, 2010: 1-9.
[14]	CHEN W , HU W , LI F ,et al. Combinatorial multi-armed bandit with general reward functions[C]// Proceedings of the 30th International Conference on Neural Information Processing Systems.[S.l.:s.n.], 2016.
[15]	WANG Q S , CHEN W . Improving regret bounds for combinatorial semi-bandits with probabilistically triggered arms and its applications[C]// Proceedings of the 31st International Conference on Neural Information Processing Systems. New York:ACM Press, 2017: 1161-1171.
[16]	CHEN W , WANG Y J , YUAN Y . combinatorial multi-armed bandit:general framework and applications[C]// Proceedings of the 30th International Conference on Machine Learning.[S.l.:s.n.], 2013: 151-159.
[17]	GAI Y , KRISHNAMACHARI B , JAIN R . Combinatorial network optimization with unknown variables:multi-armed bandits with linear rewards and individual observations[J]. IEEE/ACM Transactions on Networking, 2012,20(5): 1466-1478.
[18]	CHEN W , WANG Y J , YUAN Y ,et al. Combinatorial multi-armed bandit and its extension to probabilistically triggered arms[J]. Journal of Machine Learning Research, 2016,17(1): 1746-1778.
[19]	ABBASI-YADKORI Y , PáL D , SZEPESVáRI C . Improved algorithms for linear stochastic bandits[C]// Proceedings of the 24th International Conference on Neural Information Processing Systems. New York:ACM Press, 2011: 2312-2320.
[20]	DANI V , HAYES T P , KAKADE S M . Stochastic linear optimization under bandit feedback[Z]. 2008.
[21]	RUSMEVICHIENTONG P , TSITSIKLIS J N . Linearly parameterized bandits[J]. Mathematics of Operations Research, 2010,35(2): 395-411.
[22]	KVETON B , WEN Z , ASHKAN A ,et al. Matroid bandits:fast combinatorial optimization with learning[C]// Proceedings of the 30th Conference on Uncertainty in Artificial Intelligence.[S.l.:s.n.], 2014.
[23]	CHEN S , LIN T , KING I ,et al. Combinatorial pure exploration of multiarmed bandits[C]// Proceedings of the 28th Conference on Neural Information Processing Systems.[S.l.:s.n.], 2014: 379-387.
[24]	GABILLON V , LAZARIC A , GHAVAMZADEH M ,et al. Improved learning complexity in combinatorial pure exploration bandits[C]// Proceedings of the 19th International Conference on Artificial Intelligence and Statistics.[S.l.:s.n.], 2016.
[25]	KUROKI Y , XU L Y , MIYAUCHI A ,et al. Polynomial-time algorithms for multiplearm identification with full-bandit feedback[J]. Neural Computation, 2020,32(9): 1733-1773.
[26]	REJWAN I , MANSOUR Y . Top-k combinatorial bandits with full-bandit feedback[C]// Proceedings of the 31st International Conference on Algorithmic Learning Theory.[S.l.:s.n.], 2020.
[27]	HUANG W R , OK J , LI L ,et al. Combinatorial pure exploration with continuous and separable reward functions and its applications[C]// Proceedings of the 27th International Joint Conference on Artificial Intelligence. New York:ACM Press, 2018.
[28]	DU Y H , KUROKI Y , CHEN W . Combinatorial pure exploration with fullbandit or partial linear feedback[J]. arXiv preprint,2020,arXiv:2006.07905.
[29]	CHEN W , DU Y H , HUANG L B ,et al. Combinatorial pure exploration for dueling bandit[C]// Proceedings of the 37th International Conference on Machine Learning.[S.l.:s.n.], 2020.
[30]	COMBES R , TALEBI M S , PROUTIERE A ,et al. Combinatorial bandits revisited[C]// Proceedings of the 28th International Conference on Neural Information Processing Systems. New York:ACM Press, 2015: 2116-2124.
[31]	GARIVIER A , CAPPé O ,, . The KL-UCB algorithm for bounded stochastic bandits and beyond[C]// Proceedings of the 24th Annual Conference on Learning Theory.[S.l.:s.n.], 2011.
[32]	CUVELIER T , COMBES R , GOURDIN E . Statistically efficient,polynomial-time algorithms for combinatorial semibandits[J]. Proceedings of the ACM on Measurement and Analysis of Computing Systems, 2021,5(1): 1-31.
[33]	QIN L J , CHEN S Y , ZHU X Y . Contextual combinatorial bandit and its application on diversified online recommendation[C]// Proceedings of the 2014 SIAM International Conference on Data Mining. Philadelphia:Society for Industrial and Applied Mathematics, 2014.
[34]	CHEN L X , XU J , LU Z . Contextual combinatorial multi-armed bandits with volatile arms and submodular reward[C]// Proceedings of the 32nd International Conference on Neural Information Processing Systems. New York:ACM Press, 2018: 3251-3260.
[35]	ZUO J H , LIU X T , JOE-WONG C , ,et al. Online competitive influence maximization[J]. arXiv preprint,2020,arXiv:2006.13411.
[36]	KOMIYAMA J , HONDA J , NAKAGAWA H . Optimal regret analysis of Thompson sampling in stochastic multi-armed bandit problem with multiple plays[C]// Proceedings of the 32nd International Conference on International Conference on Machine Learning. New York:ACM Press, 2015: 1152-1161.
[37]	WANG S W , CHEN W . Thompson sampling for combinatorial semi-bandits[C]// Proceedings of the 35th International Conference on Machine Learning.[S.l.:s.n.], 2018.
[38]	WANG Q S , CHEN W . Improving regret bounds for combinatorial semi-bandits with probabilistically triggered arms and its applications[C]// Proceedings of the 31st International Conference on Neural Information Processing Systems. New York:ACM Press, 2017: 1161-1171.
[39]	HüYüK A , TEKIN C . Analysis of Thompson sampling for combinatorial multi-armed bandit with probabilistically triggered arms[C]// Proceedings of Machine Learning Research.[S.l.:s.n.], 2019.
[40]	HüYüK A , TEKIN C . Thompson sampling for combinatorial network optimization in unknown environments[J]. IEEE/ACM Transactions on Networking, 2020,28(6): 2836-2849.
[41]	WANG S W , CHEN W . Thompson sampling for combinatorial semi-bandits[C]// Proceedings of the 35th International Conference on Machine Learning.[S.l.:s.n.], 2018: 5114-5122.
[42]	ZHOU H Z , WANG L D , VARSHNEY L ,et al. A near-optimal change-detection based algorithm for piecewise-stationary combinatorial semi-bandits[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2020,34(4): 6933-6940.
[43]	CHEN W , WANG L W , ZHAO H Y ,et al. Combinatorial semi-bandit in the non stationary environment[J]. arXiv preprint,2020,arXiv:2002.03580.
[44]	ZHANG X J , LI S , LIU W W . Contextual combinatorial conservative bandits[J]. arXiv preprint,2019,arXiv:1911.11337.
[45]	CESA-BIANCHI N , LUGOSI G . Combinatorial bandits[J]. Journal of Computer and System Sciences, 2012,78(5): 1404-1422.
[46]	BUBECK S , CESA-BIANCHI N ,, KAKADE S M . Towards minimax policies for online linear optimization with bandit feedback[J]. Journal of Machine Learning Research, 2012,23.
[47]	SAKAUE S , ISHIHATA M , MINATO S I . Practical adversarial combinatorial bandit algorithm via compression of decision sets[J]. arXiv preprint,2017,arXiv:1707.08300.
[48]	LIN T , ABRAHAO B , KLEINBERG R ,et al. Combinatorial partial monitoring game with linear feedback and its applications[C]// Proceedings of the 31st International Conference on Machine Learning.[S.l.:s.n.], 2014.
[49]	CHAUDHURI S , TEWARI A . Phased exploration with greedy exploitation in stochastic combinatorial partial monitoring games[J]. arXiv preprint,2016,arXiv:1608.06403.
[50]	CHEN X Y , ZHENG K , ZHOU Z X ,et al. (Locally) Differentially private combinatorial semi-bandits[C]// Proceedings of the 37th International Conference on Machine Learning.[S.l.:s.n.], 2020.
[51]	CHEN Y W , HOFMANN K.Online learning to rank:absolute vs . relative[C]// Proceedings of the 24th International Conference on World Wide Web. New York:ACM Press, 2015.
[52]	KVETON B , SZEPESVARI C , WEN Z ,et al. Cascading bandits:learning to rank in the cascade model[C]// Proceedings of the 32nd International Conference on Machine Learning. New York:ACM Press, 2015: 767-776.
[53]	KVETON B , WEN Z , ASHKAN A ,et al. Combinatorial cascading bandits[J]. Advances in Neural Information Processing Systems, 2015,28: 1450-1458.
[54]	LI S , WANG B X , ZHANG S Y ,et al. Contextual combinatorial cascading bandits[C]// Proceedings of the 33rd International Conference on International Conference on Machine Learning. New York:ACM Press, 2016: 1245-1253.
[55]	KATARIYA S , KVETON B , SZEPESVáRI C ,, et al . DCM bandits:learning to rank with multiple clicks[C]// Proceedings of the 33rd International Conference on International Conference on Machine Learning. New York:ACM Press, 2016: 1215-1224.
[56]	CAO J Y , SUN W , SHEN Z J M ,et al. Fatigue-aware bandits for dependent click models[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2020,34(4): 3341-3348.
[57]	KVETON B , LI C , LATTIMORE T ,et al. BubbleRank:safe online learning to rerank via implicit click feedback[C]// Proceedings of the 35th Uncertainty in Artificial Intelligence Conference.[S.l.:s.n.], 2020.
[58]	ZOGHI M , TUNYS T , GHAVAMZADEH M ,et al. Online learning to rank in stochastic click models[C]// Proceedings of the 34th International Conference on Machine Learning.[S.l.:s.n.], 2017.
[59]	ZHU Z A , CHEN W Z , MINKA T ,et al. A novel click model and its applications to online advertising[C]// Proceedings of the 3rd ACM International Conference on Web Search and Data Mining. New York:ACM Press, 2010.
[60]	LI S , LATTIMORE T , SZEPESVARI C . Online learning to rank with features[C]// Proceedings of the 36th International Conference on Machine Learning.[S.l.:s.n.], 2019.
[61]	LATTIMORE T , KVETON B , LI S ,et al. TopRank:a practical algorithm for online stochastic ranking[C]// Proceedings of the 32nd International Conference on Neural Information Processing Systems.[S.l.:s.n.], 2018.
[62]	YUE Y S , GUESTRIN C . Linear submodular bandits and their application to diversified retrieval[C]// Proceedings of the 24th International Conference on Neural Information Processing Systems. New York:ACM Press, 2011: 2483-2491.
[63]	YU B S , FANG M , TAO D C . Linear submodular bandits with a knapsack constraint[C]// Proceedings of the 30th AAAI Conference on Artificial Intelligence. New York:ACM Press, 2016: 1380-1386.
[64]	CHEN L , KRAUSE A , KARBASI A . Interactive submodular bandit[C]// Proceedings of the 31st International Conference on Neural Information Processing Systems.[S.l.:s.n.], 2017.
[65]	TAKEMORI S , SATO M , SONODA T ,et al. Submodular bandit problem under multiple constraints[C]// Proceedings of the 36th Conference on Uncertainty in Artificial Intelligence.[S.l.:s.n.], 2020.
[66]	孔芳, 李奇之, 李帅 . 在线影响力最大化研究综述[J]. 计算机科学, 2020,47(5): 7-13.
	KONG F , LI Q Z , LI S . Survey on online influence maximization[J]. Computer Science, 2020,47(5): 7-13.
[67]	KEMPE D , KLEINBERG J , TARDOS é , . Maximizing the spread of influence through a social network[C]// Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.[S.l.:s.n.], 2003: 137-146.
[68]	WEN Z , KVETON B , VALKO M ,et al. Online influence maximization under independent cascade model with semibandit feedback[C]// Proceedings of the 31st International Conference on Neural Information Processing Systems.[S.l.:s.n.], 2017: 3026-3036.
[69]	WU Q Y , LI Z G , WANG H Z ,et al. Factorization bandits for online influence maximization[C]// Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery ＆ Data Mining. New York:ACM Press, 2019: 636-646.
[70]	VASWANI S , LAKSHMANAN L V S , SCHMIDT M . Influence maximization with bandits[J]. arXiv preprint,2015,arXiv:1503.00024.
[71]	LI S , KONG F , TANG K J ,et al. Online influence maximization under linear threshold model[J]. arXiv preprint,2020,arXiv:2011.06378.
[72]	VASWANI S , KVETON B , WEN Z ,et al. Model-independent online learning for influence maximization[C]// Proceedings of the 34th International Conference on Machine Learning. New York:ACM Press, 2017: 3530-3539.