基于优化反馈的组合在线学习

1 上海交通大学约翰·霍普克罗夫特计算机科学中心，上海 200240

2 微软亚洲研究院，北京 100080

Combinatorial online learning based on optimizing feedbacks

KONG Fang1, YANG Yueran1, CHEN Wei2, LI Shuai1

1 John Hopcroft Center for Computer Science, Shanghai Jiao Tong University, Shanghai 200240, China

2 Microsoft Research Asia, Beijing 100080, China

 基金资助: 国家自然科学基金资助项目.  62006151国家自然科学基金资助项目.  62076161上海市青年科技英才扬帆计划

Online: 2021-09-15

 Fund supported: The National Natural Science Foundation of China.  62006151The National Natural Science Foundation of China.  62076161Shanghai Sailing Program

Abstract

Combinatorial online learning studies how to learn the unknown parameters and gradually find the optimal combination of targets during the interactions with the environment.This problem has a wide range of applications including advertisement placement, searching and recommendation.Firstly, the definition of combinatorial online learning and its general framework – the problem of combinatorial multi-armed bandits were introduced, and its traditional algorithms and research progress were summarized.Then, the related works of two specific applications, online influence maximization and online learning to rank, were introduced.Finally, the prospective directions of further researches on combinatorial online learning were discussed.

Keywords： combinatorial multi-armed bandits ; online learning ; online influence maximization ; online learning to rank

KONG Fang, YANG Yueran, CHEN Wei, LI Shuai. Combinatorial online learning based on optimizing feedbacks. Big data research[J], 2021, 7(5): 2021052- DOI:10.11959/j.issn.2096-0271.2021052

2 组合在线学习问题

2.1 多臂老虎机问题

$R\left(T\right)=\mathbb{E}\left[\sum _{t=1}^{T}\left({\mu }^{*}-{X}_{{A}_{t},t}\right)\right] \left(1\right)$

2.2 组合多臂老虎机问题

Gai Y等人[13]最早将多用户信道分配问题建模为一个组合多臂老虎机模型，在该模型中，每个臂包含多个组件的组合，拉动同一个臂的奖励随时间相互独立，但是不同臂之间的奖励会由于一些共享的部件而产生依赖关系。之后，该框架又被进一步泛化，并被系统地阐释[14,15,16,17]

● 全信息反馈：该反馈类型假设玩家在拉动一个超级臂之后，能够观察到所有基础臂的输出，而与该基础臂是否被包含在被拉动的超级臂中无关。这是一种非常理想化的设定，现实中往往难以遇到该场景。

● 半强盗反馈：全信息反馈的设定较为理想，现实中更常见的模式是玩家只能观察到其选择的超级臂包含的基础臂的反馈信息，也就是半强盗反馈。例如，在搜索排序场景中，平台可以获取用户对所列举项目的满意程度。

● 强盗反馈：该反馈类型假设玩家只能观察到拉动超级臂所获得的总的奖励，而不能看到任何基础臂的信息。以推荐场景为例，用户的点击意味着对所推荐商品组合的认可，平台往往无法获知用户对具体某个产品的满意程度。

$R\left(T\right)=T\cdot \alpha \beta \cdot {\text{OPT}}_{\mu }-\mathbb{E}\left[\sum _{t=1}^{T}{r}_{\mu }\left({S}_{t}\right)\right] \left(2\right)$

α=β=1意味着离线算法可以返回准确的最优解，(α,β)近似累积期望懊悔值为真实的累积期望懊悔值。

3 基于优化反馈的组合在线学习算法

Gai Y 等人[17]最早考虑了组合多臂老虎机中奖励形式为线性的情况，提出了线性奖励学习（learning with linear reward， LLR）算法解决此问题，具体如下。

For i=1 to m

t=t+1

End for

While True do

t=t+1

$a=\mathrm{arg}{\mathrm{max}}_{a}\sum _{i\in {A}_{a}}{a}_{i}\left({\stackrel{^}{\mu }}_{i}+\sqrt{\frac{\left(L+1\right)\mathrm{ln}t}{{T}_{i,t}}}\right)$

End while

LLR算法使用UCB的思想平衡开发与探索的关系。在该算法中，学习者每轮可选择的超级臂中至多包含L个基础臂，其中L为超参数。每个超级臂Aa被表示为所有基础臂的加权组合，权重向量由m维的向量$a$表示，aii≥0表示基础臂i在超级臂中的权重，Aa将包含所有满足aii≠0的基础臂i。超级臂的置信上界是其包含的基础臂的置信上界的加权和，算法在每一轮选择置信上界最大的超级臂。该算法框架有丰富的应用场景，如寻找最大匹配、计算最短路径及最小生成树等。

While True do

$t←t+1$

${\overline{\mu }}_{i}=\mathrm{min}\left\{{\stackrel{^}{\mu }}_{i}+\sqrt{3\mathrm{ln}t/2{T}_{i}},1\right\}$

$S=\text{Oracle}\left({\overline{\mu }}_{1},{\overline{\mu }}_{1},\cdots ,{\overline{\mu }}_{1}\right)$

End while

Komiyama J等人[36]研究用TS算法解决组合多臂老虎机中超级臂的奖励为所包含基础臂的输出之和且超级臂的大小固定为K的问题，提出了多动作汤姆森采样（multiple-play Thompson sampling， MP-TS）算法。该工作考虑了基础臂的输出为伯努利随机变量的场景，算法为每个基础臂i∈[m]维持一个贝塔分布Beta(Ai,Bi)，在每一轮将为每个基础臂i从其分布Beta(Ai,Bi)中采样一个随机变量θi，并对所有基础臂根据该随机变量从高到低排序，从中选择前K个基础臂组成本轮要选择的超级臂，随后根据这些基础臂的输出再次更新其对应的贝塔分布。MP-TS算法达到了$O\left(\sum _{i\in \left[m\right]\\left[K\right]}\frac{{\Delta }_{i,K}\mathrm{log}T}{d\left({\mu }_{i},{\mu }_{K}\right)}\right)$的累积懊悔上界，其中d(µiK)为任意非最优基础臂i与最优臂K期望奖励的KL距离，该懊悔值上界也与此类问题的最优懊悔值相匹配。

While True do

t←t+ 1

$\theta \left(t\right)=\left({\theta }_{1}\left(t\right),\cdots ,{\theta }_{m}\left(t\right)\right)$

End while

4 应用方面

4.1 在线排序学习问题

Zhu Z A等人[59]不再考虑某特定的模型假设，而是根据实际推荐点击场景提出了更通用的广义点击模型，上述3种模型均可涵盖在内。广义点击模型假设用户浏览每个商品的概率随着商品在推荐列表中位置的后移而减小。用户点击某商品的概率与该商品的吸引力相关，且用户继续往后浏览的概率与其对当前商品的点击与否相关。用户对商品列表整体的浏览及点击情况可由如图1所示的贝叶斯网络结构来阐释。图1中Ai表示用户点击第i个商品后继续向下浏览，Bi表示用户没有点击第i个商品并继续向下浏览，Ei表示用户浏览第i个商品，Ci表示用户点击第i个商品，Ri表示第i个商品的相关程度/吸引力，即用户的浏览行为与商品的吸引力共同决定了用户对该商品的点击行为，而用户浏览第i+1个商品的可能性与用户对第i个商品采取行为的概率分布相关。

4.2 在线影响力最大化

Chen W等人[14-15]首先将IC模型下的在线影响力最大化问题建模为带触发机制的组合多臂老虎机问题（CMAB with probabilistically triggered arms， CMAB-T）。在该模型中，社交网络中的每条边被视为一个基础臂，每次选择的种子节点集合为学习者选择的动作，种子节点集合的全部出边的集合被视为超级臂。基础臂的奖励的期望即该边的权重。随着信息在网络中的传播，越来越多的边被随机触发，其上的权重也可以被观察到。基于此模型，CUCB算法[14-15]即可被用于解决该问题。通过证得IC模型上的在线影响力最大化问题满足受触发概率调制的有界平滑性条件（triggering probability modulated condition），CUCB算法可以达到 $\stackrel{˜}{O}\left(nm\sqrt{T}\right)$ 的累积懊悔上界，其中n、m分别表示网络中节点和边的数量[15]

LT模型考虑了信息传播中的相关性，这给在线影响力最大化问题的理论分析带来了更大的挑战。直到2020年，Li S等人[71]研究了LT模型下的在线影响力最大化问题，并给出了首个理论分析结果。该工作假设节点层面的反馈信息，即传播过程每一步中每个节点的激活情况可以被观察到，并基于此设计了LT-LinUCB算法，该算法可以达到 $\stackrel{˜}{O}\left({n}^{7/2}{m}^{1/2}\sqrt{T}\right)$ 的累积懊悔上界。此外，Li S等人还提出了OIM-ETC（OIM-explore then commit）算法，该算法可以同时解决IC模型和LT模型下的在线影响力最大化问题，达到了$\stackrel{˜}{O}\left({\left(nm\right)}^{4/3}{T}^{2/3}\right)$ 的累积懊悔上界。Vaswani S等人[72]还考虑了一种新的反馈类型，即信息在任意两点间的可达情况，并基于此反馈信息设计了DILinUCB（diffusion-independent LinUCB）算法。该算法同时适用于IC模型和LT模型，但其替换后的近似目标函数并不能保证严格的理论结果。

5 未来研究方向

● 设计适用于具体应用场景的有效算法。以在线影响力最大化问题为例，目前的研究工作多为通用的算法，基于TPM条件得到算法的累积懊悔上界。但实际应用场景中通常涉及更多值得关注的细节，如在线学习排序问题中用户的点击习惯、用户疲劳等现象，在线影响力最大化问题中信息传播的规律等。通用的算法往往难以全面考虑这些细节从而贴合具体的问题场景，因此针对具体问题场景设计出有效的算法，以及为这些问题证明其累积懊悔下界等仍然是值得研究的问题。

● 将组合在线学习与强化学习结合。MAB问题是强化学习的一种特例，CMAB是组合优化与MAB的结合，更进一步的问题是能否将这种结合推广到广义的强化学习领域，探索更广泛通用的组合在线学习框架，以涵盖更多的真实应用场景。

● 研究具有延迟反馈和批处理反馈的组合在线学习算法。实际应用场景中往往有更复杂的反馈情况，如在线排序学习问题中用户可能没有立即对感兴趣的项目进行访问，导致玩家行动与反馈之间有一定的时间差，即延迟反馈；有时广告供应商每隔一段时间才集中收集一次数据，即批处理反馈。在MAB问题中，针对这两种反馈方式的算法研究已取得一定的进展，在CMAB问题中，考虑这些复杂反馈形式的算法仍需继续探索。

● 深入研究分布式CMAB（distributed CMAB）。考虑多个玩家同时以相同目标选择超级臂的场景，玩家之间可以进行交流，但是每一轮每个玩家能共享的信息有限，玩家最终根据自己观察到的信息做出选择，并得到对应的独立的奖励值，最终每个玩家都会选择自己认为最优的超级臂。分布式探索的方法在MAB问题中已经有所应用和突破，但在CMAB场景中仍然需要更多的深入研究。

● 寻找更多的实际应用场景，以及在真实数据集上进行实验。已有算法可应用的数据集仍然有限，目前的研究工作多在人工数据集上进行实验，如何将算法应用到更多的真实数据集中并解决更多的实际问题，是具有重要实际意义的研究方向。

参考文献 原文顺序 文献年度倒序 文中引用次数倒序 被引期刊影响因子

ALON N , CESA-BIANCHI N ,, DEKEL O ,et al.

Online learning with feedback graphs:beyond bandits

[J]. arXiv preprint,2015,arXiv:1502.07617.

LYKOURIS T , TARDOS É ,, WALI D .

Feedback graph regret bounds for Thompson sampling and UCB

[C]// Proceedings of the 31st International Conference on Algorithmic Learning Theory.[S.l.:s.n.], 2020.

BARTÓK G , FOSTER D P , PÁL D , ,et al.

Partial monitoring-classification,regret bounds,and algorithms

[J]. Mathematics of Operations Research, 2014,39(4): 967-997.

AUER P , CESA-BIANCHI N , FISCHER P .

Finite-time analysis of the multiarmed bandit problem

[J]. Machine Learning, 2002,47(2): 235-256.

THOMPSON W R .

On the likelihood that one unknown probability exceeds another in view of the evidence of two samples

[J]. Biometrika, 1933,25(3/4): 285-294.

BUBECK S , MUNOS R , STOLTZ G .

Pure exploration in multi-armed bandits problems

[M]// Lecture Notes in Computer Science. Heidelberg: Springer, 2009: 23-37.

AUDIBERT J Y , BUBECK S , MUNOS R .

Best arm identification in multi-armed bandits

[C]// Proceedings of the 23rd Annual Conference on Learning Theory.[S.l.:s.n.], 2010: 41-53.

KARNIN Z , KOREN T , SOMEKH O .

Almost optimal exploration in multiarmed bandits

[C]// Proceedings of the International Conference on Machine Learning. New York:ACM Press, 2013: 1238-1246.

GARIVIER A , KAUFMANN E .

Optimal best arm identification with fixed confidence

[C]// Proceedings of the 29th Annual Conference on Learning Theory.[S.l.:s.n.], 2016.

WU Y F , SHARIFF R , LATTIMORE T ,et al.

Conservative bandits

[C]// Proceedings of the 33rd International Conference on Machine Learning. New York:ACM Press, 2016.

AUER P , CESA-BIANCHI N ,, FREUND Y ,et al.

The nonstochastic multiarmed bandit problem

[J]. SIAM Journal on Computing, 2002,32(1): 48-77.

LI L H , CHU W , LANGFORD J ,et al.

A contextual-bandit approach to personalized news article recommendation

[C]// Proceedings of the 19th International Conference on World Wide Web. New York:ACM Press, 2010.

GAI Y , KRISHNAMACHARI B , JAIN R .

Learning multiuser channel allocations in cognitive radio networks:a combinatorial multi-armed bandit formulation

[C]// Proceedings of 2010 IEEE Symposium on New Frontiers in Dynamic Spectrum. Piscataway:IEEE Press, 2010: 1-9.

CHEN W , HU W , LI F ,et al.

Combinatorial multi-armed bandit with general reward functions

[C]// Proceedings of the 30th International Conference on Neural Information Processing Systems.[S.l.:s.n.], 2016.

WANG Q S , CHEN W .

Improving regret bounds for combinatorial semi-bandits with probabilistically triggered arms and its applications

[C]// Proceedings of the 31st International Conference on Neural Information Processing Systems. New York:ACM Press, 2017: 1161-1171.

CHEN W , WANG Y J , YUAN Y .

combinatorial multi-armed bandit:general framework and applications

[C]// Proceedings of the 30th International Conference on Machine Learning.[S.l.:s.n.], 2013: 151-159.

GAI Y , KRISHNAMACHARI B , JAIN R .

Combinatorial network optimization with unknown variables:multi-armed bandits with linear rewards and individual observations

[J]. IEEE/ACM Transactions on Networking, 2012,20(5): 1466-1478.

CHEN W , WANG Y J , YUAN Y ,et al.

Combinatorial multi-armed bandit and its extension to probabilistically triggered arms

[J]. Journal of Machine Learning Research, 2016,17(1): 1746-1778.

ABBASI-YADKORI Y , PÁL D , SZEPESVÁRI C .

Improved algorithms for linear stochastic bandits

[C]// Proceedings of the 24th International Conference on Neural Information Processing Systems. New York:ACM Press, 2011: 2312-2320.

DANI V , HAYES T P , KAKADE S M .

Stochastic linear optimization under bandit feedback

[Z]. 2008.

RUSMEVICHIENTONG P , TSITSIKLIS J N .

Linearly parameterized bandits

[J]. Mathematics of Operations Research, 2010,35(2): 395-411.

KVETON B , WEN Z , ASHKAN A ,et al.

Matroid bandits:fast combinatorial optimization with learning

[C]// Proceedings of the 30th Conference on Uncertainty in Artificial Intelligence.[S.l.:s.n.], 2014.

CHEN S , LIN T , KING I ,et al.

Combinatorial pure exploration of multiarmed bandits

[C]// Proceedings of the 28th Conference on Neural Information Processing Systems.[S.l.:s.n.], 2014: 379-387.

GABILLON V , LAZARIC A , GHAVAMZADEH M ,et al.

Improved learning complexity in combinatorial pure exploration bandits

[C]// Proceedings of the 19th International Conference on Artificial Intelligence and Statistics.[S.l.:s.n.], 2016.

KUROKI Y , XU L Y , MIYAUCHI A ,et al.

Polynomial-time algorithms for multiplearm identification with full-bandit feedback

[J]. Neural Computation, 2020,32(9): 1733-1773.

REJWAN I , MANSOUR Y .

Top-k combinatorial bandits with full-bandit feedback

[C]// Proceedings of the 31st International Conference on Algorithmic Learning Theory.[S.l.:s.n.], 2020.

HUANG W R , OK J , LI L ,et al.

Combinatorial pure exploration with continuous and separable reward functions and its applications

[C]// Proceedings of the 27th International Joint Conference on Artificial Intelligence. New York:ACM Press, 2018.

DU Y H , KUROKI Y , CHEN W .

Combinatorial pure exploration with fullbandit or partial linear feedback

[J]. arXiv preprint,2020,arXiv:2006.07905.

CHEN W , DU Y H , HUANG L B ,et al.

Combinatorial pure exploration for dueling bandit

[C]// Proceedings of the 37th International Conference on Machine Learning.[S.l.:s.n.], 2020.

COMBES R , TALEBI M S , PROUTIERE A ,et al.

Combinatorial bandits revisited

[C]// Proceedings of the 28th International Conference on Neural Information Processing Systems. New York:ACM Press, 2015: 2116-2124.

GARIVIER A , CAPPÉ O ,, .

The KL-UCB algorithm for bounded stochastic bandits and beyond

[C]// Proceedings of the 24th Annual Conference on Learning Theory.[S.l.:s.n.], 2011.

CUVELIER T , COMBES R , GOURDIN E .

Statistically efficient,polynomial-time algorithms for combinatorial semibandits

[J]. Proceedings of the ACM on Measurement and Analysis of Computing Systems, 2021,5(1): 1-31.

QIN L J , CHEN S Y , ZHU X Y .

Contextual combinatorial bandit and its application on diversified online recommendation

[C]// Proceedings of the 2014 SIAM International Conference on Data Mining. Philadelphia:Society for Industrial and Applied Mathematics, 2014.

CHEN L X , XU J , LU Z .

Contextual combinatorial multi-armed bandits with volatile arms and submodular reward

[C]// Proceedings of the 32nd International Conference on Neural Information Processing Systems. New York:ACM Press, 2018: 3251-3260.

ZUO J H , LIU X T , JOE-WONG C , ,et al.

Online competitive influence maximization

[J]. arXiv preprint,2020,arXiv:2006.13411.

KOMIYAMA J , HONDA J , NAKAGAWA H .

Optimal regret analysis of Thompson sampling in stochastic multi-armed bandit problem with multiple plays

[C]// Proceedings of the 32nd International Conference on International Conference on Machine Learning. New York:ACM Press, 2015: 1152-1161.

WANG S W , CHEN W .

Thompson sampling for combinatorial semi-bandits

[C]// Proceedings of the 35th International Conference on Machine Learning.[S.l.:s.n.], 2018.

WANG Q S , CHEN W .

Improving regret bounds for combinatorial semi-bandits with probabilistically triggered arms and its applications

[C]// Proceedings of the 31st International Conference on Neural Information Processing Systems. New York:ACM Press, 2017: 1161-1171.

HÜYÜK A , TEKIN C .

Analysis of Thompson sampling for combinatorial multi-armed bandit with probabilistically triggered arms

[C]// Proceedings of Machine Learning Research.[S.l.:s.n.], 2019.

HÜYÜK A , TEKIN C .

Thompson sampling for combinatorial network optimization in unknown environments

[J]. IEEE/ACM Transactions on Networking, 2020,28(6): 2836-2849.

WANG S W , CHEN W .

Thompson sampling for combinatorial semi-bandits

[C]// Proceedings of the 35th International Conference on Machine Learning.[S.l.:s.n.], 2018: 5114-5122.

ZHOU H Z , WANG L D , VARSHNEY L ,et al.

A near-optimal change-detection based algorithm for piecewise-stationary combinatorial semi-bandits

[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2020,34(4): 6933-6940.

CHEN W , WANG L W , ZHAO H Y ,et al.

Combinatorial semi-bandit in the non stationary environment

[J]. arXiv preprint,2020,arXiv:2002.03580.

ZHANG X J , LI S , LIU W W .

Contextual combinatorial conservative bandits

[J]. arXiv preprint,2019,arXiv:1911.11337.

CESA-BIANCHI N , LUGOSI G .

Combinatorial bandits

[J]. Journal of Computer and System Sciences, 2012,78(5): 1404-1422.

BUBECK S , CESA-BIANCHI N ,, KAKADE S M .

Towards minimax policies for online linear optimization with bandit feedback

[J]. Journal of Machine Learning Research, 2012,23.

SAKAUE S , ISHIHATA M , MINATO S I .

Practical adversarial combinatorial bandit algorithm via compression of decision sets

[J]. arXiv preprint,2017,arXiv:1707.08300.

LIN T , ABRAHAO B , KLEINBERG R ,et al.

Combinatorial partial monitoring game with linear feedback and its applications

[C]// Proceedings of the 31st International Conference on Machine Learning.[S.l.:s.n.], 2014.

CHAUDHURI S , TEWARI A .

Phased exploration with greedy exploitation in stochastic combinatorial partial monitoring games

[J]. arXiv preprint,2016,arXiv:1608.06403.

CHEN X Y , ZHENG K , ZHOU Z X ,et al.

(Locally) Differentially private combinatorial semi-bandits

[C]// Proceedings of the 37th International Conference on Machine Learning.[S.l.:s.n.], 2020.

CHEN Y W , HOFMANN K.Online learning to rank:absolute vs .

relative

[C]// Proceedings of the 24th International Conference on World Wide Web. New York:ACM Press, 2015.

KVETON B , SZEPESVARI C , WEN Z ,et al.

[C]// Proceedings of the 32nd International Conference on Machine Learning. New York:ACM Press, 2015: 767-776.

KVETON B , WEN Z , ASHKAN A ,et al.

[J]. Advances in Neural Information Processing Systems, 2015,28: 1450-1458.

LI S , WANG B X , ZHANG S Y ,et al.

[C]// Proceedings of the 33rd International Conference on International Conference on Machine Learning. New York:ACM Press, 2016: 1245-1253.

KATARIYA S , KVETON B , SZEPESVÁRI C ,, et al .

DCM bandits:learning to rank with multiple clicks

[C]// Proceedings of the 33rd International Conference on International Conference on Machine Learning. New York:ACM Press, 2016: 1215-1224.

CAO J Y , SUN W , SHEN Z J M ,et al.

Fatigue-aware bandits for dependent click models

[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2020,34(4): 3341-3348.

KVETON B , LI C , LATTIMORE T ,et al.

BubbleRank:safe online learning to rerank via implicit click feedback

[C]// Proceedings of the 35th Uncertainty in Artificial Intelligence Conference.[S.l.:s.n.], 2020.

ZOGHI M , TUNYS T , GHAVAMZADEH M ,et al.

Online learning to rank in stochastic click models

[C]// Proceedings of the 34th International Conference on Machine Learning.[S.l.:s.n.], 2017.

ZHU Z A , CHEN W Z , MINKA T ,et al.

A novel click model and its applications to online advertising

[C]// Proceedings of the 3rd ACM International Conference on Web Search and Data Mining. New York:ACM Press, 2010.

LI S , LATTIMORE T , SZEPESVARI C .

Online learning to rank with features

[C]// Proceedings of the 36th International Conference on Machine Learning.[S.l.:s.n.], 2019.

LATTIMORE T , KVETON B , LI S ,et al.

TopRank:a practical algorithm for online stochastic ranking

[C]// Proceedings of the 32nd International Conference on Neural Information Processing Systems.[S.l.:s.n.], 2018.

YUE Y S , GUESTRIN C .

Linear submodular bandits and their application to diversified retrieval

[C]// Proceedings of the 24th International Conference on Neural Information Processing Systems. New York:ACM Press, 2011: 2483-2491.

YU B S , FANG M , TAO D C .

Linear submodular bandits with a knapsack constraint

[C]// Proceedings of the 30th AAAI Conference on Artificial Intelligence. New York:ACM Press, 2016: 1380-1386.

CHEN L , KRAUSE A , KARBASI A .

Interactive submodular bandit

[C]// Proceedings of the 31st International Conference on Neural Information Processing Systems.[S.l.:s.n.], 2017.

TAKEMORI S , SATO M , SONODA T ,et al.

Submodular bandit problem under multiple constraints

[C]// Proceedings of the 36th Conference on Uncertainty in Artificial Intelligence.[S.l.:s.n.], 2020.

[J]. 计算机科学, 2020,47(5): 7-13.

KONG F , LI Q Z , LI S .

Survey on online influence maximization

[J]. Computer Science, 2020,47(5): 7-13.

KEMPE D , KLEINBERG J , TARDOS É , .

Maximizing the spread of influence through a social network

[C]// Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.[S.l.:s.n.], 2003: 137-146.

WEN Z , KVETON B , VALKO M ,et al.

Online influence maximization under independent cascade model with semibandit feedback

[C]// Proceedings of the 31st International Conference on Neural Information Processing Systems.[S.l.:s.n.], 2017: 3026-3036.

WU Q Y , LI Z G , WANG H Z ,et al.

Factorization bandits for online influence maximization

[C]// Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery ＆ Data Mining. New York:ACM Press, 2019: 636-646.

VASWANI S , LAKSHMANAN L V S , SCHMIDT M .

Influence maximization with bandits

[J]. arXiv preprint,2015,arXiv:1503.00024.

LI S , KONG F , TANG K J ,et al.

Online influence maximization under linear threshold model

[J]. arXiv preprint,2020,arXiv:2011.06378.

VASWANI S , KVETON B , WEN Z ,et al.

Model-independent online learning for influence maximization

[C]// Proceedings of the 34th International Conference on Machine Learning. New York:ACM Press, 2017: 3530-3539.

/

 〈 〉