强化学习在资源优化领域的应用

doi:10.11959/j.issn.2096-0271.2021053

摘要/Abstract

摘要：

资源优化问题广泛存在于社会、经济的运转中，积累了海量的数据，给强化学习技术在这一领域的应用奠定了基础。由于资源优化问题覆盖广泛，从覆盖广泛的资源优化问题中划分出3类重要问题，即资源平衡问题、资源分配问题和装箱问题。并围绕这3类问题总结强化学习技术的最新研究工作，围绕各研究工作的问题建模、智能体设计等方面展开详细阐述。

关键词: 强化学习, 资源优化, 多智能体系统

Abstract:

Resource optimization is an important problem that widely exists in the social operation and economic development.There is massive data accumulated in this field which has laid the foundation for more and more application of reinforcement learning.Due to the wide coverage of resource optimization problems, three important problems from the wide range of resource optimization problems were categorized and chosen, namely resource balancing problem, resource allocation problem, and bin packing problem.The problem formulation and the reinforcement learning agent modeling of these three types of problems were introduced in detail.

Key words: reinforcement learning, resource optimization, multi-agent system

中图分类号:

TP399

王金予, 魏欣然, 石文磊, 张佳. 强化学习在资源优化领域的应用[J]. 大数据, 2021, 7(5): 131-149.

Jinyu WANG, Xinran WEI, Wenlei SHI, Jia ZHANG. Applications of reinforcement learning in the field of resource optimization[J]. Big Data Research, 2021, 7(5): 131-149.

参考文献 66

[1]	CRAINIC T G , LAPORTE G . Planning models for freight transportation[J]. European Journal of Operational Research, 1997,97(3): 409-438.
[2]	EPSTEIN R , NEELY A , WEINTRAUB A ,et al. A strategic empty container logistics optimization in a major shipping company[J]. Interfaces, 2012,42(1): 5-16.
[3]	LI J G , LEUNG S C H , WU Y ,et al. Allocation of empty containers between multi-ports[J]. European Journal of Operational Research, 2007,182(1): 400-412.
[4]	POWELL W B . Toward a unified modeling framework for real-time logistics control[J]. Military Operations Research, 1996,1(4): 69-79.
[5]	LEE D H , WANG H , CHEU R L ,et al. Taxi dispatch system based on current demands and real-time traffic conditions[J]. Transportation Research Record:Journal of the Transportation Research Board, 2004,1882(1): 193-200.
[6]	ZHANG L Y , HU T , MIN Y ,et al. A taxi order dispatch model based on combinatorial optimization[C]// Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York:ACM Press, 2017: 2151-2159.
[7]	DAVIS T . Effective supply chain management[J]. MIT Sloan Management Review, 1993,34(4): 35-35.
[8]	POIRIER C C , REITER S E . Supply chain optimization:building the strongest total business network[M]. San Francisco: Ber rett-Koehler Publishers, 1996.
[9]	ZHOU Z Y , CHENG S W , HUA B . Supply chain optimization of continuous process industries with sustainability considerations[J]. Computers ＆ Chemical Engineering, 2000,24(2-7): 1151-1158.
[10]	DE LA VEGA W F , LUEKER G S . Bin packing can be solved within 1 + ε in linear time[J]. Combinatorica, 1981,1(4): 349-355.
[11]	MARTELLO S , PISINGER D , VIGO D . The three-dimensional Bin packing problem[J]. Operations Research, 2000,48(2): 256-267.
[12]	SILVER D , SCHRITTWIESER J , SIMONYAN K ,et al. Mastering the game of Go without human knowledge[J]. Nature, 2017,550(7676): 354-359.
[13]	MNIH V , KAVUKCUOGLU K , SILVER D ,et al. Human-level control through deep reinforcement learning[J]. Nature, 2015,518(7540): 529-533.
[14]	刘朝阳, 穆朝絮, 孙长银 . 深度强化学习算法与应用研究现状综述[J]. 智能科学与技术学报, 2020,2(4): 314-326.
	LIU Z Y , MU C X , SUN C Y . An overview on algorithms and applications of deep reinforcement learning[J]. Chinese Journal of Intelligent Science and Technology, 2020,2(4): 314-326.
[15]	SUTTON R S , BARTO A G . Reinforcement learning:an introduction[M]. Cambridge: MIT Press, 1998.
[16]	WILLIAMS R J . Simpl e statistical gradient-following algorithms for connectionist reinforcement learning[J]. Machine Learning, 1992,8(3): 229-256.
[17]	WATKINS C J C H , DAYAN P . Q-learning[J]. Machine Learning, 1992,8(3): 279-292.
[18]	SCHULMAN J , LEVINE S , MORITZ P ,et al. Trust region policy optimization[C]// Proceedings of the 31st International Conference on Machine Learning.[S.l.:s.n.], 2015: 1889-1897.
[19]	HEESS N , TB D , SRIRAM S ,et al. Emergence of locomotion behaviours in rich environments[J]. arXiv preprint,2017,arXiv:1707.02286.
[20]	SCHULMAN J , WOLSKI F , DHARIWAL P ,et al. Proximal policy optimization algorithms[J]. arXiv preprint,2017,arXiv:1707.06347.
[21]	MNIH V , BADIA A P , MIRZA M ,et al. Asynchronous methods for deep reinforcement learning[C]// Proceedings of the 32nd Int ernational Conference on Machine Learning.[S.l.:s.n.], 2016: 1928-1937.
[22]	LONG Y , LEE L H , CHEW E P . The sample average approximation method for empty container repositioning with uncertainties[J]. European Journal of Operational Research, 2012,222(1): 65-75.
[23]	SONG D P , DONG J X . Empty container repositioning[M]// Handbook of ocean container transport logistics.[S.l.:s.n.], 2015: 163-208.
[24]	LI X H , ZHANG J , BIAN J ,et al. A cooperative multi-agent reinforcement learning framework for resource balancing in co mplex logistics network[J]. arXiv preprint,2019,arXiv:1903.00714.
[25]	JIANG J C , DUN C , HUANG T J ,et al. Graph convolutional reinforcement learning[J]. arXiv preprint,2018,arXiv:1810.09202,
[26]	SHI W L , WEI X R , ZHANG J ,et al. Cooperative policy learning with pretrained heterogeneous observation representation s[J]. arXiv preprint,2020,arXiv:2012.13099.
[27]	CONTARDO C , MORENCY C , ROUSSEAU L M . Balancing a dynamic public bikesharing system[M]. Montreal: CIRRELT, 2012.
[28]	E RDO?AN G , BATTARRA M , CALVO R W . An exact algorithm for the static rebalancing problem arising in bicycle sharing systems[J]. European Journal of Operational Research, 2015,245(3): 667-679.
[29]	GHOSH S , TRICK M , VARAKANTHAM P . Robust repositioning to counter unpredictable demand in bike sharing systems[C]// Proce edings of the 25th International Joint Conference on Artificial Intelligence. Palo Alto:AAAI Press, 2016: 3096-3102.
[30]	LIU J M , SUN L L , CHEN W W ,et al. Rebalancing bike sharing systems:a multi-source data smart optimization[C]// Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York:ACM Press, 2016: 1005-1014.
[31]	LI Y X , ZHENG Y , YANG Q . Dynamic bike reposition:a spatio-temporal reinforcement learning approach[C]// Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery ＆ Data Mining. New York:ACM Press, 2018: 1724-1733.
[32]	RAINER-HARBACH M , PAPAZEK P , HU B ,et al. Balancing bicycle sharing systems:a variable neighborhood search approach[C]// Proceedings of the 2013 European Conference on Evolutionary Computation in Combinatorial Optimization. Heidelberg:Springer, 2013: 121-132.
[33]	SCHUIJBROEK J , HAMPSHIRE R C , VAN HOEVE W J . Inventory rebalancing and vehicle routing in bike sharing systems[J]. European Journal of Operational Research, 2017,257(3): 992-1004.
[34]	CHEMLA D , MEUNIER F , PRADEAU T ,et al. Self-service bike sharing systems:simulation,repositioning,pricing[Z]. 2013.
[35]	FRICKER C , GAST N . Incentives and redistribution in homogeneous bike-sharing systems with stations of finite capacity[J]. EURO Journal on Transportation and Logistics, 2016,5(3): 261-291.
[36]	PAN L , CAI Q P , FANG Z X ,et al. A deep reinforcement learning framework for rebalancing dockless bike sharing systems[C]// Proceedings of the 33rd AAAI Conference on Artificial Intelligence. Palo Alto:AAAI Press, 2019: 1393-1400.
[37]	SINGLA A , SANTONI M , BARTOK G ,et al. Incentivizing users for balancing bike sharing systems[C]// Proceedings of the 29t h AAAI Conference on Artificial Intelligence. Palo Alto:AAAI Press, 2015: 723-729.
[38]	WASERHOLE A , JOST V . Pricing in vehicle sharing systems:optimization in queuing networks with product forms[J]. EURO Journal on Transportation and Logistics, 2016,5(3): 293-320.
[39]	GHOSH S , VARAKANTHAM P , ADULYASAK Y ,et al. Dynamic repositioning to reduce lost demand in bike sharing systems[J]. Journal of Artificial Intelligence Research, 2017,58: 387-430.
[40]	LOWALEKAR M , VARAKANTHAM P , GHOSH S ,et al. Online repositioning in bike sharing systems[C]// Proceedings of the 27th In ternational Conference on Automated Planning and Scheduling.[S.l.:s.n.], 2017: 200-208.
[41]	LILLICRAP T P , HUNT J J , PRITZEL A ,et al. Continuous control with deep reinforcement learning[J]. arXiv preprint,2015,arXiv:1509.02971.
[42]	CHUNG L C GPS taxi dispatch system based on A* shortest path algorithm[Z]. 2005.
[43]	LIAO Z . Taxi dispatching via glo bal positioning systems[J]. IEEE Transactions on Engineering Management, 2001,48(3): 342-347.
[44]	ALSHAMSI A , ABDALLAH S , RAHWAN I . Multiagent self-organization for a taxi dispatch system[C]// Proceedings of the 8th In ternational Conference on Autonomous Agents and Multiagent Systems.[S.l.:s.n.], 2009: 21-28.
[45]	LIN K X , ZHAO R Y , XU Z ,et al. Efficient large-scale fleet management via multiagent deep reinforcement learning[C]// Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery ＆ Data Mining. New York:ACM Press, 2018: 1774-1783.
[46]	LI M , QIN Z W , JIAO Y ,et al. Efficient ridesharing order dispatching with mean field multi-agent reinforcement learning[C]// Proceedings of the 2019 World Wide Web Conference. New York:ACM Press, 2019: 983-994.
[47]	YANG Y D , LUO R , LI M ,et al. Mean field multi-agent reinforcement learning[C]// Proceedings of the 34th International Conference on Machine Learning.[S.l.:s.n.], 2018: 5571-5580.
[48]	ZHOU M , JIN J R , ZHANG W N ,et al. Multi-agent reinforcement learning for order-dispatching via order-vehicle distribution matching[C]// Proceedings of the 28th ACM International Conference on Information and Knowledge Management. New York:ACM Press, 2019: 2645-2653.
[49]	GIJSBRECHTS J , BOUTE R N , MIEGHEM J A ,et al. Can deep reinforcement learning improve inventory management? Performance and implementation of dual sourcing-mode problems[J]. SSRN Electronic Journal, 2018.
[50]	WU D , CHEN C , YANG X ,et al. A multiagent reinforcement learning method for impression allocation in online display advertising[J]. arXiv preprint,2018,arXiv:1809.03152.
[51]	YAKOVLEVA D , POPOV A , FILCHENKOV A . Real-time bidding with soft actorcritic reinforcement learning in display advertising[C]// Proceedings of 2019 25th Conference of Open Innovations Association. Piscataway:IEEE Press, 2019: 373-382.
[52]	ZHAO X Y , GU C S , ZHANG H ,et al. DEAR:deep reinforcement learning for online advertising impression in recommender systems[J]. arXiv preprint,2019,arXiv:1909.03602.
[53]	ZHAO X Y , ZHENG X D , YANG X W ,et al. Jointly learning to recommend and advertise[C]// Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery ＆ Data Mining. New York:ACM Press, 2020: 3319-3327.
[54]	KEMMER L , KLEIST H , ROCHEBOU?T D ,, et al . Reinforcement learning for supply chain optimization[C]// Proceedings of 2018 E uropean Workshop on Reinforcement Learning.[S.l.:s.n.], 2018.
[55]	PENG Z D , ZHANG Y , FENG Y P ,et al. Deep reinforcement learning approach for capacitated supply chain optimization und er demand uncertainty[C]// Proceedings of 2019 Chinese Automation Congress. Piscataway:IEEE Press, 2019: 3512-3517.
[56]	ALVES J C , MATEUS G R . Deep reinforcement learning and optimization approach for multi-echelon supply chain with uncert ain demands[C]// Proceedings of 2020 International Conference on Computational Logistics. Heidelberg:Springer, 2020: 584-599.
[57]	JOHNSON D S , DEMERS A , ULLMAN J D ,et al. Worst-case performance bounds for simple one-dimensional packing algorithms[J]. SIAM Journal on computing, 1974,3(4): 299-325.
[58]	HU H Y , ZHANG X D , YAN X W ,et al. Solving a new 3D Bin packing problem with deep reinforcement learning method[J]. arXiv preprint,2017,arXiv:1708.05930.
[59]	SOLOZABAL R , CEBERIO J , TAKá? M , . Constrained combinatorial optimization with reinforcement learning[J]. arXiv preprint,2016,arXiv:1611.09940.
[60]	HU H Y , DUAN L , ZHANG X D ,et al. A multi-task selected learning approach for solving new type 3D Bin packing problem[J]. arXiv preprint,2018,arXiv:1804.06896.
[61]	LATERRE A , FU Y G , JABRI M K ,et al. Ranked reward:enabling self-play reinforcement learning for combinatorial optimization[J]. arXiv preprint,2018,arXiv:1807.01672.
[62]	LI D D , REN C W , GU Z Q ,et al. Solving packing problems by conditional query learning[Z]. 2019.
[63]	CAI Q P , HANG W , MIRHOSEINI A ,et al. Reinforcement learning driven heuristic optimization[J]. arXiv preprint,2019,arXiv:1906.06639.
[64]	KUNDU O , DUTTA S , KUMAR S . Deeppack:a vision-based 2D online Bin packing algorithm with deep reinforcement learning[C]// Proceedings of 2019 28th IEEE International Conference on Robot and Human Interactive Communication. Piscataway:IEEE Press, 2019: 1-7.
[65]	VERMA R , SINGHAL A , KHADILKAR H ,et al. A generalized reinforcement learning algorithm for online 3D Binpacking[J]. arXiv preprint,2020,arXiv:2007.00463.
[66]	ZHAO H , SHE Q J , ZHU C Y ,et al. Online 3D Bin packing with constrained deep reinforcement learning[J]. arXiv preprint,2020,arXiv:2006.14978.