• Title/Summary/Keyword: Multi-armed bandit

Search Result 10, Processing Time 0.023 seconds

Combining Multiple Strategies for Sleeping Bandits with Stochastic Rewards and Availability (확률적 보상과 유효성을 갖는 Sleeping Bandits의 다수의 전략을 융합하는 기법)

  • Choi, Sanghee;Chang, Hyeong Soo
    • Journal of KIISE
    • /
    • v.44 no.1
    • /
    • pp.63-70
    • /
    • 2017
  • This paper considers the problem of combining multiple strategies for solving sleeping bandit problems with stochastic rewards and stochastic availability. It also proposes an algorithm, called sleepComb(${\Phi}$), the idea of which is to select an appropriate strategy for each time step based on ${\epsilon}_t$-probabilistic switching. ${\epsilon}_t$-probabilistic switching is used in a well-known parameter-based heuristic ${\epsilon}_t$-greedy strategy. The algorithm also converges to the "best" strategy properly defined on the sleeping bandit problem. In the experimental results, it is shown that sleepComb(${\Phi}$) has convergence, and it converges to the "best" strategy rapidly compared to other combining algorithms. Also, we can see that it chooses the "best" strategy more frequently.

Opportunistic Spectrum Access Based on a Constrained Multi-Armed Bandit Formulation

  • Ai, Jing;Abouzeid, Alhussein A.
    • Journal of Communications and Networks
    • /
    • v.11 no.2
    • /
    • pp.134-147
    • /
    • 2009
  • Tracking and exploiting instantaneous spectrum opportunities are fundamental challenges in opportunistic spectrum access (OSA) in presence of the bursty traffic of primary users and the limited spectrum sensing capability of secondary users. In order to take advantage of the history of spectrum sensing and access decisions, a sequential decision framework is widely used to design optimal policies. However, many existing schemes, based on a partially observed Markov decision process (POMDP) framework, reveal that optimal policies are non-stationary in nature which renders them difficult to calculate and implement. Therefore, this work pursues stationary OSA policies, which are thereby efficient yet low-complexity, while still incorporating many practical factors, such as spectrum sensing errors and a priori unknown statistical spectrum knowledge. First, with an approximation on channel evolution, OSA is formulated in a multi-armed bandit (MAB) framework. As a result, the optimal policy is specified by the wellknown Gittins index rule, where the channel with the largest Gittins index is always selected. Then, closed-form formulas are derived for the Gittins indices with tunable approximation, and the design of a reinforcement learning algorithm is presented for calculating the Gittins indices, depending on whether the Markovian channel parameters are available a priori or not. Finally, the superiority of the scheme is presented via extensive experiments compared to other existing schemes in terms of the quality of policies and optimality.

Reinforcement Learning-Based Illuminance Control Method for Building Lighting System (강화학습 기반 빌딩의 방별 조명 시스템 조도값 설정 기법)

  • Kim, Jongmin;Kim, Sunyong
    • Journal of IKEEE
    • /
    • v.26 no.1
    • /
    • pp.56-61
    • /
    • 2022
  • Various efforts have been made worldwide to respond to environmental problems such as climate change. Research on artificial intelligence (AI)-based energy management has been widely conducted as the most effective way to alleviate the climate change problem. In particular, buildings that account for more than 20% of the total energy delivered worldwide have been focused as a target for energy management using the building energy management system (BEMS). In this paper, we propose a multi-armed bandit (MAB)-based energy management algorithm that can efficiently decide the energy consumption level of the lighting system in each room of the building, while minimizing the discomfort levels of occupants of each room.

Adaptive algorithm for optimal real-time pricing in cognitive radio enabled smart grid network

  • Das, Deepa;Rout, Deepak Kumar
    • ETRI Journal
    • /
    • v.42 no.4
    • /
    • pp.585-595
    • /
    • 2020
  • Integration of multiple communication technologies in a smart grid (SG) enables employing cognitive radio (CR) technology for improving reliability and security with low latency by adaptively and effectively allocating spectral resources. The versatile features of the CR enable the smart meter to select either the unlicensed or the licensed band for transmitting data to the utility company, thus reducing communication outage. Demand response management is regarded as the control unit of the SG that balances the load by regulating the real-time price that benefits both the utility company and consumers. In this study, joint allocation of the transmission power to the smart meter and consumer's demand is formulated as a two stage multi-armed bandit game in which the players select their optimal strategies noncooperatively without having any prior information about the media. Furthermore, based on historical rewards of the player, a real-time pricing adaptation method is proposed. The latter is validated through numerical results.

A Heuristic Time Sharing Policy for Backup Resources in Cloud System

  • Li, Xinyi;Qi, Yong;Chen, Pengfei;Zhang, Xiaohui
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.10 no.7
    • /
    • pp.3026-3049
    • /
    • 2016
  • Cloud computing promises high performance and cost-efficiency. However, most cloud infrastructures operate at a low utilization, which greatly adheres cost effectiveness. Previous works focus on seeking efficient virtual machine (VM) consolidation strategies to increase the utilization of virtual resources in production environment, but overlook the under-utilization of backup virtual resources. We propose a heuristic time sharing policy of backup VMs derived from the restless multi-armed bandit problem. The proposed policy achieves increasing backup virtual resources utilization and providing high availability. Both the results in simulation and prototype system experiments show that the traditional 1:1 backup provision can be extended to 1:M (M≫1) between the backup VMs and the service VMs, and the utilization of backup VMs can be enhanced significantly.

Hybrid Offloading Technique Based on Auction Theory and Reinforcement Learning in MEC Industrial IoT Environment (MEC 산업용 IoT 환경에서 경매 이론과 강화 학습 기반의 하이브리드 오프로딩 기법)

  • Bae Hyeon Ji;Kim Sung Wook
    • KIPS Transactions on Computer and Communication Systems
    • /
    • v.12 no.9
    • /
    • pp.263-272
    • /
    • 2023
  • Industrial Internet of Things (IIoT) is an important factor in increasing production efficiency in industrial sectors, along with data collection, exchange and analysis through large-scale connectivity. However, as traffic increases explosively due to the recent spread of IIoT, an allocation method that can efficiently process traffic is required. In this thesis, I propose a two-stage task offloading decision method to increase successful task throughput in an IIoT environment. In addition, I consider a hybrid offloading system that can offload compute-intensive tasks to a mobile edge computing server via a cellular link or to a nearby IIoT device via a Device to Device (D2D) link. The first stage is to design an incentive mechanism to prevent devices participating in task offloading from acting selfishly and giving difficulties in improving task throughput. Among the mechanism design, McAfee's mechanism is used to control the selfish behavior of the devices that process the task and to increase the overall system throughput. After that, in stage 2, I propose a multi-armed bandit (MAB)-based task offloading decision method in a non-stationary environment by considering the irregular movement of the IIoT device. Experimental results show that the proposed method can obtain better performance in terms of overall system throughput, communication failure rate and regret compared to other existing methods.

The UCT algorithm applied to find the best first move in the game of Tic-Tac-Toe (삼목 게임에서 최상의 첫 수를 구하기 위해 적용된 신뢰상한트리 알고리즘)

  • Lee, Byung-Doo;Park, Dong-Soo;Choi, Young-Wook
    • Journal of Korea Game Society
    • /
    • v.15 no.5
    • /
    • pp.109-118
    • /
    • 2015
  • The game of Go originated from ancient China is regarded as one of the most difficult challenges in the filed of AI. Over the past few years, the top computer Go programs based on MCTS have surprisingly beaten professional players with handicap. MCTS is an approach that simulates a random sequence of legal moves until the game is ended, and replaced the traditional knowledge-based approach. We applied the UCT algorithm which is a MCTS variant to the game of Tic-Tac-Toe for finding the best first move, and compared it with the result generated by a pure MCTS. Furthermore, we introduced and compared the performances of epsilon-Greedy algorithm and UCB algorithm for solving the Multi-Armed Bandit problem to understand the UCB.

Thompson sampling based path selection algorithm in multipath communication system (다중경로 통신 시스템에서 톰슨 샘플링을 이용한 경로 선택 기법)

  • Chung, Byung Chang
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.25 no.12
    • /
    • pp.1960-1963
    • /
    • 2021
  • In this paper, we propose a multiplay Thompson sampling algorithm in multipath communication system. Multipath communication system has advantages on communication capacity, robustness, survivability, and so on. It is important to select appropriate network path according to the status of individual path. However, it is hard to obtain the information of path quality simultaneously. To solve this issue, we propose Thompson sampling which is popular in machine learning area. We find some issues when the algorithm is applied directly in the proposal system and suggested some modifications. Through simulation, we verified the proposed algorithm can utilize the entire network paths. In summary, our proposed algorithm can be applied as a path allocation in multipath-based communications system.

Path selection algorithm for multi-path system based on deep Q learning (Deep Q 학습 기반의 다중경로 시스템 경로 선택 알고리즘)

  • Chung, Byung Chang;Park, Heasook
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.25 no.1
    • /
    • pp.50-55
    • /
    • 2021
  • Multi-path system is a system in which utilizes various networks simultaneously. It is expected that multi-path system can enhance communication speed, reliability, security of network. In this paper, we focus on path selection in multi-path system. To select optimal path, we propose deep reinforcement learning algorithm which is rewarded by the round-trip-time (RTT) of each networks. Unlike multi-armed bandit model, deep Q learning is applied to consider rapidly changing situations. Due to the delay of RTT data, we also suggest compensation algorithm of the delayed reward. Moreover, we implement testbed learning server to evaluate the performance of proposed algorithm. The learning server contains distributed database and tensorflow module to efficiently operate deep learning algorithm. By means of simulation, we showed that the proposed algorithm has better performance than lowest RTT about 20%.

Optimal Exploration-Exploitation Strategies in Reinforcement Learning for Online Banner Advertising: The Impact of Word-of-Mouth Effects (온라인 배너 광고 강화학습의 최적 탐색-활용 전략: 구전효과의 영향)

  • Bumsoo Kim;Gun Jea Yu;Joonkyum Lee
    • Journal of Service Research and Studies
    • /
    • v.14 no.2
    • /
    • pp.1-17
    • /
    • 2024
  • One of the most important decisions for managers in the online banner advertising industry, is to choose the best banner alternative for exposure to customers. Since it is difficult to know the click probability of each banner alternative in advance, managers must experiment with multiple alternatives, estimate the click probability of each alternative based on customer clicks, and find the optimal alternative. In this reinforcement learning process, the main decision problem is to find the optimal balance between the level of exploitation strategy that utilizes the accumulated estimated click probability information and exploration strategy that tries new alternatives to find potentially better options. In this study we analyze the impact of word-of-mouth effects and the number of alternatives on the optimal exploration-exploitation strategies. More specifically, we focus on the word-of-mouth effect, where the click-through rate of the banner increases as customers promote the related product to those around them after clicking the exposed banner, and add it to the overall reinforcement learning process. We analyze our problem by employing the Multi-Armed Bandit model, and the analysis results show that the larger the word-of-mouth effect and the fewer the number of banner alternatives, the higher the optimal exploration level of advertising reinforcement learning. We find that as the probability of customers clicking on the banner increases due to the word-of-mouth effect, the value of the previously accumulated estimated click-through rate knowledge decreases, and therefore the value of exploring new alternatives increases. Additionally, when the number of advertising alternatives is small, a larger increase in the optimal exploration level was observed as the magnitude of the word-of-mouth effect increased. This study provides meaningful academic and managerial implications at a time when online word-of-mouth and its impact on society and business is becoming more important.