Deep Contextual Bandit and Reinforcement Learning for IRS-assisted
MU-MIMO Systems
Abstract
The combination of multiple-input multiple-output (MIMO) and
intelligent reflecting surfaces (IRSs) is foreseen as a key enabler of
beyond 5G (B5G) and 6G. In this work, two different approaches are
considered for the joint optimization of the IRS phase-shift matrix and
MIMO precoders of an IRS-assisted multi-stream (MS) multi-user MIMO
(MU-MIMO) system with the aim of maximizing the system sum-rate for
every channel realization. The first one is a novel contextual bandit
(CB) approach with continuous state and action spaces called deep
contextual bandit-oriented deep deterministic policy gradient
(DCB-DDPG). The second is an innovative deep reinforcement learning
(DRL) formulation where the states, actions and rewards are selected
such that the Markov decision process (MDP) property of reinforcement
learning (RL) is properly met. Both proposals perform remarkably better
than state-of-the-art heuristic methods in high multi-user interference
scenarios.