Training Demand Prediction Models by Decision Error for Two-Stage Lot-Sizing Problems

Demand prediction to support appropriate production decisions is being actively studied. Many prediction models are designed to minimize the prediction error, which is measured by determining the difference between the predicted and ground-truth demand. However, these models ignore the effect of the prediction error on downstream production decisions. This prompted our study, which focuses on demand prediction models for two-stage uncapacitated lot-sizing problems. In this paper, we present a prediction model that minimizes the decision error, which is measured by the optimization objective of lot-sizing problems. Our model mitigates the impact of prediction errors by leveraging the structure of the lot-sizing problems. To enhance the ability to accommodate imperfect data, such as data based on inaccurate information, we subsequently extend the prediction models to distributionally robust versions. We consider the worst-case formulation in the feature space to enhance the robustness of the model to data imperfection. Numerical experiments demonstrate that the proposed prediction models are significantly superior to the traditional prediction methods when the model being trained is misspecified. In addition, the robust extension enables the models to train well on imperfect datasets while requiring less training data. Note to Practitioners—This study is motivated by the two-stage lot-sizing problem in manufacturing systems which involves the determination of the time at which (the setup decisions) to produce before the demand is revealed, and the production quantity decisions are adjusted during production processing. Predicting the demand properly while making the set-up decisions helps to decrease the production cost. This paper proposes a novel method that aims to minimize the total production cost to train demand prediction models. Practitioners can deploy the proposed method to train popular prediction models including linear prediction, decision tree, and neural network models. It is shown that the production cost associated with the decisions based on our models is lower than that associated with the traditional prediction models. Furthermore, our method enhances the ability of the model to handle model misspecification and imperfect data. Practitioners can apply our method to overcome the problems caused by imperfect data such as those containing contaminated and inaccurate measurements.


Training Demand Prediction Models by Decision
Error for Two-Stage Lot-Sizing Problems Hailei Gong , Yanzi Zhang , and Zhi-Hai Zhang , Member, IEEE Abstract-Demand prediction to support appropriate production decisions is being actively studied.Many prediction models are designed to minimize the prediction error, which is measured by determining the difference between the predicted and ground-truth demand.However, these models ignore the effect of the prediction error on downstream production decisions.This prompted our study, which focuses on demand prediction models for two-stage uncapacitated lot-sizing problems.In this paper, we present a prediction model that minimizes the decision error, which is measured by the optimization objective of lotsizing problems.Our model mitigates the impact of prediction errors by leveraging the structure of the lot-sizing problems.To enhance the ability to accommodate imperfect data, such as data based on inaccurate information, we subsequently extend the prediction models to distributionally robust versions.We consider the worst-case formulation in the feature space to enhance the robustness of the model to data imperfection.Numerical experiments demonstrate that the proposed prediction models are significantly superior to the traditional prediction methods when the model being trained is misspecified.In addition, the robust extension enables the models to train well on imperfect datasets while requiring less training data.
Note to Practitioners-This study is motivated by the two-stage lot-sizing problem in manufacturing systems which involves the determination of the time at which (the setup decisions) to produce before the demand is revealed, and the production quantity decisions are adjusted during production processing.Predicting the demand properly while making the set-up decisions helps to decrease the production cost.This paper proposes a novel method that aims to minimize the total production cost to train demand prediction models.Practitioners can deploy the proposed method to train popular prediction models including linear prediction, decision tree, and neural network models.It is shown that the production cost associated with the decisions based on our models is lower than that associated with the traditional prediction models.Furthermore, our method enhances the ability of the model to handle model misspecification and imperfect data.Practitioners can apply our method to overcome the problems

I. INTRODUCTION
L OT-SIZING has become one of the most important prob- lems in operations management since the seminal work of [1] and the technique has been widely applied to production planning and inventory control, amongst other fields.Usually, classic lot-sizing problems involve two types of decisions: the quantity of (the production quantity decisions) and time at which (the set-up decisions) to produce to ensure that the given deterministic demand over T periods is satisfied.
In practice, due to the long set-up time, the decision maker sometimes needs to make the set-up decisions before observing the uncertain demand.Many techniques such as stochastic optimization and rolling-horizon planning [2] have been developed to address the problem of uncertain demand.Motivated to overcome the challenge presented by this problem, we propose a prediction model tailored for the two-stage lot-sizing problem [3], [4], [5].We consider the problem that arises as a result of the set-up decisions being made in the first stage (the planning stage) before the actual demand becomes fully known.Decisions regarding the production quantity are made in the second stage (the operational stage) based on more accurate demand information.Compared with the classic lot-sizing model, the two-stage stochastic model enhances the robustness and flexibility of solutions against the demand fluctuation in real production situations because the proposed model enables the production quantity to be adjusted in the second stage.
A key problem here is finding a way to forecast the demand in the first stage before orders are placed and to make appropriate set-up decisions based on this forecast.In the currently prevailing era of big data, historical production data are collected and processed using one of the many available advanced machine-learning (ML) prediction models developed for this purpose.Typically, a machine-learning model f (•) is used to predict the demand d t , by some auxiliary data or features, such as seasonality, customer demographics, and weather forecasts.Effective prediction models would help the decision maker to forecast future demand to proceed to make set-up decisions in the first stage.The entire two-stage decision process is shown in Figure 1.Generally, the objectives of the prediction model and the optimization model are not consistent.The prediction and optimization models function independently and are not designed to consider the way in which they would mutually influence one another.As shown in Figure 1, the goal of the prediction model is to minimize the prediction error, whereas that of the optimization model is to minimize the production cost.The impact of the prediction error on the downstream optimization model is not considered in the prediction model.For example, when a common loss function such as the mean square error (MSE) is used to train prediction models, both overand under-predictions are penalized by the same prediction errors.At the same time, in the optimization model, because the overage cost and underage cost are different, over-and under-predictions are penalized by the respective production costs.Such a difference could be problematic.First, model misspecification widely exists in the field of demand forecasting.In practice, the decision maker may assume an incorrect demand model either because the real demand function is inaccessible owing to knowledge limitations, or because it prefers a simple model for interpretability.Second, in reality, imperfect data, such as outliers, contaminated and inaccurate data, are inevitable and these situations cause the prediction model to make prediction errors.Subsequent amplification of the prediction errors by the optimization model [6] seriously affects the accuracy of decision-making.
In this setting, we propose using the decision error to train the prediction model.We focus on training the prediction models such that they effectively utilize the model structure of two-stage lot-sizing problems.That is, we attempt to find the prediction model that minimizes the total production cost (including the fixed cost, variable cost, and inventory holding cost) of the two-stage lot-sizing problem rather than minimizing the prediction error between the predicted demand and ground-truth demand.This approach ensures that the prediction and optimization models are more closely linked and eases the influence of model misspecification.In addition, we introduce distributionally robust optimization to improve the robustness of the models.Inspired by recent developments in robust learning, we consider learning from the worst-case distribution around the empirical distribution.This scheme guarantees good performance with respect to distributions close to the empirical distribution and ensures that the models are robust to data imperfections.
Our contributions are summarized as follows: 1) We propose a new method to train prediction models for two-stage lot-sizing problems.The method does not measure the quality of prediction by the prediction error.Rather, the prediction quality is measured by the decision error, which is the difference between the actual and optimal production cost.The demand predicted using the method contributes to enhancing decisionmaking.2) We propose distributionally robust models based on the Wasserstein distance to enhance the robustness and generalization of prediction models.This method can improve the ability of the models to accommodate imperfect data.3) We propose algorithms to iteratively train the corresponding models.Both non-robust and robust models can be trained efficiently by calling the solver (such as Gurobi).4) Numerical studies demonstrate that the prediction models trained by the decision error outperform MSE-based models in the event of model misspecification.Remarkably, when imperfect data are included, the robust models significantly outperform competing models.The remainder of the paper is organized as follows: Section II presents a comprehensive review of the related literature.In Section III, we present two-stage lot-sizing problems and the decision error.Then, prediction models are developed in Section IV.In Section V, we evaluate the performance of the proposed models.Finally, Section VI draws conclusions and outlines future research directions.The proofs of the propositions and theorems in this paper can be found in Appendix A.

II. LITERATURE REVIEW
Our study is related to three streams of literature: the lot-sizing problem, distributionally robust optimization, and machine learning.

A. Lot-Sizing Problem
The lot-sizing problem is the most celebrated model of inventory management [1] and many variants were developed for more practical situations [2], [7], [8].In many applications, model parameters such as the demand are unknown when making relevant production plans.To accommodate these uncertainties, the lot-sizing problem, which considers stochastic parameters, has attracted considerable attention.
Models that consider stochastic parameters have been extensively studied in recent years [9].Markowitz et al. studied two queuing control problems, which are variants of stochastic lotsizing problems.A dynamic controlling policy was proposed to solve the problems [10].Guan and Miller considered stochastic uncapacitated lot-sizing problems.A scenario tree was proposed to describe the evolution of stochastic parameters and a polynomial-time algorithm was developed to dynamically solve the resulting model [11].Löhndorf and Minner proposed simulation-based optimization methods for stochastic lot-sizing problems [12].In contrast with previous studies, here the optimization process was considered as a black box.
Levi and Shi considered stochastic lot-sizing problems with general demand distributions and developed novel approaches to compute near-optimal policies [13].Helber et al. considered the stochastic production yield and used a service-level constraint to limit the backlog [14].Akartunalı and Dauzère-Pérès considered the possibility that the demand timing might be stochastic and proposed a novel model based thereupon [15].Quezada et al. studied the uncapacitated lot-sizing problem with uncertain demand and costs and an efficient algorithm was proposed to solve it [16].
Compared with the classic lot-sizing models, researchers developed two-stage models to efficiently hedge against demand fluctuations.Zhang considered the two-stage minimax robust lot-sizing problem with interval uncertain demands and they showed that the problem could be solved with a polynomial under certain conditions [3].Li et al. proposed a riskaverse two-stage stochastic model for a joint line balancing and lot-sizing problem under demand uncertainty [5].Slama et al. considered uncertain refurbishing durations for disassembly lot-sizing problems and they formulated the problem as a two-stage model [17].Modeling methods other than stochastic optimization such as distributionally robust optimization were also proposed to accommodate stochastic demand [4].
Although stochastic lot-sizing problems have been comprehensively studied, most of them focused on the development of optimization models.Few studies considered effective demand prediction for improved decision-making.Our study is intended to bridge this research gap.

B. Distributionally Robust Optimization
Motivated by the fact that there might be uncertainties in the training data, robust optimization is used in the machine learning and operations research fields [18].Simester et al. pointed out that the quality of training data affected by phenomena such as the covariate shift and concept shift could dramatically influence the performance of model-driven methods [19].This observation led researchers to focus on improving the robustness of data-driven models.For example, Bertsimas et al. obtained computationally tractable formulations for the three most widely used classification methods: support vector machines, logistic regression, and decision trees [20].In addition, Hong et al. proposed a learning-based robust optimization framework to dynamically estimate the key parameters in optimization problems [21].Volpi et al. and Christina et al. proposed a distributionally robust method to enhance the ability of the model to process unseen data [22], [23], [24].
Traditional robust optimization, which leads to conservative decisions, has therefore been replaced by distributionally robust optimization to more effectively overcome overly conservative decision-making and the latter form of optimization has received extensive attention [25], [26].Distributionally robust optimization has had empirical success in operations research.For instance, Basciftci et al. and Saif and Delage studied distributionally robust facility location problems and their numerical results showed that the robust version outperforms the stochastic one [27], [28].Gong and Zhang examined the distributionally robust optimization of pricing and network design problems and utilized a novel ambiguity set to describe the uncertain quality [29].
Recent studies also showed that an internal relationship exists between distributed robust optimization and regularization in machine learning.Gao et al. developed a general theory to explain the connection between variation regularization and Wasserstein distributionally robust optimization [30].Blanchet et al. showed that several regularized machine-learning estimators such as square-root Lasso and regularized logistic regression can be represented as solutions to distributionally robust optimization problems [31].Shafieezadeh-Abadeh et al. introduced new regularization techniques for machine-learning methods using ideas from distributionally robust optimization [32].
In our study, to address the problems resulting from model misspecification and imperfect data, we enhance the robustness and generalization of prediction models by applying distributionally robust optimization.
Recently, end-to-end research frameworks have attracted considerable attention.These methods consider outputting optimization results directly based on input features by training machine-learning models such as neural networks.Wilder et al. integrated discrete optimization problems into deep learning or other predictive models by directly training the machine-learning model with the optimization algorithm [39].Bertsimas and Kallus developed a framework to prescribe optimal decisions directly from data [40].Oroojlooyjadid et al. applied a deep learning model to optimize the order quantities for the newsvendor problem based on features of the demand data [33].Notz and Pibernik proposed a machine-learning approach termed kernelized empirical risk minimization to prescribe optimal decisions for capacity management problems [41].The end-to-end framework seems attractive because it outputs decisions directly by accepting feature data as input.However, as a result of its lack of interpretability and the large amount of training data it requires, it delivers poor performance in combination with complex optimization problems.Elmachtoub and Grigas [42] proposed a smart "predict-thenoptimize" framework to more effectively exploit the structural features of the optimization model and Elmachtoub et al. [43] extended the framework to decision trees.
In this study, we use the decision error to connect the prediction models and the lot-sizing optimization model.Compared with the end-to-end models, we utilize the structure of the optimization model while retaining the interpretability of the prediction models.

III. PROBLEM DESCRIPTION
The notation used throughout the paper is summarized as follows: Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.Continuous variable; inventory quantities in period t ∈ {1, . . ., T } X i j Continuous variable; proportion of d j satisfied by the production in period i.

Parameters
In this paper, unless explicitly noted, lowercase bold letters such as a represent vectors, and uppercase bold letters such as A represent matrices.A random variable is denoted by ã.A predicted variable is denoted by â.Lowercase bold letters represent vectors of corresponding scalars such as d = (d 1 , . . ., d T ).
Herein, we introduce the two-stage lot-sizing problem in Section III-A and demonstrate our approach to evaluate the decision error in Section III-B.

A. Two-Stage Lot-Sizing Problems
The two-stage lot-sizing problem can be formalized as follows: First stage: Second stage: The first-stage objective function (1a) calculates the sum of the fixed cost and expected second-stage cost.The secondstage objective function (2a) calculates the sum of the variable cost and inventory holding cost for the given first-stage decisions Z and demand d.Constraints (2b) are the inventorybalance constraints.Assume that the initial inventory I 0 = 0. Constraints (2c) prohibit X t from being positive unless Z t = 1.M is a sufficiently large positive number.Constraints (1b) and (2d) are domain constraints.
In the first stage, we assume that demand d can be predicted based on certain related and observable features x ∈ X , where X is the feature space, and ξ = (x, d) is governed by a probability distribution P. For example, the sales of e-commerce products are highly related to promotional activities, whereas the demand for commodities such as automotive parts is cyclical.Denote D x as the conditional distribution of demand given features x.The goal of the decision maker in the first stage is to find an optimal solution Z that minimizes the following objective: The following proposition asserts that p(d, Z) is linear with d.
Proposition 1: For a fixed feasible solution Z, p(d, Z) is linear with d, that is p(d, Z) = w ⊤ (Z)d, where w ⊤ (Z) is determined only by Z.
For convenience, we abbreviate w ⊤ (Z) as w ⊤ , and c = (c 1 , . . ., c T ).We rewrite the objective function (3): The first equality follows from the fact that p(d, Z) is linear with d (Proposition 1).The second equality follows from the linearity property of expectation.Equation (4c) implies that by using a prediction d for E d∼D x [d|x], the contextual stochastic model ( 3) is transformed into a deterministic one, that is, min Z c ⊤ Z + w ⊤ d.By solving the deterministic optimization problem, the decision maker can obtain a solution to the firststage set-up decision Z.The following section demonstrates the evaluation of the quality of the predicted demand d.

B. Decision Error
For given historical data (x, d), the prediction models output the predicted demand d = f (x).Finding a way to evaluate the quality of d is the key to training the prediction model.The square error which is in the form of ||( d − d)|| 2 is one of the most popular assessments to evaluate the quality of the predicted value.However, the square error ignores the impact of the prediction value on the downstream lot-sizing problem.
In this study, we propose to evaluate the prediction quality by the decision error of the two-stage lot-sizing problems.We evaluate the decision error by the following steps: 1) In the first stage, the optimization model ( 1) determines the set-up decisions Z based on the predicted demand d. 2) In the second stage, the final production quantity decisions are determined based on the pre-determined set-up decisions Z and the historical ground-truth demand d. 3) We calculate the actual production cost, including the fixed cost in the first stage, the variable cost and the inventory holding cost in the second stage, denoted as cost pr edict .4) We calculate the ex-post optimal production cost by resolving the lot-sizing problem under perfect information.That is, we plugged the ground-truth demand d into the optimization problem in the first stage and obtain the ex-post optimal set-up decisions and production quantity decisions.Denote the ex-post optimal cost as cost opt .5) We calculate the decision error (DE) which is the difference between cost pr edict and cost opt The entire evaluation process is demonstrated in Figure 2.
To explicitly express the decision error of the two-stage lot-sizing problem, we reformulate the deterministic lot-sizing problem (4c) to the location form: Denote where w ⊤ ( de )d e is the actual cost cost pr edict and w ⊤ (d e )d e is the ex-post optimal cost cost opt .We exemplify the difference between the square error and the decision error with a two-period lot-sizing problem.The parameters are c = (300, 300), p = (20, 20), h = (10, 10), d = (25, 25).We consider the influence of the predicted demand in the second period d2 .The square error is ( d2 − d 2 ) 2 .We now consider the decision error.If d2 ≤ 30, the optimal decision in the first stage is Z = (1, 0), X = ( d1 + d2 , 0).In the second stage, the optimal decision regarding the final production quantity is adjusted to X = (50, 0) based on the real demand d.The actual cost is c ⊤ Z + p(d, Z) = 300 + 1250 = 1550.If d2 > 30, the optimal decision in the first stage is Z = (1, 1), X = ( d1 , d2 ).In the second stage, the optimal decision regarding the final production quantity is adjusted to X = (25,25).The actual cost is c ⊤ Z + p(d, Z) = 600 + 1000 = 1600.Under perfect information conditions, the ex-post optimal decision for real demand d is Z = (1, 0), X = (50, 0).The optimal cost is 1550.Thus, for d2 ≤ 30, the decision error is 1550 − 1550 = 0, and for d2 > 30, the decision error is 1600 − 1550 = 50.Figure 3 illustrates the difference.

IV. PREDICTION MODELS
Herein, we introduce the prediction models f (•) that are used to predict the ground-truth demand d.This section demonstrates how to train the prediction models by the decision error.In Section IV-A, we propose the linear model trained by the decision error.We further extend the linear model to a distributionally robust version to enhance the robustness of the model in Section IV-B.Extensions to decision tree models and neural network models are presented in Section IV-C.

A. Linear Model
We consider the linear regression prediction model: f (x) = Bx, where B ∈ R T ×a is the coefficients matrix of the linear model.The linear model is used because this model is commonly used in practice and can be easily extended to quadratic or higher-dimensional models.Moreover, compared to time series analysis, regression models that consider more external factors can predict the impact of emergencies [44].
For given N training data {(x 1 , d 1 ), . . ., (x N , d N )}, where x n is a feature vector, d n is the real demand of sample n.The most popular method to fit the model to the training data is to minimize the empirical risk: min f (•) where l(•, •) is the loss function.We take the decision error DE as the loss function and a sensible goal is to find a prediction model that minimizes the mean decision error according to: For the linear regression prediction model, f (x) = Bx, our target is to find the coefficients matrix B that minimizes: Unfortunately, the decision error D E is non-convex and discontinuous on de (shown in Fig 3), implying that it would be difficult to train the prediction parameters B. To optimize the parameters efficiently, we derive a tractable surrogate loss function in Proposition 2.
Proposition 2: where S = {(5b)−(5d), w 0 = T t=1 f t Z t , w j = j i=1 X i j c i j }.The term on the right is the upperbound of the term on the left.Further, it is convex on de .
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.As shown by [42], the surrogate loss is Fisher consistent with respect to the decision loss, which means, the coefficients matrix that minimizes the surrogate loss also minimizes the true decision loss in most cases.Because of the convexity of the surrogate function, it is more efficient to find the optimal coefficients matrix that minimizes the loss.Thus, the problem becomes: We reformulate the problem (10) in Theorem 1.
Theorem 1: Solving problem ( 10) is equivalent to solving the following optimization model: Clearly, problem (11) is almost a linear optimization problem except for the bilinear term α B. However, α and B always appear in pairs in the objective function and constraints; hence, we can replace α B = C such that problem (11) can be solved directly by popular optimization solvers such as Gurobi.In the next section, we tailor the linear prediction model to a distributionally robust one for the purpose of enhancing the ability to address the problem of imperfect data.

B. Distributionally Robust Linear Model
To enhance the robustness and generalization of the prediction model, we extend the linear model to a distributionally robust one.In Section IV-B1, we formulate the distributionally robust linear prediction model.In Section IV-B2 we propose a cutting-plane algorithm to solve the robust model.In almost all cases, the estimated nominal distribution PN differs from the true distribution P.Moreover, in production systems, because of the existence of measurement errors, the collected data are usually imperfect.At the same time, because consumer preferences also change over time, the relationship between the demand d and the features x also changes.These factors are responsible for preventing the models that perform well in the training set from performing well in practice.
To enhance the robustness of the prediction models, we propose a Wasserstein distributionally robust model that considers ambiguous features x.The Wasserstein distance between two features x distributions Q and Q ′ on R p is defined as where (Q, Q ′ ) is the set of all joint distributions with marginals Q and Q ′ , || • || is an arbitrary norm on R p .We describe the uncertainty of x by a set of probability distributions.We construct the ambiguity set B ϵ ( PN ) = Q ∈ P : W (Q, PN ) ≤ ϵ which is a sphere with a radius ϵ > 0 centered at the empirical distribution PN with respect to the Wasserstein distance.
The prediction model would be determined by minimizing the objective value of the worst-case distribution in the ambiguity set, which leads to the following min-max optimization problem: By minimizing the worst-case decision error, we can actually decrease the decision error of all distributions in the ambiguity set such that the error is as small as possible.Therefore, it is reasonable to believe that the solution of the distributionally robust model can also perform well in the face of the abovementioned problems.We demonstrate that the problem ( 12) can be re-expressed in the following form.
Theorem 2: The robust model ( 12) is equivalent to the following model: where ||•|| * is the dual norm defined by ||z|| * = sup ||ξ ||≤1 z ⊤ ξ .Interestingly, Theorem 2 asserts that the distributionally robust model ( 12) is equivalent to a regularized version of model (10).The model seeks a solution to balance the trade-off between the decision error and feature ambiguity.The first term in the objective function (13a) is the average of the decision error and the second term is intended to regularize ambiguity.Regularization-based models, such as the Lasso and ridge regression models, are widely used in the field of machine learning and perform reasonably well in practice.However, unlike traditional regularization, which imposes a direct penalty on model coefficients, our model takes into account the objective of the two-stage lot-sizing model.In fact, the regularized term (13b) attempts to minimize the effect of the ambiguity of x on the objective value of the lot-sizing problem.
2) Cutting-Plane Method: Unfortunately, the above optimization problem is difficult to solve because of the maximization problem max w∈S || − α B ⊤ w + α B ⊤ w(d en )|| * in the constraints.We were therefore motivated to develop a cutting-plane algorithm to solve the problem.
By enumerating all vertices w v ∈ V(S), constraint (13b) can be replaced by: Thus, we can eliminate the maximization problem by setting constraints.However, enumerating all vertices is also computationally intractable.This led us to develop a cutting-plane algorithm to avoid the problem.We first ignore the constraint (14) and solve the resulting optimization model.Then, we check whether the optimal solution satisfies the constraints; if this is true, it is the global optimum, otherwise, it is necessary to find the unsatisfied vertices and add the related constraints to the optimization model.These steps are repeated in a delayed-constraint generation fashion until the algorithm converges to the global optimal solution.Details are presented in Algorithm 1.
Algorithm 1 Cutting-Plane Algorithm 1: Initialization: Set OBJ = 0, ingore the constraint (14) in the problem (13).2: Solve the problem (13), obtain an optimal solution B, α, λ and the optimal objective value obj 3: Set OBJ = obj 4: for n = 1 to N do Add constraint ( 14) to the problem (13) for i that λ i > λ.Proposition 3 suggests that, for the L 1 and L ∞ norms, solving the distributionally robust model ( 13) can be transformed into solving a series of MIP problems.Thus, it is amenable to popular optimization solvers.

C. Extension to Complex Prediction Models
In this section, we extend our proposed method to decision tree-and neural network-based prediction models.
1) CART Model: Tree-based methods are conceptually simple yet powerful and are also widely used in practice.Below, we briefly explain the expansion of the decision error to train a popular method for tree-based regression known as Classification and Regression Trees (CART).CART uses a set of features and thresholds to divide the feature space X into a set of rectangles, and fits a simple model (such as a constant) in each subspace.For the traditional mean square error criterion, heuristics are used to iteratively generate branches of the tree.That is, each time we search for a splitting variable x j and splitting point s, where x j is the j-th dimension of x.The feature space is split into two regions: Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
Then, we find the optimal splitting variable x j and optimal splitting point s by solving the following optimization problem min j,s   min We extend CART to find the splitting variables and points that minimize the decision error rather than the square error, that is min j,s   min Fortunately, the inner minimization problem can be solved efficiently.
Proposition 4: For a given splitting variable x j and splitting point s, the average of d i in region R 1 and R 2 : The above supremum problem is computationally intractable due to the discreteness of the CART model.We consider the following Lagrangian relaxation with penalty parameter γ : Because the Wasserstein distance, W (Q, PN ), is continuous and W (Q, PN ) = 0 if and only if Q = PN and the decision loss is bounded, the solution of problem ( 18) is a good approximation of the optimal solution of problem (17) according to Theorem 1 of [45].Taking the dual reformulation of the Lagrangian relaxation, we can obtain an efficient solution by iteratively updating the worst-case feature x (see Algorithm 3 for the full description).
Figure 4 shows a simple example that illustrates the behavior of CART trained by the mean square error (C-M) versus Algorithm 2 Algorithm for Training a CART Model 1: In the feature space in which the training dataset is located, recursively divide each region into two subregions and decide the output value on each subregion to construct a binary decision tree.

5:
Obtain the optimal objective value obj j,s 6: end for 7: Find the ( j, s) pair that minimizes the objective obj j,s .8: Divide the region by the selected ( j, s) and determine the corresponding output value: Call the steps from line 2 to line 6 for the subregions until the stop condition is met.10: Output the decision tree.that trained by the decision error (C-D) and its robust version (C-R).The parameter settings are similar to those in the example shown in Figure 3.The horizontal line represents the decision transition point, that is, if d 2 > 30, Z = (1, 1), else Z = (1, 0).The vertical lines from left to right represent the optimal splitting points of C-R, C-D and C-M respectively.As shown in Figure 4, C-M divides the feature space through the splitting point x = 0.55.C-M divides samples according to the Euclidean distance between d.On the other hand, C-D identifies the correct decision transition point through the splitting point x = 0.35.C-R divides the feature space through the splitting point x = 0.325.This is because the cost increases caused by the misclassification of the points on the left and the points on the right are different.The misclassification of the points on the right will lead to a greater cost increase, thus the C-R model moves the splitting point slightly to the left to reduce the misclassification probability of the points on the right.
2) Neural Network Model: Neural networks are popular predictive models.In this section, we present details of the expansion of the decision error to train a neural network model for solving lot-sizing problems.The multilayer feedforward neural network accepts external input x via the input layer, the hidden layer processes the signal, and the predicted demand d is output by the neurons in the output layer.A simple for ( j, s) where j ∈ {1, . . ., p} and s ∈ { Solve the following optimization problem by plugging in Solve two deterministic lot-sizing problems to obtain w ⊤ (d 1 ) and w ⊤ (d 2 ) 6: Obtain the optimal objective value obj j,s 7: end for 8: Find the ( j, s) pair that minimizes the objective obj j,s .

9:
Divide the region by the selected ( j, s) and determine the corresponding output value: end for 18: end for 19: Call the steps from line 2 to line 18 for the subregions until the stop condition is met.20: Output the decision tree.example of a neural network is shown in Figure 5.At each node j of a given layer l, the input value β l+1 k = j γ l j w l+1 j,k .A function g l j (•), the activation function, transforms an input value into an output value of the node.Actual commonly used activation functions are ReLU, leakyReLU, sigmoid, etc.The neural network learns by adjusting the connection weights w l j,k between neurons according to the training data.
For given data, we train a neural network that minimizes the surrogate decision error (23).The following proposition provides the gradients of the loss functions with respect to the weights of the network.
Proposition 5: The gradient with respect to the weights of the network for the surrogate loss function is where is the derivative of the activation function, is the value of the hidden layer output γ 2 h , and can be calculated by the chain rule.
Then, for a given learning rate η, we can update the weights iteratively to train the neural network.Details are shown in Algorithm 4. Considering that the data is imperfect, to enhance the generalization performance of the neural network, we propose a robust neural network model according to the ideas in Section IV-B: where l(θ ; (x, d)) denotes the loss of the neural network model given the model parameters θ and training data (x, d).
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
We propose Algorithm 5 to iteratively solve the Lagrangian relaxation version of problem (20): Calculate the gradient according to Proposition 5.

7:
Update network parameters according to the gradient and learning rate η.8: end for 9: Call the steps from line 3 to line 8 until the stop condition is met.10: Output the neural network.

V. COMPUTATIONAL STUDY
In this section, we present the computational results of experiments on synthetic data, which we used to evaluate the performance of the prediction models proposed in Section IV.

A. Instance Generation and Experimental Setup
We randomly generated instances for the lot-sizing problem.The parameters that were used to generate the instances were similar to those employed by [16]; that is, the production-toholding cost ratio was p/ h = 2 and the setup-to-holding cost ratio was c/ h = 200.The holding cost h t at each period was drawn from the uniform distribution U [0, 10].The production cost p t and setup cost c t were drawn from the uniform distribution respectively, where h = t h t /T .
We generated three sets of instances: the instances in set 1 involved a short planning horizon: T = 5.The instances in set 2 involved a medium-sized planning horizon: T = 10.The instances in set 3 involved a long planning horizon: T = 15.
The numerical experiments were conducted on a computer with an Intel Core i7 3.40 GHz processor and 48 GB of RAM running Windows 10 as the operating system.The proposed linear and CART prediction models were coded in Python 3.9, and solved by Gurobi 10.0.0 with the default settings.The relative neural networks were coded in PyTorch 1.13.1.

B. Model performance 1) Comparison of the Proposed Models:
In this section, we describe our evaluation of the efficiency of the proposed models: Calculate the gradient according to Proposition 5.

7:
Update network parameters according to the gradient and learning rate η 1 .

8:
Calculate the gradient ∇ x l(θ ; (x, d)) end if 12: end for 13: Call the steps from line 3 to line 12 until the stop condition is met.14: Output the neural network.We employed a polynomial kernel function [46] to generate the demand d according to the following process: First, we generated a random matrix B ∈ R T ×a where T is the number of periods and a is the feature dimensionality.Each entry of B is a Bernoulli random variable that is equal to 1 with probability 0.5.Then, the feature vector x ∈ R a was generated from the uniform distribution U [0, 1].Finally, the demand vector d was generated according to d = d base • (Bx) deg .Here, d base is a positive integer representing the baseline demand and we set d base = 200.deg is a positive integer representing the degree of the polynomial function that generates the actual demand.
Note that the parameter deg controls the degree of model misspecification for linear models.As the value of deg increases, the demand becomes increasingly nonlinear, fluctuates more violently, and the performance of the prediction model deteriorates.Controlling the value of deg enabled us to test the robustness of models.
In addition, we also added noise to the features of the training set and test set to simulate the contaminated and inaccurate information in the data, such as the measurement error.The noise had a normal distribution with a mean value of 0 and a standard deviation of 0. For linear models, the training of model L-M is equivalent to solving a least squares regression problem, which can be solved explicitly using the linear algebra module NumPy.linalg in Python.The parameters of models L-D and L-R were obtained by solving the problems ( 11) and ( 13), which can be solved by calling the solver Gurobi.For CART models, the splitting nodes and thresholds were calculated iteratively to generate the decision trees.For model C-M, the splitting criterion was measured by the mean square error, whereas for models C-D and C-R, the criterion was measured by the decision error.Further, for model C-R, we added an additional step to update the features.For the neural network models, we built our model with the help of PyTorch.Based on the results of a pre-experiment, we adopted a four-layer neural network structure and used (a, 5, 5, T ) neurons in each layer, respectively (Figure 5).All of the neural networks employed the leakyReLU activation function, where leaky Relu(x) = max(0, x) + leaky * min(0, x), leaky = 0.01.We used the adaptive moment estimation (ADAM) optimizer to optimize the weights of the network with a learning rate of 1 and batch size of 32.
The results of instances T = 5 are reported in Figure 6.For the linear models, the performance of the L-R model is vastly superior to that of the L-M model in most cases.The average optimality ratio of the L-R model is 1.013, whereas the average optimality ratios of the L-D and L-M models are 1.036 and 1.071, respectively.An interesting result is that for (deg, n) ∈ { (1,25), (1,50), (1, 100), (2, 25)}, the L-M model slightly dominates, which is consistent with the findings of Elmachtoub and Grigas [42].The slight dominance of the L-M model in these cases could be explained by the prior knowledge of the model.When deg = 1 or 2, the ground-truth demand is approximately linear, in which case the parameters of the L-M model are almost the best estimators.As the parameter deg increases, the performance of the L-M model gradually becomes worse.On the other hand, the performance of the models trained by the decision error is relatively stable and does not deteriorate significantly with the increase of model misspecification.This indicates that models trained by the decision error mitigate the impact of model misspecification on prediction and are more robust to inaccurate information by leveraging the additional information from lot-sizing problems.
Interestingly, all the CART models perform comparably, with the C-R model slightly outperforming the others.This is reasonable because, compared with the other two types of models (the linear and neural network models), the behavior of CART models more closely resembles that of classification models.The output of the CART models is determined by some discrete splitting points.This means that the output of the CART models is relatively robust to the input.As long as the input does not exceed the threshold of splitting points, the output value will not change.
For the neural network models, when the sample size is relatively small (n = 25), the optimality ratios of N-D and N-R are less than 1.05, and that of N-M is near 1.10.The performance of N-D and N-R is more stable.This shows that the decision error can more effectively guide the training of neural networks and increase training stability.An increase in the sample size gradually narrows the gap between the N-M model and the models trained by the decision error (N-D and N-R).This is because neural network models have the ability to approximate any continuous function as long as sufficient data are available.
Overall, the results show that models trained by the decision error perform more stably and effectively than models trained by the mean square error when model misspecification exists.By leveraging the information of the downstream lot-sizing problems, the models trained by the decision error mitigate the influence of inaccurate information from model misspecification.
2) Comparison of the Running Time: In this section, we report the running time of the proposed models and present the results in Table I.The running time of L-M, C-M, and N-M does not increase significantly as the sample size increases.This is plausible because the computation of the mean square error in Python is highly efficient.The running time of models trained with the decision error increases linearly with the scale of the problem.This is because when these models are trained, it is necessary to iteratively call the solver (Gurobi) to solve relevant optimization problems.Therefore, the training time increases linearly with the number of samples.Fortunately, once the prediction model has been trained, it is usually suitable for use for a period of time in the future.In addition, for models such as neural networks, when new data are observed, we only need to continue training on the original model, which will not significantly increase the time required to train the model.In the case of large-scale problems, parallel computing can be used to accelerate model training, because the structure of the optimization model is the same.Therefore, the computing time is not expected to become the bottleneck of this problem in most cases.
3) Sensitivity Analysis: Sensitivity analysis of the random noise, sample size n, deg, and planning horizon T are presented in this section.The results of the CART and neural network models are similar to those of the linear models.Due to space limitations, only the results of the linear models are shown here.Figures 7 and 8 display the results.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE I COMPARISON RESULTS OF THE RUNNING TIME
Note that to simulate the contaminated and inaccurate information in the data such as sensor measurement errors we added a certain amount of noise to the features of both the training and test sets.In this section, we varied the standard deviation of the noise from 0.1 to 0.5.Figure 7 reports the mean (lines) and the 10th and 90th quantile (shaded regions) of the optimality ratio.As the amount of noise increases, the mean of the optimality gap of the L-M model increases from 1.023 to 1.088.In addition, the mean of the optimality gap of the L-D model increases from 1.002 to 1.033.At the same time, the mean of the optimality gap of L-R barely increased.These tendencies suggest that the robust model shows strong stability toward random noise.
Our analysis of the sample size revealed that the L-R model can perform well with only a small amount of training data.The optimality ratio reaches 1.027 with only 25 training samples, whereas the L-D model requires 50-100 training samples to achieve the same optimality ratio.On the other hand, the optimality ratio of the L-M model only reaches 1.034 with 500 training samples.The analysis of deg indicated that the optimality ratio of the L-M model increases as deg increases.However, for the models trained by the decision error, the optimality ratio first increases before decreasing.This can be explained by the training objective of the models.The models trained with the decision error aim to find the decision transition point that minimizes this error (see Fig 4).As the parameter deg grows, demand polarization becomes more pronounced (i.e., some periods have high demand and others have low demand), which helps the models to find the transition point.
The analysis of the planning horizon indicated that the models L-D and L-R are more robust to increasing periods.When the planning horizon is extended from 5 to 15, the optimality ratio of the L-M model increases by 0.018, whereas that of the L-D and L-R models only increases by 0.05 and 0.02, respectively.
In summary, across our experiments, our results show that models trained with the decision error outperform those trained with the prediction error when a certain degree of model misspecification occurs.Robust models are able to effectively accommodate training data that contain inaccurate or contaminated information such as measurement errors.Moreover, robust models require only a small number of training samples to achieve good prediction results.

VI. CONCLUSION
In this study, we investigated the possibility of using the decision error to train demand prediction models for solving two-stage lot-sizing problems.We proposed a linear prediction model that minimizes the decision error by leveraging the structure of the lot-sizing problems.This approach ensures that the prediction and optimization models are more closely linked.We also extended our model by making various modifications.We extended the use of the decision error to CART models and proposed a new criterion for branching.For neural network models, we revised the loss function and subgradient to update the parameters.In addition, we extended the models to distributionally robust versions and proposed corresponding algorithms to iteratively solve the models.Extensive numerical experiments were conducted to evaluate the performance of the proposed prediction models.Generally, the results showed that the prediction models utilizing the structure of the lot-sizing problems outperformed the MSE-based models in the case of model misspecification.The robust expansion of the models allows the models to train well on imperfect datasets while requiring less training data.This research could be extended in the following directions.First, it would be interesting to extend the decision error criterion to other operations research problems such as inventory management and vehicle routing problems.Second, additional factors affecting lot-sizing problems, such as capacity constraints, the lead-time, and rolling planning, may also yield promising results.Third, it would also be of interest to consider solving the distributionally robust model under arbitrary norms.Finally, extensions to other popular machine-learning models such as support vector machine (SVM) may also yield promising results.Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
X * i j (d e ) are the optimal solutions for given d e .w 0 (d e ) = T j=1 c j Z * j (d e ), w j (d e ) = j i=1 X * i j (d e )( p i + h i, j−1 ), w(d e ) = (w 0 (d e ), w 1 (d e ), . . ., w T (d e )) ⊤ .Then the decision error (DE) is:

Fig. 3 .
Fig. 3. Geometric illustration of two types of error.

1 )δ
Model Formulation: Although a set of training samples ξ n = (x n , d n ), n ∈ {1, 2, . . ., N } is observed, the distribution P that governs the data generation is unobservable.Consider that the goal of the decision maker is to find a solution that minimizes the objective under the real distribution P. In this situation, the empirical distribution estimated from the N training samples, ξ n , is used to surrogate the unknown distribution P. Here, δ ξ n denotes the Dirac point mass at the training sample ξ n .

5 :
Check the optimality by solving the problem max w∈S || − α B ⊤ w + α B ⊤ w(d en )|| * for given B, α, and obtain the optimal objective value λ i 6: end for 7: if λ ≥ λ i for i = 1 to N then 8:STOP.Output the optimal value OBJ.Output the optimal solution B, α, λ 9: else 10:

Proposition 3 :
max w∈S ||−α B ⊤ w+α B ⊤ w(d en )|| * in Algorithm 1 Line 5 is an NP-hard problem.For the L 1 and L ∞ norms, it can be converted into a mixed integer programming (MIP) problem.

Algorithm 3
Algorithm for Training a Robust CART Model 1: In the feature space in which the training dataset is located, recursively divide each region into two subregions and decide the output value on each subregion to construct a binary decision tree.2: for K = 1, . . ., K do 3:

Algorithm 4 4 : 5 :
Algorithm for Training a Neural Network Model 1: Input: training data D = {(x 1 , d 1 ), . . ., (x n , d n )}, learning rate = η 2: Randomly initialize the network parameters.3: for (x, d) ∈ D do Calculate the predicted demand d by forward propagation.Solve the following optimization model to calculate the decision error max w∈S {w ⊤ d e − αw ⊤ de } + αw ⊤ (d e ) de − w ⊤ (d e )d e 6:

1 )
L-M: the linear regression model trained by the mean square error.2) L-D: the linear regression model trained by the decision error (Model (11)).3) L-R: the distributionally robust linear regression model trained by the decision error (Model (13)).4) C-M: the classification and regression tree trained by the mean square error (Model (15)).5) C-D: the classification and regression tree trained by the decision error (Model (16)).6) C-R: the distributionally robust classification and regression tree trained by the decision error (Model (18)).7) N-M: the neural network model trained by the mean square error.8) N-D: the neural network model trained by the decision error.9) N-R: the distributionally robust neural network model trained by the decision error (Model (21)).

3 .
The size of the training set n ∈ {5, 25, 50, 100, 500} and the value of the parameter deg ∈ {1, 2, 4, 8} were varied.The performance of the trained models was evaluated by using the criterion: O ptimalit y Ratio = cost pr edict cost opt on a test set of 100 samples.
Fixed set-up cost in period t ∈ {1, . .., T } p t Unit production cost in period t ∈ {1, . .., T } h tUnit inventory cost in period t ∈ {1, . . ., T } h i, j Unit inventory cost of holding a unit inventory from period i to period j + 1, h i, j = t Binary variable; indicator of production at time period t (= 1) or not (= 0) X t Continuous variable; production quantities in period t ∈ {1, . . ., T } I t