Multi-task Gaussian process upper confidence bound for hyperparameter tuning and its application for simulation studies of additive manufacturing

Abstract In many scientific and engineering applications, Bayesian Optimization (BO) is a powerful tool for hyperparameter tuning of a machine learning model, materials design and discovery, etc. Multi-task BO is a general method to efficiently optimize multiple different, but correlated, “black-box” functions. The objective of this work is to develop an algorithm for multi-task BO with automatic task selection so that only one task evaluation is needed per query round. Specifically, a new algorithm, namely, Multi-Task Gaussian Process Upper Confidence Bound (MT-GPUCB), is proposed to achieve this objective. The MT-GPUCB is a two-step algorithm, where the first step chooses which query point to evaluate, and the second step automatically selects the most informative task to evaluate. Under the bandit setting, a theoretical analysis is provided to show that our proposed MT-GPUCB is no-regret under some mild conditions. Our proposed algorithm is verified experimentally on a range of synthetic functions. In addition, our algorithm is applied to Additive Manufacturing simulation software, namely, Flow-3D Weld, to determine material property values, ensuring the quality of simulation output. The results clearly show the advantages of our query strategy for both design point and task.


Introduction
In machine learning, the proper setting of hyperparameters in the algorithms (for example, regularization weights, learning rates, etc.) is crucial for achieving satisfying performance.
A poor setting of hyperparameters may result in a useless model even when the model structure is correct.In materials design and discovery, how to choose the chemical structure, composition, or processing conditions of a material to meet design criteria is a key problem.
There are many other examples of design problems in advertising, healthcare informatics, manufacturing and so on.Any significant advances in automated design can result in immediate product improvements and innovation in a wide area of domains.
Bayesian optimization (BO) Jones et al. (1998); Shahriari et al. (2015) has emerged as a powerful tool for these various design problems.Fundamentally, it is a general method to efficiently optimize "black-box" functions.There is only weak prior knowledge available, typically characterized by expensive and noisy function evaluations, a lack of gradient information, and high levels of non-convexity.BO is impacting a wide range of areas, including combinatorial optimization Williams et al. (2000); Hutter et al. (2011), automatic machine learning Bergstra et al. (2011); Snoek et al. (2012), material design Frazier and Wang (2016), and reinforcement learning Brochu et al. (2010).
Bayesian optimization is a sequential model-based approach to solving the "black-box" optimization problem.For a given task, the method iterates over the following steps until the available computational budget is exhausted: 1) a set of evaluated points is used to learn a probabilistic regression model p(f ) of the objective function f [typically in the form of a Gaussian process (GP) Rasmussen (2003)]; 2) p(f ) is used to induce a proper acquisition function that leverages the uncertainty in the posterior to trade-off exploration and exploitation; 3) the acquisition function is optimized to determine the next query point to be evaluated; and 4) the regression data set in 1) is augmented with the newly evaluated point.
Different from single-task Bayesian optimization introduced above, Multi-task Bayesian optimization (MTBO) Swersky et al. (2013) is a general method to efficiently optimize multiple different but correlated "black-box" functions.The settings for multi-task Bayesian optimization widely exist in many real-world applications.For example, the K-fold crossvalidation Bengio and Grandvalet (2004) is a widely used technique to estimate the generalization error of machine learning model for a given set of hyperparameters.However, it needs to retrain a model K times using all K training-validation splits.The validation errors of a model trained on K different training-validation splits can be treated as K "black-box" functions, which need to be minimized as K different tasks.These K tasks will be highly correlated since the data are randomly partitioned among K training-validation splits.The performance of our proposed method in the application of fast cross-validation Swersky et al. (2013); Moss et al. (2020) is presented in Section 6.1, which aims at mini-mizing the average validation errors in K-fold cross-validation.
Another motivating example comes from additive manufacturing.That is, a material scientist might want to find the raw material properties (for example, thermal conductivity, laser absorptivity, surface tension, etc.), which are difficult to measure at high temperatures Yan et al. (2020).These raw material property values are the input of the Computational Fluid Dynamics (CFD) software Flow Science (2019) to simulate the melt pool dynamics for different printing conditions.While the printing conditions (for example, laser power and dwell time) can be varied for a given type of material, the raw material properties are generally provided in the forms of a suggested range, not specific values.Appropriate setting of these raw material properties are very critical to the quality of simulation output.For one specific printing setting, the task is to determine raw material property values to ensure the simulation quality.In the case of multiple printing settings, there are multiple tasks that need to be optimized.The performance of our proposed method in the application of raw material properties determination for multiple printing settings is presented in Section 6.2.
Gaussian process upper confidence bound (GP-UCB) Srinivas et al. (2012) is one of the popular and efficient algorithms in Bayesian optimization using confidence bounds to deal with an exploitation-exploration trade-off.GP-UCB cannot be directly applied to the multi-task case, where one specific printing condition can be treated as one single task.
Based on GP-UCB, Dai et al. (2020) developed a method to optimize each task separately, where the objective functions for all tasks are modeled by multi-output Gaussian process Williams et al. (2007).However, there are two drawbacks for this strategy: first, it is not sample-efficient since every task should be evaluated independently; second, for different tasks, the algorithm may find different values for the same raw material properties, which makes the results difficult to be used.
As an alternative, MTBO Swersky et al. (2013) aims to optimize all tasks simultaneously and make use of the shared information and structure to speed up the optimization process.Regardless of MTBO's well-known efficiency, there is still space of improvement to better address the need of our applications of cross-validation and raw material properties determination identified above.Specifically, it is neither effective nor efficient for the MTBO algorithm to query a point to be evaluated for all tasks simultaneously in each round since these tasks are correlated.Therefore, to fully utilize the correlations among tasks may provide a room for further improvement of the accuracy of MTBO algorithm.
The objective of this work is to develop an algorithm for multi-task Bayesian optimization with automatic task selection so that only one task evaluation is needed per query round.To achieve this objective, a new algorithm, namely, multi-task Gaussian process upper confidence bound (MT-GPUCB), is proposed.The MT-GPUCB is a two-step algorithm, where the first step chooses which query point to evaluate, and the second step automatically selects the most informative task to evaluate.The contributions of this paper are summarized as follows: • Multi-task Gaussian process upper confidence bound is proposed to provide the query strategy for both design point and task in each round to balance explorationexploitation across tasks.
• Under some mild conditions, multi-task Gaussian process upper confidence bound is proved to be a no-regret learning algorithm.That is, the algorithm will converge to the optimal solution in the end.
Remark 1 (Discussion on mean/integrated response BO).The strategy of our method to select query point and task in each round is improved from MTBO Swersky et al. (2013), where our method selects the tasks by considering the task correlations but MTBO does not.It is conceptually similar to the mean/integrated response (BO) proposed by Williams et al. (2000); Janusevskis and Le Riche (2013);Toscano-Palmerin and Frazier (2018);Tan (2020).These reference papers consider that optimization problems (the objective function is the sum or integral of an output) depend on control and environmental variables.Those control and environmental variables are analogous to query points and tasks in our proposed method.The objective of the mean/integrated response BO is to find the optimal control variable under all environmental variables.Therefore, mean/integrated response BO can be considered as multi-task Bayesian optimization.However, there are distinct differences between our method and the above references in terms of modeling the relationship of tasks.
For the mean/integrated response BO, they assume that the environmental variables (tasks) follow a specific distribution independently.As a result, these methods' performance deteriorates when a small number of tasks (≤ 10 for example) are highly correlated.This is because, in mean/integrated response BO, the correlation between tasks is not fully utilized by modeling them independently.In our paper, there is no distribution assumption on the tasks.Instead, our method uses a multi-task Gaussian process to model the correlation between tasks through a task correlation matrix, where the task correlation is better handled.Besides, the acquisition function in our method is more intuitive and simpler to implement without involving computationally expensive integration used by the above literature methods.In Toscano-Palmerin and Frazier (2018), their experiment results show that one of our benchmarks, namely, MTBO Swersky et al. (2013), had a competitive performance with their proposed method when the number of tasks is small (m ≤ 10).In our experiments, our method performs much better than MTBO, where the number of tasks is small (m ≤ 10).Due to this observation and implementation issue, the proposed method in Toscano-Palmerin and Frazier (2018) is not compared in our paper.
Despite the difference in modeling, it is still necessary to compare with the state-of-the-art in mean/integrated response BO.EIQ from Tan ( 2020) is used as one of the benchmarks in Sections 5 and 6 since it is the most representative and effective method in the literature.In our paper, the case of a small number of tasks is the focus of our algorithm, where all experiments in our paper have less than ten tasks.For all experiments, our algorithm exhibits the best performance among all benchmarks because our algorithm can explore the task correlation explicitly.
The remainder of this paper is organized as follows.A brief review of theoretical foundations in single task Bayesian optimization is provided in Section 2. The proposed Multi-task Gaussian process upper confidence bound algorithm is introduced in Section 3. The regret analysis of our proposed algorithm is provided in Section 4, followed by the numerical studies and actual case studies in Sections 5 and 6 for testing and validation of the proposed algorithm.Finally, the conclusions and future work are discussed in Section 7.

Theoretical Foundations of Bayesian Optimization
To begin with, single task Bayesian optimization (BO) Snoek et al. (2012) considers the problem of sequentially optimizing an unknown function f : X → R. The goal is to find as soon as possible.To model the "black-box" function f , the Gaussian process assumption on f is introduced, which is a Bayesian statistical approach for modeling functions.A standard gaussian process (GP) Rasmussen ( 2003) is a stochastic nonparametric approach for regression that extends the concept of multivariate Gaussian distributions to infinite dimensions.To enforce implicit properties like smoothness without relying on any parametric assumptions, a GP is used to model f , written as GP(µ(x), k(x, x )), which is completely specified by its mean and covariance functions, where µ(x) = E(f (x)) represents the distribution mean that is often set to 0, and k(x, ] is the kernel matrix.The kernel function k encodes smoothness properties of sample functions f drawn from the GP, which is the most critical component for GP.Finite-dimensional linear, squared exponential, and Matérn kernels are the common choices for the kernel functions Williams and Rasmussen (2006).Throughout this paper, we further assume bounded variance by restricting k(x, x) ≤ 1, for all x ∈ X .
A major computational benefit of working with GPs is that posterior inference can be performed in closed form.Given collected observations y T = [y 1 , . . ., y T ] at design inputs A T = {x 1 , . . ., x T }, the posterior over f is a GP distribution again, with mean µ T (x), covariance k T (x, x ), and variance σ 2 T (x): where

Multi-Task Gaussian Process Upper Confidence Bound
In our setting of multi-task Bayesian optimization, we can define a "black-box" function over the composite set X × Z, namely, f : X × Z → R, where X ∈ R d is the design space and Z = {1, . . ., M } is the set of task index.Assume that the (noisy) observations ) is the result of querying point x on task m, and t is independent Gaussian noise following N (0, σ 2 ).
Our objective is to develop an algorithm that contains: Step 1. Choose the query point x t ∈ X that benefits for all tasks; Step 2. Select the most informative task (i.e., m t ∈ Z) to evaluate for better sample efficiency.
To model the correlations between tasks, our underlying Gaussian process model over f must be extended across the task space.By defining a kernel over X × Z, the posterior over f can be similarly calculated through (2).Although increasing the dimension of the kernel for X to incorporate Z provides a very flexible model, it is argued by Kandasamy et al. (2017) that overly flexible models can harm optimization speed by requiring too much learning, restricting the sharing of information across the task space.Therefore, it is common to use more restrictive separable kernels that better model specific aspects of the given problem.A common kernel for multi-task spaces is the intrinsic coregionalization kernel of Álvarez and Lawrence (2011).This kernel defines a co-variance between design parameter and task pairs of where ⊗ denotes the Kronecker product, k X measures the relationship between inputs, and k Z measures the relationship between tasks, allowing the sharing of information across the task space.Once the composite kernel k multi is determined, the prediction of f over the composite set X × Z follows (2).
The following assumption is made based on our applications.
Assumption 1. x * = arg max x∈X f (x, m) for all task m.That is, f (x, m) has same set of optimal solutions (not necessary to be unique) for all different tasks.However, they may have a very different trend globally.
Assumption 1 is valid for our application of raw material properties determination.Different printing settings represent different tasks (m = 1, . . ., M ), however, they share the same raw material properties (x * ) that need to be identified.Our objective is to find the common optimal solution shared for different tasks (3) using as few samples as possible.
In the setting of single-task Bayesian optimization, Bayesian experimental design rule used in Chaloner and Verdinelli (1995) x t = arg max can be wasteful since it aims at decreasing uncertainty globally, not just searching where maxima might be.Another approach to (1) is to pick points as x t = arg max x∈X µ t−1 (x), maximizing the expected reward based on the posterior so far.However, this pure exploitation rule is too greedy and tends to get stuck in shallow local optima.Instead, GP-UCB Srinivas et al. ( 2012) is a combined strategy to choose where β t are appropriate constants, µ t−1 (x) and σ t−1 (x) can be calculated from the GP posterior defined in (2).This objective prefers both points x where f is uncertain (large and such where we expect to achieve high rewards (large µ t−1 (•), exploitation) since it implicitly negotiates the exploration-exploitation tradeoff.
However, GP-UCB is not ready to be used for multi-task Bayesian optimization since there is only one task in the objective.In our proposed multi-task Bayesian optimization, we would like to have an algorithm that automatically selects a task to evaluate for a given point to speed up the optimization because it is not necessary to evaluate all tasks for a given query point due to the correlations among multi-tasks.Motivated by GP-UCB and Algorithm 1: MT-GPUCB Input: Input space X , Z; GP prior µ 0 = 0, σ 0 , k X and k Z 1 for t=1,2,. . .do 2 Step 1: Choose Perform GP and Bayesian update to obtain µ t (x, m) and σ t (x, m) for all m Bayesian experimental design, our proposed MT-GPUCB algorithm 1 is to provide (1) a query strategy for design point and (2) automatic task selection to be performed at the query point selected in (1).Specifically, our proposed algorithm contains two main steps. • Step 1: the algorithm selects query point x t that has the largest summation of UCB for all tasks (as shown in Line 2 algorithm 1), namely, which means that the selected query point x t has the "best" potential to perform good for all tasks simultaneously.The objective function in ( 6) is the summation form of (5) over different tasks so that it is a reasonable upper bound on the objective function in (3). • Step 2: given the query point selected from step 1, the algorithm selects the task that has the largest information gain to play (as shown in Line 3 algorithm 1), namely, That is, the most informative task will be selected to represent all the other tasks.
The above criteria in (7) origins from the criteria in (4).However, our criteria selects which task to evaluate given the query point selected in Step 1.
By querying one task for a given point in one round, our algorithm can achieve better sample efficiency than querying all tasks in one round.Despite the simplicity and easy interpretation of algorithm 1, theoretical performance guarantee can also be derived in next section, where the regret in the bandit setting will be analyzed.

Regret Analysis for Multi-Task Gaussian Process Upper Confidence Bound
A natural performance metric in the context of bandit Srinivas et al. (2012) is the cumulative regret, the loss in reward due to not knowing the optimal solution of f beforehand.Given the assumption 1, for our choice x t in round t, define instantaneous multi-task regret as . Even though only one task is evaluated each round, the regret for all tasks are incurred since all tasks are equally important.This is quite different from the definition in Krause and Ong (2011), where only the evaluated task is defined in the instantaneous regret.Our definition also add difficulties in our theoretical analysis.
The cumulative regret after T rounds is the sum of instantaneous regrets: R(T ) := T t=1 r t .A desirable asymptotic property of an algorithm is to be no-regret (also called sub-linear convergence rate): lim T →+∞ R(T ) T = 0 Similar to the GP-UCB regret analysis Krause and Ong (2011), our bound depends on a term capturing the information gain between query choices and "black-box" functions.Specifically, define is the maximum information gain between T rounds of queries, where I(y T ; f T ) is information gain Cover and Thomas (1991).It is the mutual information between and observation y T = f T + T at these points: quantifying the reduction in uncertainty about f from revealing y T .Here, and T ∼ N (0, σ 2 I).For a Gaussian, H(N (µ, Σ)) = 1 2 log |2πeΣ|, so that in our setting ),(x ,m )∈X ×Z .Three conditions are analyzed for regret bounds.Note that none of the results subsume each other, and so all cases may be of use.For the first two conditions, we assume a known GP prior and (1) a finite X and (2) infinite X with mild assumptions about k multi .A third (and perhaps more "agnostic") way to express assumptions about f is to require that f has low "complexity" as quantified in terms of the Reproducing Kernel Hilbert Space (RKHS, Wahba (1990)) norm associated with kernel k multi .The following theorem shows that our MT-GPUCB algorithm converges to the optimal solution with sub-linear convergence rate.
Theorem 1.Under assumption 1 and k(x, x) ≤ 1, pick δ ∈ (0, 1), suppose one of the following conditions holds 1. X is finite, f is sampled from a multi-task gaussian process with known noise variance σ 2 , and 2. X ⊂ [0, r] d be compact and convex, d ∈ N, r > 0. Suppose that the kernel k multi satisfies the following high probability bound on the derivatives of GP sample paths f : for some constants a, b > 0, Choose β t = 2 log(2M t 2 π 2 /(3δ)) + 2d log t 2 rdb log(4dM a/δ) .
Running algorithm 1, a regret bound can be obtained with high probability.Specifically, Proof.See proof in Appendix A.
Theorem 1 shows the regret bound under three independent conditions.The conditions (1) and (2) are corresponding to the cases of finite discrete and convex compact design space, respectively.Under condition (1) or (2), the cumulative regret is bounded in terms of the maximum information gain with respect to the multi-task GP defined over X × Z.
The smoothness assumption on k(x, x ) in conditions (2) disqualifies GPs with highly erratic sample paths.It holds for stationary kernels k(x, x ) = k(x − x ) which are four times differentiable (Theorem 5 of Ghosal et al. (2006)) such as the Squared Exponential and Matérn kernels with ν > 2. Under condition (3), a regret bound is obtained in a more agnostic setting, where no prior on f is assumed, and much weaker assumptions are made about the noise process.The theoretical results in Theorem 1 will be further verified experimentally using synthetic functions in Section 5.
Importantly, note that some upper bounds are needed for the information gain γ T defined in (8) so that our algorithm 1 is a no-regret.γ T is a problem-dependent quantity properties of both the kernel and the design input space that will determine the growth of regret.In Krause and Ong (2011) (section 5.2), the upper bound of γ T has been derived for common kernels such as finite-dimensional linear, squared exponential, and Matérn kernels (ν > 1) to guarantee that the no-regret conclusion is valid.For the Matérn kernels (ν > 1), Remark 2 (Discussion on mild conditions).Under the Gaussian process assumption with proper kernel (for example, Matérn kernels with ν > 2), condition (1) or (2) can be easily satisfied in many problems.For example, the number of layers and the number of neurons per layer in a deep neural network can only take discrete values Garrido-Merchán and Hernández-Lobato (2020); Maftouni et al. (2020), which is suitable for condition (1); the BoxConstraint and KernelScale in a Gaussian kernel support vector machine (SVM) Han et al. (2012) can form a compact convex design space, which is suitable for condition (2).

Numerical Study
To evaluate the performance of the proposed MT-GPUCB algorithm 1, a numerical illustration of our algorithm on six synthetic functions is conducted in this section.In all analyses, EIQ Tan (2020), MTBO Swersky et al. (2013), CGP-UCB Krause and Ong (2011), and GP-UCB Srinivas et al. (2012) are selected as benchmarks for comparison with the proposed algorithm, which are state-of-the-art methods in the related area.EIQ is one of the proposed acquisition functions that are designed for expected quadratic loss.MTBO is an entropy search based algorithm for multi-task BO while CGP-UCB is the algorithm assumed that the task to be evaluated is randomly selected.GP-UCB in the setting of multi-task BO represents the algorithm, where all tasks are evaluated in each round without considering the task correlation.Throughout this paper, the maximin Latin hypercube design Joseph and Hung (2008), which demonstrates good space filling properties and firstdimension projection properties, is implemented to obtain initial design points.The codes of MT-GPUCB are implemented in Matlab 2019a.The CPU of the computer used in the experiments is an Intel ® Core ™ Processor i7-6820HQ (Quad Core 2.70 GHz, 3.60 GHz Turbo, 8MB 45W).
Choice of β t in (6) for practical consideration: β t , as specified by Theorem 1, has unknown constants and usually tends to be conservative in practice Srinivas et al. (2012).
For better empirical performance, a more aggressive strategy is required.Following the recommendations in Kandasamy et al. (2015Kandasamy et al. ( , 2017Kandasamy et al. ( , 2019)), we set it to be of the correct "order", namely, β t = 0.2d log(2t).It offers a good tradeoff between exploration and exploitation.Note that this captures the correct dependence on d and t in Theorem 1.

Experimental Settings
Ackley, Bohachevsky, Colville, Levy, Powell, and Rastrigin are selected from Surjanovic and Bingham (2020) for numerical comparison, where most of them have many local optimal solutions so that it is hard to optimize them.For each of those function, there are coefficients can be changed to simulate multiple functions with the same optimal solution to match the Table 1: Optimal solution and coefficient vector for synthetic functions used in this numerical study.

Functions
x * Task Encode Matrix (E) ii. Bohachevsky: In addition, the optimal solution and coefficient vector to encode different tasks for each function are listed in the Table 1.The visualization of 4 synthetic functions in Task 1 that lie in two-dimensional space is shown in Figure 1 to show the complex nature in optimization.The design space X , the number of tasks, the number of initial design points, and the number of additional function evaluations for six synthetic functions are summarized in the dimensions of design space and the number of tasks, respectively.For each synthetic function, the number of additional function evaluations (T ) is set to the number that one of the methods can achieve a near-optimal performance.Due to the different characteristics of synthetic functions (for example, the number of tasks, the dimension of design space, and the complexity of optimization), the number of function evaluations (T ) are quite different for different synthetic functions.
Once the task encode matrix is known, each row represents the coefficient for each task as the input to calculate the r in (10 Matérn (ν = 5/2) kernels are selected, where the distance matrix in k Z (m, m ) are constructed by the coefficients of different tasks.Note that the product of two Matérn (ν = 5/2) kernels is still Matérn (ν = 5/2).Thus, the smoothness assumption of the kernel in Theorem 1 under condition (2) is satisfied.The Matén kernel Genton ( 2001) is given by where ν controls the smoothness of sample paths (the smaller, the rougher) and B ν is a modified Bessel function.The kernel parameters are learned through maximizing the marginal log likelihood.We maximized the acquisition function ( 6) by densely sampling 10000 points from a d-dimensional low-discrepancy Sobol sequence, and starting Matlab fmincon (a local optimizer) from the sampled point with highest value.All analyses in this section are repeated 50 times to get the mean and standard deviation for comparison.

Numerical Results
For quantitative comparison, the gap measure Malkomes and Garnett (2018) is reported, which is defined as where M m=1 f (x first , m) is the maximum function value among the first initial design points, M m=1 f (x best , m) is the best value found by the algorithm, and M m=1 f (x * , m) is the optimum value."0" means there is no improvement starting from the initial design points and "1" means the algorithm can find the optimum.
The results of the gap measure are summarized in Table 3.Our proposed algorithm has the best performance for all six synthetic functions in terms of mean and standard deviation.Specifically, our algorithm can find a better solution than other benchmarks within the same budget.CGP-UCB has the second-best performance while GPUCB has the worst performance.EIQ has the second-worst performance.This result demonstrates the effectiveness of the automatic task selection strategy of our proposed algorithm.The red bold numbers represent the best performance in each scenario.
Although our applications are not for online real-time, the computation efficiency of our algorithm is also provided in this study.Thereafter, the average computation time (in seconds) of each round is reported in Table 4. EIQ and MTBO are very time-consuming since they involve numerical interrogation, which is computationally expensive.Our proposed algorithm takes a similar computation time with CGP-UCB and GPUCB because they share the same acquisition function.In summary, the results from Tables 3 and 4 show that our MT-GPUCB has not only the best optimization accuracy but also the most efficient computation power.defined as min 0≤t≤T r t , can be plotted via the number of evaluations in Figure 2.This measure is more relevant to pure search problems (i.e., no exploitation) and captures how quickly the algorithms find the optimal point.As shown in Figure 2, our algorithm can consistently lead to faster convergence trends than the benchmark algorithms.In most cases, our algorithm can reach the best performance achieved by other benchmarks within a much less number of evaluations.GPUCB is the worst across all the synthetic functions since it evaluated all tasks for a query point, which is very wasteful.EIQ is ineffective for all the synthetic functions since it does not fully consider the correlation among tasks.

Real-World Case Studies
In this section, two real-world case studies are used for evaluating the performance of the proposed algorithm.In Section 6.1, Application 1: Fast cross-validation to determine the optimal set of hyperparameters for machine learning models.In Section 6.2, Application 2: Raw material properties determination using Flow3D Weld software in different print conditions.Same comparison methods, Matlab setting and acquisition function maximization method described in Section 5 are used in this section.In terms of k X (x, x ) and k Z (m, m ) ∈ R M ×M , Matérn (ν = 5/2) kernel with a separate length scale per predictor is selected for both kernels.One-hot encoding Hastie et al. (2009) is used in constructing the task features for each task so that k Z (m, m ) can be calculated , where 1 m ∈ R M is the vector that only m-th position equals 1 and the remaining is 0.
For both SVM and CNN, 20 initial design points are selected by maximin Latin hypercube design.If a variable belongs to [10 −3 , 10 3 ], we will search from the exponents, that is, [−3, 3].For 100 and 50 additional evaluations are conducted for SVM and CNN, respectively.All analyses in this section are repeated 20 times to get the mean and standard deviation for comparison.Since the optimal hyperparameters are unknown, the average accuracy of cross-validation is plotted against the number of evaluations in Figure 3.For both SVM and CNN models, our algorithm can achieve the highest average accuracy in estimating the hyperparameters while it is very stable, as shown by the confidence interval.
EIQ and GP-UCB have the lowest average accuracy, while GP-UCB is unstable for both SVM and CNN.CGP-UCB and MTBO demonstrate competitive performance at the early stage, while MT-GPUCB can keep improving as the number of evaluations goes on.This experiment shows that our algorithm makes nontrivial decisions regarding which fold to query, which can steadily improve the average accuracy.

Application 2: Raw material Properties Determination for Flow-3D Weld
Flow-3D Weld Flow Science ( 2019) is a simulation software based on computational fluid dynamics (CFD).It provides powerful insights into laser welding processes such as electron beam melting (EBM) and selective laser melting (SLM) Gokuldoss et al. (2017), which are representative powder bed fusion additive manufacturing for machine learning applications in quality and reliability Shen et al. (2020Shen et al. ( , 2021)).When we use the Flow-3D Weld, the material of metal powder (for example, Ti6Al4V) and its raw material property (for example, fluid absorption rate (FAR) and thermal conductivity (TC)), as well as printing conditions (for example, laser power and dwell time), are the input to the software.The output from Flow-3D Weld is the melt pool geometries (changing along time), where an example of one frame of the melt pool boundary is shown in Figure 4a.The melt pool information is a critical intermediate measure that reflects the outcome of a laser powder bed fusion process Weld with high accuracy.Otherwise, these values have to be randomly selected within the range, which cannot guarantee simulation accuracy using Flow-3D.
For the same raw material (Ti6Al4V powder in this paper), its properties should remain the same under different printing conditions (tasks defined in this paper).So, these raw material properties are considered hyperparameters in our proposed model.This paper aims to determine these hyperparameters based on a number of tasks (namely, actual AM experiments; in this paper, we used Flow-3D Weld to synthesize the AM experiments).The problem studied in this subsection has the following procedure.
• Step 1 (AM experiments using Flow-3D simulation): AM experiments are simulated using Flow-3D Weld.Ti6Al4V (power) is selected as the raw material with the laser melting process.The entire process of one laser melting experiment takes 2 million seconds (ms), during which the laser is turned on.Two different printing conditions (tasks) are applied, namely, Condition 1: 50% laser power, 0.8 ms dwell time (time of laser on), and Condition 2: 40% laser power, 1 ms dwell time.We set the raw material properties of Ti6Al4V, namely, fluid absorption rate (FAR) = 0.3780 and thermal conductivity (TC) = 1.64e 6 cm*g/s 3 K in all experiments to generate actual AM experiments data.These values are randomly chosen from the given ranges provided by the software, and they are treated as a BlackBox and unknown to our developed algorithm.They will be considered as the ground truth to test the performance of our proposed method.This is the way used in our paper to simulate the impact of the specific raw material properties on metal AM printing. • Step 2 (Multi-tasking learning to best determine the actual raw material (FAR= 0.3780 and TC = 1.64e 6 cm*g/s 3 K).Together with the optimal solution, the visualization of our "black-box" function is shown in Figure 4b.It shows that our problem is very challenging with multiple local optimal solutions.Especially, the geometry of the surface is very complicated when it is close to the optimal solution.
In terms of k X (x, x ) and k Z (m, m ) ∈ R M ×M , Matérn (ν = 5/2) kernel with a separate length scale per predictor is selected for both kernels.The printing conditions are used in constructing the task features for each task.We search FAR from the range [0.34, 0.41] and TC from the range [1.4,1.9] (e 6 will be multiplied when input the TC into the software).
20 initial design points are selected by maximin Latin hypercube design and 16 additional evaluations are determined by BO algorithms.The distance to the optimal solution (ground truth) is plotted in Figure 5 for different algorithms.At the beginning, our algorithm learns at a relatively slow speed.After that, our algorithm converges to the optimal solution in a much faster speed than any other algorithms.In the end, our algorithm converges to a much better solution than any other benchmarks.

Conclusion
In this paper, a multi-task Gaussian process upper confidence bound (MT-GPUCB) algorithm is implemented for obtaining the solution from different but correlated tasks.The proposed MT-GPUCB considers query strategy for design point with automatic task selection in each round of multi-task Bayesian optimization to improve sample efficiency.Under some mild conditions, our algorithm is a no-regret algorithm, which converges to the optimal solution with sub-linear convergence rate.Experimentally, our algorithm is validated by synthetic functions and real-world case studies in fast cross-validation and raw material properties determination.Based on the convergence speed results of these case studies, it is evident that our MT-GPUCB outperforms state-of-the-art algorithms in related areas.
In addition, there are still some aspects of MT-GPUCB that deserve further investigations.First, the selection of kernel is critical to the performance of Gaussian process.
Therefore, how to select a proper kernel is one of the next steps of research.Especially, how to construct the task kernel k Z is the key step.Second, the maximization of acquisition function is a non-trivial problem Wilson et al. (2018).Thereafter, how to efficiently optimize the acquisition function can be further investigated.Third, our query strategy to select both design point and task is not limited to UCB-based acquisition function.There are probability of improvement, expected improvement, and entropy-based acquisition functions can be explored in the future.
Based on the Cauchy-Schwarz inequality, we have where the last inequality is because of the definition of γ T .Therefore,

A.2 Proof of Theorem 1 under Condition (2)
Recall that the finite case proof is based on Lemma 1 paving the way for Lemma 2. However, Lemma 1 does not hold for infinite X .From Lemma 4 to Lemma 7, the extension to infinite X is accomplished by the discretization trick.First, let us observe that we have confidence on all decisions actually chosen.
Proof.Directly followed by Lemma 1.
Purely for the sake of analysis, we use a set of discretizations X t ⊂ X , where X t will be used at time t in the analysis Lemma 5. Pick δ ∈ (0, 1) and set β t = 2 log(M |X t |π t /δ) where t≥1 π −1 t = 1, π t > 0.
setting described in assumption 1.The forms of these functions are summarized as follows (a, b, . . .are the coefficients can be tuned to generate different tasks except Bohachevsky): i. Ackley (a, b > 0

Figure 1 :
Figure 1: Visualization of 4 synthetic functions in two dimensional space (they are in the form of minimization problems).

Figure 2 :
Figure 2: Simple Regret via Number of Evaluations for different synthetic functions and methods (95% Confidence interval, Log means log e ).
cross-validationBengio and Grandvalet (2004) is a widely used technique for estimating the generalization error of machine learning models, but requires retraining a model K times.For a given set of hyperparameters, the generalization error can be usually obtained by taking the average validation errors based on the model training on K training-validation splits.This can be prohibitively expensive with complex models and large datasets.With a good GP model, we can very likely obtain a high quality estimation of generalization error by evaluating the model trained on one training-validation split.To speed up the process of hyperparameter tuning using cross-validation, the algorithm 1 is applied to dynamically determine which hyperparameters and training-validation split to query in each round.Datasets and machine learning models used in this application are as follows: (i) Train Gaussian kernel support vector machine (SVM) on Arcene Cancer Dataset Guyon et al. (2005) for two-class classification.Arcene Cancer Dataset is a massspectrometric dataset, which contains 7000 continuous input variables.In addition, there are 200 records in which 88 of 200 are cancer patients, 112 of 200 are healthy patients.(ii) Train convolutional neural network (CNN) Lawrence et al. (1997) on 2000 Handwritten digit samples (0∼9) from MNIST dataset LeCun et al. (1998) for deep learning classification.Each image has size of 28 × 28.The structure of CNN used in this article contains three convolutional layers and one fully connected layer.For ith convolutional layer (i = 1, . . ., 3), the number of filters is 2 i+2 with size 3 × 3, batch normalization and max pooling are used together with the ReLU activation function.For the final fully-connected layer, it uses the Softmax activation function for classification.The maximum number of epochs is set to 20, which makes CNN converges for all the cases.The hyperparameters that need to be tuned via cross-validation for the above two machine learning models are summarized below: (i) Determine BoxConstraint∈ [10 −3 , 10 3 ] and KernelScale∈ [10 −3 , 10 3 ] for training Gaussian kernel support vector machine (SVM) with 5-fold cross-validation.Box-Constraint and KernelScale are the two key components to the performance of SVM.(ii) Determine InitialLearnRate∈ [10 −5 , 10 0 ], L2Regularization∈ [10 −5 , 10 0 ] and Mo-mentum∈ [0.4,1] for training convolutional neural network (CNN) with 5-fold crossvalidation.InitialLearnRate and Momentum are two most important parameters in stochastic gradient descent with momentum (SGDM) Sutskever et al. (2013) optimizer.The L2Regularization can help improving the ability of generalization of CNN.
Figure 3: Average accuracy of cross-validation on SVM and CNN for different datasets and algorithms (95% Confidence interval).
Figure 4: Visualization of (a) output from Flow3d-Weld; (b) surrogate modeling of image loss.
properties): Our algorithm is applied to guide a sequence of AM experiments (simulated using Flow-3D Weld) to determine the true values of FAR and TC used in Step 1. Specifically, our algorithm aims to minimize the image loss between the image from a queried simulation and the image from the AM data in Step 1, which is the 2 norm of the difference of two images.The image loss is used to construct the "black-box" function f in our multi-task BO.The performance of different algorithms is evaluated by the distance between the values of FAR and TC queried in Step 2 and the actual values of FAR and TC in Step 1.To let the reader better understand the difficulty of our problem, 50 points are selected by maximin Latin hypercube design, where 20 points are sampled near the optimal solution

Figure 5 :
Figure 5: The distance to optimal solution vs. number of evaluations for different algorithms (Log means log e ).

Table 2 .
The number of initial design points is calculated from 5M d, where d and M are

Table 2 :
Summary of experimental settings in this numerical study.

Table 3 :
Results for the gap measure performance across 50 repetitions for different synthetic functions and algorithms.

Table 4 :
Results for the computation time of each round across 50 repetitions for different synthetic functions and algorithms.
Furthermore, to show the convergence behavior for each algorithm, the simple regret,