Bayesian Optimization Methods for Hyperparameter Search

Posted on 2019-07-10 标签机器学习

Machine learning involves numerous hyperparameters—model hyperparameters, optimizer hyperparameters, loss function hyperparameters, etc. Users need to set these based on experience and adjust according to training results. Optimal values vary by task and dataset, with no universal empirical values.

This step is often tedious and time-consuming. To simplify, Hyperparameter optimization research aims to automatically search for optimal hyperparameters. Common methods include grid search and random search, with more advanced methods like heuristic search and Bayesian optimization. This article introduces Bayesian optimization for hyperparameter search—a common approach also provided as a service by Google on Google Cloud. We focus on the GPR (Gaussian Process Regression) + GP-BUCB (Gaussian Process Regression-Batch Upper Confidence Bound) method.

More precisely, Bayesian Optimization is an algorithm framework for inferring black-box functions that cannot be explicitly formulated. In hyperparameter search, this function maps hyperparameters \(x\) to final benefit \(y\)—not necessarily loss, but could combine loss with evaluation metrics, e.g., in CTR estimation: \(y = \alpha_1 * auc - \alpha_2 * logloss\).

Common Bayesian optimization methods consist of two parts: Gaussian Process Regression (GPR) and Exploration-Exploitation trade-off. We’ll describe each part in detail.

Gaussian Process Regression (GPR)

First, as the name suggests, GPR is a regression method. Given known x and y values, when presented with a new x, it can predict the corresponding new y. Unlike common regression methods that predict a value, GPR predicts a distribution for the new y.

Where does “Gaussian” come in? Bayesian optimization is a Bayesian method, and Bayesian methods require priors. Here, Gaussian serves as a prior—we assume all y follow a zero-mean multivariate joint Gaussian distribution (note each y individually follows a one-dimensional Gaussian distribution). Compared to one-dimensional Gaussian, multivariate Gaussian replaces mean with a mean vector and variance with a covariance matrix (see How does multivariate Gaussian distribution develop from univariate?).

“Process” refers to stochastic process—multiple random variables following Gaussian distributions form a stochastic process. But this point isn’t crucial for understanding GPR.

GPR can be derived from Weight-space or Function-space angles. See Gaussian Processes for Machine Learning pages 7-35 for details. For simplicity, we’ll describe from Function-space perspective, using intuitive explanations. Some notation may differ from the PDF.

In the training set, assume we have 3 x values: \(x_1, x_2, x_3\), with corresponding y values: \(f_1, f_2, f_3\). Since we assume all y follow a zero-mean multivariate joint Gaussian distribution, we can write (figure from this article):

The key question: where does the covariance matrix K come from? Another important assumption:

If two x values are similar (close together), their corresponding y values have higher correlation. For example, if x3 is close to x2, then f3 and f2 have higher correlation than f3 and f1. In other words, the covariance matrix is a function of X, not y.

How to measure similarity between two x values? Many choices exist. GPR commonly uses RBF kernel (the same kernel in SVM—theoretically, any SVM kernel works here). RBF implicitly performs dimensionality expansion on x and computes inner product, improving expressiveness.

Covariance:

\[k(x_1, x_2) = E[(x_1 - E(x_1))*(x_2 - E(x_2))]\]

Correlation coefficient:

\[k(x_1, x_2) = \frac{E[(x_1 - E(x_1))*(x_2 - E(x_2))]}{\sigma_{x_1}\sigma_{x_2}}\]