Requests for technical support from the VASP group should be posted in the VASP-forum.

# Difference between revisions of "Machine learning force field: Theory"

Here we present the theory for on-the-fly machine learning force fields, which will be only available from VASP6.0 or higher. The theory will be presented in a very condensed manner and for a more detailed description of the methods we refer the readers to Refs. [1] and [2].

## Algorithm

Fig. 1: On-the-fly machine learning force field generation scheme.

To obtain the machine-learning force field several structure datasets are required. A structure dataset defines the Bravais lattice and the atomic positions of the system and contains the total energy, the forces and the stress tensor calculated by first principles. Given these structure datasets, the machine identifies local configurations around an atom to learn what force field is appropriate. The local configuration measures the radial and angular distribution of neighboring atoms around this given site and is captured in so called descriptors.

The on-the-fly force field generation scheme is given by the following steps (a flowchart of the algorithm is shown in Fig. 1):

1. The machine predicts the energy, the forces and the stress tensor and their uncertainties for a given structure using the existing force field.
2. The machine decides whether to perform a first principles calculation (proceed with step 3); otherwise we skip to step 5.
3. Performing the first principles calculation, we obtain a new structure dataset which may improve the force field.
4. If the number of newly collected structures reaches a certain threshold or if the predicted uncertainty becomes too large, the machine learns an improved force field by including the new structure datasets and local configurations.
5. Atomic positions and velocities are updated by using the force field (if accurate enough) or the first principles calculation.
6. If the desired total number of ionic steps (NSW) is reached we are done; otherwise proceed with step 1.

## Descriptor

The potential energy ${\displaystyle U}$ of a structure with ${\displaystyle N_{{a}}}$ atoms is approximated as

${\displaystyle U=\sum \limits _{{i=1}}^{{N_{{{\mathrm {a}}}}}}U_{{i}}.}$

The local energies ${\displaystyle U_{{i}}}$ are functionals ${\displaystyle U_{{i}}=F[\rho _{{i}}({\mathbf {r}})]}$ of the probability density ${\displaystyle \rho _{{i}}}$ to find another atom ${\displaystyle j}$ at the position ${\displaystyle {\mathbf {r}}}$ around the atom ${\displaystyle i}$ within a cut-off radius ${\displaystyle R_{{{\mathrm {cut}}}}}$ defined as

${\displaystyle \rho _{{i}}\left({\mathbf {r}}\right)=\sum \limits _{{j=1}}^{{N_{{{\mathrm {a}}}}}}f_{{{\mathrm {cut}}}}\left(r_{{ij}}\right)g\left({\mathbf {r}}-{\mathbf {r}}_{{ij}}\right),\qquad \qquad \qquad \qquad r_{{ij}}=|{\mathbf {r}}_{{ij}}|=|{\mathbf {r}}_{{j}}-{\mathbf {r}}_{{i}}|.}$

Here ${\displaystyle f_{{{\mathrm {cut}}}}}$ is a cut-off function acting outside of ${\displaystyle R_{{{\mathrm {cut}}}}}$. In the Smooth Overlap of Atomic Positions[3] (SOAP) method the delta function ${\displaystyle g({\mathbf {r}})}$ is approximated as

${\displaystyle g\left({\mathbf {r}}\right)={\frac {1}{{\sqrt {2\sigma _{{{\mathrm {atom}}}}\pi }}}}{\mathrm {exp}}\left(-{\frac {|{\mathbf {r}}|^{{2}}}{2\sigma _{{{\mathrm {atom}}}}^{{2}}}}\right).}$

Fig. 2: Radial and angular descriptors.

Unfortunately ${\displaystyle \rho _{{i}}\left({\mathbf {r}}\right)}$ is not rotationally invariant. To deal with this problem intermediate functions or descriptors depending on ${\displaystyle \rho _{{i}}\left({\mathbf {r}}\right)}$ possessing rotational invariance are introduced:

This is the simplest descriptor which relies on the radial distribution function

${\displaystyle \rho _{{i}}^{{(2)}}\left(r\right)={\frac {1}{4\pi }}\int \rho _{{i}}\left(r{\hat {{\mathbf {r}}}}\right)d{\hat {{\mathbf {r}}}},}$

where ${\displaystyle {\hat {{\mathbf {r}}}}}$ denotes the unit vector the vector ${\displaystyle {\mathbf {r}}}$ between atoms ${\displaystyle i}$ and ${\displaystyle j}$ [see Fig. 2 (a)]. The Radial descriptor can also be regarded as a two-body descriptor.

### Angular descriptor

In most cases the radial descriptor is not enough to distinguish different probability densities ${\displaystyle \rho _{{i}}}$, since two different ${\displaystyle \rho _{{i}}}$ can yield the same ${\displaystyle \rho _{{i}}^{{(2)}}}$. To improve on this angular information between two radial descriptors is also incorporated within an angular descriptor

${\displaystyle \rho _{{i}}^{{(3)}}\left(r,s,\theta \right)=\iint \delta \left({\hat {{\mathbf {r}}}}\cdot {\hat {{\mathbf {s}}}}-{\mathrm {cos}}\theta \right)\rho _{{i}}\left(r{\hat {{\mathbf {r}}}}\right)\rho _{{i}}^{{*}}\left(s{\hat {{\mathbf {s}}}}\right)d{\hat {{\mathbf {r}}}}d{\hat {{\mathbf {s}}}},}$

where ${\displaystyle \theta }$ denotes the angle between two vectors ${\displaystyle {\mathbf {r}}_{{ij}}}$ and ${\displaystyle {\mathbf {r}}_{{ik}}}$ [see Fig. 2 (b)]. The function ${\displaystyle \rho _{{i}}^{{(3)}}}$ is also commonly referred to as angular distribution function and it is equivalent to the power spectrum used in the Gaussian Approximation Potential[4] (GAP) method.

### Basis set expansion

The atomic probability density can be also expanded in terms of basis functions

${\displaystyle \rho _{{i}}\left({\mathbf {r}}\right)=\sum \limits _{{l=1}}^{{L_{{{\mathrm {max}}}}}}\sum \limits _{{m=-l}}^{{l}}\sum \limits _{{n=1}}^{{N_{{{\mathrm {R}}}}^{{l}}}}c_{{nlm}}^{{i}}\chi _{{nl}}\left(r\right)Y_{{lm}}\left({\hat {{\mathbf {r}}}}\right),}$

where ${\displaystyle c_{{nlm}}^{{i}}}$, ${\displaystyle \chi _{{nl}}}$ and ${\displaystyle Y_{{lm}}}$ denote expansion coefficients, radial basis functions and spherical harmonics, respectively. The indices ${\displaystyle n}$, ${\displaystyle l}$ and ${\displaystyle m}$ denote main, angular and magnetic quantum numbers, respectively.

By using the above equation the radial descriptor and angular descriptor (power spectrum) can be written as

${\displaystyle \rho _{{i}}^{{(2)}}\left(r\right)={\frac {1}{{\sqrt {4\pi }}}}\sum \limits _{{n=1}}^{{N_{{{\mathrm {R}}}}^{{0}}}}c_{{n00}}^{{i}}\chi _{{nl}}\left(r\right),}$

and

${\displaystyle \rho _{{i}}^{{(3)}}\left(r,s,\theta \right)=\sum \limits _{{l=1}}^{{L_{{{\mathrm {max}}}}}}\sum \limits _{{n=1}}^{{N_{{{\mathrm {R}}}}^{{l}}}}\sum \limits _{{\nu =1}}^{{N_{{{\mathrm {R}}}}^{{l}}}}{\sqrt {{\frac {2l+1}{2}}}}p_{{n\nu l}}^{{i}}\chi _{{nl}}\left(r\right)\chi _{{\nu l}}\left(s\right)P_{{l}}\left({\mathrm {cos}}\theta \right),}$

${\displaystyle p_{{n\nu l}}^{{i}}={\sqrt {{\frac {8\pi ^{{2}}}{2l+1}}}}\sum \limits _{{m=-l}}^{{l}}c_{{nlm}}^{{i}}c_{{\nu lm}}^{{i*}},}$

where ${\displaystyle P_{{l}}}$ represent Legendre polynomials of order ${\displaystyle l}$.

In many cases ${\displaystyle \chi _{{nl}}}$ is multiplied with an angular filtering function[5] ${\displaystyle \eta }$, which can noticably reduce the necessary basis set size without losing accuracy in the calculations

${\displaystyle \eta _{{l,a_{{{\mathrm {FILT}}}}}}={\frac {1}{1+a_{{{\mathrm {FILT}}}}[l(l+1)]^{{2}}}}}$

where ${\displaystyle a_{{{\mathrm {FILT}}}}}$ is a paramater controling the extent of the filtering. A larger value for the parameter should provide more filtering.

## Potential energy fitting

It is convenient to express the local potential energy of atom ${\displaystyle i}$ in structure ${\displaystyle \alpha }$ in terms of linear coefficients ${\displaystyle w_{{i_{{\mathrm {B}}}}}}$ and a kernel ${\displaystyle K\left({\mathbf {X}}_{{i}},{\mathbf {X}}_{{i_{{\mathrm {B}}}}}\right)}$ as follows

${\displaystyle U_{{i}}^{{\alpha }}=\sum \limits _{{i_{{\mathrm {B}}}=1}}^{{N_{{\mathrm {B}}}}}w_{{i_{{\mathrm {B}}}}}K\left({\mathbf {X}}_{{i}}^{{\alpha }},{\mathbf {X}}_{{i_{{\mathrm {B}}}}}\right)}$

where ${\displaystyle N_{{\mathrm {B}}}}$ is the basis set size. The kernel ${\displaystyle K\left({\mathbf {X}}_{{i}},{\mathbf {X}}_{{i_{{\mathrm {B}}}}}\right)}$ measures the similarity between a local configuration from the training set ${\displaystyle \rho _{{i}}({\mathbf {r}})}$ and basis set ${\displaystyle \rho _{{i_{{\mathrm {B}}}}}({\mathbf {r}})}$. Using the radial and angular descriptors it is written as

${\displaystyle K\left({\mathbf {X}}_{{i}},{\mathbf {X}}_{{i_{{\mathrm {B}}}}}\right)=\beta ^{{(2)}}\left({\mathbf {X}}_{{i}}^{{(2)}}\cdot {\mathbf {X}}_{{i_{{{\mathrm {B}}}}}}^{{(2)}}\right)+\beta ^{{(3)}}\left({\mathbf {{\hat {X}}}}_{{i}}^{{(3)}}\cdot {\mathbf {{\hat {X}}}}_{{i_{{{\mathrm {B}}}}}}^{{(3)}}\right)^{{\zeta ^{{(3)}}}}.}$

Here the vectors ${\displaystyle {\mathbf {X}}_{{i}}^{{(2)}}}$ and ${\displaystyle {\mathbf {{\hat {X}}}}_{{i}}^{{(3)}}}$ contain all coefficients ${\displaystyle c_{{n00}}^{{i}}}$ and ${\displaystyle p_{{n\nu l}}^{{i}}}$, respectively. The notation ${\displaystyle {\mathbf {{\hat {X}}}}_{{i}}^{{(3)}}}$ indicates that it is a normalized vector of ${\displaystyle {\mathbf {X}}_{{i}}^{{(3)}}}$. The parameters ${\displaystyle \beta ^{{(2)}}}$ and ${\displaystyle \beta ^{{(3)}}}$ control the weighting of the radial and angular terms, respectively. The parameter ${\displaystyle \zeta ^{{(3)}}}$ controls the sharpness of the kernel and the order of the many-body interactions.

### Matrix vector form of linear equations

Similarly to the energy ${\displaystyle U_{{i}}^{{\alpha }}}$the forces and the stress tensor are also described as linear functions of the coefficients ${\displaystyle w_{{i_{{\mathrm {B}}}}}}$. All three are fitted simultaneously which leads to following matrix-vector form

${\displaystyle {\mathbf {Y}}={\mathbf {\Phi }}{\mathbf {w}}}$

where ${\displaystyle {\mathbf {Y}}}$ is a super vector consisting of the sub vectors ${\displaystyle \{{\mathbf {y}}^{{\alpha }}|\alpha =1,...,N_{{{\mathrm {st}}}}\}}$. Here each ${\displaystyle {\mathbf {y}}^{{\alpha }}}$ contains the first principle energies per atom, forces and stress tensors for each structure ${\displaystyle \alpha }$. ${\displaystyle N_{{\mathrm {st}}}}$ denotes the total number of structures. The size of ${\displaystyle {\mathbf {Y}}}$ is ${\displaystyle N_{{{\mathrm {st}}}}\times (1+3N_{{a}}^{{\alpha }}+6)}$.

The matrix ${\displaystyle {\mathbf {\Phi }}}$ is also called as design matrix[6]. The rows of this matrix are blocked for each structure ${\displaystyle \alpha }$, where the first line of each block consists of the kernel used to calculate the energy. The subsequent ${\displaystyle 3N_{{a}}^{{\alpha }}}$ lines consist of the derivatives of the kernel with respect to the atomic coordinates used to calculate the forces. The final 6 lines within each structure consist of the derivatives of the kernel with respect to the unit cell coordinates used to calculate the stress tensor components. The overall size of ${\displaystyle {\mathbf {Y}}}$ is ${\displaystyle N_{{{\mathrm {st}}}}\times (1+3N_{{a}}^{{\alpha }}+6)\times N_{{\mathrm {B}}}}$ looking like as follows

## Error estimation

We use two different estimators for the errors in the force field:

• Bayesian error estimation.
• Spilling factor[7].

### Bayesian error estimation

Ultimately for error prediction we want to get the maximized probability of observing a new structure ${\displaystyle {\mathbf {y}}}$ on basis of the training set ${\displaystyle {\mathbf {Y}}}$, which is denoted as ${\displaystyle p\left({\mathbf {y}}|{\mathbf {Y}}\right)}$. For this we need to get from the error of the linear fitting coefficients ${\displaystyle {\mathbf {w}}}$ in the reproduction of the training data ${\displaystyle p\left({\mathbf {w}}|{\mathbf {Y}}\right)}$ to ${\displaystyle p\left({\mathbf {y}}|{\mathbf {Y}}\right)}$ which is explained in the following.

First we obtain ${\displaystyle p\left({\mathbf {w}}|{\mathbf {Y}}\right)}$ from the Bayesian theorem

${\displaystyle p\left({\mathbf {w}}|{\mathbf {Y}}\right)={\frac {p\left({\mathbf {Y}}|{\mathbf {w}}\right)p\left({\mathbf {w}}\right)}{p\left({\mathbf {Y}}\right)}}}$,

where we assume multivariate Gaussian distributions ${\displaystyle {\mathcal {N}}}$ for the likelihood function

${\displaystyle p\left({\mathbf {Y}}|{\mathbf {w}}\right)={\mathcal {N}}\left({\mathbf {Y}}|{\mathbf {\Phi }}{\mathbf {w}},\sigma _{{{\mathrm {v}}}}^{{2}}{\mathbf {I}}\right)}$

and the (conjugate) prior

${\displaystyle p\left({\mathbf {w}}\right)={\mathcal {N}}\left({\mathbf {w}}|{\mathbf {0}},\sigma _{{{\mathrm {w}}}}^{{2}}{\mathbf {I}}\right).}$

The parameters ${\displaystyle \sigma _{{{\mathrm {w}}}}^{{2}}}$ and ${\displaystyle \sigma _{{{\mathrm {v}}}}^{{2}}}$ need to be optimized to balance the accuracy and robustness of the force field (see below).The ormalization is obtained by

${\displaystyle p\left({\mathbf {Y}}\right)=\int p\left({\mathbf {Y}}|{\mathbf {w}}\right)p\left({\mathbf {w}}\right)d{\mathbf {w}}.}$

Using the equations from above and the completing square method[6] ${\displaystyle p\left({\mathbf {w}}|{\mathbf {Y}}\right)}$ is obtained as follows

${\displaystyle p\left({\mathbf {w}}|{\mathbf {Y}}\right)={\mathcal {N}}\left({\mathbf {w}}|{\mathbf {{\bar w}}},{\mathbf {\Sigma }}\right).}$

Since the prior and the likelood function are described by multivariate Gaussian distributions the posterior ${\displaystyle {\mathcal {N}}\left({\mathbf {w}}|{\mathbf {{\bar w}}},{\mathbf {\Sigma }}\right)}$ describes a multivariate Gaussian written as

${\displaystyle {\mathcal {N}}\left({\mathbf {w}}|{\mathbf {{\bar w}}},{\mathbf {\Sigma }}\right)={\frac {1}{{\sqrt {\left(2\pi \right)^{{N_{{{\mathrm {B}}}}}}{\mathrm {det}}{\mathbf {\Sigma }}}}}}{\mathrm {exp}}\left[-{\frac {\left({\mathbf {w}}-{\mathbf {{\bar w}}}\right)^{{{\mathrm {T}}}}{\mathbf {\Sigma }}^{{-1}}\left({\mathbf {w}}-{\mathbf {{\bar w}}}\right)}{2}}\right]}$

where we use the following definitions for the center of the Guassian distribution ${\displaystyle {\mathbf {{\bar w}}}}$ and the covariance matrix ${\displaystyle {\mathbf {\Sigma }}}$

${\displaystyle {\mathbf {{\bar w}}}={\frac {1}{\sigma _{{{\mathrm {v}}}}^{{2}}}}{\mathbf {\Sigma }}{\mathbf {\Phi }}^{{{\mathrm {T}}}}{\mathbf {Y}},}$

${\displaystyle {\mathbf {\Sigma }}^{{-1}}={\frac {1}{\sigma _{{{\mathrm {w}}}}^{{2}}}}{\mathbf {I}}+{\frac {1}{\sigma _{{{\mathrm {v}}}}^{{2}}}}{\mathbf {\Phi }}^{{{\mathrm {T}}}}{\mathbf {\Phi }}.}$

By using the relation

${\displaystyle p\left({\mathbf {y}}|{\mathbf {Y}}\right)=\int p\left({\mathbf {y}}|{\mathbf {w}}\right)p\left({\mathbf {w}}|{\mathbf {Y}}\right)d{\mathbf {w}}}$

and the completing square method[6] the distribution of ${\displaystyle p\left({\mathbf {y}}|{\mathbf {Y}}\right)}$ is written as

${\displaystyle p\left({\mathbf {y}}|{\mathbf {Y}}\right)={\mathcal {N}}\left({\mathbf {\phi }}{\mathbf {{\bar w}}},{\mathbf {\sigma }}\right),}$

${\displaystyle {\mathbf {\sigma }}=\sigma _{{{\mathrm {v}}}}^{{2}}{\mathbf {I}}+{\mathbf {\phi }}^{{{\mathrm {T}}}}{\mathbf {\Sigma }}{\mathbf {\phi }}.}$

Here ${\displaystyle {\mathbf {\phi }}}$ is similar to ${\displaystyle {\mathbf {\Phi }}}$ but only for the new structure ${\displaystyle {\mathbf {y}}}$.

The mean vector ${\displaystyle {\mathbf {\phi }}{\mathbf {{\bar w}}}}$ contains the results of the predictions on the dimensionless energy per atom, forces and stress tensor. The diagonal elements of ${\displaystyle {\mathbf {\sigma }}}$, which correspond to the variances of the predicted results, are used as the uncertainty in the prediction.

Finally to get the best results and to prevent overfitting the parameters ${\displaystyle \sigma _{{{\mathrm {v}}}}^{{2}}}$ and ${\displaystyle \sigma _{{{\mathrm {w}}}}^{{2}}}$ have to be optimized. To achieve this, we use the evidence approximation[8][9][10] (also called as empirical bayes, 2 maximum likelihood or generalized maximum likelihood), which maximizes the evidence function (also called model evidence) defined as

${\displaystyle p\left({\mathbf {Y}}|\sigma _{{{\mathrm {v}}}}^{{2}},\sigma _{{{\mathrm {w}}}}^{{2}}\right)=\left({\frac {1}{{\sqrt {2\pi \sigma _{{{\mathrm {v}}}}^{{2}}}}}}\right)^{{M}}\left({\frac {1}{{\sqrt {2\pi \sigma _{{{\mathrm {w}}}}^{{2}}}}}}\right)^{{N_{{{\mathrm {B}}}}}}\int {\mathrm {exp}}\left[-E\left({\mathbf {w}}\right)\right]d{\mathbf {w}},}$

${\displaystyle E\left({\mathbf {w}}\right)={\frac {1}{2\sigma _{{{\mathrm {v}}}}^{{2}}}}||{\mathbf {\Phi }}{\mathbf {w}}-{\mathbf {Y}}||^{{2}}+{\frac {1}{2\sigma _{{{\mathrm {w}}}}^{{2}}}}||{\mathbf {w}}||^{{2}}.}$

The function ${\displaystyle p\left({\mathbf {Y}}|\sigma _{{{\mathrm {v}}}}^{{2}},\sigma _{{{\mathrm {w}}}}^{{2}}\right)}$ corresponds to the probability that the regression model with parameters ${\displaystyle \sigma _{{{\mathrm {v}}}}^{{2}}}$ and ${\displaystyle \sigma _{{{\mathrm {w}}}}^{{2}}}$ provides the reference data ${\displaystyle {\mathbf {Y}}}$. Hence the best fit is optimized by maximizing this probability. The maximization is carried out by simultaneously solving the following equations

${\displaystyle \sigma _{{{\mathrm {w}}}}^{{2}}={\frac {|{\mathbf {{\bar {w}}}}|^{{2}}}{\gamma }},}$

${\displaystyle \sigma _{{{\mathrm {v}}}}^{{2}}={\frac {|{\mathbf {T}}-{\mathbf {\phi }}{\mathbf {{\bar {w}}}}|^{{2}}}{M-\gamma }},}$

${\displaystyle \gamma =\sum \limits _{{k=1}}^{{N_{{{\mathrm {B}}}}}}{\frac {\lambda _{{k}}}{\lambda _{{k}}+1/\sigma _{{{\mathrm {w}}}}^{{2}}}}}$

where ${\displaystyle \lambda _{{k}}}$ are the eigenvalues of the matrix ${\displaystyle {\mathbf {\Phi }}^{{{\mathrm {T}}}}{\mathbf {\Phi }}/\sigma _{{{\mathrm {v}}}}^{{2}}}$.

### Spilling factor

The spilling factor is a measure of the density of a local reference configuration ${\displaystyle i_{{{\mathrm {B}}}}}$ near the new local configuration ${\displaystyle {{\mathbf {X}}}}$ given as

${\displaystyle s=1-{\frac {\sum \limits _{{i_{{{\mathrm {B}}}}=1}}^{{N_{{{\mathrm {B}}}}}}\sum \limits _{{i'_{{{\mathrm {B}}}}=1}}^{{N_{{{\mathrm {B}}}}}}K\left({\mathbf {X}},{\mathbf {X}}_{{i_{{{\mathrm {B}}}}}}\right)K^{{-1}}\left({\mathbf {X}}_{{i_{{{\mathrm {B}}}}}},{\mathbf {X}}_{{i'_{{{\mathrm {B}}}}}}\right)K\left({\mathbf {X}}_{{i'_{{{\mathrm {B}}}}}},{\mathbf {X}}\right)}{K\left({\mathbf {X}},{\mathbf {X}}\right)}}.}$

It should be noted that the equation is slightly modified from its orginial equation given by Miwa and Ohno[7] to make it applicable to a non-normalized similarity measure such as, e.g. the radial descriptor, too.