Requests for technical support from the VASP group should be posted in the VASP-forum.

# Difference between revisions of "Machine learning force field: Theory"

Here we present the theory for on-the-fly machine learning force fields, which will be only available from VASP6.0 or higher. The theory will be presented in a very condensed manner and for a more detailed description of the methods we refer the readers to Refs. and [1].

## Algorithm

Fig. 1: On-the-fly machine learning force field generation scheme.

Before presenting the algorithm we want to define the following terms:

• Structure datasets: A single structure dataset consist of the Bravais lattice, the atomic positions, the total energy, the forces and the stress tensor for one specific structure calculated by first principles.
• Local configurations: Local configuration around each atom mapped onto descriptors. Several structure datasets and local configurations are selected and the machine-learning force field is fitted to those.

The on-the-fly force field generation scheme is given by the following steps (a flowchart of the algorithm is shown in Fig. 1):

1. The machine predicts the energy, forces and stress tensor and their uncertainties on a given structure using existing force field.
2. The machine decides whether to do first principles calculations or not. If machine decides not to do then algorithm skips steps 3 and 4.
3. The first principles calculation is carried out and the data is stored as a candidate for new reference dataset.
4. If the number of newly collected structures reaches a certain threshold, or if the uncertainty in the prediction becomes too large, the machine updates the set of reference structure datasets and local reference configurations and creates a new force field.
5. Atomic positions are velocities are updated before returning to step 1 or ending the calculation if the total number of ionic steps ${\displaystyle I=N_{\mathrm {MD} }}$ is reached. The update is done in the following way:
1. If machine judges that the force field is not accurate enough, the first principles energy, forces and stress tensor are used.
2. Otherwise the force field is used.

## Descriptor

The potential energy ${\displaystyle U}$ of a structure with ${\displaystyle N_{a}}$ atoms is approximated as

${\displaystyle U=\sum \limits _{i=1}^{N_{\mathrm {a} }}U_{i}.}$

The local energies ${\displaystyle U_{i}}$ are functionals ${\displaystyle U_{i}=F[\rho _{i}(\mathbf {r} )]}$ of the probability density ${\displaystyle \rho _{i}}$ to find another atom ${\displaystyle j}$ at the position ${\displaystyle \mathbf {r} }$ around the atom ${\displaystyle i}$ within a cut-off radius ${\displaystyle R_{\mathrm {cut} }}$ defined as

${\displaystyle \rho _{i}\left(\mathbf {r} \right)=\sum \limits _{j=1}^{N_{\mathrm {a} }}f_{\mathrm {cut} }\left(r_{ij}\right)g\left(\mathbf {r} -\mathbf {r} _{ij}\right),\qquad \qquad \qquad \qquad r_{ij}=|\mathbf {r} _{ij}|=|\mathbf {r} _{j}-\mathbf {r} _{i}|.}$

Here ${\displaystyle f_{\mathrm {cut} }}$ is a cut-off function acting outside of ${\displaystyle R_{\mathrm {cut} }}$. In the Smooth Overlap of Atomic Positions[2] (SOAP) method the delta function ${\displaystyle g(\mathbf {r} )}$ is approximated as

${\displaystyle g\left(\mathbf {r} \right)={\frac {1}{\sqrt {2\sigma _{\mathrm {atom} }\pi }}}\mathrm {exp} \left(-{\frac {|\mathbf {r} |^{2}}{2\sigma _{\mathrm {atom} }^{2}}}\right).}$

Fig. 2: Radial and angular descriptors.

Unfortunately ${\displaystyle \rho _{i}\left(\mathbf {r} \right)}$ is not rotationally invariant. To deal with this problem intermediate functions or descriptors depending on ${\displaystyle \rho _{i}\left(\mathbf {r} \right)}$ possessing rotational invariance are introduced:

This is the simplest descriptor which relies on the radial distribution function

${\displaystyle \rho _{i}^{(2)}\left(r\right)={\frac {1}{4\pi }}\int \rho _{i}\left(r{\hat {\mathbf {r} }}\right)d{\hat {\mathbf {r} }},}$

where ${\displaystyle {\hat {\mathbf {r} }}}$ denotes the unit vector the vector ${\displaystyle \mathbf {r} }$ between atoms ${\displaystyle i}$ and ${\displaystyle j}$ [see Fig. 2 (a)]. The Radial descriptor can also be regarded as a two-body descriptor.

### Angular descriptor

In most cases the radial descriptor is not enough to distinguish different probability densities ${\displaystyle \rho _{i}}$, since two different ${\displaystyle \rho _{i}}$ can yield the same ${\displaystyle \rho _{i}^{(2)}}$. To improve on this angular information between two radial descriptors is also incorporated within an angular descriptor

${\displaystyle \rho _{i}^{(3)}\left(r,s,\theta \right)=\iint \delta \left({\hat {\mathbf {r} }}\cdot {\hat {\mathbf {s} }}-\mathrm {cos} \theta \right)\rho _{i}\left(r{\hat {\mathbf {r} }}\right)\rho _{i}^{*}\left(s{\hat {\mathbf {s} }}\right)d{\hat {\mathbf {r} }}d{\hat {\mathbf {s} }},}$

where ${\displaystyle \theta }$ denotes the angle between two vectors ${\displaystyle \mathbf {r} _{ij}}$ and ${\displaystyle \mathbf {r} _{ik}}$ [see Fig. 2 (b)]. The function ${\displaystyle \rho _{i}^{(3)}}$ is also commonly referred to as angular distribution function and it is equivalent to the power spectrum used in the Gaussian Approximation Potential[3] (GAP) method.

### Basis set expansion

The atomic probability density can be also expanded in terms of basis functions

${\displaystyle \rho _{i}\left(\mathbf {r} \right)=\sum \limits _{l=1}^{L_{\mathrm {max} }}\sum \limits _{m=-l}^{l}\sum \limits _{n=1}^{N_{\mathrm {R} }^{l}}c_{nlm}^{i}\chi _{nl}\left(r\right)Y_{lm}\left({\hat {\mathbf {r} }}\right),}$

where ${\displaystyle c_{nlm}^{i}}$, ${\displaystyle \chi _{nl}}$ and ${\displaystyle Y_{lm}}$ denote expansion coefficients, radial basis functions and spherical harmonics, respectively. The indices ${\displaystyle n}$, ${\displaystyle l}$ and ${\displaystyle m}$ denote main, angular and magnetic quantum numbers, respectively.

By using the above equation the radial descriptor and angular descriptor (power spectrum) can be written as

${\displaystyle \rho _{i}^{(2)}\left(r\right)={\frac {1}{\sqrt {4\pi }}}\sum \limits _{n=1}^{N_{\mathrm {R} }^{0}}c_{n00}^{i}\chi _{nl}\left(r\right),}$

and

${\displaystyle \rho _{i}^{(3)}\left(r,s,\theta \right)=\sum \limits _{l=1}^{L_{\mathrm {max} }}\sum \limits _{n=1}^{N_{\mathrm {R} }^{l}}\sum \limits _{\nu =1}^{N_{\mathrm {R} }^{l}}{\sqrt {\frac {2l+1}{2}}}p_{n\nu l}^{i}\chi _{nl}\left(r\right)\chi _{\nu l}\left(s\right)P_{l}\left(\mathrm {cos} \theta \right),}$

${\displaystyle p_{n\nu l}^{i}={\sqrt {\frac {8\pi ^{2}}{2l+1}}}\sum \limits _{m=-l}^{l}c_{nlm}^{i}c_{\nu lm}^{i*},}$

where ${\displaystyle P_{l}}$ represent Legendre polynomials of order ${\displaystyle l}$.

## Potential energy fitting

It is convenient to express the local potential energy of atom ${\displaystyle i}$ in structure ${\displaystyle \alpha }$ in terms of linear coefficients ${\displaystyle w_{i_{\mathrm {B} }}}$ and a kernel ${\displaystyle K\left(\mathbf {X} _{i},\mathbf {X} _{i_{\mathrm {B} }}\right)}$ as follows

${\displaystyle U_{i}^{\alpha }=\sum \limits _{i_{\mathrm {B} }=1}^{N_{\mathrm {B} }}w_{i_{\mathrm {B} }}K\left(\mathbf {X} _{i}^{\alpha },\mathbf {X} _{i_{\mathrm {B} }}\right)}$

where ${\displaystyle N_{\mathrm {B} }}$ is the basis set size. The kernel ${\displaystyle K\left(\mathbf {X} _{i},\mathbf {X} _{i_{\mathrm {B} }}\right)}$ measures the similarity between a local configuration from the training set ${\displaystyle \rho _{i}(\mathbf {r} )}$ and basis set ${\displaystyle \rho _{i_{\mathrm {B} }}(\mathbf {r} )}$. Using the radial and angular descriptors it is written as

${\displaystyle K\left(\mathbf {X} _{i},\mathbf {X} _{i_{\mathrm {B} }}\right)=\beta ^{(2)}\left(\mathbf {X} _{i}^{(2)}\cdot \mathbf {X} _{i_{\mathrm {B} }}^{(2)}\right)+\beta ^{(3)}\left(\mathbf {\hat {X}} _{i}^{(3)}\cdot \mathbf {\hat {X}} _{i_{\mathrm {B} }}^{(3)}\right)^{\zeta ^{(3)}}.}$

Here the vectors ${\displaystyle \mathbf {X} _{i}^{(2)}}$ and ${\displaystyle \mathbf {\hat {X}} _{i}^{(3)}}$ contain all coefficients ${\displaystyle c_{n00}^{i}}$ and ${\displaystyle p_{n\nu l}^{i}}$, respectively. The notation ${\displaystyle \mathbf {\hat {X}} _{i}^{(3)}}$ indicates that it is a normalized vector of ${\displaystyle \mathbf {X} _{i}^{(3)}}$. The parameters ${\displaystyle \beta ^{(2)}}$ and ${\displaystyle \beta ^{(3)}}$ control the weighting of the radial and angular terms, respectively. The parameter ${\displaystyle \zeta ^{(3)}}$ controls the sharpness of the kernel and the order of the many-body interactions.

### Matrix vector form of linear equations

Similarly to the energy ${\displaystyle U_{i}^{\alpha }}$the forces and the stress tensor are also described as linear functions of the coefficients ${\displaystyle w_{i_{\mathrm {B} }}}$. All three are fitted simultaneously which leads to following matrix-vector form

${\displaystyle \mathbf {Y} =\mathbf {\Phi } \mathbf {w} }$

where ${\displaystyle \mathbf {Y} }$ is a super vector consisting of the sub vectors ${\displaystyle \{\mathbf {y} ^{\alpha }|\alpha =1,...,N_{\mathrm {st} }\}}$. Here each ${\displaystyle \mathbf {y} ^{\alpha }}$ contains the first principle energies per atom, forces and stress tensors for each structure ${\displaystyle \alpha }$. ${\displaystyle N_{\mathrm {st} }}$ denotes the total number of structures. The size of ${\displaystyle \mathbf {Y} }$ is ${\displaystyle N_{\mathrm {st} }\times (1+3N_{a}^{\alpha }+6)}$.

The matrix ${\displaystyle \mathbf {Y} }$ is also called as design matrix[4]. The rows of this matrix are blocked for each structure ${\displaystyle \alpha }$, where the first line of each block consists of the kernel used to calculate the energy. The subsequent ${\displaystyle 3N_{a}^{\alpha }}$ lines consist of the derivatives of the kernel with respect to the atomic coordinates used to calculate the forces. The final 6 lines within each structure consist of the derivatives of the kernel with respect to the unit cell coordinates used to calculate the stress tensor components. The overall size of ${\displaystyle \mathbf {Y} }$ is ${\displaystyle N_{\mathrm {st} }\times (1+3N_{a}^{\alpha }+6)\times N_{\mathrm {B} }}$ looking like as follows

${\displaystyle \mathbf {\Phi } ={\begin{bmatrix}\sum _{i}{\frac {1}{N_{\mathrm {a} }^{\alpha =1}}}K\left(\mathbf {X} _{i}^{\alpha =1},\mathbf {X} _{i_{\mathrm {B} }=1}\right)&\sum _{i}{\frac {1}{N_{\mathrm {a} }^{\alpha =1}}}K\left(\mathbf {X} _{i}^{\alpha =1},\mathbf {X} _{i_{\mathrm {B} }=2}\right)&\dots &\dots &\dots \\\sum _{i}\nabla _{x_{1}}K\left(\mathbf {X} _{i}^{\alpha =1},\mathbf {X} _{i_{\mathrm {B} }=1}\right)&\sum _{i}\nabla _{x_{1}}K\left(\mathbf {X} _{i}^{\alpha =1},\mathbf {X} _{i_{\mathrm {B} }=2}\right)&\dots &\dots &\dots \\\sum _{i}\nabla _{y_{1}}K\left(\mathbf {X} _{i}^{\alpha =1},\mathbf {X} _{i_{\mathrm {B} }=1}\right)&\sum _{i}\nabla _{y_{1}}K\left(\mathbf {X} _{i}^{\alpha =1},\mathbf {X} _{i_{\mathrm {B} }=2}\right)&\dots &\dots &\dots \\\sum _{i}\nabla _{z_{1}}K\left(\mathbf {X} _{i}^{\alpha =1},\mathbf {X} _{i_{\mathrm {B} }=1}\right)&\sum _{i}\nabla _{z_{1}}K\left(\mathbf {X} _{i}^{\alpha =1},\mathbf {X} _{i_{\mathrm {B} }=2}\right)&\dots &\dots &\dots \\\sum _{i}\nabla _{x_{2}}K\left(\mathbf {X} _{i}^{\alpha =1},\mathbf {X} _{i_{\mathrm {B} }=1}\right)&\sum _{i}\nabla _{x_{2}}K\left(\mathbf {X} _{i}^{\alpha =1},\mathbf {X} _{i_{\mathrm {B} }=2}\right)&\dots &\dots &\dots \\\sum _{i}\nabla _{y_{2}}K\left(\mathbf {X} _{i}^{\alpha =1},\mathbf {X} _{i_{\mathrm {B} }=1}\right)&\sum _{i}\nabla _{y_{2}}K\left(\mathbf {X} _{i}^{\alpha =1},\mathbf {X} _{i_{\mathrm {B} }=2}\right)&\dots &\dots &\dots \\\sum _{i}\nabla _{z_{2}}K\left(\mathbf {X} _{i}^{\alpha =1},\mathbf {X} _{i_{\mathrm {B} }=1}\right)&\sum _{i}\nabla _{z_{2}}K\left(\mathbf {X} _{i}^{\alpha =1},\mathbf {X} _{i_{\mathrm {B} }=2}\right)&\dots &\dots &\dots \\\ldots &\ldots &\dots &\dots &\dots \\\ldots &\ldots &\dots &\dots &\dots \\\ldots &\ldots &\dots &\dots &\dots \\\sum _{i}{\frac {1}{N_{\mathrm {a} }^{\alpha =2}}}K\left(\mathbf {X} _{i}^{\alpha =2},\mathbf {X} _{i_{\mathrm {B} }=1}\right)&\sum _{i}{\frac {1}{N_{\mathrm {a} }^{\alpha =2}}}K\left(\mathbf {X} _{i}^{\alpha =2},\mathbf {X} _{i_{\mathrm {B} }=2}\right)&\dots &\dots &\dots \\\sum _{i}\nabla _{x_{1}}K\left(\mathbf {X} _{i}^{\alpha =2},\mathbf {X} _{i_{\mathrm {B} }=1}\right)&\sum _{i}\nabla _{x_{1}}K\left(\mathbf {X} _{i}^{\alpha =2},\mathbf {X} _{i_{\mathrm {B} }=2}\right)&\dots &\dots &\dots \\\sum _{i}\nabla _{y_{1}}K\left(\mathbf {X} _{i}^{\alpha =2},\mathbf {X} _{i_{\mathrm {B} }=1}\right)&\sum _{i}\nabla _{y_{1}}K\left(\mathbf {X} _{i}^{\alpha =2},\mathbf {X} _{i_{\mathrm {B} }=2}\right)&\dots &\dots &\dots \\\sum _{i}\nabla _{z_{1}}K\left(\mathbf {X} _{i}^{\alpha =2},\mathbf {X} _{i_{\mathrm {B} }=1}\right)&\sum _{i}\nabla _{z_{1}}K\left(\mathbf {X} _{i}^{\alpha =2},\mathbf {X} _{i_{\mathrm {B} }=2}\right)&\dots &\dots &\dots \\\sum _{i}\nabla _{x_{2}}K\left(\mathbf {X} _{i}^{\alpha =2},\mathbf {X} _{i_{\mathrm {B} }=1}\right)&\sum _{i}\nabla _{x_{2}}K\left(\mathbf {X} _{i}^{\alpha =2},\mathbf {X} _{i_{\mathrm {B} }=2}\right)&\dots &\dots &\dots \\\sum _{i}\nabla _{y_{2}}K\left(\mathbf {X} _{i}^{\alpha =2},\mathbf {X} _{i_{\mathrm {B} }=1}\right)&\sum _{i}\nabla _{y_{2}}K\left(\mathbf {X} _{i}^{\alpha =2},\mathbf {X} _{i_{\mathrm {B} }=2}\right)&\dots &\dots &\dots \\\sum _{i}\nabla _{z_{2}}K\left(\mathbf {X} _{i}^{\alpha =2},\mathbf {X} _{i_{\mathrm {B} }=1}\right)&\sum _{i}\nabla _{z_{2}}K\left(\mathbf {X} _{i}^{\alpha =2},\mathbf {X} _{i_{\mathrm {B} }=2}\right)&\dots &\dots &\dots \\\ldots &\ldots &\dots &\dots &\dots \\\ldots &\ldots &\dots &\dots &\dots \\\ldots &\ldots &\dots &\dots &\dots \end{bmatrix}}.}$

## Error estimation

We use two different estimators for the errors in the force field:

• Bayesian error estimation.
• Spilling factor[5].

### Bayesian error estimation

Ultimately for error prediction we want to get the maximized probability of observing a new structure ${\displaystyle \mathbf {y} }$ on basis of the training set ${\displaystyle \mathbf {Y} }$, which is denoted as ${\displaystyle p\left(\mathbf {y} |\mathbf {Y} \right)}$. For this we need to get from the error of the linear fitting coefficients ${\displaystyle \mathbf {w} }$ in the reproduction of the training data ${\displaystyle p\left(\mathbf {w} |\mathbf {Y} \right)}$ to ${\displaystyle p\left(\mathbf {y} |\mathbf {Y} \right)}$ which is explained in the following.

First we obtain ${\displaystyle p\left(\mathbf {w} |\mathbf {Y} \right)}$ from the Bayesian theorem

${\displaystyle p\left(\mathbf {w} |\mathbf {Y} \right)={\frac {p\left(\mathbf {Y} |\mathbf {w} \right)p\left(\mathbf {w} \right)}{p\left(\mathbf {Y} \right)}}}$,

where we assume multivariate Gaussian distributions ${\displaystyle {\mathcal {N}}}$ for the likelihood function

${\displaystyle p\left(\mathbf {Y} |\mathbf {w} \right)={\mathcal {N}}\left(\mathbf {Y} |\mathbf {\Phi } \mathbf {w} ,\sigma _{\mathrm {v} }^{2}\mathbf {I} \right)}$

and the (conjugate) prior

${\displaystyle p\left(\mathbf {w} \right)={\mathcal {N}}\left(\mathbf {w} |\mathbf {0} ,\sigma _{\mathrm {w} }^{2}\mathbf {I} \right).}$

The parameters ${\displaystyle \sigma _{\mathrm {w} }^{2}}$ and ${\displaystyle \sigma _{\mathrm {v} }^{2}}$ need to be optimized to balance the accuracy and robustness of the force field (see below).The ormalization is obtained by

${\displaystyle p\left(\mathbf {Y} \right)=\int p\left(\mathbf {Y} |\mathbf {w} \right)p\left(\mathbf {w} \right)d\mathbf {w} .}$

Using the equations from above <p\left( \mathbf{w} | \mathbf{Y} \right)> is obtained as follows

${\displaystyle p\left(\mathbf {w} |\mathbf {Y} \right)={\mathcal {N}}\left(\mathbf {w} |\mathbf {\bar {w}} ,\mathbf {\Sigma } \right).}$

Since the prior and the likelood function are described by multivariate Gaussian distributions the posterior ${\displaystyle {\mathcal {N}}\left(\mathbf {w} |\mathbf {\bar {w}} ,\mathbf {\Sigma } \right)}$ describes a multivariate Gaussian written as

${\displaystyle {\mathcal {N}}\left(\mathbf {w} |\mathbf {\bar {w}} ,\mathbf {\Sigma } \right)={\frac {1}{\sqrt {\left(2\pi \right)^{N_{\mathrm {B} }}\mathrm {det} \mathbf {\Sigma } }}}\times \mathrm {exp} \left[-{\frac {\left(\mathbf {w} -\mathbf {\bar {w}} \right)^{\mathrm {T} }\mathbf {\Sigma } ^{-1}\left(\mathbf {w} -\mathbf {\bar {w}} \right)}{2}}\right]}$

where we use the following definitions for the center of the Guassian distribution ${\displaystyle \mathbf {\bar {w}} }$ and the covariance matrix ${\displaystyle \mathbf {\Sigma } }$

${\displaystyle \mathbf {\bar {w}} ={\frac {1}{\sigma _{\mathrm {v} }^{2}}}\mathbf {\Sigma } \mathbf {\Phi } ^{\mathrm {T} }\mathbf {Y} ,}$

${\displaystyle \mathbf {\Sigma } ^{-1}={\frac {1}{\sigma _{\mathrm {w} }^{2}}}\mathbf {I} +{\frac {1}{\sigma _{\mathrm {v} }^{2}}}\mathbf {\Phi } ^{\mathrm {T} }\mathbf {\Phi } .}$