Here we present the theory for on-the-fly machine learning force fields. The theory will be presented in a very condensed manner and for a more detailed description of the methods we refer the readers to Refs. [1] and [2].
Introduction
Molecular dynamics is one of the most important methods for the determination of dynamic properties. The quality of the molecular dynamics simulation depends very strongly on the accuracy of the calculational method, but higher accuracy usually comes at the cost of increased computational demand. A very accurate method is ab initio molecular dynamics, where the interactions of atoms and electrons are calculated fully quantum mechanically from ab initio calculations (such as e.g. DFT). Unfortunately this method is very often limited too small simulation cells and simulation times. One way to hugely speed up these calculations is by using force fields, which are parametrizations of the potential energy. These parametrizations can range frome very simple functions with only empirical parameters, to very complex functions parametrized using thousands of ab initio calculations. Usually making accurate force fields manually is a very time consuming task that needs tremendous expertise and know how.
One way to greatly reduce computational cost and required human intervention is by machine learning. Here, in the prediction of the target property the method automatically interpolates between known training systems which were previously calculated by ab initio. This way the generation of force fields is already significantly simplified compared to classical force field which need manual (or even empirical) adjustment of the parameters. Nevertheless there is still the problem how to choose the proper (minimal) training data. One very efficient and automic way to solve that is to adapt on-the-fly learning. Here an MD calculation is used for the learning. During the run of the MD calculation ab initio data is picked out and added to the training data. From the existing data a force field is continuously built up. At each step it is judged whether to make an ab initio calculation and possibly add the data to the force field or to use the force field for that step and actually skip learning for that step. Hence the name "on the fly" learning. The crucial point here is the probability model for the estimation of errors. This model not force field is build up from newly sampled data. Hence the name on the fly. In this way the more accurate the force field gets the less sampling is needed and the more expensive ab initio steps are skipped. This way not only the convergence of the force field can be controlled but also a very wide spread scan through phase space for the training structures can be performed. The crucial point for an on-the-fly machine learning which will be explained with the rest of the methodology in the following subsections, is to be able to predict errors of the force field on a newly sampled structure without the necessity to perform an ab initio calculation on that structure.
Algorithm
Fig. 1: On-the-fly machine learning force field generation scheme.
To obtain the machine-learning force field several structure datasets are required. A structure dataset defines the Bravais lattice and the atomic positions of the system and contains the total energy, the forces and the stress tensor calculated by first principles. Given these structure datasets, the machine identifies local configurations around an atom to learn what force field is appropriate. The local configuration measures the radial and angular distribution of neighboring atoms around this given site and is captured in so called descriptors.
The on-the-fly force field generation scheme is given by the following steps (a flowchart of the algorithm is shown in Fig. 1):
- The machine predicts the energy, the forces and the stress tensor and their uncertainties for a given structure using the existing force field.
- The machine decides whether to perform a first principles calculation (proceed with step 3); otherwise we skip to step 5.
- Performing the first principles calculation, we obtain a new structure dataset which may improve the force field.
- If the number of newly collected structures reaches a certain threshold or if the predicted uncertainty becomes too large, the machine learns an improved force field by including the new structure datasets and local configurations.
- Atomic positions and velocities are updated by using the force field (if accurate enough) or the first principles calculation.
- If the desired total number of ionic steps (NSW) is reached we are done; otherwise proceed with step 1.
Descriptors
The potential energy
of a structure with
atoms is approximated as
The local energies
are functionals
of the probability density
to find another atom
at the position
around the atom
within a cut-off radius
defined as
Here
is a cut-off function acting outside of
. In the Smooth Overlap of Atomic Positions[3] (SOAP) method the delta function
is approximated as
Fig. 2: Radial and angular descriptors.
Unfortunately
is not rotationally invariant. To deal with this problem intermediate functions or descriptors depending on
possessing rotational invariance are introduced:
Radial descriptor
This is the simplest descriptor which relies on the radial distribution function
where
denotes the unit vector the vector
between atoms
and
[see Fig. 2 (a)].
The Radial descriptor can also be regarded as a two-body descriptor.
Angular descriptor
In most cases the radial descriptor is not enough to distinguish different probability densities
, since two different
can yield the same
. To improve on this angular information between two radial descriptors is also incorporated within an angular descriptor
where
denotes the angle between two vectors
and
[see Fig. 2 (b)]. The function
is also commonly referred to as angular distribution function and it is equivalent to the power spectrum used in the Gaussian Approximation Potential[4] (GAP) method.
Basis set expansion
The atomic probability density can be also expanded in terms of basis functions
where
,
and
denote expansion coefficients, radial basis functions and spherical harmonics, respectively. The indices
,
and
denote main, angular and magnetic quantum numbers, respectively.
By using the above equation the radial descriptor and angular descriptor (power spectrum) can be written as
and
where
represent Legendre polynomials of order
.
In many cases
is multiplied with an angular filtering function[5]
(ML_FF_IAFILT2_MB), which can noticably reduce the necessary basis set size without losing accuracy in the calculations
where
(ML_FF_AFILT2_MB) is a paramater controling the extent of the filtering. A larger value for the parameter should provide more filtering.
Potential energy fitting
It is convenient to express the local potential energy of atom
in structure
in terms of linear coefficients
and a kernel
as follows
where
is the basis set size. The kernel
measures the similarity between a local configuration from the training set
and basis set
. Using the radial and angular descriptors it is written as
Here the vectors
and
contain all coefficients
and
, respectively. The notation
indicates that it is a normalized vector of
. The parameters
(ML_FF_W1_MB) and
(ML_FF_W2_MB) control the weighting of the radial and angular terms, respectively. The parameters
(ML_FF_NHYP1_MB) and
(ML_FF_NHYP2_MB) control the sharpness of the kernel and the order of the many-body interactions.
Matrix vector form of linear equations
Similarly to the energy
the forces and the stress tensor are also described as linear functions of the coefficients
. All three are fitted simultaneously which leads to following matrix-vector form
where
is a super vector consisting of the sub vectors
. Here each
contains the first principle energies per atom, forces and stress tensors for each structure
.
denotes the total number of structures. The size of
is
.
The matrix
is also called as design matrix[6]. The rows of this matrix are blocked for each structure
, where the first line of each block consists of the kernel used to calculate the energy. The subsequent
lines consist of the derivatives of the kernel with respect to the atomic coordinates used to calculate the forces. The final 6 lines within each structure consist of the derivatives of the kernel with respect to the unit cell coordinates used to calculate the stress tensor components. The overall size of
is
looking like as follows
Error estimation
We use two different estimators for the errors in the force field:
- Bayesian error estimation.
- Spilling factor[7].
Bayesian error estimation
Ultimately for error prediction we want to get the maximized probability of observing a new structure
on basis of the training set
, which is denoted as
. For this we need to get from the error of the linear fitting coefficients
in the reproduction of the training data
to
which is explained in the following.
First we obtain
from the Bayesian theorem
,
where we assume multivariate Gaussian distributions
for the likelihood function
and the (conjugate) prior
The parameters
and
need to be optimized to balance the accuracy and robustness of the force field (see below). The normalization is obtained by
Using the equations from above and the completing square method[6]
is obtained as follows
Since the prior and the likelood function are described by multivariate Gaussian distributions the posterior
describes a multivariate Gaussian written as
where we use the following definitions for the center of the Guassian distribution
and the covariance matrix
By using the relation
and the completing square method[6] the distribution of
is written as
Here
is similar to
but only for the new structure
.
The mean vector
contains the results of the predictions on the dimensionless energy per atom, forces and stress tensor.
The diagonal elements of
, which correspond to the variances of the predicted results, are used as the uncertainty in the prediction.
Finally to get the best results and to prevent overfitting the parameters
and
have to be optimized. To achieve this, we use the evidence approximation[8][9][10] (also called as empirical bayes, 2 maximum likelihood or generalized maximum likelihood), which maximizes the evidence function (also called model evidence) defined as
The function
corresponds to the probability that the regression model with parameters
and
provides the reference data
. Hence the best fit is optimized by maximizing this probability. The maximization is carried out by simultaneously solving the following equations
where
are the eigenvalues of the matrix
.
Spilling factor
The spilling factor is a measure of the density of a local reference configuration
near the new local configuration
given as
It should be noted that the equation is slightly modified from its orginial equation given by Miwa and Ohno[7] to make it applicable to a non-normalized similarity measure such as, e.g. the radial descriptor, too.
References
- ↑ R. Jinnouchi, J. Lahnsteiner, F. Karsai, G. Kresse and M. Bokdam, Phys. Rev. Lett. 122, 225701 (2019).
- ↑ R. Jinnouchi, F. Karsai and G. Kresse, Phys. Rev. B 100. 014105 (2019).
- ↑ A. P. Bartók, R. Kondor, and G. Csányi, Phys. Rev. B 87, 184115 (2013).
- ↑ A. P. Bartók, M.C. Payne, R. Kondor, and G. Csányi, Phys. Rev. Lett 104, 136403 (2010).
- ↑ [ J. P. Boyd, Chebyshev and Fourier Spectral Methods (Dover Publications, New York, 2000).]
- ↑ a b c C. M. Bishop, Pattern Recognition and Machine Learning, (New York: Springer), (2006).
- ↑ a b K. Miwa, and H. Ohno, Phys. Rev. B 94, 184109 (2016).
- ↑ S.F. Gull, and J. Skilling, Maximum Entropy Bayesian Methods, Fundam. Theor. Phys., 28th ed. (Springer, Dordrecht, 1989).
- ↑ D. J. C. Mackay, Neural Computation 4, 415 (1992).
- ↑ R. Jinnouchi, and R. Asahi, J. Phys. Chem. Lett. 8, 4279 (2017).