Requests for technical support from the VASP group should be posted in the VASP-forum.

# Machine learning force field calculations: Basics

In general to perform a machine-learning force field calculation, you need to set

ML_LMLFF = .TRUE.


in the INCAR file. Then depending on the particular calculation, you need to set the values of additional INCAR tags. In the first few sections, we list the tags that a user may typically encounter. Most of the other input are set to defaults and should be only changed by experienced users in cases where the changes are essential.

In the following most of the tags are only shown for the angular descriptor (tags containing a 2 in it). Almost each tag has an analogous tag for the radial descriptor (tags containing 1 in it). The usage of these tags is the same for both descriptors.

## Type of machine learning calculation

In this section, we describe the modes in which machine learning calculations can be done in VASP and show exemplary INCAR settings. A typical example showing these modes in action is the machine-learning of a force field for a material with two phases A and B. Initially, we have no force field of the material, so we choose a small to medium sized supercell of phase A to generate a new force field from scratch. In this step, ab initio calculations are performed whenever necessary improving the force field on this phase until it is sufficiently accurate. When applied to phase B, the force field learned on phase A might contain useful information about the local configurations. Hence one would run a continuation run and the machine will automatically collect the necessary structure datasets from phase B to refine the force field. In many cases, only few such structure datasets are required, but it is still necessary to verify this for every case. After the force field is sufficiently trained, one can use it to describe much larger cell sizes. Hence, one can switch off learning on larger cells and use only the force field. This is then orders of magnitudes faster than the ab initio calculation. If the sampled atomic environments are similar to the structure datasets used for learning, the force field is transferable for the same constituting elements, but it should be still cautiously judged whether the force field can describe rare events in the larger cell.

### On-the-fly force field generation from scratch

To generate a new force field, one does not need any special input files. First, one sets up a molecular dynamics calculation as usual (see Molecular Dynamics) adding the machine learning related ones to the INCAR file. To start from scratch add

ML_ISTART = 0


Running the calculation will result in generating the main output files ML_LOGFILE, ML_ABN and ML_FFN files. The latter two are required for restarting from an existing force field.

### Continuing on-the-fly learning from already existing force-fields

To continue from a previous run, copy the following files

cp ML_ABN ML_AB
cp CONTCAR POSCAR


The file ML_AB contains the ab initio reference data. One can also start from a new POSCAR file. To proceed with learning and obtain an improved force field set

ML_ISTART = 1


in the INCAR file.

The continuation can cover a very different structure than before or even new elements.

### Force field calculations without learning

Once a sufficiently accurate force field has been generated, one can use it to predict properties. Copy the force field information (and possibly the structures)

cp ML_ABN ML_AB
cp ML_FFN ML_FF


The file ML_FFN holds the force field parameters. One can also use different POSCAR files, e.g., a larger supercell. In the INCAR file, select only force field based calculations by setting

ML_ISTART = 2


## Reference total energies

To obtain the force field, one needs a reference total energy. For ML_ISCALE_TOTEN=2 this reference energy is set to the average of the total energy of the training data. This is the default setting and we advice to use this setting if not needed otherwise.

If needed, reference atomic calculations can be performed (see Calculation of atoms). One can then specify to use the atomic energy and give reference energies for all atoms by setting the following variables in the INCAR file

ML_ISCALE_TOTEN=1
ML_EATOM_REF = E_at1 E_at2 ...


If the tag ML_EATOM_REF is not specified, default values of 0.0 eV/atom are assumed.

## Converging a MLFF calculation

If a very fine spatial resolution is required due to small distances, or rapid spatial variations of the potential, the Gaussian broadening in the atomic density can be lowered, by setting the parameter

ML_SION1
ML_SION2


Since more basis functions are required to describe the less smoothened density a larger number of radial basis functions are required and hence this number is automatically increased by the program if not set by the user (ML_MRB1 and ML_MRB2). The number of basis functions is increased by the same ratio as the Gaussian broadening was decreased and vice versa. The number of basis functions are changed the same way if the cut off radius of the descriptor

ML_RCUT1
ML_RCUT2


is changed.

## Weighting of energy, forces and stress

In many cases the force field can be optimized to reproduce one of the target observables accurately by weighting the desired quantity more strongly. Of course at the same time other observables are less well reproduced. Empirically in many test cases up to a given weight ratio the improvement of the more strongly weighted observable was much larger than the accuracy loss of the other observables. The optimum ratio depends on the material and the parameters of the force field. So it has to be determined for each case separately. The weights of the energy, forces and stress can be changed in ML_WTOTEN, ML_WTIFOR and ML_WTSIF, respectively. The default value is 1.0 for each. Since the input tags define the ratio of the weights, it suffices to raise the value of only one observable.

We advise to use ML_WTOTEN${\displaystyle \geq }$10 whenever energies are important.

## Caution: number of structures and basis functions

The maximum number of structure datasets ML_MCONF and basis functions ML_MB constitutes a memory bottleneck of the calculation, because the required arrays are allocated statically at the beginning of the calculation. Therefore one must not set these input variables to too large numbers initially. For ML_ISTART=0, the defaults ML_MCONF=1500 and ML_MB are used. For ML_ISTART=1 and 3, the defaults for both are set to the number of entries read from the ML_AB file plus 500. If at any point during the calculation either the number of structure datasets or the size of the basis set exceeds its respective maximum number, the calculation automatically stops with an error message. Then one should increase the number and restart the calculation.

## Other important input tags

In this section, we describe other important input tags for standard machine learning force field calculations. Typically a user does not need to tweak the default values.

### Angular momentum quantum numbers

ML_LMAX2
This tag specifies the maximum angular momentum quantum number of spherical harmonics used to expand atomic distributions.
ML_LAFILT2
This tag specifies whether angular momentum filtering is active or not. By activating the angular filtering (ML_LAFILT2=.TRUE. and using the filtering function from reference [1] the computation can be noticably speeded up without loosing too much accuracy. Also by using the the angular filtering the maximum angular momentum number cut-off ML_LMAX2=6 can be lowered to a value of 4 again gaining computational speed. The user is still advised to check the accuracy of the angular filtering for his application.
ML_IAFILT2
This tag selects the type of angular filtering. We advise to use the default (ML_IAFILT2=2).
ML_AFILT2
This parameter sets the filtering parameter of the filtering function from reference [1]. The default of ML_AFILT2=0.002 worked well in most tested applications, but we advise the user to check this parameter for his application.

### New structure dataset block

ML_MCONF_NEW

This tag specifies the number of structure datasets that are stored temporally as candidates for the training data. The purpose of this is to block operations for expensive calculations that would be otherwise sequentially executed. In this way a faster performance is obtained at the cost of a small memory overhead. The value of ML_MCONF_NEW=5 was optimized empirically, but for different systems other choices might be more performant.

## Example input for liquid Si

This is a sample output how the machine learning tags would be set in the INCAR file for a very basis calculation.

SYSTEM = Si_lquid
### Electronic structure part
PREC = FAST
ALGO = FAST
SIGMA = 0.1
ISPIN = 1
ISMEAR = 0
ENCUT = 325
NELM = 100
EDIFF = 1E-4
NELMIN = 6
LREAL = A
ISYM = -1

### MD part
IBRION = 0
ISIF = 2
NSW = 30000
POTIM = 1.0

### Output part
LWAVE = .FALSE.
LCHARG = .FALSE.

### Machine Learning part
### Major tags for machine learning
ML_LMLFF = .TRUE.
ML_ISTART = 0


## Important algorithms

This part describes important algorithms used in the machine learning force field method.

## Sampling of training data and local reference configurations

We employ a learning scheme where structures are only added to the list of training structures when local reference configurations are picked for atoms that have an error in the force higher than a given threshold. So in the following it is implied that whenever a new training structure is obtained, also local reference configurations from this structure are obtained.

Usually one can employ that the force field doesn't necessary needs to be retrained immediately at every step when a training structure with corresponding local configurations is added. Instead one can also collect candidates and do the learning in a later step for all structures simultaneously. This way saving significant computational cost. Of course learning after every new configurations or after every blocks can have different results, but with not too large block sizes the difference should be small.

The tag ML_MCONF_NEW sets the block size for learning. If the Bayesian error of the force for any atom is above the threshold ML_CTIFOR but below ML_CDOUB${\displaystyle \times }$ML_CTIFOR, the structure is added to the list of new training structures. Whenever the number of candidates is equal to ML_MCONF_NEW the new training structures are added to the training structures and the force field is updated. To avoid sampling of too similar structures the next step from which on training structures are allowed to be taken as candidates is set by ML_NMDINT. All ab initio calculations within this distance are skipped if the Bayesian error for the force on all atoms is below ML_CDOUB${\displaystyle \times }$ML_CTIFOR. If the error at any time is above ML_CDOUB${\displaystyle \times }$ML_CTIFOR immediately the candidates are added to the list of training structure and the force field is updated. This is like an emergency break which won't allow the force field to drift too far away from the ab initio trajectories.

## Threshold for error of forces

Training structures and their corresponding local configurations are only chosen if error in the forces of any atom exceeds a chosen threshold. The initial threshold is set to the value provided by ML_CTIFOR (the unit is eV/Angstrom). The behaviour how the threshold is further controlled is given by ML_ICRITERIA. The following options are available:

• ML_ICRITERIA = 0: No update of initial value of ML_CTIFOR is done.
• ML_ICRITERIA = 1: Update of criteria using average of the Bayesian errors of the forces from history (see description of method below).
• ML_ICRITERIA = 2: Update of criteria using gliding average of Bayesian errors (probably more robust but not well tested).

Generally it is recommended to automatically update the threshold ML_CTIFOR during machine learning. Details on how and when the update is performed are controlled by ML_CSLOPE, ML_CSIG and ML_MHIS.

Description of ML_ICRITERIA=1:

ML_CTIFOR is updated using the average Bayesian error in the previous steps. Specifically, it is set to

ML_CTIFOR = (average of the stored Bayesian errors) *(1.0 + ML_CX).

The number of entries in the history of the Bayesian errors are controlled by ML_MHIS. To avoid that noisy data or an abrupt jump of the Bayesian error causes issues, the standard error of the history must be below the threshold ML_CSIG, for the update to take place. Furthermore, the slope of the stored data must be below the threshold ML_CSLOPE. In practice, the slope and the standard errors are at least to some extent correlated: often the standard error is proportional to ML_MHIS/3 times the slope or somewhat larger. We recommend to vary only ML_CSIG and keep ML_CSLOPE fixed to its default value.

## References

1. a b [ J. P. Boyd, Chebyshev and Fourier Spectral Methods (Dover Publications, New York, 2000).]