Liquid Si - MLFF: Difference between revisions

From VASP Wiki
 
(65 intermediate revisions by 4 users not shown)
Line 3: Line 3:
== Task ==
== Task ==


Generating a machine learning force field for liquid Si.
Generating a machine learning force field for liquid Si. For this tutorial, we expect that the user is already familiar with running [[Liquid Si - Standard MD|conventional ab initio molecular dynamic calculations]].


== Input ==
== Input ==
Line 9: Line 9:
=== {{TAG|POSCAR}} ===
=== {{TAG|POSCAR}} ===


*In this example we start from a 64 atom super cell of diamond-fcc Si (the same as in this example: {{TAG|Liquid Si - Standard MD}}:
In this example we start from a 64 atom super cell of diamond-fcc Si (the same as in {{TAG|Liquid Si - Standard MD}}):
  Si cubic diamond 2x2x2 super cell of conventional cell
  Si_CD_2x2x2
       5.43090000000000
       5.43090000000000
     2.00000000  0.00000000  0.00000000
     2.00000000  0.00000000  0.00000000
Line 85: Line 85:
=== {{TAG|KPOINTS}} ===
=== {{TAG|KPOINTS}} ===


*We will start with a single k point in this example:
We will start with a single k point in this example:
  K-Points
  K-Points
   0
   0
Line 97: Line 97:
  {{TAGBL|SIGMA}} = 0.1
  {{TAGBL|SIGMA}} = 0.1
  {{TAGBL|LREAL}} = Auto
  {{TAGBL|LREAL}} = Auto
{{TAGBL|ALGO}} = FAST
  {{TAGBL|ISYM}} = -1
  {{TAGBL|ISYM}} = -1
  {{TAGBL|NELM}} = 100
  {{TAGBL|NELM}} = 100
Line 115: Line 114:
  {{TAGBL|NSW}} = 10000
  {{TAGBL|NSW}} = 10000
  {{TAGBL|POTIM}} = 3.0
  {{TAGBL|POTIM}} = 3.0
 
  {{TAGBL|RANDOM_SEED}} =         88951986                0               0
#Parameters for initialization of wavefunctions and charge density
  {{TAGBL|INIWAV}} = 1
{{TAGBL|IWAVPR}} = 1
{{TAGBL|ISTART}} = 0
{{TAGBL|ICHARG}} = 0
   
   
  #Machine learning paramters
  #Machine learning paramters
  {{TAGBL|ML_FF_LMLFF}} = .TRUE.
  {{TAGBL|ML_LMLFF}} = .TRUE.
  {{TAGBL|ML_FF_ISTART}} = 0
  {{TAGBL|ML_ISTART}} = 0
{{TAGBL|ML_FF_NWRITE}} = 2
{{TAGBL|ML_FF_EATOM}} = -.70128086E+00
*The user should be familiar at this step how to run a basic molecular dynamics calculations. If not please go through the example here: {{TAG|Liquid Si - Standard MD}}.
*Machine learning is switched on by setting the following tag: {{TAG|ML_FF_LMLFF}}=''.TRUE.''.
*By setting the tag {{TAG|ML_FF_ISTART}} to zero learning from scratch is selected.
*The flag {{TAG|ML_FF_NWRITE}}=2 selects a more verbose output where the error on energies, forces and stress are output to the {{TAG|ML_LOGFILE}} file. This setting is very handy to check the accuracy of the force field.
*The tag {{TAG|ML_FF_EATOM}}=-.70128086E+00 sets the atomic reference energy for each species. How to obtain that energy is explained below.
*Since in the on the fly learning several hundreds or thousands of molecular dynamics steps can lie between two ab initio calculations the usual extrapolation of the wave functions and charge densities from the history of previous wave functions and charge densities {{TAG|IWAVPR}}=2 is not advisable, hence we will set {{TAG|IWAVPR}}=1 for which the charge densities are extrapolated from atomic charge densities.


== Calculation ==
== Calculation ==


=== Reference energy ===
Before the force field for liquid Si can be calculated, the atomic energy of a single Si atom in a large enough box has to be calculated.
For that the following steps have to be done:
*Create a new directory ''Si_ATOM'' by typing ''mkdir Si_ATOM'' and go to that directory ''cd Si_ATOM''.
*Create an {{TAG|INCAR}} file with the following parameters:
#Basic parameters
{{TAG|ISMEAR}} = 0
{{TAG|SIGMA}} = 0.1
{{TAG|LREAL}} = Auto
{{TAG|ALGO}} = FAST
{{TAG|ISYM}} = 0
{{TAG|NELM}} = 100
{{TAG|EDIFF}} = 1E-4
{{TAG|LWAVE}} = .FALSE.
{{TAG|LCHARG}} = .FALSE.
{{TAG|ISPIN}} = 2
*Create a {{TAG|POSCAR}} file with a single atom in a large enough box (the box should be orthorombic to have enough degrees of freedom for electronic relaxation):
Si atom
      1.00090000000000
    12.00000000  0.00000000  0.00000000
      0.00000000  12.01000000  0.00000000
      0.00000000  0.00000000  12.02000000
    Si
    1
Direct
    0.00000000  0.00000000  0.00000000
*Create a {{TAG|KPOINTS}} file with a single k point (or copy the one delivered with this example):
test
0 0 0
Gamma
  1 1 1
  0 0 0
*Copy the {{TAG|POTCAR}} from the previous directory to this directory (''cp ../POTCAR'').
*Run the calculation and look at the total energy in the {{TAG|OUTCAR}} file (or {{TAG|OSZICAR}}). AFter that switch back to the original directory. That energy will be used for the {{TAG|ML_FF_EATOM}} tag. If multiple atom types are present in the structure than this step has to be repeated for each atom type separately and the reference energies are provided as a list after the {{TAG|ML_FF_EATOM}} tag, where the ordering of the energies corresponds to the ordering of the elements in the {{TAG|POTCAR}} file.


=== Creating the liquid structure ===
=== Creating the liquid structure ===


We will start this example from a perfect super cell of crystalline silicon (fcc diamond structure) containing 64 atoms. The temperature is set to 2000 K so that when an MD is run the melting should occur relatively fast. We will do the melting with on-the-fly learning. This will greatly accelerate the melting since after some time most of the ab initio calculations are skipped and the very fast force field takes over. This calculation will be executed for 10000 steps with a step size of 3 fs.
Because we don't have a structure of liquid silicon readily available, we first create that structure by starting from a super cell of crystalline silicon with 64 atoms. The temperature is set to 2000 K so that the crystal melts rapidly in the MD run. To improve the simulation speed drastically, we utilize the on-the-fly machine learning. Most of the ab initio steps will be replaced by very fast force-field ones. Within 10000 steps equivalent to 30 ps, we have obtained a good starting position for the subsequent simulations in the {{TAG|CONTCAR}} file. You can copy the [[#Input|input files]] or [[#Download|download them]].


Please run now the calculation.
After running the calculation, we obtained a force field, but its initial trajectory might be tainted but the unreasonable starting position. Nevertheless, it is instructive to inspect the output to understand how to assess the accuracy of a force field, before refining it in subsequent calculations.
The main output files for the machine learning are
;{{TAG|ML_ABN}}: contains the ab initio structure datasets used for the learning. It will be needed for continuation runs as {{TAG|ML_AB}}.
;{{TAG|ML_FFN}}: contains the regression results (weights, parameters, etc.). It will be needed for continuation runs as {{TAG|ML_FF}}.
;{{TAG|ML_LOGFILE}}: logging the proceedings of the machine learning. This file consists of keywords that are nicely "grepable." The keywords are explained in the in the beginning of the file and upon "grepping". The status of each MD step is given by the keyword "STATUS". Please invoke the following command:
grep STATUS ML_LOGFILE
The output should look similar to the following:
# STATUS ###############################################################
# STATUS This line describes the overall status of each step.
# STATUS
# STATUS nstep ..... MD time step or input structure counter
# STATUS state ..... One-word description of step action
# STATUS            - "accurate"  (1) : Errors are low, force field is used
# STATUS            - "threshold" (2) : Errors exceeded threshold, structure is sampled from ab initio
# STATUS            - "learning"  (3) : Stored configurations are used for training force field
# STATUS            - "critical"  (4) : Errors are high, ab initio sampling and learning is enforced
# STATUS            - "predict"  (5) : Force field is used in prediction mode only, no error checking
# STATUS is ........ Integer representation of above one-word description (integer in parenthesis)
# STATUS doabin .... Perform ab initio calculation (T/F)
# STATUS iff ....... Force field available (T/F, False after startup hints to possible convergence problems)
# STATUS nsample ... Number of steps since last reference structure collection (sample = T)
# STATUS ngenff .... Number of steps since last force field generation (genff = T)
# STATUS ###############################################################
# STATUS            nstep    state is doabin    iff  nsample    ngenff
# STATUS                2        3  4      5      6        7        8
# STATUS ###############################################################
STATUS                  0 threshold  2      T      F        0        0
STATUS                  1 critical  4      T      F        0        1
STATUS                  2 critical  4      T      F        0        2
STATUS                  3 critical  4      T      T        0        1
STATUS                  4 critical  4      T      T        0        1
STATUS                  5 critical  4      T      T        0        1
      .                  .        .  .      .      .        .        .
      .                  .        .  .      .      .        .        .
      .                  .        .  .      .      .        .        .
STATUS              9997 accurate  1      F      T      945      996
STATUS              9998 accurate  1      F      T      946      997
STATUS              9999 accurate  1      F      T      947      998
STATUS              10000 learning  3      T      T      948      999


After running the calculation we should obtain a fairly good liquid in the {{TAG|CONTCAR}} file.  
Another important keyword is "ERR". For this instance we should type the following command:
grep ERR ML_LOGFILE
The output should look like the following:
# ERR ######################################################################
# ERR This line contains the RMSEs of the predictions with respect to ab initio results for the training data.
# ERR
# ERR nstep ......... MD time step or input structure counter
# ERR rmse_energy ... RMSE of energies (eV atom^-1)
# ERR rmse_force .... RMSE of forces (eV Angst^-1)
# ERR rmse_stress ... RMSE of stress (kB)
# ERR ######################################################################
# ERR              nstep      rmse_energy      rmse_force      rmse_stress
# ERR                  2                3                4                5
# ERR ######################################################################
ERR                    0  0.00000000E+00  0.00000000E+00  0.00000000E+00
ERR                    1  0.00000000E+00  0.00000000E+00  0.00000000E+00
ERR                    2  0.00000000E+00  0.00000000E+00  0.00000000E+00
ERR                    3  2.84605192E-05  9.82351889E-03  2.40003743E-02
ERR                    4  1.83193349E-05  1.06700600E-02  5.37606479E-02
ERR                    5  4.12132223E-05  1.34123085E-02  1.01588957E-01
ERR                    6  9.51627413E-05  1.90335214E-02  1.31959103E-01
.                      .                .                .                .
.                      .                .                .                .
.                      .                .                .                .
ERR                  9042  1.07159240E-02  2.41283323E-01  4.95695745E+00
ERR                  9052  1.07159240E-02  2.41283323E-01  4.95695745E+00
ERR                10000  1.07159240E-02  2.41283323E-01  4.95695745E+00


As a side effect we have also learned a force field, but with maybe a quite bad trajectory at the beginning. Usually it is more systematic to learn on the pure structures. In our case this means to use the {{TAG|CONTCAR}} file obtained after the melting. Also the force field was only learned using a single k point. To obtain a better accuracy we will use more k points. So after this step we will start the learning from scratch using the {{TAG|CONTCAR}} obtained so and using more accurate parameters.
This tag lists the errors on the energy, forces and stress of the force field compared to the ab initio results on the available training data. The second column shows the MD step. We see that the entry is not output at every MD step. The errors only change if the force field is updated, hence when an ab initio calculation is executed (it should correlate with the doabin column of the STATUS keyword). The other three columns show the errors on the energy (eV/atom), forces (ev/Angstrom) and stress (kB).


But before we do that we will take a look at the accuracy of the force obtained now.  
=== Structral properties of the force field ===
To examine the accuracy of structural properties, we compare the deviations between a 3 ps molecular dynamics run using the force field and a full ab initio calculation. For a meaningful comparison, it is best to start from the same initial structure. We will use the liquid structure, we obtained in the previous step and back it up
cp CONTCAR POSCAR.T2000_relaxed


The main output files for the machine learning are:
Now, we proceed with the force field calculation and set up the required files
*{{TAG|ML_LOGFILE}}: This contains the main output of the machine learning. Since we use {{TAG|ML_FF_NWRITE}}=2 the error on energy, forces and stress of the force field compared to ab initio is also written out on this file for every step.
cp POSCAR.T2000_relaxed {{TAGBL|POSCAR}}
*{{TAG|ML_ABNCAR}}: This contains the ab initio data used for the learning. It will be needed for continuation runs as {{TAG|ML_ABCAR}}.
cp {{TAGBL|ML_FFN}} {{TAGBL|ML_FF}}
*{{TAG|ML_FFNCAR}}: This contains the regression results (weights, parameters, etc.). It will be needed for continuation runs as {{TAG|ML_FFCAR}}.
To run a shorter simulation using only the force field, we change the following {{TAG|INCAR}} tags to
{{TAGBL|ML_ISTART}} = 2
{{TAGBL|NSW}} = 1000
After the calculation finished, we backup the history of the atomic positions
cp {{TAGBL|XDATCAR}} XDATCAR.MLFF_3ps
To analyze the pair correlation function, we use the PERL script ''pair_correlation_function.pl''
<div class="toccolours mw-customtoggle-script">'''Click to show/hide pair_correlation_function.pl'''</div>
<div class="mw-collapsible mw-collapsed" id="mw-customcollapsible-script">
<pre>
#!/usr/bin/perl


In the {{TAG|ML_LOGFILE}} we will look for the last entry on the error of energy, forces and stress, Bayesian error and spilling factor, which should look similar to this (since we run a molecular dynamics calculation in parallel on different computers actual results can deviate a little):
use strict;
====================================================================================================
use warnings;
      Information on error estimations
----------------------------------------------------------------------------------------------------
              Error in energy (eV atom^-1):    0.016994
              Error in force  (eV Angst^-1):    0.247664
                      Error in stress (kB):    5.368218
                Bayesian error (eV Angst-1):    0.064904    0.127195
                        Spilling factor (-):    0.000690    0.020000
====================================================================================================
The first entry of the Bayesian error is the estimated error from our model. The second entry is the error threshold. In our case it is newly determined during the calculations. The first and second entry for the spilling factor are the calculated spilling factor and the threshold, respectively. The threshold for the spilling factor is usually always kept constant during the calculations.


Next we will look at the accuracy of structural properties of the force field. For the we first run a 3000 fs molecular dynamics calculation with and without the force field starting from the new {{TAG|CONTCAR}} file. First we save that new {{TAG|CONTCAR}} file to
#configuration for which ensemble average is to be calculated
cp CONTCAR POSCAR.T2000_relaxed
my $confmin=1;            #starting index of configurations in XDATCAR file for pair correlation function
Now do the following steps to run the force field calculation:
my $confmax=20000;          #last index of configurations in XDATCAR file for pair correlation function
*Copy ''POSCAR.T2000_relaxed'' to {{TAG|POSCAR}}.
my $confskip=1;          #stepsize for configuration loop
*Copy {{TAG|ML_ABNCAR}} to {{TAG|ML_ABCAR}}.
my $species_1 = 1;        #species 1 for which pair correlation function is going to be calculated
*Copy {{TAG|ML_FFNCAR}} to {{TAG|ML_FFCAR}}.
my $species_2 = 1;        #species 2 for which pair correlation function is going to be calculated
*Change {{TAG|INCAR}} tags:
#setting radial grid
**Change {{TAG|ML_FF_ISTART}}=0 to {{TAG|ML_FF_ISTART}}=2: This will switch of learning and only use the machine-learned force field.
my $rmin=0.0;            #minimal value of radial grid
**Change {{TAG|NSW}}=10000 to {{TAG|NSW}}=1000.
my $rmax=10.0;            #maximum value of radial grid
*Run calculation.
my $nr=300;                #number of equidistant steps in radial grid
*Copy {{TAG|XDATCAR}} to ''XDATCAR.MLFF_3ps''.
my $dr=($rmax-$rmin)/$nr; #stepsize in radial grid
my $tol=0.0000000001;    #tolerance limit for r->0
my $z=0;                  #counter
my $numelem;              #number of elements
my @elements;            #number of atoms per element saved in list/array
my $lattscale;            #scaling factor for lattice
my @b;                    #Bravais matrix
my $nconf=0;              #number of configurations in XDATCAR file
my @cart;                #Cartesian coordinates for each atom and configuration
my $atmin_1=0;            #first index of species one
my $atmax_1;              #last index of species one
my $atmin_2=0;            #first index of species two
my $atmax_2;              #last index of species two
my @vol;                  #volume of cell (determinant of Bravais matrix)
my $pi=4*atan2(1, 1);    #constant pi
my $natom=0;              #total number of atoms in cell
my @pcf;                  #pair correlation function (list/array)
my $mult_x=1;            #periodic repetition of cells in x dimension
my $mult_y=1;            #periodic repetition of cells in y dimension
my $mult_z=1;            #periodic repetition of cells in z dimension
my @cart_super;          #Cartesian cells over multiple cells
my @vec_len;              #Length of lattice vectors in 3 spatial coordinates
#my $ensemble_type="NpT";  #Set Npt or NVT. Needs to be set since both have different XDATCAR file.
my $ensemble_type="NVT";  #Set Npt or NVT. Needs to be set since both have different XDATCAR file.
my $av_vol=0;              #Average volume in cell


Now do the following steps to run the ab initio reference calculation:
#reading in XDATCAR file
*Copy ''POSCAR.T2000_relaxed'' to {{TAG|POSCAR}}.
while (<>)
*Change {{TAG|ML_FF_LMLFF}}=''.TRUE.'' to {TAG|ML_FF_LMLFF}}=''.FALSE.'': This will turn off the machine learning completely.
{
*Run caclulation.
  chomp;
*Copy {{TAG|XDATCAR}} to ''XDATCAR.AI_3ps''.
  $_=~s/^/ /;
  my @help=split(/[\t,\s]+/);
  $z++;
  if ($z==2)
  {
      $lattscale = $help[1];
  }
  if ($z==3)
  {
      $b[$nconf+1][1][1]=$help[1]*$lattscale;
      $b[$nconf+1][1][2]=$help[2]*$lattscale;
      $b[$nconf+1][1][3]=$help[3]*$lattscale;
  }
  if ($z==4)
  {
      $b[$nconf+1][2][1]=$help[1]*$lattscale;
      $b[$nconf+1][2][2]=$help[2]*$lattscale;
      $b[$nconf+1][2][3]=$help[3]*$lattscale;
  }
  if ($z==5)
  {
      $b[$nconf+1][3][1]=$help[1]*$lattscale;
      $b[$nconf+1][3][2]=$help[2]*$lattscale;
      $b[$nconf+1][3][3]=$help[3]*$lattscale;
  }
  if ($z==7)
  {
      if ($nconf==0)
      {
        $numelem=@help-1;
        for (my $i=1;$i<=$numelem;$i++)
        {
            $elements[$i]=$help[$i];
            $natom=$natom+$help[$i];
        }
      }
  }
  if ($_=~m/Direct/)
  {
      $nconf=$nconf+1;
      #for NVT ensemble only one Bravais matrix exists, so it has to be copied
      if ($ensemble_type eq "NVT")
      {
        for (my $i=1;$i<=3;$i++)
        {
            for (my $j=1;$j<=3;$j++)
            {
              $b[$nconf][$i][$j]=$b[1][$i][$j];
            }
        }
      }
      for (my $i=1;$i<=$natom;$i++)
      {
        $_=<>;
        chomp;
        $_=~s/^/ /;
        my @helpat=split(/[\t,\s]+/);
        $cart[$nconf][$i][1]=$b[1][1][1]*$helpat[1]+$b[1][1][2]*$helpat[2]+$b[1][1][3]*$helpat[3];
        $cart[$nconf][$i][2]=$b[1][2][1]*$helpat[1]+$b[1][2][2]*$helpat[2]+$b[1][2][3]*$helpat[3];
        $cart[$nconf][$i][3]=$b[1][3][1]*$helpat[1]+$b[1][3][2]*$helpat[2]+$b[1][3][3]*$helpat[3];
      }
      if ($ensemble_type eq "NpT")
      {
        $z=0;
      }
  }
  last if eof;
}


To analyze the pair correlation function use the PERL script ''pair_correlation_function.pl'':
if ($confmin>$nconf)
#!/usr/bin/perl
{
  print "Error, confmin larger than number of configurations. Exiting...\n";
  exit;
}
if ($confmax>$nconf)
{
  $confmax=$nconf;
}
   
   
use strict;
for (my $i=1;$i<=$nconf;$i++)
use warnings;
{
  #calculate lattice vector lengths
#configuration for which ensemble average is to be calculated
  $vec_len[$i][1]=($b[$i][1][1]*$b[$i][1][1]+$b[$i][1][2]*$b[$i][1][2]+$b[$i][1][3]*$b[$i][1][3])**0.5;
my $confmin=1;           #starting index of configurations in XDATCAR file for pair correlation function
  $vec_len[$i][2]=($b[$i][2][1]*$b[$i][2][1]+$b[$i][2][2]*$b[$i][2][2]+$b[$i][2][3]*$b[$i][2][3])**0.5;
my $confmax=20000;           #last index of configurations in XDATCAR file for pair correlation function
  $vec_len[$i][3]=($b[$i][3][1]*$b[$i][3][1]+$b[$i][3][2]*$b[$i][3][2]+$b[$i][3][3]*$b[$i][3][3])**0.5;
my $confskip=1;          #stepsize for configuration loop
  #calculate volume of cell
my $species_1 = 1;        #species 1 for which pair correlation function is going to be calculated
  $vol[$i]=$b[$i][1][1]*$b[$i][2][2]*$b[$i][3][3]+$b[$i][1][2]*$b[$i][2][3]*$b[$i][3][1]+$b[$i][1][3]*$b[$i][2][1]*$b[$i][3][2]-$b[$i][3][1]*$b[$i][2][2]*$b[$i][1][3]-$b[$i][3][2]*$b[$i][2][3]*$b[$i][1][1]-$b[$i][3][3]*$b[$i][2][1]*$b[$i][1][2];
my $species_2 = 1;        #species 2 for which pair correlation function is going to be calculated
  $av_vol=$av_vol+$vol[$i];
#setting radial grid
}
my $rmin=0.0;             #minimal value of radial grid
$av_vol=$av_vol/$nconf;
my $rmax=10.0;            #maximum value of radial grid
 
my $nr=300;                #number of equidistant steps in radial grid
#choose species 1 for which pair correlation function is going to be calculated
my $dr=($rmax-$rmin)/$nr; #stepsize in radial grid
$atmin_1=1;
my $tol=0.0000000001;    #tolerance limit for r->0
if ($species_1>1)
 
{
my $z=0;                  #counter
  for (my $i=1;$i<$species_1;$i++)
my $numelem;              #number of elements
  {
my @elements;            #number of atoms per element saved in list/array
    $atmin_1=$atmin_1+$elements[$i];
my $lattscale;            #scaling factor for lattice
  }
my @b;                    #Bravais matrix
}
my $nconf=0;              #number of configurations in XDATCAR file
$atmax_1=$atmin_1+$elements[$species_1]-1;
my @cart;                #Cartesian coordinates for each atom and configuration
#choose species 2 to which paircorrelation function is calculated to
my $atmin_1=0;            #first index of species one
$atmin_2=1;
my $atmax_1;              #last index of species one
if ($species_2>1)
my $atmin_2=0;           #first index of species two
{
my $atmax_2;              #last index of species two
  for (my $i=1;$i<$species_2;$i++)
my @vol;                  #volume of cell (determinant of Bravais matrix)
  {
my $pi=4*atan2(1, 1);    #constant pi
    $atmin_2=$atmin_2+$elements[$i];
my $natom=0;              #total number of atoms in cell
  }
my @pcf;                  #pair correlation function (list/array)
}
my $mult_x=1;            #periodic repetition of cells in x dimension
$atmax_2=$atmin_2+$elements[$species_2]-1;
my $mult_y=1;            #periodic repetition of cells in y dimension
#initialize pair correlation function
my $mult_z=1;            #periodic repetition of cells in z dimension
for (my $i=0;$i<=($nr-1);$i++)
my @cart_super;          #Cartesian cells over multiple cells
{
my @vec_len;              #Length of lattice vectors in 3 spatial coordinates
  $pcf[$i]=0.0;
#my $ensemble_type="NpT";  #Set Npt or NVT. Needs to be set since both have different XDATCAR file.
}
my $ensemble_type="NVT";  #Set Npt or NVT. Needs to be set since both have different XDATCAR file.
# loop over configurations, make histogram of pair correlation function
my $av_vol=0;              #Average volume in cell
for (my $j=$confmin;$j<=$confmax;$j=$j+$confskip)
{
#reading in XDATCAR file
  for (my $k=$atmin_1;$k<=$atmax_1;$k++)
while (<>)
  {
{
       for (my $l=$atmin_2;$l<=$atmax_2;$l++)
    chomp;
    $_=~s/^/ /;
    my @help=split(/[\t,\s]+/);
    $z++;
    if ($z==2)
    {
      $lattscale = $help[1];
    }
    if ($z==3)
    {
      $b[$nconf+1][1][1]=$help[1]*$lattscale;
      $b[$nconf+1][1][2]=$help[2]*$lattscale;
      $b[$nconf+1][1][3]=$help[3]*$lattscale;
    }
    if ($z==4)
    {
      $b[$nconf+1][2][1]=$help[1]*$lattscale;
      $b[$nconf+1][2][2]=$help[2]*$lattscale;
      $b[$nconf+1][2][3]=$help[3]*$lattscale;
    }
    if ($z==5)
    {
      $b[$nconf+1][3][1]=$help[1]*$lattscale;
      $b[$nconf+1][3][2]=$help[2]*$lattscale;
      $b[$nconf+1][3][3]=$help[3]*$lattscale;
    }
    if ($z==7)
    {
       if ($nconf==0)
       {
       {
           $numelem=@help-1;
           if ($k==$l) {next};
           for (my $i=1;$i<=$numelem;$i++)
           for (my $g_x=-$mult_x;$g_x<=$mult_x;$g_x++)
           {
           {
             $elements[$i]=$help[$i];
             for (my $g_y=-$mult_y;$g_y<=$mult_y;$g_y++)
            $natom=$natom+$help[$i];
          }
      }
    }
    if ($_=~m/Direct/)
    {
      $nconf=$nconf+1;
      #for NVT ensemble only one Bravais matrix exists, so it has to be copied
      if ($ensemble_type eq "NVT")
      {
          for (my $i=1;$i<=3;$i++)
          {
            for (my $j=1;$j<=3;$j++)
             {
             {
                 $b[$nconf][$i][$j]=$b[1][$i][$j];
                 for (my $g_z=-$mult_y;$g_z<=$mult_z;$g_z++)
                {
                  my $at2_x=$cart[$j][$l][1]+$vec_len[$j][1]*$g_x;
                  my $at2_y=$cart[$j][$l][2]+$vec_len[$j][2]*$g_y;
                  my $at2_z=$cart[$j][$l][3]+$vec_len[$j][3]*$g_z;
                  my $dist=($cart[$j][$k][1]-$at2_x)**2.0+($cart[$j][$k][2]-$at2_y)**2.0+($cart[$j][$k][3]-$at2_z)**2.0;
                  $dist=$dist**0.5;
                  #determine integer multiple
                  my $zz=int(($dist-$rmin)/$dr+0.5);
                  if ($zz<$nr)
                  {
                      $pcf[$zz]=$pcf[$zz]+1.0;
                  }
                }
             }
             }
           }
           }
       }
       }
      for (my $i=1;$i<=$natom;$i++)
  }
      {
}
          $_=<>;
          chomp;
          $_=~s/^/ /;
          my @helpat=split(/[\t,\s]+/);
          $cart[$nconf][$i][1]=$b[1][1][1]*$helpat[1]+$b[1][1][2]*$helpat[2]+$b[1][1][3]*$helpat[3];
          $cart[$nconf][$i][2]=$b[1][2][1]*$helpat[1]+$b[1][2][2]*$helpat[2]+$b[1][2][3]*$helpat[3];
          $cart[$nconf][$i][3]=$b[1][3][1]*$helpat[1]+$b[1][3][2]*$helpat[2]+$b[1][3][3]*$helpat[3];
      }
      if ($ensemble_type eq "NpT")
      {
          $z=0;
      }
    }
    last if eof;
}
   
   
if ($confmin>$nconf)
#make ensemble average, rescale functions and print
{
for (my $i=0;$i<=($nr-1);$i++)
    print "Error, confmin larger than number of configurations. Exiting...\n";
{
    exit;
  my $r=$rmin+$i*$dr;
}
  if ($r<$tol)
if ($confmax>$nconf)
  {
{
      $pcf[$i]=0.0;
    $confmax=$nconf;
  }
}
  else
 
  {
for (my $i=1;$i<=$nconf;$i++)
      $pcf[$i]=$pcf[$i]*$av_vol/(4*$pi*$r*$r*$dr*(($confmax-$confmin)/$confskip)*($atmax_2-$atmin_2+1)*($atmax_1-$atmin_1+1));#*((2.0*$mult_x+1.0)*(2.0*$mult_y+1.0)*(2.0*$mult_z+1.0)));
{
  }
    #calculate lattice vector lengths
  print $r," ",$pcf[$i],"\n";
    $vec_len[$i][1]=($b[$i][1][1]*$b[$i][1][1]+$b[$i][1][2]*$b[$i][1][2]+$b[$i][1][3]*$b[$i][1][3])**0.5;
}
    $vec_len[$i][2]=($b[$i][2][1]*$b[$i][2][1]+$b[$i][2][2]*$b[$i][2][2]+$b[$i][2][3]*$b[$i][2][3])**0.5;
</pre>
    $vec_len[$i][3]=($b[$i][3][1]*$b[$i][3][1]+$b[$i][3][2]*$b[$i][3][2]+$b[$i][3][3]*$b[$i][3][3])**0.5;
</div>
    #calculate volume of cell
and process the previously saved {{TAG|XDATCAR}} files
    $vol[$i]=$b[$i][1][1]*$b[$i][2][2]*$b[$i][3][3]+$b[$i][1][2]*$b[$i][2][3]*$b[$i][3][1]+$b[$i][1][3]*$b[$i][2][1]*$b[$i][3][2]-$b[$i][3][1]*$b[$i][2][2]*$b[$i][1][3]-$b[$i][3][2]*$b[$i][2][3]*$b[$i][1][1]-$b[$i][3][3]*$b[$i][2][1]*$b[$i][1][2];
  perl pair_correlation_function.pl XDATCAR.MLFF_3ps > pair_MLFF_3ps.dat
    $av_vol=$av_vol+$vol[$i];
 
}
To save time the pair correlation function for 1000 ab initio MD steps is precalculated in the file ''pair_AI_3ps.dat''.
$av_vol=$av_vol/$nconf;
 
The interested user can of course calculate the results of the ab initio MD by rerunning the above steps while switching off machine learning via
#choose species 1 for which pair correlation function is going to be calculated
{{TAGBL|ML_LMLFF}} = .FALSE.
$atmin_1=1;
if ($species_1>1)
{
    for (my $i=1;$i<$species_1;$i++)
    {
      $atmin_1=$atmin_1+$elements[$i];
    }
}
$atmax_1=$atmin_1+$elements[$species_1]-1;
#choose species 2 to which paircorrelation function is calculated to
$atmin_2=1;
if ($species_2>1)
{
    for (my $i=1;$i<$species_2;$i++)
    {
      $atmin_2=$atmin_2+$elements[$i];
    }
}
$atmax_2=$atmin_2+$elements[$species_2]-1;
#initialize pair correlation function
for (my $i=0;$i<=($nr-1);$i++)
{
    $pcf[$i]=0.0;
}
# loop over configurations, make histogram of pair correlation function
for (my $j=$confmin;$j<=$confmax;$j=$j+$confskip)
{
    for (my $k=$atmin_1;$k<=$atmax_1;$k++)
    {
        for (my $l=$atmin_2;$l<=$atmax_2;$l++)
        {
          if ($k==$l) {next};
          for (my $g_x=-$mult_x;$g_x<=$mult_x;$g_x++)
          {
              for (my $g_y=-$mult_y;$g_y<=$mult_y;$g_y++)
              {
                for (my $g_z=-$mult_y;$g_z<=$mult_z;$g_z++)
                {
                    my $at2_x=$cart[$j][$l][1]+$vec_len[$j][1]*$g_x;
                    my $at2_y=$cart[$j][$l][2]+$vec_len[$j][2]*$g_y;
                    my $at2_z=$cart[$j][$l][3]+$vec_len[$j][3]*$g_z;
                    my $dist=($cart[$j][$k][1]-$at2_x)**2.0+($cart[$j][$k][2]-$at2_y)**2.0+($cart[$j][$k][3]-$at2_z)**2.0;
                    $dist=$dist**0.5;
                    #determine integer multiple
                    my $zz=int(($dist-$rmin)/$dr+0.5);
                    if ($zz<$nr)
                    {
                      $pcf[$zz]=$pcf[$zz]+1.0;
                    }
                }
              }
          }
        }
    }
}
 
#make ensemble average, rescale functions and print
for (my $i=0;$i<=($nr-1);$i++)
{
    my $r=$rmin+$i*$dr;
    if ($r<$tol)
    {
      $pcf[$i]=0.0;
    }
    else
    {
      $pcf[$i]=$pcf[$i]*$av_vol/(4*$pi*$r*$r*$dr*(($confmax-$confmin)/$confskip)*($atmax_2-$atmin_2+1)*($atmax_1-$atmin_1+1));#*((2.0*$mult_x+1.0)*(2.0*$mult_y+1.0)*(2.0*$mult_z+1.0)));
    }
    print $r," ",$pcf[$i],"\n";
  }


Obtain the pair correlation function from the previously saved {{TAG|XDATCAR}} files:
We can compare the pair correlation functions, e.g. with gnuplot using the following command
perl pair_correlation_function.pl XDATCAR.MLFF_3ps > pair_MLFF_3ps.dat
perl pair_correlation_function.pl XDATCAR.AI_3ps > pair_AI_3ps.dat
Plot the pair correlation functions using the following command:
  gnuplot -e "set terminal jpeg; set xlabel 'r(Ang)'; set ylabel 'PCF'; set style data lines; plot 'pair_MLFF_3ps.dat', 'pair_AI_3ps.dat' " > PC_MLFF_vs_AI_3ps.jpg
  gnuplot -e "set terminal jpeg; set xlabel 'r(Ang)'; set ylabel 'PCF'; set style data lines; plot 'pair_MLFF_3ps.dat', 'pair_AI_3ps.dat' " > PC_MLFF_vs_AI_3ps.jpg
The pair correlation functions obtained that way should look similar like that:
The pair correlation functions obtained that way should look similar to this figure


[[File:PC MLFF vs AI 3ps.jpg|400px]]
[[File:PC MLFF vs AI 3ps.jpg|400px]]


We see that pair correlation is quite well reproduced although the error on the force is a little bit large. Nevertheless we just used this learning to very create a liquid structure very fast, since creating a liquid from a perfect crystal can be a time consuming task.
We see that pair correlation is quite well reproduced although the error in the force of ~0.242 eV/<math>\AA</math> shown above is a little bit too large. This error is maybe too large for accurate production calculations (usually an accuracy of approximately 0.1 eV/<math>\AA</math> is targeted), but since the pair correlation function is well reproduced it is perfectly fine to use this on-the-fly force field in the time-consuming melting of the crystal.


=== Obtaining more accurate force field ===
=== Obtaining a more accurate force field ===
After learning how to get a good starting structure for the the liquid we will restart the learning but with better ab initio parameters using the {{TAG|CONTCAR}} from the previous subsection. It is important in most calculations to have quite accurate ab initio calculations since bad ab initio parameters can really limit the accuracy of the learning or even inhibit a proper learning.


We do the following steps to restart learning:
Including the melting phase in the force field may impact the accuracy of the force field. To improve it is usually advisable to learn on the pure structures, which in our case this means to use the {{TAG|CONTCAR}} file obtained after the melting. Furthermore, the force field was learned using only a single k point so that we can improve the accuracy of the reference data by including more k points. In most calculations, it is important to conduct accurate ab initio calculations since bad reference data can limit the accuracy or even inhibit the learning of a force field.
*Copy ''POSCAR.T2000_relaxed'' to {{TAG|POSCAR}}.
 
*Change {{TAG|INCAR}} tags:
We restart from the liquid structure obtained [[#Creating the liquid structure|before]]
**Set {{TAG|ALGO}}=''Normal''.
cp POSCAR.T2000_relaxed {{TAGBL|POSCAR}}
**Set {{TAG|ML_FF_LMLFF}}=''.FALSE.'' back to {{TAG|ML_FF_LMLFF}}=''.TRUE.''.
and change the following {{TAG|INCAR}} tags
**Set {{TAG|ML_FF_ISTART}}=2 back to {{TAG|ML_FF_ISTART}}=0.
{{TAGBL|ALGO}} = Normal
**Set {{TAG|NSW}}=1000 back to {{TAG|NSW}}=10000. We will learn again for 30 ps.
{{TAGBL|ML_LMLFF}} = .TRUE.
**Optionally add k point parallelization wth the {{TAG|KPAR}} tag (if you are not familiar with this tag leave it be).
{{TAGBL|ML_ISTART}} = 0
*Increase the k mesh in the {{TAG|KPOINTS}} file. The new file looks like:
{{TAGBL|NSW}} = 1000
  test
If you run have resources to run in parallel, you can reduce the computation time further by adding k point parallelization with the {{TAG|KPAR}} tag.
We use a denser k-point mesh in the {{TAG|KPOINTS}} file
  2x2x2 Gamma-centered mesh
  0 0 0
  0 0 0
  Gamma
  Gamma
   2 2 2
   2 2 2
   0 0 0
   0 0 0
*Run the calculation using the '''standard''' version of VASP (usually ''vasp_std'').
We will learn a new force field with 1000 MD steps (each of 3 fs). Keep in mind to run the calculation using the '''standard''' version of VASP (usually ''vasp_std'').
After running the calculation, we examine the error "ERR" in the {{TAG|ML_LOGFILE}} by typing:
grep "ERR" ML_LOGFILE
where the last entries should be close to
ERR                  886  5.98467749E-03  1.48190308E-01  2.38264786E+00
ERR                  908  5.98467749E-03  1.48190308E-01  2.38264786E+00
ERR                  925  5.98467749E-03  1.48190308E-01  2.38264786E+00
ERR                  959  5.98467749E-03  1.48190308E-01  2.38264786E+00
ERR                  980  5.98467749E-03  1.48190308E-01  2.38264786E+00
ERR                  990  5.99559653E-03  1.50261779E-01  2.40349561E+00
ERR                  1000  5.99559653E-03  1.50261779E-01  2.40349561E+00


After running the calculation we will again look at the error in the {{TAG|ML_LOGFILE}}, where the obtained values should be close to:
We immediately see that the errors for the forces are significantly lower than in the previous calculation with only one k point. This is due to the less noisy ab initio data which is easier to learn.
====================================================================================================
      Information on error estimations
----------------------------------------------------------------------------------------------------
              Error in energy (eV atom^-1):    0.006079
              Error in force  (eV Angst^-1):    0.165678
                      Error in stress (kB):    2.553148
                Bayesian error (eV Angst-1):    0.044695    0.101357
                        Spilling factor (-):    0.002285    0.020000
====================================================================================================


We immediately see that the errors are significantly lower than in the training with one k point.
To understand how the force field is learned, we inspect the {{TAG|ML_ABN}} file containing the training data. In the beginning of this file, you will find information about the number of reference structures for training
 
  1.0 Version
Next we try to look at how well the force field is learned and try to monitor how much training data is added to a structure after each learning cycle.
 
The {{TAG|ML_ABNCAR}} file contains the training data. The number of reference structures and the basis-set size are written at the beginning of this file. The values in this example at this step should be close to the following:
  **************************************************
  **************************************************
       The number of configurations
       The number of configurations
  --------------------------------------------------
  --------------------------------------------------
           55
           48
and
and the number of local reference configurations (size of the basis set)
  **************************************************
  **************************************************
       The numbers of basis sets per atom type
       The numbers of basis sets per atom type
  --------------------------------------------------
  --------------------------------------------------
         396
         382
for the training-set size and basis-set size, respectively.


Now we want to further continue the learning to see how much data is added and how the accuracy of the force field changes.
We will further continue the training a 1000 MD steps and see how the number of training structures and the number of local reference configurations change.  
Do the following steps for a continuation run with on-the-fly learning:
Do the following steps:
*Copy {{TAG|ML_ABNCAR}} to {{TAG|ML_ABCAR}}.
cp {{TAGBL|ML_ABN}} {{TAGBL|ML_AB}}
*Copy {{TAG|CONTCAR}} to {{TAG|POSCAR}}.
cp {{TAGBL|CONTCAR}} {{TAGBL|POSCAR}}
*Modify the {{TAG|INCAR}} file in the following way:
and set the following {{TAG|INCAR}} tag
**Change {{TAG|ML_FF_ISTART}}=0 to {{TAG|ML_FF_ISTART}}=1.
{{TAGBL|ML_ISTART}} = 1
**Add {{TAG|ML_FF_CTIFOR}}=''x'', where ''x'' is the last criterion for the threshold of the Bayesian error obtained from the {{TAG|ML_LOGFILE}} (from the entry above it would be 0.101357, but please use the value that you obtained in your calculation). By setting already a good guess for the threshold several ab initio steps can be skipped from the beginning which would be required to determine the Bayesian error criterion. '''Mind''': When continuation runs are performed where the crystal structure is changed, the last obtained error criterion for the Bayesian error can be too large and then many ab initio steps can be skipped wrongly (this can especially happen when going from first a liquid to a solid phase). In that case it is safer to use the default of {{TAG|ML_FF_CTIFOR}}=10^{-16}</math>}}.


After learning we again look at the last entry of the errors in the {{TAG|ML_LOGFILE}}, which should look like:
After running the calculation, we inspect the last instance of the errors in the {{TAG|ML_LOGFILE}} by typing:
  ====================================================================================================
  grep ERR ML_LOGFILE
      Information on error estimations
The last few lines should have values close to:
  ----------------------------------------------------------------------------------------------------
  ERR                  675  5.10937061E-03  1.46895065E-01  2.50094941E+00
              Error in energy (eV atom^-1):    0.006252
ERR                  861  5.10937061E-03  1.46895065E-01  2.50094941E+00
              Error in force (eV Angst^-1):    0.168549
  ERR                  924  5.10937061E-03  1.46895065E-01  2.50094941E+00
                      Error in stress (kB):    2.606332
ERR                  942  5.01460989E-03  1.47816836E-01  2.47329693E+00
                Bayesian error (eV Angst-1):    0.037569    0.101357
ERR                  1000  5.01460989E-03  1.47816836E-01  2.47329693E+00
                        Spilling factor (-):    0.001114    0.020000
We see that the accuracy has changed slightly. We also look at the ML_ABN file and the number of reference structures for training should increase compared to the run before
====================================================================================================
  1.0 Version
We see that the accuracy has not significantly changed. Also the threshold for the Bayesian error has not change (by typing ''grep "Bayesian error (eV Angst-1):" ML_LOGFILE we see that it was the same the whole time). By looking at the actual errors we see that that the threshold was overwritten only very few times. To see how often the threshold was overstepped type the following command on the command line:
grep "Bayesian error (eV Angst-1):" ML_LOGFILE |awk '{if ($5>0.101357) {print $5}}'
Consequently only very few structures were added to the ab initio data in the {{TAG|ML_ABNCAR}} file. The entry for the reference structures and basis sets should look like:
  **************************************************
  **************************************************
       The number of configurations
       The number of configurations
  --------------------------------------------------
  --------------------------------------------------
           65
           99
and
Also the number of local reference configurations (basis sets) increases compared to the previous calculation
  **************************************************
  **************************************************
       The numbers of basis sets per atom type
       The numbers of basis sets per atom type
  --------------------------------------------------
  --------------------------------------------------
         407
         665
This means that either the force field is already accurate enough or that the threshold is set too high and ab initio calculations are skipped because the force field is too accurate for that criterion.


Next we will test the learning with the default value for {{TAG|ML_FF_CTIFOR}} (which is practically zero), to see if the force field changes significantly. For that the following steps have to be done:
*Copy {{TAG|CONTCAR}} to {{TAG|POSCAR}}.
*Copy {{TAG|ML_ABNCAR}} to {{TAG|ML_ABCAR}}.
*Remove the {{TAG|ML_FF_ISTART}} line in the {{TAG|INCAR}} file.


After rerunning the calculation we see that many structures have been added in the {{TAG|ML_ABNCAR}} file:
Ideally, one should continue learning until no structures need to be added to the training data and basis set anymore. Very often this can take up to 100ps depending on the material and conditions. In practice, the prediction of the Bayesian error exhibits numerical inaccuracies so that an ab initio calculation is conducted from time to time even if the force field is accurate enough. So measuring only the decreasing frequency of addition of new data is not sufficient to know when to finish. One should also look at the accuracy of the force on the training data and more importantly on the accuracy on some test data that is outside of the training sets.
  **************************************************
      The number of configurations
--------------------------------------------------
        160
...
**************************************************
      The numbers of basis sets per atom type
--------------------------------------------------
        726
Nevertheless the accuracy of the force field did not improve significantly in the {{TAG|ML_LOGFILE}}:
====================================================================================================
      Information on error estimations
----------------------------------------------------------------------------------------------------
              Error in energy (eV atom^-1):    0.005950
              Error in force  (eV Angst^-1):    0.162064
                      Error in stress (kB):    2.628567
                Bayesian error (eV Angst-1):    0.029723    0.061348
                        Spilling factor (-):    0.000406    0.020000
====================================================================================================
 
Ideally one should continue learning until no structures need to be added to the training data and basis set any more. Practically this is hard to achieve, since the prediction of the Bayesian error has some inaccuracies and learning can be executed even if the force field is accurate enough. Nevertheless for practical calculations one has to stop the calculation at a given point. In this example we stop here because the accuracy on test structures compared to the ab initio data is not improving in the {{TAG|ML_LOGFILE}} by further addition of training data and this example is only a tutorial example. For high quality production calculations suited for publications we would advise to learn until almost no data is added to the training structures and basis set. In many cases this involves a length of 100 ps for learning.
 
'''User task''': The user should continue learning and observe the convergence of the number of reference structures. Most likely the maximum allowed basis set size {{TAG|ML_FF_MB_MB}} needs to be increased. The user should also look at the pair correlation function with respect to the ab initio data. Is it still ok?
 
=== Optimizing weights on energy, forces and stress ===
 
Next we learn how to refine an existing force after it has been learned. The input parameters {{TAG|ML_FF_WTOTEN}}, {{TAG|ML_FF_WTIFOR}} and {{TAG|ML_FF_WTSIF}} are the weights for energy, force and stress in the linear fitting, respectively. By default they are set to 1.0, so energy, force and stress are weighted equally. The weights are relative weights, this means it is sufficient to increase only the weight of interest, while keeping the other parameters at 1.0. If the weight for a component is increased the error of that component usually becomes lower with higher weight. Simultaneously the error on the other components gets worse, but at a different rate. For a given optimal value of the weight the overall error can be minimized, such that the component of interest is reproduced with much more accuracy. Usually this can be important for methods where the free energy is important (such as e.g. {{TAG|Interface pinning calculations}}, {{TAG|Metadynamics}} etc.) and this weighting can improve the error on energy significantly while only slightly making the errors on forces and stress worse.
 
To run the weight optimizations calculations the following steps are necessary:
*Copy {{TAG|INCAR}} to ''INCAR.run''. This will serve as a template since several short calculations will be executed with different weights for the energy.
*Copy {{TAG|CONTCAR}} to ''POSCAR.run''. For each calculation the same {{TAG|POSCAR}} file needs to be used as a starting point.
*Copy {{TAG|ML_ABNCAR}} to {{TAG|ML_ABCAR}}.
*Modify ''INCAR.run'' as follows:
**Set {{TAG|ML_FF_ISTART}}=1.
**Set {{TAG|NSW}}=1. We will only need a single new test structure.
**The {{TAG|ML_FF_WTOTEN}} test will be added by the runscript.
*Use the bash script ''run_param_opt.sh'' (or use your own) which contains the following lines:
#!/bin/bash
if [ -f error_vs_eweight.dat ]; then
    rm error_vs_eweight.dat
fi
touch error_vs_eweight.dat
for i in 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0
do
    cat INCAR.run > INCAR
    echo "ML_FF_WTOTEN = $i" >> INCAR
    cp POSCAR.run POSCAR
    mpirun -np $np /fs/home/karsai/PROGRAMS/VASP/CLONES/MLFF/SHMEM/SHMEM_JULY_23_2019_RYOSUKE_latest/bin/vasp_std
    cp ML_LOGFILE ML_LOGFILE.$i
    awk -v var="$i" '/Error in energy \(eV atom\^-1\)\:/{energy=$6} /Error in force  \(eV Angst\^-1\)\:/ {force=$6} /Error in stress \(kB\)\:/{stress=$5} /Energy variance \(eV atom\^-1\)\:/{energy_var=$5}  /Force variance \(eV Angst\^-1\)\:/{force_var=$5} /Stress variance \(kB\)\:/ {stress_var=$4} END{print var,energy/energy_var+force/force_var+stress/stress_var}' ML_LOGFILE.$i >> error_vs_eweight.dat
done


== Download ==
== Download ==
[[Media:MLFF_Liquid_Si_tutorial.tgz| MLFF_Liquid_Si_tutorial.tgz]]


{{Template:Machine learning force field - Tutorial}}
{{Template:Machine learning force field - Tutorial}}


[[Category:Examples]]
[[Category:Examples]]

Latest revision as of 15:13, 15 April 2022

Task

Generating a machine learning force field for liquid Si. For this tutorial, we expect that the user is already familiar with running conventional ab initio molecular dynamic calculations.

Input

POSCAR

In this example we start from a 64 atom super cell of diamond-fcc Si (the same as in Liquid Si - Standard MD):

Si_CD_2x2x2
     5.43090000000000
    2.00000000   0.00000000   0.00000000
    0.00000000   2.00000000   0.00000000
    0.00000000   0.00000000   2.00000000
   Si
   64
Direct
   0.00000000   0.00000000   0.00000000
   0.50000000   0.00000000   0.00000000
   0.00000000   0.50000000   0.00000000
   0.50000000   0.50000000   0.00000000
   0.00000000   0.00000000   0.50000000
   0.50000000   0.00000000   0.50000000
   0.00000000   0.50000000   0.50000000
   0.50000000   0.50000000   0.50000000
   0.37500000   0.12500000   0.37500000
   0.87500000   0.12500000   0.37500000
   0.37500000   0.62500000   0.37500000
   0.87500000   0.62500000   0.37500000
   0.37500000   0.12500000   0.87500000
   0.87500000   0.12500000   0.87500000
   0.37500000   0.62500000   0.87500000
   0.87500000   0.62500000   0.87500000
   0.00000000   0.25000000   0.25000000
   0.50000000   0.25000000   0.25000000
   0.00000000   0.75000000   0.25000000
   0.50000000   0.75000000   0.25000000
   0.00000000   0.25000000   0.75000000
   0.50000000   0.25000000   0.75000000
   0.00000000   0.75000000   0.75000000
   0.50000000   0.75000000   0.75000000
   0.37500000   0.37500000   0.12500000
   0.87500000   0.37500000   0.12500000
   0.37500000   0.87500000   0.12500000
   0.87500000   0.87500000   0.12500000
   0.37500000   0.37500000   0.62500000
   0.87500000   0.37500000   0.62500000
   0.37500000   0.87500000   0.62500000
   0.87500000   0.87500000   0.62500000
   0.25000000   0.00000000   0.25000000
   0.75000000   0.00000000   0.25000000
   0.25000000   0.50000000   0.25000000
   0.75000000   0.50000000   0.25000000
   0.25000000   0.00000000   0.75000000
   0.75000000   0.00000000   0.75000000 
   0.25000000   0.50000000   0.75000000
   0.75000000   0.50000000   0.75000000
   0.12500000   0.12500000   0.12500000
   0.62500000   0.12500000   0.12500000
   0.12500000   0.62500000   0.12500000
   0.62500000   0.62500000   0.12500000
   0.12500000   0.12500000   0.62500000
   0.62500000   0.12500000   0.62500000
   0.12500000   0.62500000   0.62500000
   0.62500000   0.62500000   0.62500000
   0.25000000   0.25000000   0.00000000
   0.75000000   0.25000000   0.00000000
   0.25000000   0.75000000   0.00000000
   0.75000000   0.75000000   0.00000000
   0.25000000   0.25000000   0.50000000
   0.75000000   0.25000000   0.50000000
   0.25000000   0.75000000   0.50000000
   0.75000000   0.75000000   0.50000000
   0.12500000   0.37500000   0.37500000
   0.62500000   0.37500000   0.37500000
   0.12500000   0.87500000   0.37500000
   0.62500000   0.87500000   0.37500000
   0.12500000   0.37500000   0.87500000
   0.62500000   0.37500000   0.87500000
   0.12500000   0.87500000   0.87500000
   0.62500000   0.87500000   0.87500000

KPOINTS

We will start with a single k point in this example:

K-Points
 0
Gamma
 1  1  1
 0  0  0

INCAR

#Basic parameters
ISMEAR = 0
SIGMA = 0.1
LREAL = Auto
ISYM = -1
NELM = 100
EDIFF = 1E-4
LWAVE = .FALSE.
LCHARG = .FALSE.

#Parallelization of ab initio calculations
NCORE = 2

#MD
IBRION = 0
MDALGO = 2
ISIF = 2
SMASS = 1.0
TEBEG = 2000
NSW = 10000
POTIM = 3.0
RANDOM_SEED =          88951986                0                0

#Machine learning paramters
ML_LMLFF = .TRUE.
ML_ISTART = 0

Calculation

Creating the liquid structure

Because we don't have a structure of liquid silicon readily available, we first create that structure by starting from a super cell of crystalline silicon with 64 atoms. The temperature is set to 2000 K so that the crystal melts rapidly in the MD run. To improve the simulation speed drastically, we utilize the on-the-fly machine learning. Most of the ab initio steps will be replaced by very fast force-field ones. Within 10000 steps equivalent to 30 ps, we have obtained a good starting position for the subsequent simulations in the CONTCAR file. You can copy the input files or download them.

After running the calculation, we obtained a force field, but its initial trajectory might be tainted but the unreasonable starting position. Nevertheless, it is instructive to inspect the output to understand how to assess the accuracy of a force field, before refining it in subsequent calculations. The main output files for the machine learning are

ML_ABN
contains the ab initio structure datasets used for the learning. It will be needed for continuation runs as ML_AB.
ML_FFN
contains the regression results (weights, parameters, etc.). It will be needed for continuation runs as ML_FF.
ML_LOGFILE
logging the proceedings of the machine learning. This file consists of keywords that are nicely "grepable." The keywords are explained in the in the beginning of the file and upon "grepping". The status of each MD step is given by the keyword "STATUS". Please invoke the following command:
grep STATUS ML_LOGFILE

The output should look similar to the following:

# STATUS ###############################################################
# STATUS This line describes the overall status of each step.
# STATUS 
# STATUS nstep ..... MD time step or input structure counter
# STATUS state ..... One-word description of step action
# STATUS             - "accurate"  (1) : Errors are low, force field is used
# STATUS             - "threshold" (2) : Errors exceeded threshold, structure is sampled from ab initio
# STATUS             - "learning"  (3) : Stored configurations are used for training force field
# STATUS             - "critical"  (4) : Errors are high, ab initio sampling and learning is enforced
# STATUS             - "predict"   (5) : Force field is used in prediction mode only, no error checking
# STATUS is ........ Integer representation of above one-word description (integer in parenthesis)
# STATUS doabin .... Perform ab initio calculation (T/F)
# STATUS iff ....... Force field available (T/F, False after startup hints to possible convergence problems)
# STATUS nsample ... Number of steps since last reference structure collection (sample = T)
# STATUS ngenff .... Number of steps since last force field generation (genff = T)
# STATUS ###############################################################
# STATUS            nstep     state is doabin    iff   nsample    ngenff
# STATUS                2         3  4      5      6         7         8
# STATUS ###############################################################
STATUS                  0 threshold  2      T      F         0         0
STATUS                  1 critical   4      T      F         0         1
STATUS                  2 critical   4      T      F         0         2
STATUS                  3 critical   4      T      T         0         1
STATUS                  4 critical   4      T      T         0         1
STATUS                  5 critical   4      T      T         0         1
     .                  .        .   .      .      .         .         .
     .                  .        .   .      .      .         .         .
     .                  .        .   .      .      .         .         .
STATUS               9997 accurate   1      F      T       945       996
STATUS               9998 accurate   1      F      T       946       997
STATUS               9999 accurate   1      F      T       947       998
STATUS              10000 learning   3      T      T       948       999

Another important keyword is "ERR". For this instance we should type the following command:

grep ERR ML_LOGFILE

The output should look like the following:

# ERR ######################################################################
# ERR This line contains the RMSEs of the predictions with respect to ab initio results for the training data.
# ERR 
# ERR nstep ......... MD time step or input structure counter
# ERR rmse_energy ... RMSE of energies (eV atom^-1)
# ERR rmse_force .... RMSE of forces (eV Angst^-1)
# ERR rmse_stress ... RMSE of stress (kB)
# ERR ######################################################################
# ERR               nstep      rmse_energy       rmse_force      rmse_stress
# ERR                   2                3                4                5
# ERR ######################################################################
ERR                     0   0.00000000E+00   0.00000000E+00   0.00000000E+00
ERR                     1   0.00000000E+00   0.00000000E+00   0.00000000E+00
ERR                     2   0.00000000E+00   0.00000000E+00   0.00000000E+00
ERR                     3   2.84605192E-05   9.82351889E-03   2.40003743E-02
ERR                     4   1.83193349E-05   1.06700600E-02   5.37606479E-02
ERR                     5   4.12132223E-05   1.34123085E-02   1.01588957E-01
ERR                     6   9.51627413E-05   1.90335214E-02   1.31959103E-01
.                       .                .                .                .
.                       .                .                .                .
.                       .                .                .                .
ERR                  9042   1.07159240E-02   2.41283323E-01   4.95695745E+00
ERR                  9052   1.07159240E-02   2.41283323E-01   4.95695745E+00
ERR                 10000   1.07159240E-02   2.41283323E-01   4.95695745E+00

This tag lists the errors on the energy, forces and stress of the force field compared to the ab initio results on the available training data. The second column shows the MD step. We see that the entry is not output at every MD step. The errors only change if the force field is updated, hence when an ab initio calculation is executed (it should correlate with the doabin column of the STATUS keyword). The other three columns show the errors on the energy (eV/atom), forces (ev/Angstrom) and stress (kB).

Structral properties of the force field

To examine the accuracy of structural properties, we compare the deviations between a 3 ps molecular dynamics run using the force field and a full ab initio calculation. For a meaningful comparison, it is best to start from the same initial structure. We will use the liquid structure, we obtained in the previous step and back it up

cp CONTCAR POSCAR.T2000_relaxed

Now, we proceed with the force field calculation and set up the required files

cp POSCAR.T2000_relaxed POSCAR
cp ML_FFN ML_FF

To run a shorter simulation using only the force field, we change the following INCAR tags to

ML_ISTART = 2
NSW = 1000

After the calculation finished, we backup the history of the atomic positions

cp XDATCAR XDATCAR.MLFF_3ps

To analyze the pair correlation function, we use the PERL script pair_correlation_function.pl

Click to show/hide pair_correlation_function.pl
#!/usr/bin/perl

use strict;
use warnings;

#configuration for which ensemble average is to be calculated
my $confmin=1;            #starting index of configurations in XDATCAR file for pair correlation function
my $confmax=20000;           #last index of configurations in XDATCAR file for pair correlation function
my $confskip=1;           #stepsize for configuration loop
my $species_1 = 1;        #species 1 for which pair correlation function is going to be calculated
my $species_2 = 1;        #species 2 for which pair correlation function is going to be calculated
#setting radial grid 
my $rmin=0.0;             #minimal value of radial grid
my $rmax=10.0;            #maximum value of radial grid
my $nr=300;                #number of equidistant steps in radial grid
my $dr=($rmax-$rmin)/$nr; #stepsize in radial grid
my $tol=0.0000000001;     #tolerance limit for r->0 
 
my $z=0;                  #counter
my $numelem;              #number of elements
my @elements;             #number of atoms per element saved in list/array
my $lattscale;            #scaling factor for lattice
my @b;                    #Bravais matrix
my $nconf=0;              #number of configurations in XDATCAR file
my @cart;                 #Cartesian coordinates for each atom and configuration
my $atmin_1=0;            #first index of species one
my $atmax_1;              #last index of species one
my $atmin_2=0;            #first index of species two
my $atmax_2;              #last index of species two
my @vol;                  #volume of cell (determinant of Bravais matrix)
my $pi=4*atan2(1, 1);     #constant pi
my $natom=0;              #total number of atoms in cell
my @pcf;                  #pair correlation function (list/array)
my $mult_x=1;             #periodic repetition of cells in x dimension
my $mult_y=1;             #periodic repetition of cells in y dimension
my $mult_z=1;             #periodic repetition of cells in z dimension
my @cart_super;           #Cartesian cells over multiple cells
my @vec_len;              #Length of lattice vectors in 3 spatial coordinates
#my $ensemble_type="NpT";  #Set Npt or NVT. Needs to be set since both have different XDATCAR file.
my $ensemble_type="NVT";  #Set Npt or NVT. Needs to be set since both have different XDATCAR file.
my $av_vol=0;               #Average volume in cell

#reading in XDATCAR file
while (<>)
{
   chomp;
   $_=~s/^/ /;
   my @help=split(/[\t,\s]+/);
   $z++;
   if ($z==2)
   {
      $lattscale = $help[1];
   }
   if ($z==3)
   {
      $b[$nconf+1][1][1]=$help[1]*$lattscale;
      $b[$nconf+1][1][2]=$help[2]*$lattscale;
      $b[$nconf+1][1][3]=$help[3]*$lattscale;
   }
   if ($z==4)
   {
      $b[$nconf+1][2][1]=$help[1]*$lattscale;
      $b[$nconf+1][2][2]=$help[2]*$lattscale;
      $b[$nconf+1][2][3]=$help[3]*$lattscale;
   }
   if ($z==5)
   {
      $b[$nconf+1][3][1]=$help[1]*$lattscale;
      $b[$nconf+1][3][2]=$help[2]*$lattscale;
      $b[$nconf+1][3][3]=$help[3]*$lattscale;
   }
   if ($z==7)
   {
      if ($nconf==0)
      {
         $numelem=@help-1;
         for (my $i=1;$i<=$numelem;$i++)
         {
            $elements[$i]=$help[$i];
            $natom=$natom+$help[$i];
         }
      }
   }
   if ($_=~m/Direct/)
   {
      $nconf=$nconf+1;
      #for NVT ensemble only one Bravais matrix exists, so it has to be copied
      if ($ensemble_type eq "NVT")
      {
         for (my $i=1;$i<=3;$i++)
         { 
            for (my $j=1;$j<=3;$j++)
            {
               $b[$nconf][$i][$j]=$b[1][$i][$j];
            }
         }
      }
      for (my $i=1;$i<=$natom;$i++)
      {
         $_=<>;
         chomp;
         $_=~s/^/ /;
         my @helpat=split(/[\t,\s]+/);
         $cart[$nconf][$i][1]=$b[1][1][1]*$helpat[1]+$b[1][1][2]*$helpat[2]+$b[1][1][3]*$helpat[3];
         $cart[$nconf][$i][2]=$b[1][2][1]*$helpat[1]+$b[1][2][2]*$helpat[2]+$b[1][2][3]*$helpat[3];
         $cart[$nconf][$i][3]=$b[1][3][1]*$helpat[1]+$b[1][3][2]*$helpat[2]+$b[1][3][3]*$helpat[3];
      }
      if ($ensemble_type eq "NpT")
      {
         $z=0;
      }
   } 
   last if eof;
}

if ($confmin>$nconf)
{
   print "Error, confmin larger than number of configurations. Exiting...\n";
   exit;
}
if ($confmax>$nconf)
{
   $confmax=$nconf;
}
 
for (my $i=1;$i<=$nconf;$i++)
{
   #calculate lattice vector lengths
   $vec_len[$i][1]=($b[$i][1][1]*$b[$i][1][1]+$b[$i][1][2]*$b[$i][1][2]+$b[$i][1][3]*$b[$i][1][3])**0.5;
   $vec_len[$i][2]=($b[$i][2][1]*$b[$i][2][1]+$b[$i][2][2]*$b[$i][2][2]+$b[$i][2][3]*$b[$i][2][3])**0.5;
   $vec_len[$i][3]=($b[$i][3][1]*$b[$i][3][1]+$b[$i][3][2]*$b[$i][3][2]+$b[$i][3][3]*$b[$i][3][3])**0.5;
   #calculate volume of cell
   $vol[$i]=$b[$i][1][1]*$b[$i][2][2]*$b[$i][3][3]+$b[$i][1][2]*$b[$i][2][3]*$b[$i][3][1]+$b[$i][1][3]*$b[$i][2][1]*$b[$i][3][2]-$b[$i][3][1]*$b[$i][2][2]*$b[$i][1][3]-$b[$i][3][2]*$b[$i][2][3]*$b[$i][1][1]-$b[$i][3][3]*$b[$i][2][1]*$b[$i][1][2];
   $av_vol=$av_vol+$vol[$i];
}
$av_vol=$av_vol/$nconf;

#choose species 1 for which pair correlation function is going to be calculated
$atmin_1=1;
if ($species_1>1)
{
   for (my $i=1;$i<$species_1;$i++)
   {
     $atmin_1=$atmin_1+$elements[$i];
   }
}
$atmax_1=$atmin_1+$elements[$species_1]-1;
#choose species 2 to which paircorrelation function is calculated to
$atmin_2=1;
if ($species_2>1)
{
   for (my $i=1;$i<$species_2;$i++)
   {
     $atmin_2=$atmin_2+$elements[$i];
   }
}
$atmax_2=$atmin_2+$elements[$species_2]-1;
#initialize pair correlation function
for (my $i=0;$i<=($nr-1);$i++)
{
   $pcf[$i]=0.0;
}
# loop over configurations, make histogram of pair correlation function 
for (my $j=$confmin;$j<=$confmax;$j=$j+$confskip)
{
   for (my $k=$atmin_1;$k<=$atmax_1;$k++)
   {
       for (my $l=$atmin_2;$l<=$atmax_2;$l++)
       {
          if ($k==$l) {next};
          for (my $g_x=-$mult_x;$g_x<=$mult_x;$g_x++)
          {
             for (my $g_y=-$mult_y;$g_y<=$mult_y;$g_y++)
             {
                for (my $g_z=-$mult_y;$g_z<=$mult_z;$g_z++)
                {
                   my $at2_x=$cart[$j][$l][1]+$vec_len[$j][1]*$g_x;
                   my $at2_y=$cart[$j][$l][2]+$vec_len[$j][2]*$g_y;
                   my $at2_z=$cart[$j][$l][3]+$vec_len[$j][3]*$g_z;
                   my $dist=($cart[$j][$k][1]-$at2_x)**2.0+($cart[$j][$k][2]-$at2_y)**2.0+($cart[$j][$k][3]-$at2_z)**2.0;
                   $dist=$dist**0.5;
                   #determine integer multiple 
                   my $zz=int(($dist-$rmin)/$dr+0.5);
                   if ($zz<$nr)
                   {
                      $pcf[$zz]=$pcf[$zz]+1.0;
                   }
                }
             }
          }
       }
   }
}
 
#make ensemble average, rescale functions and print
for (my $i=0;$i<=($nr-1);$i++)
{
   my $r=$rmin+$i*$dr;
   if ($r<$tol)
   {
      $pcf[$i]=0.0;
   }
   else
   {
      $pcf[$i]=$pcf[$i]*$av_vol/(4*$pi*$r*$r*$dr*(($confmax-$confmin)/$confskip)*($atmax_2-$atmin_2+1)*($atmax_1-$atmin_1+1));#*((2.0*$mult_x+1.0)*(2.0*$mult_y+1.0)*(2.0*$mult_z+1.0)));
   }
   print $r," ",$pcf[$i],"\n";
}

and process the previously saved XDATCAR files

perl pair_correlation_function.pl XDATCAR.MLFF_3ps > pair_MLFF_3ps.dat

To save time the pair correlation function for 1000 ab initio MD steps is precalculated in the file pair_AI_3ps.dat.

The interested user can of course calculate the results of the ab initio MD by rerunning the above steps while switching off machine learning via

ML_LMLFF = .FALSE.

We can compare the pair correlation functions, e.g. with gnuplot using the following command

gnuplot -e "set terminal jpeg; set xlabel 'r(Ang)'; set ylabel 'PCF'; set style data lines; plot 'pair_MLFF_3ps.dat', 'pair_AI_3ps.dat' " > PC_MLFF_vs_AI_3ps.jpg

The pair correlation functions obtained that way should look similar to this figure

We see that pair correlation is quite well reproduced although the error in the force of ~0.242 eV/ shown above is a little bit too large. This error is maybe too large for accurate production calculations (usually an accuracy of approximately 0.1 eV/ is targeted), but since the pair correlation function is well reproduced it is perfectly fine to use this on-the-fly force field in the time-consuming melting of the crystal.

Obtaining a more accurate force field

Including the melting phase in the force field may impact the accuracy of the force field. To improve it is usually advisable to learn on the pure structures, which in our case this means to use the CONTCAR file obtained after the melting. Furthermore, the force field was learned using only a single k point so that we can improve the accuracy of the reference data by including more k points. In most calculations, it is important to conduct accurate ab initio calculations since bad reference data can limit the accuracy or even inhibit the learning of a force field.

We restart from the liquid structure obtained before

cp POSCAR.T2000_relaxed POSCAR

and change the following INCAR tags

ALGO = Normal
ML_LMLFF = .TRUE.
ML_ISTART = 0
NSW = 1000

If you run have resources to run in parallel, you can reduce the computation time further by adding k point parallelization with the KPAR tag. We use a denser k-point mesh in the KPOINTS file

2x2x2 Gamma-centered mesh
0 0 0
Gamma
 2 2 2
 0 0 0

We will learn a new force field with 1000 MD steps (each of 3 fs). Keep in mind to run the calculation using the standard version of VASP (usually vasp_std). After running the calculation, we examine the error "ERR" in the ML_LOGFILE by typing:

grep "ERR" ML_LOGFILE

where the last entries should be close to

ERR                   886   5.98467749E-03   1.48190308E-01   2.38264786E+00
ERR                   908   5.98467749E-03   1.48190308E-01   2.38264786E+00
ERR                   925   5.98467749E-03   1.48190308E-01   2.38264786E+00
ERR                   959   5.98467749E-03   1.48190308E-01   2.38264786E+00
ERR                   980   5.98467749E-03   1.48190308E-01   2.38264786E+00
ERR                   990   5.99559653E-03   1.50261779E-01   2.40349561E+00
ERR                  1000   5.99559653E-03   1.50261779E-01   2.40349561E+00

We immediately see that the errors for the forces are significantly lower than in the previous calculation with only one k point. This is due to the less noisy ab initio data which is easier to learn.

To understand how the force field is learned, we inspect the ML_ABN file containing the training data. In the beginning of this file, you will find information about the number of reference structures for training

 1.0 Version
**************************************************
     The number of configurations
--------------------------------------------------
         48

and the number of local reference configurations (size of the basis set)

**************************************************
     The numbers of basis sets per atom type
--------------------------------------------------
       382

We will further continue the training a 1000 MD steps and see how the number of training structures and the number of local reference configurations change. Do the following steps:

cp ML_ABN ML_AB
cp CONTCAR POSCAR

and set the following INCAR tag

ML_ISTART = 1

After running the calculation, we inspect the last instance of the errors in the ML_LOGFILE by typing:

grep ERR ML_LOGFILE

The last few lines should have values close to:

ERR                   675   5.10937061E-03   1.46895065E-01   2.50094941E+00
ERR                   861   5.10937061E-03   1.46895065E-01   2.50094941E+00
ERR                   924   5.10937061E-03   1.46895065E-01   2.50094941E+00
ERR                   942   5.01460989E-03   1.47816836E-01   2.47329693E+00
ERR                  1000   5.01460989E-03   1.47816836E-01   2.47329693E+00

We see that the accuracy has changed slightly. We also look at the ML_ABN file and the number of reference structures for training should increase compared to the run before

 1.0 Version
**************************************************
     The number of configurations
--------------------------------------------------
         99

Also the number of local reference configurations (basis sets) increases compared to the previous calculation

**************************************************
     The numbers of basis sets per atom type
--------------------------------------------------
       665


Ideally, one should continue learning until no structures need to be added to the training data and basis set anymore. Very often this can take up to 100ps depending on the material and conditions. In practice, the prediction of the Bayesian error exhibits numerical inaccuracies so that an ab initio calculation is conducted from time to time even if the force field is accurate enough. So measuring only the decreasing frequency of addition of new data is not sufficient to know when to finish. One should also look at the accuracy of the force on the training data and more importantly on the accuracy on some test data that is outside of the training sets.

Download

MLFF_Liquid_Si_tutorial.tgz