Page 1 of 1

VASP 5.3.3 terminating

Posted: Mon Jun 10, 2013 3:38 am
by sjensen313
Hi Everyone,
I purchased my version of VASP 5.3.3 through materials design but my question is applicable even when I submit jobs to VASP through the command line exclusively. I am getting premature job terminations before the calculation finishes that I think might be related to a hangup. I am running redhat linux on a 2 processor, 12 core total machine with 64 gigs of ram.

When I initially had everything set up, the jobs ran and completed as expected. Within a few days when I submit a job however I got the following error:
mpiexec_server.org: cannot connect to local mpd (/tmp/mpd2.console_user); possible causes:
1. no mpd is running on this host
2. an mpd is running but was started without a "console" (-n option)

I believe this error is telling me that the mpd daemon is not running. I added mpdboot and mpd to my submission script and the problem went away but about 3 minutes into the run the job terminates abruptly. In an effort to fix this I added walltime and nohup before all the executables that I run to try and help it. This allowed the job to run for 30-60 minutes but it still finishes before it is done. The issue with this termination is that I don't see the errors in the error file or in OUTCAR.

Here is the script I use to submit to a PBS job queue. Things in bold are things I added to try to address this early termination issue.

#PBS
#PBS -l walltime=168:00:00
#PBS -q mainq
#PBS -l nodes=1:ppn=1
#PBS -o VASP.out
#PBS -e stderror
cd /data1/opt/MD/2.0/TaskServer/Tasks/task00141
cp VASP_INCAR INCAR
cp VASP_KPOINTS KPOINTS
export PATH=/data1/opt/MD/Linux-x86_64/IntelMPI/bin:$PATH
export LD_LIBRARY_PATH=/data1/opt/MD/Linux-x86_64/IntelMPI/lib:/data1/opt/MD/Linux-x86_64/IntelMPI/bin:/data1/opt/MD/Linux-x86_64/IntelMKL/lib
nohup mpdboot -n 1 -f ~/mpd.hosts -r ssh
nohup mpd &

/data1/opt/MD/Linux-x86_64/IntelMPI/bin/mpiexec -n 6 /data1/opt/MD/2.0/TaskServer/Tools/vasp5.3.3/Linux-x86_64/vasp_parallel
touch finished

Does anyone know why I need to add the mpdboot and mpd to get this to work? I thought that they took care of the mpd daemon so I didn't have to. I am using mpiexec/run version 1.6.4 and mpd version 4.1. I know that the time out is not related to the queue because I was able to submit day long jobs before and other people routinely run other jobs (non VASP) that run for days.

Thanks so much for your help!
Stephen

VASP 5.3.3 terminating

Posted: Mon Jun 10, 2013 9:30 am
by alex
Hi Stephen,
you may better ask the MD guys and girls. It looks more like some Windows/MPI related stuff. And the know probably best, where the traps are hidden.

Cheers,

alex