AGRIF & XIOS : buffer size error?

Hi all,

I am trying to run an Agrif simulation while using XIOS for output.

First of all, this config without XIOS runs without problems. When XIOS is activated it fails either during timestepping or either after timestepping (while XIOS writes the .nc files).

The output shows :

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 2697 RUNNING AT XXX
=   EXIT CODE: 9
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
   Intel(R) MPI Library troubleshooting guide:
      https://software.intel.com/node/561764
===================================================================================

I am running XIOS in detached mode (see an extract of the iodef below):

  <context id="xios">
    <variable_definition>
      <variable_group id="buffer">
        <variable id="optimal_buffer_size" type="string">performance</variable>
        <variable id="buffer_size_factor" type="double">8</variable>
      </variable_group>

      <variable_group id="parameters" >
        <variable id="using_server" type="bool">true</variable>
        <variable id="info_level" type="int">100</variable>
        <variable id="print_file" type="bool">true</variable>
      </variable_group>
    </variable_definition>
  </context>

Since reducing enough the number of fields to be written in output files or reducing simulation time leads to a successful completion of the run it appears to me that I suffer from buffer size errors.

This is why I tried increasing the buffer_size_factor in iodef.xml but without success. Increasing the number of xios dedicated servers did not seem to solve the problem. Neither did using secondary servers (<variable id="using_server2" type="bool">true</variable>)…

Any suggestion or working XIOS/AGRIF config files to compare with will be much appreciated !

Thanks in advance,

Enzo

Hi,

For those who might face the same problem, the game changer in my case was to manually distribute the croco computing cpus and the xios server cpus on the machine. The idea being that xios servers will require large amount of RAM to gather all the data and hence, if all the xios server end up on the same node on your machine one may reach memory limitation on this specific node without reaching the overall limit.

One can switch from :
mpirun -np 100 ./croco croco.in : -n 12 xios_server.exe

To :
mpirun -np 25 ./croco croco.in : -n 3 xios_server.exe : -np 25 ./croco croco.in : -n 3 xios_server.exe : -np 25 ./croco croco.in : -n 3 xios_server.exe : -np 25 ./croco croco.in : -n 3 xios_server.exe

This is a way to distribute the memory requirements on a machine with 28 core per node.

1 Like