Hello CROCO community,
I am running an interannual simulation that consistently crashes at the same timestep, regardless of the processor configuration I use. The model runs fine for 70,919 timesteps (to 2012-12-19 23:56:00) and then terminates with MPI/memory errors.
2012-12-19 23:40:00
70917 333.99444 3.684811804E-03 2.3211180E+01 2.3214865E+01 3.9092766E+15 0
2012-12-19 23:48:00
70918 334.00000 3.684760715E-03 2.3211185E+01 2.3214869E+01 3.9092767E+15 0
2012-12-19 23:56:00
70919 334.00556 3.684711058E-03 2.3211189E+01 2.3214873E+01 3.9092768E+15 0
[1763663303.000948] [cn1412:391219:0] mm_xpmem.c:137 UCX ERROR failed to attach xpmem apid 0x120005f833 offset 0x11fed000 length 12288: Cannot allocate memory
[1763663303.000971] [cn1412:391219:0] ucp_rkey.c:897 UCX ERROR failed to unpack remote key from remote md[6]: Input/output error
free(): double free detected in tcache 2
Program received signal SIGABRT: Process abort signal. What I’ve tried:
- Different tile configurations (varying NP_XI and NP_ETA)
- Different numbers of nodes and processors (tested with 96, 64, and 48 processors)
- All configurations crash at near exact time step.
System information:
- MPI implementation: Intel MPI with libfabric/UCX
- The error occurs in
MPIDI_OFI_progressduringPMPI_Waitall1. Are there any known issues with UCX/XPMEM in CROCO that I should be aware of?
- What diagnostic steps would you recommend to identify whether this is a physics issue versus an MPI/system issue?
Any suggestions would be greatly appreciated!