GPU Requirements for CROCO

Hi,

I have couple of questions on GPU requirements for CROCO:

@cnguyen @nguc @cmazoyer

  1. Is it possible to run CROCO on GPU with FP32 (single precision) selected? How can we select FP32 over FP64?
  2. Commercial and workstation grade Nvidia cards have non-ECC VRAM. There are several software packages to catch memory errors for non-ECC GPU memory. Can CROCO utilise non-ECC error mitigation so it can be run on workstation GPUs?

I ask these questions because if the above can be addressed, CROCO on GPU can have great performance gains for a much wider user base.

For example Oceananingans.jl provides both the above options (FP32 option, and non-ECC software correction).

  1. Optionally it would be great if in the future other GPUs are supported (e.g. Radeon through ROCm) since Nvidia prices have skyrocketed!

Regards

Konstantinos

Hi (Konstantinos_K),

1.

you need to check this:

Ans.: now you may try:

#define DBLEPREC

to

/* #define DBLEPREC */

in jobcomp remove -r8 from FFLAGS1

and Please share the compilation log.

Thank you!

Thank you for the reply. I will try this weekend and report back here.

Konstantinos

@swapnil I tried both with #define DBLEPREC and with /* #define DBLEPREC */ but I get the same error:

OPERATING SYSTEM IS: Linux
file namelist_pisces exists in Run directory
Mustang namelist directory MUSTANG_NAMELIST exists
Checking COMPILEAGRIF…
Checking COMPILEMPI…
Checking COMPILEXIOS…
Checking COMPILEOASIS…
Checking COMPILEOMP…
Checking COMPILEOPENACC…
cpp -traditional -DLinux -DXLF -P -I/usr/include -ICROCOFILES/AGRIF_INC mpc.F > mpc_.f
nvfortran -g -fast -r8 -i4 -mcmodel=medium -Mbackslash -I/usr/include -acc -Minfo=accel -o mpc mpc_.f
cpp -traditional -DLinux -DXLF -P -I/usr/include -ICROCOFILES/AGRIF_INC cppcheck.F | ./mpc > cppcheck_.f
FORTRAN STOP
nvfortran -c -g -fast -r8 -i4 -mcmodel=medium -Mbackslash -I/usr/include -acc -Minfo=accel cppcheck_.f -o cppcheck.o
nvfortran -g -fast -r8 -i4 -mcmodel=medium -Mbackslash -I/usr/include -acc -Minfo=accel -o cppcheck cppcheck.o
cat cppdefs.h cppdefs_dev.h > mergcpp.txt
./cppcheck

This is CPPCHECK: Creating new version of check_switches1.F.

FORTRAN STOP
cpp -traditional -DLinux -DXLF -P -I/usr/include -ICROCOFILES/AGRIF_INC checkkwds.F | ./mpc > checkkwds_.f
FORTRAN STOP
nvfortran -c -g -fast -r8 -i4 -mcmodel=medium -Mbackslash -I/usr/include -acc -Minfo=accel checkkwds_.f -o checkkwds.o
nvfortran -g -fast -r8 -i4 -mcmodel=medium -Mbackslash -I/usr/include -acc -Minfo=accel -o checkkwds checkkwds.o
rm -f setup_kwds.F
./checkkwds

This is CHECKKWDS: Creating new version of “setup_kwds.F”.

FORTRAN STOP
cpp -traditional -DLinux -DXLF -P -I/usr/include -ICROCOFILES/AGRIF_INC cross_matrix.F | ./mpc > cross_matrix_.f
FORTRAN STOP
nvfortran -c -g -fast -r8 -i4 -mcmodel=medium -Mbackslash -I/usr/include -acc -Minfo=accel cross_matrix_.f -o cross_matrix.o
nvfortran -g -fast -r8 -i4 -mcmodel=medium -Mbackslash -I/usr/include -acc -Minfo=accel -o cross_matrix cross_matrix.o
./cross_matrix *.F90 *.F

This is CROSS_MATRIX: Creating new version of Make.depend.

FORTRAN STOP
cpp -traditional -DLinux -DXLF -P -I/usr/include -ICROCOFILES/AGRIF_INC srcscheck.F | ./mpc > srcscheck_.f
FORTRAN STOP
nvfortran -c -g -fast -r8 -i4 -mcmodel=medium -Mbackslash -I/usr/include -acc -Minfo=accel srcscheck_.f -o srcscheck.o
nvfortran -g -fast -r8 -i4 -mcmodel=medium -Mbackslash -I/usr/include -acc -Minfo=accel -o srcscheck srcscheck.o
rm -f check_srcs.F
cpp -traditional -DLinux -DXLF -P -I/usr/include -ICROCOFILES/AGRIF_INC insert_node.F > insert_node_.f1
python3 ./change_loops.py insert_node_.f1 insert_node_.tmp
cat insert_node_.tmp | ./mpc > insert_node_.f && \rm insert_node_.tmp
FORTRAN STOP
nvfortran -c -g -fast -r8 -i4 -mcmodel=medium -Mbackslash -I/usr/include -acc -Minfo=accel insert_node_.f -o insert_node.o
cpp -traditional -DLinux -DXLF -P -I/usr/include -ICROCOFILES/AGRIF_INC lenstr.F > lenstr_.f1
python3 ./change_loops.py lenstr_.f1 lenstr_.tmp
cat lenstr_.tmp | ./mpc > lenstr_.f && \rm lenstr_.tmp
FORTRAN STOP
nvfortran -c -g -fast -r8 -i4 -mcmodel=medium -Mbackslash -I/usr/include -acc -Minfo=accel lenstr_.f -o lenstr.o
cpp -traditional -DLinux -DXLF -P -I/usr/include -ICROCOFILES/AGRIF_INC partit.F > partit..f
nvfortran -c -g -fast -r8 -i4 -mcmodel=medium -Mbackslash -I/usr/include -acc -Minfo=accel partit.
.f -o partit.o
nvfortran -g -fast -r8 -i4 -mcmodel=medium -Mbackslash -I/usr/include -acc -Minfo=accel -o partit partit.o insert_node.o lenstr.o -L/usr/lib/x86_64-linux-gnu -lnetcdff -Wl,-Bsymbolic-functions -flto=auto -ffat-lto-objects -flto=auto -Wl,-z,relro -Wl,-z,now -lnetcdf -lnetcdf -lm
nvfortran-Error-Unknown switch: -flto=auto
nvfortran-Error-Unknown switch: -ffat-lto-objects
nvfortran-Error-Unknown switch: -flto=auto
gmake: *** [Makefile:260: partit] Error 1

Hi, @Konstantinos_K

this error related to netcdf.

Please set netcdf path directly in jobcomp : NETCDFLIB=“-L**/usr/lib/x86_64-linux-gnu** -lnetcdff -lnetcdf -lm”

try to recompile again. share log (I hope -flto=auto -ffat-lto-objects issues are not there).

Thank you! I am waiting for your log (live). sorry for the delay!

Okay, I set the netcdf path to the following in jobcomp and it worked:

NETCDFLIB=“-L/usr/lib/x86_64-linux-gnu -lnetcdff -lnetcdf -lnetcdf -lm”

I have some interesting results.

  1. Compilation: It completes successfully with both the #define DBLEPREC and -r8 (for FP64) and without them (for FP32). I include the compilation logs from both cases.
  2. Run: I tested the Kelvin-Helmoltz test case in 3D mode, with both OPENACC and MPI, and with only OPENACC and nvhpc-nompi

# define

KH_INST

CPP options:

# undef  KH_INSTY
# define  KH_INST3D
# define OPENACC
# define MPI             (and second time with # undef MPI)
# define NBQ

Results for FP64:

on 13900H CPU only with MPI (14 cores) run time: 72 min

on Laptop RTX 2000 Ada (similar to RTX 4060 mobile but lower frequency) run time with MPI & OPENACC: 709 min (extrapolated)

on RTX 2000 Ada with only OPENACC and nvhpc-nompi, run time: 65 min

Results for FP32:

When compiled with /* #define DBLEPREC */ and no -r8 flag, compiles but run fails (attached logs)

When compiled with /* #define DBLEPREC */ but including the -r8 flag, compiles and runs in identical time and debug log as in FP64: 65 min, but netCDF history file is corrupted.

In essence, I only managed to run GPU in FP64 mode. Taking into account that the small mobile GPU is faster than the 14-core CPU in nvhpc-nompi mode, and the potential to be at least 10x times faster in FP32 based on the TFLOP difference between the FP64 and FP32, there is great potential for up to 50x faster on a high-end workstaion GPU e.g. RTX 5000 Ada in FP32 compared to a workstation CPU!

I hope we can make the FP32 mode work to test this.

Regards

Konstantinos

GPU compile FP64 log.txt (238.4 KB)

GPU compile FP32 log.txt (239.5 KB)

Failed FP32 run log.txt (23.2 KB)

Hi, ( Konstantinos_K )
Because WENO_Z is there:

master/OCEAN/step3d_t.F#L1314-L1319
master/OCEAN/step3d_t.F#L1201-L1209

#ifdef WENO_Z
# ifdef DBLEPREC
      Eps = 1.e-40   ! F64
# else
      Eps = 1.e-14   ! F32
# endif
#endif

or try:

Eps = EPSILON(Eps)**2   ! no DBLEPREC guard

Profile + Perturbation tricks (dtanh)

master/OCEAN/ana_initial.F#L440-L445

u(i,j,k,1)  = -du*tanh((z_r(1,1,k)-zu0)/hu0)
u(i,j,k,1)  = u(i,j,k,1) + 0.01*du *               ! add perturbation
     & (-2.*tanh((z_r(1,1,k)-zu0)/hup0))*
     & (1. -tanh((z_r(1,1,k)-zu0)/hup0)**2.)
     & *sin(2.*2.*pi/xl*xp(i,j))
     & *(1.+eps3D*(sin( 2.*4.*pi/el*yr(i,j))))

to

u(i,j,k,1)  = -du*dtanh((z_r(1,1,k)-zu0)/hu0)
u(i,j,k,1)  = u(i,j,k,1) + 0.01*du *                ! add perturbation
     & (-2.*dtanh((z_r(1,1,k)-zu0)/hup0))*
     & (1. -dtanh((z_r(1,1,k)-zu0)/hup0)**2.)
     & *sin(2.*2.*pi/xl*xp(i,j))
     & *(1.+eps3D*(sin( 2.*4.*pi/el*yr(i,j))))

master/OCEAN/ana_initial.F#L471-L474

wz(i,j,k,1) = -0.01*du*hup0*
     & (1.-tanh((z_w(1,1,k)-zu0)/hup0)**2)
     & *2.*2.*pi/xl*cos(2.*2.*pi/xl*xp(i,j))
     & *(1.+eps3D*(sin( 2.*4.*pi/el*yp(i,j))))

to

wz(i,j,k,1) = -0.01*du*hup0*
     & (1.-dtanh((z_w(1,1,k)-zu0)/hup0)**2)
     & *2.*2.*pi/xl*cos(2.*2.*pi/xl*xp(i,j))
     & *(1.+eps3D*(sin( 2.*4.*pi/el*yp(i,j))))

I know there are many problems. but please check its passed or not 2nd time steps without NaN.

Please share the log ( :grinning_face: bbl etc not implemented yet!).

Thank you!

Hi,

Thank you for the suggestions. I have some free time tomorrow and I will try to test and report back.

Regards

Konstantinos

1 Like

Hi, (Konstantinos_K)

NetCDF history file is corrupted. What i suspect is below :
(or ncdump and check if there are any “print” related to history file corruption or share):
Please check, because its W not R.

        call fillvalue3d(work,ncidhis,hisW,
     &      vname(1,indxW),record,w3dvar,type)

to

        call fillvalue3d_w(work,ncidhis,hisW,
     &      vname(1,indxW),record,w3dvar,type)
         call fillvalue3d(work ,ncidavg,avgW,
     &        vname(1,indxW), record,w3dvar,type)

to

         call fillvalue3d_w(work ,ncidavg,avgW,
     &        vname(1,indxW), record,w3dvar,type)

and please check KH_INST3D in param.h
(line-42, how it is 1m resolution, i don’t know)
Thank You! :+1:

Best regards,