Handling CMEMS Data

Author

Clyde Blanco

The Copernicus Marine Service (CMEMS) offers a multitude of environmental data pertaining to the different aspects of the ocean (i.e., physical, biogeochemical, and sea ice), both on a regional and global scale. This service is free and anyone with a registered account can access their data. You can create a new account here. In this section, you will learn to download data from CMEMS through R, process this data, and couple it to the catch data.

📋Prerequisite: You need to have a CMEMS account to follow this section.

Downloading NetCDF Files

There are several ways to extract data from CMEMS. You can browse through their data store and manually download the data you need. To do this, select a product and then click the download icon () just beside the product title. The advantage of this option is you can interactively and easily subset the data you want. However, this option has a data limit per download request (approx. 2GB). The other option, which is called Copernicus Marine Toolbox, doesn’t have any limit (technically it has and this depends on the available storage of your computer). If you have a large catch dataset that spans multiple years, the second option is more suitable. Although we will be running codes in R, this part requires Python.

📋Prerequisite: Both R and Python should be installed in your computer.

You need to first install Copernicus Marine Toolbox and this can be done through the command line. You can use the code below and run it in R.

system("python -m pip install copernicusmarine")

Depending on your knowledge on the target species you are working with, you can choose different variables from CMEMS. In our case, we’ve selected the following variables - bottom temperature, currents (north and eastward components), pH, dissolved oxygen (DO), sea surface height, salinity, net primary production, and significant wave height. Since, our target species are demersal, we wanted to use environmental parameter values at seafloor level. However, only temperature is available at this depth; the other variables are provided at multiple depth levels. The process for extracting seafloor-level values is outlined in the next subsection. To optimize memory usage, we downloaded pH, DO, and salinity at multiple depths, while obtaining the remaining variables at a single depth (surface or seafloor level, as appropriate). Refer to the codes below to download CMEMS data. variable and datasetID values can be found on the download page of each CMEMS product. Setting the maximum depth level to the deepest level available in CMEMS will just increase the file size.

library(tibble)

USERNAME <- "YourUsername"                             # specify your CMEMS username                                                           
PASSWORD <- "YourPassword"                             # specify your CMEMS password

out_dir <- paste0("\"",                                # specificy the folder to save the NetCDF files
                  getwd(),                                                     
                  "/Data/CMEMS",
                  "\"") 
                  
lon <- list(min(catch_data$Longitude),
            max(catch_data$Longitude))                 # min and max Longitude of your data
lat <- list(min(catch_data$Latitude),
            max(catch_data$Latitude))                  # min and max Latitude of your data

date_min = min(catch_data$Date) - 60                   # specify the earliest date 
date_max = max(catch_data$Date)                        # specify latest date


CMEMS <- tribble(                                      # create a table that will contain the dataset info

  ~variable,    ~common_name,    ~out_name,           ~depth,            ~z_levels,     ~datasetID,
  "bottomT",    "bottomT",       "bottomT",           c("0.","0."),       "one",         "cmems_mod_nws_phy-bottomt_my_7km-2D_P1D-m",    # seafloor temperature
  "zos",        "ssh",           "ssh",               c("0.","0."),       "one",         "cmems_mod_nws_phy-ssh_my_7km-2D_P1D-m",        # sea surface height above geoid
  "nppv",       "pp",            "pp",                c("0.","0."),       "one",         "cmems_mod_nws_bgc-pp_my_7km-3D_P1D-m",         # net primary production
  "uo",         "current",       "current_u",         c("0.","0."),       "one",         "cmems_mod_nws_phy-uv_my_7km-3D_P1D-m",         # eastward component of water current
  "vo",         "current",       "current_v",         c("0.","0."),       "one",         "cmems_mod_nws_phy-uv_my_7km-3D_P1D-m",         # northward component of water current
  "so",         "salinity",      "salinity",          c("0.","1000."),    "multi",       "cmems_mod_nws_phy-s_my_7km-3D_P1D-m",          # salinity
  "o2",         "DO",            "DO",                c("0.","1000."),    "multi",       "cmems_mod_nws_bgc-o2_my_7km-3D_P1D-m",         # dissolved oxygen
  "ph",         "pH",            "pH",                c("0.","1000."),    "multi",       "cmems_mod_nws_bgc-ph_my_7km-3D_P1D-m",         # pH
  "VHM0",       "wave",          "wave",              c("0.","0."),       "one",         "MetO-NWS-WAV-RAN"                              # mean mave height
  
)


for(i in 1:nrow(CMEMS)){                              # use a loop to download all data                                                   
  
  command <- paste0("copernicusmarine subset --username ", USERNAME,
                    " --password ", PASSWORD,
                    " -i ", CMEMS[i,]$datasetID,
                    " -x ", lon[1],                        # mininum longitude
                    " -X ",lon[2],                         # maximum longitude    
                    " -y ", lat[1]," -Y ",lat[2],          # latitude
                    " -z ", unlist(CMEMS[i,]$depth)[1],    # minimum depth
                    " -Z ", unlist(CMEMS[i,]$depth)[2],    # maximum depth
                    " -t ", date_min,                      # minimum date                    
                    " -T ", date_max,                      # maximum date
                    " -v ", CMEMS[i,]$variable,            # variable to download
                    " -o ", out_dir,                       # folder to save the downloaded file
                    " -f ", CMEMS[i,]$out_name,            # output filename
                    ".nc")
  
  system(command, input="Y")  
  
}

You might have noticed that the minimum date we’ve specified is 60 days before the earliest record on the catch data. The reason behind this is to test whether the lag or change of values of the environmental variables can be good predictors in the model. Moreover, we have set the maximum depth at 1000 m to capture the deepest level in our study area because it’s not possible to specify the maximum depth per location during data download.

Extracting Seafloor Values

Extracting the values at the seafloor level from the downloaded NetCDF files leverages on how each layer is named and time stamped. Let’s open a NetCDF file and check its attributes.

library(terra)

DO_raster <- rast("./Data/CMEMS/DO.nc")

DO_raster

class       : SpatRaster 
dimensions  : 285, 252, 1159  (nrow, ncol, nlyr)
resolution  : 0.11111, 0.06667  (x, y)
extent      : -14.94449, 13.05523, 42.96681, 61.96776  (xmin, xmax, ymin, ymax)
coord. ref. : lon/lat WGS 84 
source      : DO.nc 
varname     : o2 (Mole Concentration of Dissolved Oxygen in Sea Water) 
names       : o2_depth=0_1, o2_depth=3_1, o2_de~=10_1, o2_de~=15_1, o2_de~=20_1, o2_de~=30_1, ... 
unit        :     mmol m-3,     mmol m-3,    mmol m-3,    mmol m-3,    mmol m-3,    mmol m-3, ... 
time        : 2023-09-07 to 2023-11-06 UTC

The above attributes are from a NetCDF file containing values for dissolved oxygen. It has a dimension of 285 by 252 cells per layer with 1159 layers. The resolution of each cell is 0.11111° by 0.06667°. The data ranges from 2023-09-07 UTC to 2023-11-06 UTC. Let’s subset the data into one day and check the names of the layers.

date <- as.POSIXct("2023-09-07", tz="UTC")

subset_raster <- DO_raster[[terra::time(DO_raster)==date]] 

plot(subset_raster)

As you can see, the names of the layer follows this format variable_depth_numbering. The first part is the variable name. The middle part is the depth level in meters. The last part corresponds to the the order of the layers based on the time step, in this case daily. Layers with similar time-step have the same numbering. To extract the seafloor values, the raster can be divided into separate daily rasters and convert them into a tabular data as shown in the codes below.

library(dplyr)
library(tidyr)
library(stringr)
library(kableExtra)

time <- unique(terra::time(subset_raster))                                # get timestamp
extent <- terra::ext(subset_raster)                                       # get extent of the raster
x <- terra::as.data.frame(subset_raster, xy = TRUE, na.rm=FALSE)          # convert raster into a dataframe, values in each depth are separated in columns

x	y	o2_depth=0_1	o2_depth=3_1	o2_depth=10_1	o2_depth=15_1	o2_depth=20_1	o2_depth=30_1	o2_depth=50_1	o2_depth=75_1	o2_depth=100_1	o2_depth=125_1	o2_depth=150_1	o2_depth=200_1	o2_depth=250_1	o2_depth=300_1	o2_depth=400_1	o2_depth=500_1	o2_depth=600_1	o2_depth=750_1	o2_depth=1000_1
-14.88894	61.93443	247.79	247.82	247.83	247.66	247.33	246.90	232.25	205.30	200.62	198.27	201.07	203.29	202.82	200.23	197.48	193.45	189.51	184.04	178.62
-14.77783	61.93443	245.95	245.97	245.96	245.83	245.59	245.16	230.79	201.87	195.80	196.08	198.95	204.13	203.53	202.49	200.40	201.73	191.59	183.50	171.19
-14.66672	61.93443	246.64	246.67	246.71	246.70	246.68	246.30	224.00	192.92	191.72	191.84	193.77	201.97	203.19	201.83	201.71	203.68	193.39	183.03	165.03
-14.55561	61.93443	246.67	246.70	246.72	246.70	246.68	246.40	234.41	213.42	201.61	202.78	205.50	210.09	209.73	209.88	209.66	208.28	202.76	192.44	176.61
-14.44450	61.93443	245.66	245.68	245.69	245.67	245.63	245.54	230.98	216.54	214.81	217.22	218.28	219.88	218.29	216.67	214.30	212.17	209.00	198.67	178.64
-14.33339	61.93443	245.11	245.13	245.14	245.11	245.03	245.01	230.77	213.56	213.74	215.08	215.99	218.44	219.34	219.59	218.88	217.78	214.43	198.26	181.08

depth_levels <- x %>%                                                     # get names of the columns that has depth
  select(-any_of(c("x","X","y","Y"))) %>% 
  colnames()

x <-  x %>%
  dplyr::filter(dplyr::if_any(.cols = !!depth_levels,
                              .fns = ~ !is.na(.x))) %>%                   # remove rows that have NAs across depth levels (i.e., these are cells that are on land)
  dplyr::rename_with(., ~ stringr::str_extract(.x,"(?<=_depth=)[^_]+"),
                     tidyselect::contains("_depth"))                      # rename the depth columns with just numbers

x	y	0	3	10	15	20	30	50	75	100	125	150	200	250	300	400	500	600	750	1000
-14.88894	61.93443	247.79	247.82	247.83	247.66	247.33	246.90	232.25	205.30	200.62	198.27	201.07	203.29	202.82	200.23	197.48	193.45	189.51	184.04	178.62
-14.77783	61.93443	245.95	245.97	245.96	245.83	245.59	245.16	230.79	201.87	195.80	196.08	198.95	204.13	203.53	202.49	200.40	201.73	191.59	183.50	171.19
-14.66672	61.93443	246.64	246.67	246.71	246.70	246.68	246.30	224.00	192.92	191.72	191.84	193.77	201.97	203.19	201.83	201.71	203.68	193.39	183.03	165.03
-14.55561	61.93443	246.67	246.70	246.72	246.70	246.68	246.40	234.41	213.42	201.61	202.78	205.50	210.09	209.73	209.88	209.66	208.28	202.76	192.44	176.61
-14.44450	61.93443	245.66	245.68	245.69	245.67	245.63	245.54	230.98	216.54	214.81	217.22	218.28	219.88	218.29	216.67	214.30	212.17	209.00	198.67	178.64
-14.33339	61.93443	245.11	245.13	245.14	245.11	245.03	245.01	230.77	213.56	213.74	215.08	215.99	218.44	219.34	219.59	218.88	217.78	214.43	198.26	181.08

x <- x %>%
   tidyr::pivot_longer(cols = -c(x,y),      # pivot into long format, all parameter values in one column                                  
                      names_to = "depth",
                      values_to = "z")

x	y	depth	z
-14.88894	61.93443	0	247.79
-14.88894	61.93443	3	247.82
-14.88894	61.93443	10	247.83
-14.88894	61.93443	15	247.66
-14.88894	61.93443	20	247.33
-14.88894	61.93443	30	246.90

x <- x %>%
  dplyr::mutate(depth = as.numeric(depth)) %>%                 
  dplyr::group_by(x,y) %>%                            # group by grid cell
  tidyr::drop_na() %>%                                # remove NAs
  dplyr::arrange(desc(depth), .by_group=TRUE) %>%     # arrange depth from deepest to shallowest
  dplyr::slice_head() %>%                             # only extract the row with the deepest level
  dplyr::select(-depth)

x	y	z
-14.88894	43.00015	111.25
-14.88894	43.06682	118.13
-14.88894	43.13349	121.47
-14.88894	43.20016	117.86
-14.88894	43.26683	119.71
-14.88894	43.33350	115.99

x <- terra::rast(x, type="xyz", crs="epsg:4326")    # convert xyz dataframe into a SpatRaster object
terra::time(x) <- time                              # label the DateTime based from the original SpatRaster
names(x) <- time
x <- terra::extend(x,extent)                        # extend extent based on the original SpatRaster

plot(x)

Now, we can bind all functions above into one and apply this to the downloaded NetCDF files.

ValueAtDeepest <- function(x){                                            # x  is a SpatRaster object (several depth levels in each time step)
   time <- unique(terra::time(x))                                          # get unique time DateTime
   extent <- terra::ext(x)                                                 # get extent of the SpatRaster
   x = terra::as.data.frame(x, xy = TRUE, na.rm=FALSE)                     # convert SpatRaster into a data frame, values in each depth are separated in columns
   
   depth_levels <- x %>%                                                   # get names of the columns that has depth
     select(-any_of(c("x","X","y","Y"))) %>% 
     colnames()
        
   x <-  x %>%
     dplyr::filter(dplyr::if_any(.cols = !!depth_levels,
                                 .fns = ~ !is.na(.x))) %>%                 # remove rows that have NAs across depth levels
     dplyr::rename_with(., ~ stringr::str_extract(.x,"(?<=_depth=)[^_]+"),
                        tidyselect::contains("_depth")) %>%
     tidyr::pivot_longer(cols = -c(x,y),                                   # pivot into long format, all parameter values in one column                                  
                         names_to = "depth",
                         values_to = "z") %>%
     dplyr::mutate(depth = as.numeric(depth))
        
     gc()
        
     x <-  x %>%
       dplyr::group_by(x,y) %>%                                             # group by grid cell
       tidyr::drop_na() %>%                                                 # remove NAs
       dplyr::arrange(desc(depth), .by_group=TRUE) %>%                      # arrange depth from deepest to shallowest
       dplyr::slice_head() %>%                                              # only extract the row with the deepest level
       dplyr::select(-depth) 
        
     x <- terra::rast(x, type="xyz", crs="epsg:4326")                       # convert xyz dataframe into a SpatRaster object
     terra::time(x) <- time                                                 # label the DateTime based from the original SpatRaster
     names(x) <- time
     x <- terra::extend(x,extent)                                           # extend extent based on the original SpatRaster
     
     return(x)
}

The function above works on individual time-steps, putting it inside a for loop will take longer because this process is sequential. You can use parallel processing instead to speed up the processing. Since we have data spanning from 2006 to 2024, a nested loop was created with the outer for loop working through different NetCDF files and the inner for loop breaking down the task on a yearly basis. The parallel processing is implemented using foreach and it is nested within the inner for loop. Also, the function is created within the foreach function instead of creating it outside of the loops because sometimes the objects in the Global Environment are not transferred into the individual parallel workers.

library(terra)
library(lubridate)
library(doParallel)
library(foreach)

folder <- c("Salinity","Oxygen","pH")                                           # specify the folder names

for(i in 1:length(folder)){                                                     # outer loop working through each folder
  
  start <- Sys.time()                                                           # time tracking is used to get an overview on the processing time of each NetCDF file
  
  path <- unlist(list.files(path = paste0("./Data/Raw/Parameters/",folder[i]),  # extract the path of the NetCDF files
                            pattern='.nc$', all.files=TRUE, full.names=TRUE))
  
  raster <- rast(path)                                                          # import NetCDF as SpatRaster
  
  year <- as.character(terra::time(raster)) %>% substr(.,1,4) %>% unique()
  rastList <- vector(mode="list", length=length(year))                          # create a list that will contain yearly processed rasters 
  
  for(j in 1:length(year)){                                                     # inner loop that processes the raster each year
    
    cat(sprintf("\033[33;1m%-1s\033[0m %-12s \033[33mExtracting seafloor values from\033[0m\033[36;1m %-s %-s \033[0m\033[33mfile...\033[0m\r",
                "i","[Processing]",basename(path),year[j]))
    
    date_min = ymd(paste0(year[j],"0101"), tz="UTC")                                                        
    date_max = ymd(paste0(year[j],"1231"), tz="UTC")                        
    dateRange <- seq(date_min,date_max, by = "day")                             # generate a range of dates because foreach works on a daily time step
    
    dates <- unique(terra::time(raster[[terra::time(raster) %in% dateRange]]))  # this vector is created because sometimes the NetCDF files contains missing dates
    
    varname <- paste0("bottom_",unique(varnames(raster)))                       # specify variable name to be used in writing the processed NetCDF file
    longname <- paste0("Seafloor ",tolower(unique(longnames(raster))))          # specify long variable name to be used in writing the processed NetCDF file
    unit <- unique(units(raster))                                               # specify variable unit to be used in writing the processed NetCDF file
    
    n_cores <- 8                                                                # specify the number of parallel workers, use to
    cl <- makeCluster(n_cores)
    registerDoParallel(cl)
    
    
    processed <- foreach(k = 1:length(dates), .export=c("path"), .packages = c("terra","Rcpp","dplyr","stringr","tidyr","tidyselect")) %dopar% {
      
      ValueAtDeepest <- function(x){                                                  
        
        time <- unique(terra::time(x))                                                
        extent <- terra::ext(x)                                                       
        x = terra::as.data.frame(x, xy = TRUE, na.rm=FALSE)                           
        
        depth_levels <- x %>%
          select(-any_of(c("x","X","y","Y"))) %>% 
          colnames()
        
        x <-  x %>%
          dplyr::filter(dplyr::if_any(.cols = !!depth_levels,
                                      .fns = ~ !is.na(.x))) %>%                                
          dplyr::rename_with(., ~ stringr::str_extract(.x,"(?<=_depth=)[^_]+"),
                             tidyselect::contains("_depth")) %>%
          tidyr::pivot_longer(cols = -c(x,y),                                                                          
                              names_to = "depth",
                              values_to = "z") %>%
          dplyr::mutate(depth = as.numeric(depth))
        
        gc()
        
        x <-  x %>%
          dplyr::group_by(x,y) %>%                                                    
          tidyr::drop_na() %>%                                                        
          dplyr::arrange(desc(depth), .by_group=TRUE) %>%                             
          dplyr::slice_head() %>%                                                     
          dplyr::select(-depth) 
        
        x <- terra::rast(x, type="xyz", crs="epsg:4326")                             
        terra::time(x) <- time                                                       
        names(x) <- time
        x <- terra::extend(x,extent)                                                  
        
        return(x)
        
      }
      
      raster <- rast(path)                                              
      
      subRaster <- raster[[terra::time(raster)==dates[k]]]                      # subset raster per day
      
      bottom_values <- ValueAtDeepest(subRaster) %>%                            # apply the custom function
        terra::focal(x = ., w = 3, fun="mean",                       
                     na.policy="only", na.rm=TRUE) %>%                          # calculate "moving window" values to fill gaps near the coast
        terra::wrap()                                                           # pack processed SpatRaster so it can be imported back to the Global Environment
      
      return(bottom_values)
      
    }
    
    stopCluster(cl)                                                             # stop parallel backend
    
    processed <- lapply(processed,unwrap)                                       # unwrap packaged SpatRaster in the Global Environment
    rastList[[j]] <- rast(processed)                                            # append SpatRaster into a list
    
  }
  
  final_raster <- rast(rastList)                                                # convert list of SpatRasters into one SpatRaster object
  
  
  terra::writeCDF(final_raster,                                                 # write processed SpatRaster into NetCDF file
                  overwrite = TRUE,
                  varname = varname,
                  longname = longname,
                  unit = unit,
                  filename = paste0("./Data/Processed/Parameters/processed_",
                                    basename(path)))
  
  end <- Sys.time()
  timeDiff <- round(difftime(end,start, units="hours"), digits=4)
  
  cat(sprintf("\033[32;1m%-1s\033[0m %-12s %-s seafloor values extracted. Elapsed time is %-s hours. \033[0m\n",
              "✔","[Completed]",basename(path),timeDiff))                       # print processing update
  
  file.remove(path)                                                             # delete raw file to save storage space
}

Appending Variables to Catch Data

text here