8.1 Application Packaging

Author

Anthony Van De Sompele and Clyde Blanco

Published

March 26, 2025

At the end of the Machine Learning section, we saved the trained model as an RDS file, which contains the tidymodels workflow and its associated metadata. To generate predictions, we need to create an R script that will download and process the input data, pass it through the model, and save the predictions as NetCDF files. By combining this script with the model, we can package them together for deployment, allowing the process to run autonomously in any environment. This can be achieved using Docker for seamless, isolated execution.

Make sure you have the following directory structure to create the Docker image.The custom-base-r will be used to generate the base image while fish_suitability will be used to build the final image.

Docker_Project
├── custom-base-r
│   ├── 𝘪𝘯𝘴𝘵𝘢𝘭𝘭_𝘱𝘢𝘤𝘬𝘢𝘨𝘦𝘴.𝘙 
│   └── 𝘋𝘰𝘤𝘬𝘦𝘳𝘧𝘪𝘭𝘦
└── fish_suitability
    ├── Data
    │   ├── Processed
    │   │   └── Parameters 
    │   │       └── 𝘣𝘢𝘵𝘩𝘺𝘮𝘦𝘵𝘳𝘺_𝘮𝘦𝘢𝘯.𝘵𝘪𝘧
    │   └── Raw
    │       └── Parameters
    ├── Models
    │   └── 𝘸𝘰𝘳𝘬𝘧𝘭𝘰𝘸_𝘸𝘪𝘵𝘩_𝘮𝘦𝘵𝘢𝘥𝘢𝘵𝘢.𝘳𝘥𝘴
    ├── Scripts
    │   └── 𝘨𝘦𝘵_𝘮𝘰𝘥𝘦𝘭_𝘱𝘳𝘦𝘥𝘪𝘤𝘵𝘪𝘰𝘯𝘴.𝘙
    ├── Outputs
    ├── 𝘋𝘰𝘤𝘬𝘦𝘳𝘧𝘪𝘭𝘦
    └── 𝘳𝘦𝘲𝘶𝘪𝘳𝘦𝘮𝘦𝘯𝘵𝘴.𝘵𝘹𝘵

Get Model Predictions R Script

To retrieve predictions from the model, we must first download the required input data. Since our model was trained using the reanalysis ocean physics and biogeochemistry products from CMEMS, we will use the corresponding forecast variables. The process will be similar to the code used in the Handling CMEMS Data section. Although not strictly necessary, it’s helpful to insert print statements between code blocks in the script. These can provide progress updates during the execution of the Docker container or assist in debugging if the execution fails.

Create CMEMS Data Directory

In the code below, we have created a directory of the variables to download from CMEMS, based on the initial set of variables we used for model fitting. As you may have noticed, there are two versions of each variable: NWS and IBI. These refer to the Northwest Shelf (NWS) and Iberian-Biscay-Irish (IBI) products, respectively. We have included both products because the spatial coverage of one product alone is insufficient. Additionally, we’ve specified the minimum date for each variable, which depends on the lag days selected for each variable. We also indicate whether we need surface, seafloor, or multilevel (to be converted to seafloor) data. The reason for listing all of this information is to ensure that the code can be reused even if the model’s selected variables change during feature selection.

cat("Downloading input data from CMEMS.\n")

USERNAME <- "YourUsername"                                                      # Specify CMEMS username                                                           
PASSWORD <- "YourPassword"                                                      # Specify CMEMS password


out_dir <- paste0("\"",                                                         # Specificy the folder to save the NetCDF files
                  getwd(),                                                     
                  "/Data/Raw/Parameters",
                  "\"")  

#Time coverage
date_max = Sys.Date() + 10                                                      # Specify dates for forecasting

library(dplyr)

CMEMS <- tribble(
  ~variable,    ~common_name,    ~out_name,       ~lon,          ~lat,         ~depth,            ~z_levels,   ~date_min,       ~datasetID,
  "bottomT",    "bottomT",       "bottomT_NWS",   c(-15, 13),    c(43, 62),    c("0.5","0.5"),    "one",      Sys.Date()-14,   "cmems_mod_nws_phy_anfc_0.027deg-3D_P1D-m", 
  "zos",        "ssh",           "ssh_NWS",       c(-15, 13),    c(43, 62),    c("0.5","0.5"),    "one",      Sys.Date()-2,    "cmems_mod_nws_phy_anfc_0.027deg-3D_P1D-m",
  "nppv",       "pp",            "pp_NWS",        c(-15, 13),    c(43, 62),    c("0.5","0.5"),    "one",      Sys.Date()-60,   "cmems_mod_nws_bgc_anfc_0.027deg-3D_P1D-m",
  "uo",         "current",       "current_u_NWS", c(-15, 13),    c(43, 62),    c("0.5","0.5"),    "one",      Sys.Date()-2,    "cmems_mod_nws_phy_anfc_0.027deg-3D_P1D-m",
  "vo",         "current",       "current_v_NWS", c(-15, 13),    c(43, 62),    c("0.5","0.5"),    "one",      Sys.Date()-2,    "cmems_mod_nws_phy_anfc_0.027deg-3D_P1D-m",
  "so",         "salinity",      "salinity_NWS",  c(-15, 13),    c(43, 62),    c("0.","1000."),   "multi",    Sys.Date()-14,   "cmems_mod_nws_phy_anfc_0.027deg-3D_P1D-m",   
  "o2",         "DO",            "DO_NWS",        c(-15, 13),    c(43, 62),    c("0.","1000."),   "multi",    Sys.Date()-14,   "cmems_mod_nws_bgc_anfc_0.027deg-3D_P1D-m",
  "ph",         "pH",            "pH_NWS",        c(-15, 13),    c(43, 62),    c("0.","1000."),   "multi",    Sys.Date()-14,   "cmems_mod_nws_bgc_anfc_0.027deg-3D_P1D-m",
  "VHM0",       "wave_mean",     "wave_NWS",      c(-15, 13),    c(43, 62),    c("0.5","0.5"),    "one",      Sys.Date()-2,    "cmems_mod_nws_wav_anfc_0.027deg_PT1H-i",
  "bottomT",    "bottomT",       "bottomT_IBI",   c(-9.9,-0.4),  c(42.5,47),   c("0.5","0.5"),    "one",      Sys.Date()-14,   "cmems_mod_ibi_phy_anfc_0.027deg-3D_P1D-m", 
  "zos",        "ssh",           "ssh_IBI",       c(-9.9,-0.4),  c(42.5,47),   c("0.5","0.5"),    "one",      Sys.Date()-2,    "cmems_mod_ibi_phy_anfc_0.027deg-3D_P1D-m", 
  "nppv",       "pp",            "pp_IBI",        c(-9.9,-0.4),  c(42.5,47),   c("0.5","0.5"),    "one",      Sys.Date()-60,   "cmems_mod_ibi_bgc_anfc_0.027deg-3D_P1D-m",
  "uo",         "current",       "current_u_IBI", c(-9.9,-0.4),  c(42.5,47),   c("0.5","0.5"),    "one",      Sys.Date()-2,    "cmems_mod_ibi_phy_anfc_0.027deg-3D_P1D-m", 
  "vo",         "current",       "current_v_IBI", c(-9.9,-0.4),  c(42.5,47),   c("0.5","0.5"),    "one",      Sys.Date()-2,    "cmems_mod_ibi_phy_anfc_0.027deg-3D_P1D-m", 
  "so",         "salinity",      "salinity_IBI",  c(-9.9,-0.4),  c(42.5,47),   c("0.","1000."),   "multi",    Sys.Date()-14,   "cmems_mod_ibi_phy_anfc_0.027deg-3D_P1D-m",    
  "o2",         "DO",            "DO_IBI",        c(-9.9,-0.4),  c(42.5,47),   c("0.","1000."),   "multi",    Sys.Date()-14,   "cmems_mod_ibi_bgc_anfc_0.027deg-3D_P1D-m",
  "ph",         "pH",            "pH_IBI",        c(-9.9,-0.4),  c(42.5,47),   c("0.","1000."),   "multi",    Sys.Date()-14,   "cmems_mod_ibi_bgc_anfc_0.027deg-3D_P1D-m",
  "VHM0",       "wave_mean",     "wave_IBI",      c(-9.9,-0.4),  c(42.5,47),   c("0.5","0.5"),    "one",      Sys.Date()-2,    "cmems_mod_ibi_wav_anfc_0.027deg_PT1H-i"
)

Download CMEMS Data

Remember that we have also saved a list of predictors in the model’s metadata (see this subsection). We can extract this list from the metadata and use it to filter the CMEMS directory, ensuring that only the necessary variables are downloaded. This approach prevents the retrieval of unnecessary CMEMS data, optimizing storage and processing efficiency.

trained_workflow <- readRDS("./Models/workflow_with_metadata.rds")  # read trained model and its metadata

predictors <- unlist(trained_workflow[["metadata"]][["error"]]$Predictors)     # retrieve the predictors of the model

CMEMS_var <- predictors %>%                                                     # remove prefix labels (e.g., lag1_, lag2_, lag7_)
  sub("^[^_]*[0-9]+_", "",.) 

if("current" %in% CMEMS_var){                                                   # add current_u and current_v to the list if current is one of the predictors
  CMEMS_var <- c(CMEMS_var,"current_u","current_v")
}

CMEMS_var <- unique(CMEMS_var)
CMEMS_var <- CMEMS_var[!grepl("time_trend|depth",CMEMS_var)]                    # remove non-CMEMS variables

CMEMS <- CMEMS %>%
  filter(common_name %in% CMEMS_var)


for(i in 1:nrow(CMEMS)){                                                  
  
  command <- paste0("copernicusmarine subset",
                    " --username ",USERNAME,
                    " --password ",PASSWORD,
                    " -i ",CMEMS[i,]$datasetID,
                    " -x ",unlist(CMEMS[i,]$lon)[1]," -X ",unlist(CMEMS[i,]$lon)[2],
                    " -y ",unlist(CMEMS[i,]$lat)[1]," -Y ",unlist(CMEMS[i,]$lat)[2],
                    " -z ",unlist(CMEMS[i,]$depth)[1]," -Z ",unlist(CMEMS[i,]$depth)[2],
                    " -t ",CMEMS[i,]$date_min," -T ",date_max,
                    " -v ",CMEMS[i,]$variable,
                    " -o ",out_dir,
                    " -f ",CMEMS[i,]$out_name,".nc")
  
  system(command, input="Y")  
}

Process CMEMS Data

Next, we need to process the downloaded CMEMS data following the steps outlined in this subsection. We need to have a template raster that will be converted to spatial points. These points will be used to extract the values from the CMEMS data and pass it to the model. The predictions, which are equidistant from each other, will be the centroids of the prediction raster. Here, we are using a bathymetry raster as a template.

cat("Processing CMEM input data.\n")

library(terra)
library(lubridate)
library(dplyr)
library(stringr)
library(tidyr)
library(magrittr)


#create function to process the CMEMS data
ValueAtDeepest <- function(x){                                                  # x  is a SpatRaster object (several depth levels in each time step)
  
  time <- unique(terra::time(x))                                                # get unique time DateTime
  extent <- terra::ext(x)                                                       # get extent of the SpatRaster
  x = terra::as.data.frame(x, xy = TRUE, na.rm=FALSE)                           # convert SpatRaster into a data frame, values in each depth are separated in columns
  
  x <-  x %>%
    tidyr::drop_na(contains("_depth=0")) %>%                                    # remove rows that have NA at the surface (i.e, NA at the surface means no values at any depth level)
    dplyr::rename_with(., ~ str_extract(.x,"(?<=\\=)\\d+"),
                       contains("_depth")) %>%
    tidyr::pivot_longer(cols = -c(x,y),                                         # pivot into long format, all parameter values in one column                                  
                        names_to = "depth",
                        values_to = "z") %>%
    dplyr::mutate(depth = as.numeric(depth)) %>%                 
    dplyr::group_by(x,y) %>%                                                    # group by grid cell
    tidyr::drop_na() %>%                                                        # remove NAs
    dplyr::arrange(desc(depth), .by_group=TRUE) %>%                             # arrange depth from deepest to shallowest
    dplyr::slice_head() %>%                                                     # only extract the row with the deepest level
    dplyr::select(-depth) 
  
  x <- terra::rast(x, type="xyz", crs="epsg:4326")                              # convert xyz dataframe into a SpatRaster object
  terra::time(x) <- time                                                        # label the DateTime based from the original SpatRaster
  names(x) <- time
  x <- terra::extend(x,extent)                                                  # extend extent based on the original SpatRaster
  
  return(x)
  
}


meanDepth <- rast("./Data/Processed/Parameters/bathymetry_mean.tif")            # import depth raster
names(meanDepth) <- "depth"

template_data <- as.data.frame(meanDepth, xy=TRUE)                              # convert depth SpatRaster to a dataframe. Each centroid of the raster becomes a point
template_data <- template_data %>%                                              # convert dataframe to SpatVector
  vect(geom=c("x","y"), crs="epsg:4326")


pathList <- list.files(path = "./Data/Raw/Parameter",                           # list file paths of downloaded CMEMS data
                       pattern='.nc$', all.files=TRUE, full.names=TRUE)

names <- sub("\\.nc","",basename(pathList))                                     # extract names
names(pathList) <- names                                                        # create names for the list elements

rastList <- vector(mode="list", length=length(pathList))                        # create list that will contain the processed raster
names(rastList) <- names                                                        # each element corresponds to each processed environmental variable

one_level <- CMEMS %>%                                                          # create a vector for surface variables
  filter(z_levels=="one") %>% 
  pull(out_name)


for(i in 1:length(pathList)){
  
  cat(i,"/",length(pathList),": Processing ", names(pathList[i]),"\n", sep="")
  
  var <- names(rastList[i])
  
  if(var %in% one_level){
    if(grepl("wave",var)){                                                      
      raster <- rast(pathList[i])
      terra::time(raster) <- as.POSIXct(substring(terra::time(raster),1,10),    # remove time part but retain the date
                                        tz="UTC")
      dates <- unique(terra::time(raster))
      temp_list <- vector(mode="list", length=length(dates))
      
      for(j in 1:length(dates)){
        temp_list[[j]] <- mean(raster[[terra::time(raster)==dates[j]]])         # compute mean wave height per day (raw data is hourly)
        names(temp_list[[j]]) <- gsub("_","_mean_",names[i])
        terra::time(temp_list[[j]]) <- dates[j]
        }
      
      rastList[[i]] <- rast(temp_list)
      
    } else {
      rastList[[var]] <- rast(pathList[[var]])                                  # open other surface variables as SpatRasters    
      names(rastList[[var]]) <- rep(var, times=terra::nlyr(rastList[[var]]))
      } 
    
  } else{                                                                       # extract seafloor values for multi-level variables
    raster <- rast(pathList[i])
    dates <- unique(terra::time(raster))
    temp_list <- vector(mode="list", length=length(dates))
    
    for(j in 1:length(dates)){
      subRaster <- raster[[terra::time(raster)==dates[j]]]
      check <- as.data.frame(subRaster)
      if(nrow(check)==0){break}
      temp_list[[j]] <- ValueAtDeepest(subRaster)
      names(temp_list[[j]]) <- names[i]
    }
    rastList[[i]] <- rast(temp_list)
  }
}

Build Input Dataframes

Now that the CMEMS forecast data has been processed, we need to construct the daily input dataframes for the model to generate predictions. The number of lagged days for each predictor can be adjusted based on the selection used during model training.

In the final stages of the code, new variables will be created as described in the Create New Predictors subsection under Data Exploration. The coalesce() function will consolidate similar variables into a single column (e.g., merging bottomT_NWS and bottomT_IBI). Additionally, the repeated if statements in the latter part of the pipeline ensure that new predictors are generated only if they are present in the extracted dataframe.

cat("Building input dataframe.\n")

rm(list=setdiff(ls(),c("trained_workflow","predictors","template_data",
                       "rastList")))

date_min <- as.POSIXct(as.character(Sys.Date()),tz="UTC")                       # set reference date to today
date_max <- lapply(rastList,terra::time) %>%                                    # set the latest forecast date based on the shortest available CMEMS forecast
  lapply(., max) %>%
  do.call("c",.) %>%
  min()

dates_seq <- seq(date_min, date_max, by = "day")                                # specify prediction dates
df_list <- vector(mode="list", length=length(dates_seq))                        # create list that will contain the the input dataframe per day


for(i in 1:length(dates_seq)){                                                  # create input dataframe for each forecast date looping through the prediction dates
  
  newdata <- template_data                                                      # this template data comes from the bathymetry raster
  cat("Date:",as.character(dates_seq[i]),"\n")
  
  for(j in 1:length(rastList)){
    
    raster <- rastList[[j]]
    date <- dates_seq[i]
    cat("    Extracting ",unique(names(raster))," values.\n",sep="")
    
    subRaster <- raster[[terra::time(raster)==date]]                            
    newdata <- terra::extract(subRaster,newdata,bind=TRUE)
    
    if(grepl(x=unique(names(raster)),                                           # if the variables being processed are current, wave or ssh, then extract the following day lags: 1 and 2
             pattern = "current_u|current_v|wave_mean|ssh")){
      
      date <- date - lubridate::days(1)
      subRaster <- raster[[terra::time(raster)==date]]
      names(subRaster) <- paste0("lag1_",names(subRaster))
      newdata <- terra::extract(subRaster,newdata,bind=TRUE)
      
      gc()
      
      date <- date - lubridate::days(1)
      subRaster <- raster[[terra::time(raster)==date]]
      names(subRaster) <- paste0("lag2_",names(subRaster))
      newdata <- terra::extract(subRaster,newdata,bind=TRUE)
      
      gc()
      
    } else if(grepl(x=unique(names(raster)),                                    # if the variables being processed pp, then extract the following day lags: 30 and 60
                    pattern = "pp")){
      
      date <- date - lubridate::days(30)
      subRaster <- raster[[terra::time(raster)==date]]
      names(subRaster) <- paste0("lag30_",names(subRaster))
      newdata <- terra::extract(subRaster,newdata,bind=TRUE)
      
      gc()
      
      date <- date - lubridate::days(30)
      subRaster <- raster[[terra::time(raster)==date]]
      names(subRaster) <- paste0("lag60_",names(subRaster))
      newdata <- terra::extract(subRaster,newdata,bind=TRUE)
      
      gc()
      
    } else{                                                                     # if the variables being processed are salinity, bottomT, pH or DO, then extract the following day lags: 7 and 14
      
      date <- date - lubridate::days(7)
      subRaster <- raster[[terra::time(raster)==date]]
      names(subRaster) <- paste0("lag7_",names(subRaster))
      newdata <- terra::extract(subRaster,newdata,bind=TRUE)
      
      gc()
      
      date <- date - lubridate::days(7)
      subRaster <- raster[[terra::time(raster)==date]]
      names(subRaster) <- paste0("lag14_",names(subRaster))
      newdata <- terra::extract(subRaster,newdata,bind=TRUE)
      
      gc()
    }
    
  }
  
  rm(list=c("raster","subRaster"))
  
  newdata <- as.data.frame(newdata, geom="XY") %>%                    
    mutate(Date = dates_seq[i],
           time_trend = as.numeric(Date - min(as.POSIXct("2006-01-11 00:00:00",tz="UTC")))/1e9,
           Longitude = x,
           Latitude = y,
           across(ends_with("_NWS"),                                                
                  ~ coalesce(., get(sub("_NWS", "_IBI", cur_column()))),              
                  .names = "{gsub('_NWS','', .col)}")) %>%
    select(-ends_with(c("_NWS", "_IBI"))) %>%
    {if(any(str_detect(predictors,"current"))) mutate(.,current = sqrt(current_u^2 + current_v^2),  
                                                      lag1_current = sqrt(lag1_current_u^2 + lag1_current_v^2),
                                                      lag2_current = sqrt(lag2_current_u^2 + lag2_current_v^2),
                                                      delta1_current = current - lag1_current,
                                                      delta2_current = current - lag2_current,
                                                      delta1_current_u = current_u - lag1_current_u,
                                                      delta2_current_u = current_u - lag2_current_u,
                                                      delta1_current_v = current_v - lag1_current_v,
                                                      delta2_current_v = current_v - lag2_current_v) else .} %>%
    {if(any(str_detect(predictors,"ssh"))) mutate(.,delta1_ssh = ssh - lag1_ssh,
                                                  delta2_ssh = ssh - lag2_ssh) else .} %>%
    {if(any(str_detect(predictors,"wave_mean"))) mutate(.,delta1_wave_mean = wave_mean - lag1_wave_mean,
                                                        delta2_wave_mean = wave_mean - lag2_wave_mean) else .} %>%
    {if(any(str_detect(predictors,"DO"))) mutate(.,delta7_DO = DO - lag7_DO,
                                                 delta14_DO = DO - lag14_DO) else .} %>%
    {if(any(str_detect(predictors,"pH"))) mutate(.,delta7_pH = pH - lag7_pH,
                                                 delta14_pH = pH - lag14_pH) else .} %>%
    {if(any(str_detect(predictors,"pp"))) mutate(.,delta30_pp = pp - lag30_pp,
                                                 delta60_pp = pp - lag60_pp) else .} %>%
    {if(any(str_detect(predictors,"salinity"))) mutate(.,delta7_salinity = salinity - lag7_salinity,
                                                       delta14_salinity = salinity - lag14_salinity) else .} %>%
    {if(any(str_detect(predictors,"bottomT"))) mutate(.,delta7_bottomT = bottomT -lag7_bottomT,
                                                      delta14_bottomT = bottomT - lag14_bottomT) else .} %>%
    select(Date,Longitude,Latitude,any_of(predictors)) %>%
    drop_na() 
  
  df_list[[i]] <- newdata
  
  gc()
  
}

Save Predictions into a NetCDF file

Finally, we can use the dataframes created from the forecasted CMEMS data as input for the model. Each day of predictions will be saved as a separate layer in a NetCDF file. Since we have enriched the model with metadata, we can also include this metadata as additional information when writing the NetCDF file, ensuring better documentation and traceability of the model predictions.

cat("Getting model predictions.\n")

library(workflows)
library(ranger)
library(terra)


error <- trained_workflow[["metadata"]][["model_metrics"]] %>%
    filter(.metric=="pr_auc" & test_type=="test_time") 

FAO_code <- "SOL"
common_name <- "Common Sole"
scientific_name <- "Solea solea"
title <- paste0("title=Daily Predictions of Fishing Suitability of ",common_name," (",scientific_name,")")
institution <- paste0("institution=",trained_workflow[["metadata"]][["institute"]])
source <- paste0("source=",trained_workflow[["metadata"]][["model"]])
comment <- paste0("comment=",trained_workflow[["metadata"]][["training_data"]]," ",
                    trained_workflow[["metadata"]][["notes"]]," ",
                    error$test_type_info,", yielding an ",tolower(error$.metric_info), " of ",
                    round(error$.estimate,3),". The model performance was evaluated on ",
                    format(trained_workflow[["metadata"]][["creation_date"]],"%d-%b-%Y"),".")
contact <- paste0("contact=",trained_workflow[["metadata"]][["contact"]])
library <- paste0("library=",trained_workflow[["metadata"]][["library"]])
  
model <- trained_workflow[["workflow"]]
  
pred_list <- vector(mode="list", length=length(dates_seq))                    #list that will contain the raster predictions
  
for(i in 1:length(df_list)){
    
    df <- df_list[[i]]
    
    if(nrow(df)==0){next}
    
    predictions <- predict(model,df) %>%                            
      select(.pred) %>%        
      cbind(df,.) %>%
      mutate(.pred = .pred*100) %>%
      rename(fishing_suitability = .pred,
             x = Longitude,
             y = Latitude) %>%
      select(x,y,fishing_suitability) %>%
      rast(.,type="xyz", crs="EPSG:4326")                                       #convert to raster
    
    terra::time(predictions) <- unique(df$Date)
    
    pred_list[[j]] <- predictions
  }
  
  raster <- rast(pred_list)                                                     #bind all rasters into one object
  
  terra::writeCDF(raster,                                                       #export as NetCDF file
                  overwrite = TRUE,
                  varname = paste0(FAO_code,"_predictions"),
                  longname = paste0("Predicted Fishing Suitability of ",scientific_name),
                  unit = "%",
                  atts = c(title,institution,source,comment,contact,library),
                  filename = paste0("./Outputs/",FAO_code,"_predictions_",Sys.Date(),".nc"))

We can then put all of the codes above into one R script. Let’s name it get_model_predictions.R and save it in the Scripts folder as indicated here.

Containerize the Model into a Docker Image

We have mentioned Docker a couple of times in this section, but what is Docker? Docker is an open-source platform that allows developers to build, package, and run applications in lightweight, portable containers. These containers include everything an application needs to run, such as libraries, dependencies, and configuration files, ensuring that it works consistently across different environments.

How does it work? Developers write code and define dependencies in a Dockerfile. Then, Docker builds an image based on the Dockerfile. The image is then stored in a registry like Docker Hub or a private repository. Lastly, containers are created from the image and run on any Docker-compatible system.

In our use case, we are using Windows VM running Windows Server 2022 DataCenter. In this virtual machine, specific installaions are required for Docker to run. This is well documented in this link. The two important installations are the following.

Install the Windows Server 2022 containers feature
Install Hyper-V in Windows Server 2022

Also, Docker Desktop for Windows needs to be installed in this server. The process will also require the installation of Windows SubSystem for Linux (WSL). Once the server is correctly setup and installed we can move on to the next part which is creating the Docker images and running the container.

Creating the Docker Image

Since the model runs in an R environment, we first need to create an image based on R-base. We are using the package known as rocker/r-base.

Create the Image custom-base-r

We first need to create a Dockerfile. It is a file that contains a set of instructions for building a Docker image. It defines the steps to create a custom Docker image by specifying the base image, dependencies, configuration, and commands that need to be executed within the image. The Dockerfile is like a blueprint or recipe for creating a Docker container environment. To create a Dockerfile, open Notepad++ and paste the following codes. Then, save the file with the name Dockerfile without file extension. Make sure the file type is set to “All types”.

FROM rocker/r-base:latest

RUN apt-get update

RUN apt-get -y install build-essential xml2 openssl nano

RUN apt-get -y install libfontconfig1-dev 

RUN apt-get -y install libgdal-dev libgeos-dev libproj-dev libharfbuzz-dev libfribidi-dev

RUN apt-get update

COPY Scripts/install_packages.R /install_packages.R

RUN Rscript /install_packages.R

We also need a separate R script to install the necessary R packages for the get_model_predictions.R to work. Save this in the Scripts folder as install_packages.R.

options(repos = "https://cloud.r-project.org") # setting a CRAN mirror
install.packages(c("stats"), repos = "https://cloud.r-project.org")
install.packages(c("graphics"), repos = "https://cloud.r-project.org")
install.packages(c("grDevices"), repos = "https://cloud.r-project.org")
install.packages(c("datasets"), repos = "https://cloud.r-project.org")
install.packages(c("methods"), repos = "https://cloud.r-project.org")
install.packages(c("base"), repos = "https://cloud.r-project.org")
install.packages(c("ranger"), repos = "https://cloud.r-project.org")
install.packages(c("lightgbm"), repos = "https://cloud.r-project.org")
install.packages(c("workflows"), repos = "https://cloud.r-project.org")
install.packages(c("magrittr"), repos = "https://cloud.r-project.org")
install.packages(c("tidyr"), repos = "https://cloud.r-project.org")
install.packages(c("stringr"), repos = "https://cloud.r-project.org")
install.packages(c("dplyr"), repos = "https://cloud.r-project.org")
install.packages(c("lubridate"), repos = "https://cloud.r-project.org")
install.packages(c("terra"), repos = "https://cloud.r-project.org")
install.packages(c("tidyselect"), repos = "https://cloud.r-project.org")
install.packages(c("listenv"), repos = "https://cloud.r-project.org")
install.packages(c("purrr"), repos = "https://cloud.r-project.org")
install.packages(c("splines"), repos = "https://cloud.r-project.org")
install.packages(c("lattice"), repos = "https://cloud.r-project.org")
install.packages(c("parsnip"), repos = "https://cloud.r-project.org")
install.packages(c("colorspace"), repos = "https://cloud.r-project.org")
install.packages(c("vctrs"), repos = "https://cloud.r-project.org")
install.packages(c("generics"), repos = "https://cloud.r-project.org")
install.packages(c("utf8"), repos = "https://cloud.r-project.org")
install.packages(c("survival"), repos = "https://cloud.r-project.org")
install.packages(c("prodlim"), repos = "https://cloud.r-project.org")
install.packages(c("rlang"), repos = "https://cloud.r-project.org")
install.packages(c("pillar"), repos = "https://cloud.r-project.org")
install.packages(c("glue"), repos = "https://cloud.r-project.org")
install.packages(c("withr"), repos = "https://cloud.r-project.org")
install.packages(c("lifecycle"), repos = "https://cloud.r-project.org")
install.packages(c("lava"), repos = "https://cloud.r-project.org")
install.packages(c("timeDate"), repos = "https://cloud.r-project.org")
install.packages(c("munsell"), repos = "https://cloud.r-project.org")
install.packages(c("gtable"), repos = "https://cloud.r-project.org")
install.packages(c("future"), repos = "https://cloud.r-project.org")
install.packages(c("recipes"), repos = "https://cloud.r-project.org")
install.packages(c("codetools"), repos = "https://cloud.r-project.org")
install.packages(c("parallel"), repos = "https://cloud.r-project.org")
install.packages(c("class"), repos = "https://cloud.r-project.org")
install.packages(c("fansi"), repos = "https://cloud.r-project.org")
install.packages(c("Rcpp"), repos = "https://cloud.r-project.org")
install.packages(c("scales"), repos = "https://cloud.r-project.org")
install.packages(c("ipred"), repos = "https://cloud.r-project.org")
install.packages(c("jsonlite"), repos = "https://cloud.r-project.org")
install.packages(c("parallelly"), repos = "https://cloud.r-project.org")
install.packages(c("dials"), repos = "https://cloud.r-project.org")
install.packages(c("ggplot2"), repos = "https://cloud.r-project.org")
install.packages(c("digest"), repos = "https://cloud.r-project.org")
install.packages(c("stringi"), repos = "https://cloud.r-project.org")
install.packages(c("bonsai"), repos = "https://cloud.r-project.org")
install.packages(c("ncdf4"), repos = "https://cloud.r-project.org")
install.packages(c("grid"), repos = "https://cloud.r-project.org")
install.packages(c("DiceDesign"), repos = "https://cloud.r-project.org")
install.packages(c("hardhat"), repos = "https://cloud.r-project.org")
install.packages(c("cli"), repos = "https://cloud.r-project.org")
install.packages(c("tools"), repos = "https://cloud.r-project.org")
install.packages(c("tibble"), repos = "https://cloud.r-project.org")
install.packages(c("future.apply"), repos = "https://cloud.r-project.org")
install.packages(c("pkgconfig"), repos = "https://cloud.r-project.org")
install.packages(c("ellipsis"), repos = "https://cloud.r-project.org")
install.packages(c("MASS"), repos = "https://cloud.r-project.org")
install.packages(c("Matrix"), repos = "https://cloud.r-project.org")
install.packages(c("data.table"), repos = "https://cloud.r-project.org")
install.packages(c("timechange"), repos = "https://cloud.r-project.org")
install.packages(c("gower"), repos = "https://cloud.r-project.org")
install.packages(c("rstudioapi"), repos = "https://cloud.r-project.org")
install.packages(c("R6"), repos = "https://cloud.r-project.org")
install.packages(c("globals"), repos = "https://cloud.r-project.org")
install.packages(c("rpart"), repos = "https://cloud.r-project.org")
install.packages(c("nnet"), repos = "https://cloud.r-project.org")
install.packages(c("compiler"), repos = "https://cloud.r-project.org")
install.packages(c("xgboost"), repos = "https://cloud.r-project.org", dependencies=TRUE)

After setting up the directories and creating all necessary files, we can proceed with building the base Docker image. This image will be derived from rocker/r-base, an official R environment maintained by the Rocker Project. In the Command Prompt, change the directory to the custom-base-r folder using cd. Then run the docker command. The docker build creates a Docker image from the specified Dockerfile. The -t custom-base-r assigns the name custom-base-r to the newly created image. The . at the end specifies the build context, which is the current directory. This directory should contain the Dockerfile and any other necessary files.

cd "C:\Path\to\your\Docker_Project\custom-base-r"

docker build -t custom-base-r .

Build the Final Docker Image

Once you have built the modified R base image, we can finalize this image for deployment. Make sure that there is another Dockerfile in the fish_suitability folder with the the following codes.

FROM custom-base-r

############################

##  PYTHON from pyenv

############################

RUN apt-get update

RUN apt-get install -y git

RUN apt-get update

RUN git clone https://github.com/pyenv/pyenv.git /.pyenv

ENV PYENV_ROOT="/.pyenv"

ENV PATH="/.pyenv/bin:${PATH}"

ARG PYTHON_VERSION=3.11

ENV PYTHON_VERSION ${PYTHON_VERSION}

RUN pyenv install ${PYTHON_VERSION}

RUN git clone https://github.com/pyenv/pyenv-virtualenv.git $PYENV_ROOT/plugins/pyenv-virtualenv

RUN pyenv virtualenv ${PYTHON_VERSION} base && \
    pyenv global base

ENV PATH=".pyenv/versions/base/bin:$PATH"

############################

##  Model requirements

############################

COPY Scripts /Scripts

COPY Models /Models

COPY Data /Data

COPY Output /Output

############################

##  Execution

############################

COPY requirements.txt /requirements.txt

RUN pip install -r /requirements.txt

CMD Rscript /Scripts/get_model_predictions.R

Then navigate to the fish_suitability folder in the Command Prompt and run the following docker command.

cd "C:\Path\to\your\Docker_Project\fish_suitability"

docker build -t fish_suitability .

You can double check if indeed the image was created by executing the following command. The fish_suiability image should be listed under the column REPOSITORY.

docker images

We can now run a Docker Container from the newly created Docker Image. In order to get the prediction output files (i.e., NetCDF files), we need to specify a directory for these files to be saved.

 docker run -it --rm -v /path/to/model/outputs:/Outputs fish_suitability

docker run - This command is used to run a new container from a Docker image.
-i - Keeps the standard input (stdin) open so you can interact with the container.
-t - Allocates a pseudo-terminal (TTY), making it possible to interact with the container via the command line.
--rm - This flag removes the container automatically after it stops.
-v - This flag mounts a volume, which means sharing a folder between your local machine and the Docker container.
/path/to/model/outputs - This is a directory on your local machine.
/Outputs - This is the directory inside the container where the local folder will be accessible.
fish_suitability - This is the Docker image being used to create the container.