Venkat's Playspace

Table of Content:

Introduction
Data Used
Evaluating algorithms locally using R
Training and Deploying a model using AWS Sagemaker
Invoking the Model from Salesforce
Final Thoughts

Introduction:

In this article, I will explore training and hosting a machine learning model in AWS Sagemaker. The need for this came up during my Machine Learning course. I found that I could train and build models on my desktop with a fairly high accuracy using common and well-regarded algorithms like Random Forest or Boosting. The natural next step was to see how to use these models in the applications I use on a regular basis. I explored ways to host these models and spent a lot of time exploring packages like Plumber (in R) and Flask (Python). There are a lot of options available. However, spinning up my own machines to host the models involved a lot of system architecting to ensure the stability and performance of the end system. One can ease their load a bit using things like Elastic Beanstalk or Kubernetes. I did have success deploying an NLP model for Sentiment Analysis using the AWS Elastic Kubernetes Service. For code and details, check out my page here: https://vvrrao.home.blog/hosting-a-custom-nlp-model-on-aws-elastic-kubernetes-service/

An alternative is AWS Sagemaker, which provides the hosting and publishing mechanism if you can provide it the algorithm. It also comes with built in algorithms for things like XGBoost and Random Forest(via external toolkits) making life a bit easier. IT IS NOT PERFECT and I could not get a 100% replication of what I trained in my local machine using the prebuilt algorithms on Sagemaker – however I got something which was close enough which, I suppose, highlights the pros and cons of Sagemaker.

Below, I will FIRST train my model OUTSIDE Sagemaker using the CARET package in R. I will then, retrain the model using the Sagemaker provided XGBoost algorithm and publish the endpoint as a REST API using the AWS API Gateway

Finally, to demonstrate the usability, I will configure Salesforce to make an outbound REST call to consume the endpoint and make predictions on Salesforce hosted data.

Sagemaker vs Einstein

If the last line gave you pause, you are probably not alone. Salesforce has its own Machine Learning/AI product suite (Einstein). Why would it need to integrate with Sagemaker?

For a couple of reasons. – Einstein is a bit of a black box in the sense there is not a lot of clarity on how its modeling takes place behind the scenes. I touched upon it in my post explaining training a model using Einstein Language. Sagemaker, allows you to control the algorithm or build your own one allowing you to tailor the model as you see fit. A company might also have its own Data Science team building models for use across multiple applications and it often makes sense to leverage their existing models.

Data Used:

Data Used

For the exercise I will use the Weight Lifting Exercises dataset provided at http://web.archive.org/web/20161224072740/http:/groupware.les.inf.puc-rio.br/har.

The following publications detail the results of the study and many thanks to them for providing this data for training purposes

Ugulino, W.; Cardador, D.; Vega, K.; Velloso, E.; Milidiu, R.; Fuks, H. Wearable Computing: Accelerometers’ Data Classification of Body Postures and Movements. Proceedings of 21st Brazilian Symposium on Artificial Intelligence. Advances in Artificial Intelligence – SBIA 2012. In: Lecture Notes in Computer Science. , pp. 52-61. Curitiba, PR: Springer Berlin / Heidelberg, 2012. ISBN 978-3-642-34458-9. DOI: 10.1007/978-3-642-34459-6_6.

Read more: http:/groupware.les.inf.puc-rio.br/har#ixzz4TjqOY3y5 Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H.Qualitative Activity Recognition of Weight Lifting Exercises. Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human ’13) . Stuttgart, Germany: ACM SIGCHI, 2013.

Read more: http:/groupware.les.inf.puc-rio.br/har#ixzz4TjrF9Ul4

The Weight Lifting dataset involved attaching accelerometers to various participants, monitoring them as they exercised and measuring how well they did the exercise. The quality of the movement was captured in the variable classe which had a value of A-E. This is not my dataset so I will let the authors describe the dataset to you in their words

Six young health participants were asked to perform one set of 10 repetitions of the Unilateral Dumbbell Biceps Curl in five different fashions: exactly according to the specification (Class A), throwing the elbows to the front (Class B), lifting the dumbbell only halfway (Class C), lowering the dumbbell only halfway (Class D) and throwing the hips to the front (Class E).

Source: http://web.archive.org/web/20161224072740/http:/groupware.les.inf.puc-rio.br/har.

The data may be downloaded here:

https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv

There is also a challenge dataset of 20 measurements which you can test your model against:

https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv

The answers to the challenge are: B A B A A E D B A A B C B A E E A B B B

Evaluating Algorithms locally using R:

My initial thoughts had been to explore potential algorithms using R. I used the Caret Package for data cleansing , set up and model fitting.

A few general principles; up front I split the data and reserved 30% of the data for validation. That data was set aside initially and did not play any role in training the model. I calculated accuracy of the model by simply testing against this data set and calculating the percentage of correct predictions.

You can train on your local machine. However, Sagemaker also provides an R-kernel which is an option if you are looking for a more powerful machine to run the model on.

The markdown files are on my github at the following location

https://github.com/vvr-rao/Sagemaker-XGBoost/blob/main/RandomForest.md

https://github.com/vvr-rao/Sagemaker-XGBoost/blob/main/XGB.md

The libraries I used were as follows:

library(caret)
library(e1071)
library(xgboost)

Initial Data Exploration and Cleansing

The data set consists of 19622 records with 160 columns. Not all of the columns are useful and we will need to clean them up. I loaded the data into rawTrainData.

rawTestData consists of 20 records which the exercise tries to predict. As I mentioned earlier – the correct predictions are B A B A A E D B A A B C B A E E A B B B. All my models predicted this so I will not dwell on it.

train_url <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
test_url <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
rawTrainData <- read.csv(url(train_url))
rawTestData <- read.csv(url(test_url))
dim(rawTrainData)
dim(rawTestData)

summary(rawTrainData)

Running summary(rawTrainData) gives a wealth of information which I will not duplicate here. A lot of the columns have very low variance and a lot of the columns are just NA. My first steps are to clean them up.

a) The following command gives the variance of each column

apply(rawTrainData, 2, var)

b) The following bit of code removes columns with Near Zero Variance

nearZeroVarCols <- nearZeroVar(rawTrainData)
training <- rawTrainData[,-nearZeroVarCols]
testing <- rawTestData[,-nearZeroVarCols]

c) Data with only NA can be removed as follows

training <- training[, colSums(is.na(training)) == 0] 
testing <- testing[, colSums(is.na(testing)) == 0]

d) I also saw that the first 5 columns were just the name of the person doing the exercise and the timestamp of the recordings. Didn’t see them effecting the model so I dropped them

training <-training[,-c(1:5)]
testing <- testing[,-c(1:5)]
dim(training)

All of this left me with a data set which was 19622×54.

Setting of the Training Data set

As I mentioned earlier, I pulled out 30% of the data and set it aside as a Validation set (subValidation). This data was not involved in training of the model and serves as a measure of accuracy of the model.

subSamples <- createDataPartition(y=training$classe, p=0.70, list=FALSE)
subTraining <- training[subSamples, ] 
subValidation <- training[-subSamples, ]

First Model: Decision Tree

My first model was built as a basic Decision Tree. A decision tree is super fast to train but not the most accurate. No harm in trying though…

mod_DT <- train(classe ~ ., data = subTraining, method="rpart")
pred_DT <-  predict(mod_DT, subValidation)
confMat_DT <- table(subValidation$classe,pred_DT)
accuracy_DT <- sum(diag(confMat_DT))/sum(confMat_DT)
confMat_DT
accuracy_DT

Above, I trained the model using the subTraining dataset. My target is classe. I checked the accuracy by using my trained model to predict against subValidation. Below is the output. Not very impressive and an accuracy of 53%. The Confusion Matrix shows how poorly the predictions did against the actual data.

Confusion Matrix
       A    B    C    D    E
  A 1534   27  108    0    5
  B  458  418  263    0    0
  C  483   34  509    0    0
  D  389  163  373    0   39
  E  125   73  226    0  658

 Accuracy: 0.52999150382328

Second Model: Random Forest

Next up was the Random Forest algorithm.

mod_RF <- train(classe ~ ., data = subTraining, method = "rf", ntree = 100)
pred_RF <- predict(mod_RF, subValidation)
confMat_RF <- table(subValidation$classe,pred_RF)
accuracy_RF <- sum(diag(confMat_RF))/sum(confMat_RF)
confMat_RF
accuracy_RF

This took a LOT longer to run but gave me a much higher accuracy above 99%

 Confusion Matrix
       A    B    C    D    E
  A 1674    0    0    0    0
  B    1 1137    1    0    0
  C    0    2 1024    0    0
  D    0    0    5  958    1
  E    0    1    0    2 1079

Accuracy: 0.997790994052676

Here is the summary of the model

Random Forest 

13737 samples
   53 predictor
    5 classes: 'A', 'B', 'C', 'D', 'E' 

No pre-processing
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 13737, 13737, 13737, 13737, 13737, 13737, ... 
Resampling results across tuning parameters:

  mtry  Accuracy   Kappa    
   2    0.9910534  0.9886858
  27    0.9954755  0.9942782
  53    0.9914457  0.9891823

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was mtry = 27.

Third Model: XGBoost

Next up. XGBoost.

tune_grid <- expand.grid(nrounds = 100,
                        max_depth = 6,
                        eta = 0.05,
                        gamma = 0.01,
                        colsample_bytree = 1,
                        min_child_weight = 0.5,
                        subsample = 0.5)
trctrl <- trainControl(method = "cv", number = 5)
mod_RF <- train(classe ~ ., data = subTraining, method = "xgbTree",
                trControl=trctrl,
                tuneGrid = tune_grid,
                tuneLength = 10)

pred_RF <- predict(mod_RF, subValidation)
confMat_RF <- table(subValidation$classe,pred_RF)
accuracy_RF <- sum(diag(confMat_RF))/sum(confMat_RF)
confMat_RF
accuracy_RF

I did do some tuning by tweaking the parameters but not as much as I expected. Note that I did a 5-fold cross validation, something which I will talk about when I get to training XGBoost in Sagemaker. I got a pretty high accuracy here as well – 99%

Below is the Confusion Matrix and how well my predictions stacked up against actual values.

Confusion Matrix
       A    B    C    D    E
  A 1672    1    0    0    1
  B    5 1121   13    0    0
  C    0    3 1021    2    0
  D    0    0   11  949    4
  E    0    2    1    7 1072

Accuracy: 0.991503823279524

Here’s the output of the model:

eXtreme Gradient Boosting 

13737 samples
   53 predictor
    5 classes: 'A', 'B', 'C', 'D', 'E' 

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 10990, 10989, 10989, 10990, 10990 
Resampling results:

  Accuracy   Kappa    
  0.9925748  0.9906079

Tuning parameter 'nrounds' was held constant at a value of 100
Tuning
 held constant at a value of 0.5
Tuning parameter 'subsample' was held
 constant at a value of 0.5

Training and Deploying a model using AWS Sagemaker:

At this point I had a couple of models I was happy with and turned my attention to deploying them. I spent some time playing with the Plumber package in R. Technically, you could deploy any of the above models using Plumber but you would be stuck with having to manage a lot of infrastructure. Hunting for options for a quick deploy led me to Sagemaker.

Sagemaker will train and deploy your model for you and manage your infrastructure, building in failover and redundancy. XGBoost is available as one of the core Sagemaker algorithms and Random Forest is available as part of the scikit-learn package.

The key reason to use a Random Forest model is if you expect a few features to overwhelm the model. I did not see that in my EDA so I decided to go with XGBoost which was 99% accurate. (Generally, with a 99% accuracy, there was a risk I was overfitting the model, but given the nature of the data I felt that was not the case). My plan was to replicate the parameters I used in my local model into Sagemaker, re-train the model and see where it took me.

I used the same data cleansing as before.

I used the R-kernel in Sagemaker to cleanse the data, fit the model and deploy.

My code is on my github at the following location:

https://github.com/vvr-rao/Sagemaker-XGBoost/blob/main/SagemakerXGBoostModelTrainAndDeploy.ipynb

Explanation of code

When you create a notebook instance you provision an EC2 instance in the background with your required libraries installed and running. Sagemaker also creates a default S3 bucket (or you could create one separately). This stores any data files and the output for the job.

The first few lines of code are house-keeping items to load libraries, get the default bucket and session id for future use

Part 1 – Setting up the datasets

library(reticulate)
sagemaker <- import('sagemaker')

session <- sagemaker$Session() 

bucket <- session$default_bucket()
#creates a default bucket of format sagemaker-<aws-region-name>-<aws account number>

role_arn <- sagemaker$get_execution_role()

#load my train and test data
#The training and test data for this was provided at the following location: http://web.archive.org/web/20161224072740/http:/groupware.les.inf.puc-rio.br/har
train_url <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
test_url  <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
rawTrainData <- read.csv(url(train_url))
rawTestData <- read.csv(url(test_url))
dim(rawTrainData)
dim(rawTestData)

#save the downloaded files locally an din S3
library(readr)
write_csv(rawTrainData, 'pml-training.csv', col_names = FALSE) 
write_csv(rawTestData, 'pml-testing.csv', col_names = FALSE)
session$upload_data(path = 'pml-training.csv', 
                                bucket = bucket, 
                                key_prefix = 'data')
session$upload_data(path = 'pml-testing.csv', 
                                bucket = bucket, 
                                key_prefix = 'data')

I did the same data cleansing as before so I will not duplicate all of it here. (code is available on my github). However, there are a couple of extra steps I needed to do

Sagemaker expects the variable you are predicting to be in the first column of the data set.
For a multiclass classification problem, this data needs to be an integer starting with 0 so I had to recode my classe variable from A,B,C,D,E to 0,1,2,3,4 (NOTE: I tried recoding to 1,2,3,4,5 which threw an error when I tried to fit the model)

#We're predicting classe. Move to the first column (since Sagemaker requires it) and change from A,B,C,D,E to 0,1,2,3,4
library(tidyverse)
training1 <- training %>% relocate(classe)

training1$classe<-recode(training1$classe, 'A'=0, 'B'=1, 'C'=2, 'D'=3, 'E'=4)
training1$classe  <- as.factor(training1$classe)

Of course, classe is a factor so I set it up as one

As before, I set aside 30% of my data for validation. This will not be used in training the model. That dataset is called subValidation

#create Training and Validation sets
subSamples <- createDataPartition(y=training1$classe, p=0.70, list=FALSE)
subTraining <- training1[subSamples, ] 
subValidation <- training1[-subSamples, ]## Going to keep this set aside. Will not be used to build the model

I do a further division on subTraining (my training dataset). This is not something I needed to do while training on my local machine. However, it seems the Sagemaker version of XGBoost requires a training and validation set. I did a 70-30 split.

#will only use the subTraning to train the model
subSamples1 <- createDataPartition(y=subTraining$classe, p=0.70, list=FALSE)
modTraining <- subTraining[subSamples1, ] 
modValidation <- subTraining[-subSamples1, ]

Finally, more house-keeping. In this case, I set up the datasets into a format Sagemaker needed

write_csv(modTraining, 'clean_train1.csv', col_names = FALSE) 
write_csv(modValidation, 'clean_valid1.csv', col_names = FALSE)

s3_train <- session$upload_data(path = 'clean_train1.csv', 
                                bucket = bucket, 
                                key_prefix = 'data')
s3_valid <- session$upload_data(path = 'clean_valid1.csv', 
                                bucket = bucket, 
                                key_prefix = 'data')

s3_train_input <- sagemaker$inputs$TrainingInput(s3_data = s3_train, content_type = 'csv') 
s3_valid_input <- sagemaker$inputs$TrainingInput(s3_data = s3_valid, content_type = 'csv')

Part 2 – Training the model

The first step in training the model is actually getting the model from the Elastic Container Registry in AWS. There are multiple versions of the model. You can get the latest one using this:

xgboost_container <- sagemaker$amazon$amazon_estimator$get_image_uri(session$boto_session$region_name,
                          'xgboost', 
                         repo_version='latest')

However, I found that there are subtle differences in syntax between the versions which can cause the code to break. Therefore, I retrieved a specific version. I’m hoping this means that, if I need to revisit the code a few years down the line, it will still work without issue.

xgboost_container <- sagemaker$amazon$amazon_estimator$get_image_uri(session$boto_session$region_name,
                          'xgboost', 
                         repo_version='1.2-1')

You need to specify an output directory for the model on S3, important since you will reference it when you use and deploy the model.

Sagemaker spins up a separate machine to run the model fitting (another EC2 instance in the background – love those). You also need to specify the kind of machine you want. I imagine you get something as big and fast as needed for your data. The fitting time was always < 30 minutes for me across multiple runs, which I felt was reasonable

s3_output <- paste0('s3://', bucket, '/output')
estimator <- sagemaker$estimator$Estimator(image_uri = xgboost_container,
                                     role = role_arn,
                                     train_instance_count = 1L,
                                     train_instance_type = 'ml.m5.large',
                                     train_volume_size = 30L,
                                     train_max_run = 3600L,
                                     input_mode = 'File',
                                     output_path = s3_output,
                                     output_kms_key = NULL,
                                     base_job_name = NULL,
                                     sagemaker_session = NULL)

Finally, hyperparameters. I tried to duplicate the parameters I used from earlier when I was training my model using the Caret package. The one thing I was unable to do with Sagemaker was the k-fold validation (remember I did 5-fold Cross Validation on caret). I could not find any setting for it.

estimator$set_hyperparameters(
        max_depth = 6L,
        eta = 0.05,
        gamma = 0.01,
        min_child_weight = 0.5,
        subsample = 0.5,
        objective = "multi:softmax", ##since this is a multiclass
        num_class = 5L, ## required for multi:softmax
        num_round = 100L,
        colsample_bytree = 1L )

The objective = “multi:softmax” is needed since this is a multiclass classification. I found that if I left it out, Sagemaker ignored that the target was a factor. Another alternative is “multi:softprob”. That gives a list of probabilities of all the factors and is possibly more informative. (softmax gives just the one with the highest probability).

All that is left is to specify the input data(which we created earlier) and fit the job

job_name <- paste('sagemaker-train-xgboost', format(Sys.time(), '%H-%M-%S'), sep = '-')

input_data <- list('train' = s3_train_input,
                   'validation' = s3_valid_input)

estimator$fit(inputs = input_data,
              job_name = job_name)

Part 3 – Create an Endpoint and Test

Once the model is fitted, you will need to deploy it. This creates an endpoint on yet another EC2 instance. You can have more than one instance for redundancy. This was the main reason I was using Sagemaker – to have a tool which could host the model for me and take care of the infrastructure requirements. You could build the hosting yourself and publish as an API using something like Plumber but that would require a lot of effort in provisioning machines for the hosting and webservers to publish.

serializer <- sagemaker$serializers$CSVSerializer(content_type='text/csv')
model_endpoint <- estimator$deploy(initial_instance_count = 1L,
                                   instance_type = 'ml.t2.medium',serializer=serializer)

A bit further, I will show how to expose this endpoint as a REST API and invoke it from POSTMAN and Salesforce.

Once deployed, we test our model. As before, we test it using the dataset we had set aside earlier – subValidation. Accuracy is measured by how the predictions match up to the actual data. My confusion matrix will show how well the model did

#note: 5885 is just the count of rows in the validation set
val1 <- subValidation[,-c(1)] #take out the classe from the validation set
head(val1)
valset <- as.matrix(val1 [1:5885, ])
predictions2 <- model_endpoint$predict(valset)

predictions2 <- str_split(predictions2, pattern = ',', simplify = TRUE)
predictions2 <- as.numeric(predictions2)
output2 <- cbind(predicted_classe = as.integer(predictions2), subValidation[1:5885, ])

confMat_Val = table(output2$predicted_classe, output2$classe)
accuracy_Val <- sum(diag(confMat_Val))/sum(confMat_Val)

  Confusion Matrix

       0    1    2    3    4
  0 1672   13    0    0    0
  1    0 1110    7    1    1
  2    0   16 1018   16    0
  3    2    0    1  943    4
  4    0    0    0    4 1077

Accuracy: 0.988954970263382

My accuracy is around 99% as well. The exact number is different from what I got training locally but not by much.

Part 4 – Deploy as a REST API

Finally, I deploy the endpoint as a REST API. This will be published using API Gateway.

The method of exposing a Sagemaker endpoint over the API Gateway involves creating a Lambda function and then exposing the Lambda to the Gateway. My plan was to create a REST API accepting the inputs in the body as a POST request and returning the predicted classe.

I found this article very helpful and followed the instructions provided there: https://aws.amazon.com/blogs/machine-learning/call-an-amazon-sagemaker-model-endpoint-using-amazon-api-gateway-and-aws-lambda/

Below is my Lambda code (in Python 3.8). AWS enforces rights using Roles and Policies so you need to make sure that the lambda function has a Role with a Policy allowing it to call your Sakemaker endpoint.

import os
import io
import boto3
import json
import csv


ENDPOINT_NAME = os.environ['ENDPOINT_NAME']
runtime= boto3.client('runtime.sagemaker')

def lambda_handler(event, context):
        
    data = json.loads(json.dumps(event))
    payload = data['data']
    print(payload)
    
    response = runtime.invoke_endpoint(EndpointName=ENDPOINT_NAME,
                                       ContentType='text/csv',
                                       Body=payload)
    print(response)
    result = json.loads(response['Body'].read().decode())

    classe = ''
    pred = int(result)
    if pred == 0:
         classe = 'A'
    elif pred == 1:
         classe = 'B'
    elif pred == 2:
         classe = 'C'
    elif pred == 3:
         classe = 'D'
    elif pred == 4:
         classe = 'E'
    return classe

This expects an input in the format of:

{"data":"478,0.96,3.93,-87.6,5,-0.03,0,-0.02,-14,1,43,50,635,-301,0,0,0,17,-4.37,1.61,-0.72,-23,108,-124,738,2,34,53.11895381,-56.56030919,-74.15238363,28,0.37,0.11,-0.48,-146,138,-184,-527,248,146,148,18.8,127,33,0.05,-0.06,0.02,-115,236,-184,-584,621,647"}

You need to expose the above Lambda function in an API Gateway. You essentially need to create a new API as a REST API, create a resource and then create an entry for the POST method with an integration type of Lambda and referencing your Lambda function.

When done, it would look something like this:

You deploy by creating a stage which gives you an endpoint available publicly. You should be able to test this out fairly easily using PostMan or CURL

Essentially, at this point you have a trained model. hosted in a cloud environment and accessible via a REST API.

In the next section, I will go over invoking this from Salesforce.com. Of course, the model may be invoked by any tool which can do REST calls.

Invoking the Model from Salesforce:

In my mind, I visualized that the Salesforce application was supporting a high tech gym with tracking being done on the clients. I assumed each session would be recorded and somebody with hit a button to get an idea on whether the exercise was performed correctly. In my mind, the button is called ‘GetClasse’ and would call my REST API with a trained model behind it.

The way to do this in Salesforce (assuming you use the Lightning UI) is to create an action on the object, with a flow behind it. The flow would call an Apex Class which would retrieve the data stored in the application, format it and make a REST callout. It would then store the response back in Salesforce in real time.

a) The apex class would look something like this. Note the @InvocableMethod(label=’Invoke’) – you need that to invoke the code in a Flow

public class InvokeAPI {
  @InvocableMethod(label='Invoke')
  public static list<String> getClasse(list<ID> ids) {
    list<ID> recids = ids;
    list<string> resp = new list<string>();
    list<Test__c> recs = [SELECT    Pitch__c, Roll__c, Yaw__c, Total_Accel__c, Classe__c FROM Test__c WHERE Id in :recids];
    

    system.debug('Ids recieved');
    system.debug(recids);
    list<Test__c> forupdate = new list<Test__c>();
   for (Test__c t : recs) {
      
        
    String endPointURL = '<API ENDPOINT>';

    Httprequest request = new HttpRequest();
    Http http = new Http();


    
    string body = '{\"data\":\"';
    body = body + t.Pitch__c + ',';
    body = body + t.Roll__c + ',';
    body = body + t.Yaw__c + ',';
    body = body + t.Total_Accel__c;
    body = body + '\"}';
    
    
    system.debug(body);
    
    request.setMethod('POST');
    request.setEndpoint(endPointURL);
    request.setHeader('Content-Type', 'text/plain');
    
    request.setTimeout(120000); 
    request.setBody(body);          
          
    //Making call to external REST API
    HttpResponse response = http.send(request);  

    System.debug('responseBody: '+response.getBody());
    
    String temp = response.getBody().replace('"','');
    t.Classe__c = temp;
    forupdate.add(t);
    resp.add(temp);
    }
    update forupdate;
    
    return resp;
  }
}

b) Next, you would need to create a Flow.

The flow has 2 steps. The first – Get Records – gets the current record.

The second, – Apex Action – just passes the Id to the apex class we created earlier

c) Once we create a flow, we need to head to the object we want to invoke the flow from and create an Action to invoke the flow

d) Once done, we just need to add the Action to the Page layout under the Salesforce Mobile and Lightning Experience Actions section

e) You also need to whitelist the API Gateway endpoint. This is done in Setup -> Remote Site settings and tells Salesforce that you plan to make outbound calls to the endpoint and to let them through.

If all goes well, and your endpoint is working, you should now be able to enter data into Salesforce and, on the click of a button, get a predicted classe populated by your model.

Final Thoughts:

And that is pretty much it. Hopefully, this serves a a blueprint to train and host a model in Sagemaker and to invoke it. Boosting algorithms are fairly popular so XGBoost should have a fairly high usage. There are other algorithms available which make Sagemaker a very powerful tool.

As I said earlier, I could not get a 100% replication of what I trained in my local machine using the prebuilt algorithms on Sagemaker so it is not perfect. Alternatives, if you still want to use the cloud, might be to spin up EC2 instances with Load Balancing with Plumber or Flask and perhaps use something like Beanstalk to ease some of the work. As I mentioned at the start, I did have success using AWS EKS to host an NLP model in a Flask app. a writeup on how to do that is here: https://vvrrao.home.blog/hosting-a-custom-nlp-model-on-aws-elastic-kubernetes-service/

References:

A very special thanks to the creators of the dataset.

Share this: