CDK Entest

Introduction#

GitHub this note summarize different ways to deploy a HuggingFace model on Amazon SageMaker

SageMaker Model
HuggingFaceModel
JumpStartModel

SageMaker Model#

A model in SageMaker consist of image uri and model data.

from sagemaker import Session
from sagemaker.model import Model
from sagemaker import image_uris
from sagemaker import model_uris
import json

Let retrieve an image uri

model_id = "huggingface-text2text-flan-t5-xxl"
model_version = "*"

and

image_uri = image_uris.retrieve(
  region=None,
  framework=None,
  image_scope="inference",
  model_id=model_id,
  model_version=model_version,
  instance_type="ml.g5.12xlarge"
)

then retrieve model data

model_uri = model_uris.retrieve(
  model_id=model_id,
  model_version=model_version,
  model_scope="inference"
)

Now we can create a Model object

model_inference = Model(
  image_uri=deploy_image_uri,
  model_data=model_uri,
  role=role,
  predictor_cls=None,
  name=endpoint_name,
  env={"SAGEMAKER_MODEL_SERVER_WORKERS": "1", "TS_DEFAULT_WORKERS_PER_MODEL":"1"}
)

Let deploy the model to create an endpoint. Check the last section to invoke the endpoint.

model_inference.deploy(
  initial_instance_count=1,
  instance_type="ml.g5.12xlarge",
  endpoint_name=endpoint_name
)

HuggingFaceModel#

Another method is using HuggingFaceModel object. In this method, the model data can be loaded after the endpoint is created

hub = {
  'HF_MODEL_ID': 'sentence-transformers/all-MiniLM-L6-v2',
  'HF_TASK': 'feature-extraction'
}

let create a HF model object

hf_model = HuggingFaceModel(
  role=None,
  # modeal data could be loaded after endpoint created
  # model_data=model_data,
  transformers_version="4.26",
  pytorch_version="1.13",
  py_version="py39",
  env=hub
)

Then we can call deploy method on the HuggingFaceModel to create an endpoint

HuggingFace LLM DLC#

The third method is using get_huggingface_llm_image_uri

# get image uri from aws ecr
# same hugging face llm dlc for different models
llm_image_uri = get_huggingface_llm_image_uri(
  session=session,
  version="0.9.3",
  backend="huggingface"
)

let specify a model

hf_model_id = "OpenAssistant/pythia-12b-sft-v8-7k-steps" # model id from huggingface.co/models
use_quantization = False # wether to use quantization or not
instance_type = "ml.g5.12xlarge" # instance type to use for deployment
number_of_gpu = 4 # number of gpus to use for inference and tensor parallelism
health_check_timeout = 300 # Increase the timeout for the health check to 5 minutes for downloading the model

Now create a HuggingFaceModel

llm_model = HuggingFaceModel(
    role=None,
    image_uri=llm_image_uri,
    env={
    'HF_MODEL_ID': hf_model_id,
    'HF_MODEL_QUANTIZE': json.dumps(use_quantization),
    'SM_NUM_GPUS': json.dumps(number_of_gpu)
    }
)

Gradio#

The best way would be using SageMaker JumpStartModel

from sagemaker.jumpstart.model import JumpStartModel

Let deploy a meta-textgenerate-llama-2-7b-f model

model_id, model_version = "meta-textgeneration-llama-2-7b-f", "*"

model = JumpStartModel(
    model_id=model_id,
    model_version=model_version,
    role=role
)

# it takes about 10 minutes
predictor = model.deploy()

After the enpoint is deployed, let use gradio to build a simple chatbot

import gradio as gr

with gr.Blocks() as demo:
    gr.Markdown("## Chat with Amazon SageMaker")
    with gr.Column():
        chatbot = gr.Chatbot()
        with gr.Row():
            with gr.Column():
                message = gr.Textbox(label="Chat Message Box", placeholder="Chat Message Box", show_label=False)
            with gr.Column():
                with gr.Row():
                    submit = gr.Button("Submit")
                    clear = gr.Button("Clear")

    def respond(message, chat_history):
        # convert chat history to prompt
        converted_chat_history = ""
        #
        prompt = [[{"role": "user", "content": message}]]
        # send request to endpoint
        llm_response = predictor.predict({"inputs": prompt, "parameters": parameters}, custom_attributes='accept_eula=true')
        # remove prompt from response
        parsed_response = llm_response[0]['generation']['content']
        # parsed_response = llm_response[0]["generated_text"][len(prompt):]
        chat_history.append((message, parsed_response))
        return "", chat_history

    submit.click(respond, [message, chatbot], [message, chatbot], queue=False)
    clear.click(lambda: None, None, chatbot, queue=False)

demo.launch(share=True)

Invoke Endpoint#

To invoke an endoint, we can create a predictor

from sagemaker.huggingface.model import HuggingFacePredictor
from sagemaker import Session
import numpy as np

Define a model

model_id = "huggingface-text2text-flan-t5-xxl"
model_version = "*"
endpoint_name = "hugging_face_llm_demo"
endpoint_name_deployed = "huggingface-pytorch-inference-2023-09-19-05-00-49-267"

Create a HuggingFacePredictor

predictor = HuggingFacePredictor(
  endpoint_name=endpoint_name_deployed,
  sagemaker_session=session
)

then invoke endpoint

response = predictor.predict({
   "inputs": "Today is a sunny day and I'll get some ice cream."
})

SageMaker Model#

HuggingFaceModel#

HuggingFace LLM DLC#

Gradio#

Invoke Endpoint#

Reference#