Introduction#
GitHub this note summarize different ways to deploy a HuggingFace model on Amazon SageMaker
- SageMaker Model
- HuggingFaceModel
- JumpStartModel
SageMaker Model#
A model in SageMaker consist of image uri and model data.
from sagemaker import Sessionfrom sagemaker.model import Modelfrom sagemaker import image_urisfrom sagemaker import model_urisimport json
Let retrieve an image uri
model_id = "huggingface-text2text-flan-t5-xxl"model_version = "*"
and
image_uri = image_uris.retrieve(region=None,framework=None,image_scope="inference",model_id=model_id,model_version=model_version,instance_type="ml.g5.12xlarge")
then retrieve model data
model_uri = model_uris.retrieve(model_id=model_id,model_version=model_version,model_scope="inference")
Now we can create a Model object
model_inference = Model(image_uri=deploy_image_uri,model_data=model_uri,role=role,predictor_cls=None,name=endpoint_name,env={"SAGEMAKER_MODEL_SERVER_WORKERS": "1", "TS_DEFAULT_WORKERS_PER_MODEL":"1"})
Let deploy the model to create an endpoint. Check the last section to invoke the endpoint.
model_inference.deploy(initial_instance_count=1,instance_type="ml.g5.12xlarge",endpoint_name=endpoint_name)
HuggingFaceModel#
Another method is using HuggingFaceModel object. In this method, the model data can be loaded after the endpoint is created
hub = {'HF_MODEL_ID': 'sentence-transformers/all-MiniLM-L6-v2','HF_TASK': 'feature-extraction'}
let create a HF model object
hf_model = HuggingFaceModel(role=None,# modeal data could be loaded after endpoint created# model_data=model_data,transformers_version="4.26",pytorch_version="1.13",py_version="py39",env=hub)
Then we can call deploy method on the HuggingFaceModel to create an endpoint
HuggingFace LLM DLC#
The third method is using get_huggingface_llm_image_uri
# get image uri from aws ecr# same hugging face llm dlc for different modelsllm_image_uri = get_huggingface_llm_image_uri(session=session,version="0.9.3",backend="huggingface")
let specify a model
hf_model_id = "OpenAssistant/pythia-12b-sft-v8-7k-steps" # model id from huggingface.co/modelsuse_quantization = False # wether to use quantization or notinstance_type = "ml.g5.12xlarge" # instance type to use for deploymentnumber_of_gpu = 4 # number of gpus to use for inference and tensor parallelismhealth_check_timeout = 300 # Increase the timeout for the health check to 5 minutes for downloading the model
Now create a HuggingFaceModel
llm_model = HuggingFaceModel(role=None,image_uri=llm_image_uri,env={'HF_MODEL_ID': hf_model_id,'HF_MODEL_QUANTIZE': json.dumps(use_quantization),'SM_NUM_GPUS': json.dumps(number_of_gpu)})
Gradio#
The best way would be using SageMaker JumpStartModel
from sagemaker.jumpstart.model import JumpStartModel
Let deploy a meta-textgenerate-llama-2-7b-f model
model_id, model_version = "meta-textgeneration-llama-2-7b-f", "*"model = JumpStartModel(model_id=model_id,model_version=model_version,role=role)# it takes about 10 minutespredictor = model.deploy()
After the enpoint is deployed, let use gradio to build a simple chatbot
import gradio as grwith gr.Blocks() as demo:gr.Markdown("## Chat with Amazon SageMaker")with gr.Column():chatbot = gr.Chatbot()with gr.Row():with gr.Column():message = gr.Textbox(label="Chat Message Box", placeholder="Chat Message Box", show_label=False)with gr.Column():with gr.Row():submit = gr.Button("Submit")clear = gr.Button("Clear")def respond(message, chat_history):# convert chat history to promptconverted_chat_history = ""#prompt = [[{"role": "user", "content": message}]]# send request to endpointllm_response = predictor.predict({"inputs": prompt, "parameters": parameters}, custom_attributes='accept_eula=true')# remove prompt from responseparsed_response = llm_response[0]['generation']['content']# parsed_response = llm_response[0]["generated_text"][len(prompt):]chat_history.append((message, parsed_response))return "", chat_historysubmit.click(respond, [message, chatbot], [message, chatbot], queue=False)clear.click(lambda: None, None, chatbot, queue=False)demo.launch(share=True)
Invoke Endpoint#
To invoke an endoint, we can create a predictor
from sagemaker.huggingface.model import HuggingFacePredictorfrom sagemaker import Sessionimport numpy as np
Define a model
model_id = "huggingface-text2text-flan-t5-xxl"model_version = "*"endpoint_name = "hugging_face_llm_demo"endpoint_name_deployed = "huggingface-pytorch-inference-2023-09-19-05-00-49-267"
Create a HuggingFacePredictor
predictor = HuggingFacePredictor(endpoint_name=endpoint_name_deployed,sagemaker_session=session)
then invoke endpoint
response = predictor.predict({"inputs": "Today is a sunny day and I'll get some ice cream."})