As part of laying down application architecture for LLM applications, one key focus area is LLM deployments. Related to LLM deployment is laying down LLM hosting strategy as part of which different hosting options need to be looked at, and evaluated based on various criteria including cost and appropriate hosting should be selected. In this blog, we will learn about different hosting options for different kinds of LLM and related strategies.
LLM Hosting Cost depends on the type of LLM Needed
What is going to be the cost related to LLM hosting depends upon the type of LLM we need for our application.
LLM Hosting Cost for Proprietary Models
If we need to use a proprietary model such as GPT-4 or Claude-2, our LLM hosting cost would primarily be API cost. We don’t require to host such models. Rather these models are hosted on the servers of LLM API providers. The proprietary models expose REST API which can be used in the application. The API cost depends upon the number of tokens processed as part of the API request. Recall that the number of tokens includes both the input tokens (the text you send to the model) and the output tokens (the text the model generates in response). For example, if you send a request with a prompt that contains 100 tokens and the model generates a response with 150 tokens, the total number of tokens processed would be 250 tokens.
LLM Hosting Cost = f(API Cost)
There are several reasons why we would want to use the hosting options related to the usage of the proprietary model rather than the open-source model. Some of the important ones:
- The LLM provider would deal with all maintenance/updates of the LLMs.
- We would not be required to worry about upfront setup and compute resources costs.
- Options to use LLM having desired performance and accuracy.
LLM Hosting Cost for Open Source Models
If we want to use open-source LLMs such as Llama, these models would need to be downloaded locally and hosted on their infrastructure. The cost would depend upon the size of the model.
Let’s understand with the help of the Llama-13B model. The Llama-13B model can be downloaded from the Huggingface website. Here is the Huggingface Llama page. Downloading the model implies downloading model weights in the form of necessary files. Here are some of the steps that would need to be taken to download the model.
- Fill out this MetaAI request form. The link for this form can also be found on the Huggingface Llama page.
- Follow the instructions on the HiggingFace Llama page to convert the weights into Huggingface transformers format.
- Once done, the Llama model and the tokenizers can be loaded on the local server.
The hosting cost would comprise the following:
- Storage cost: For a model with 13 billion parameters, the storage requirement could be around 60GB. Assuming cloud storage at $0.02 per GB per month, the storage cost would be approximately $1.20 per month.
- Compute cost: There would be a need to set up servers with sufficient computational power to host and run the model. This typically involves GPU instances. Suppose we chose to use VMs with NVIDIA A100 GPUs which are capable of running such large language models. The cost for an on-demand A100 GPU instance in the cloud (e.g., AWS EC2 P4d instances deployed in EC2 ultraclusters) can range from $30 to $40 per hour. Assuming the model runs 24/7, the monthly cost for one GPU instance could be around $30 * 24 * 30 = $21,600. For the AWS EC2 p4d instance, if you reserve it for a year, you get it at a $19.22 hourly reserved price. That would decrease the cost to roughly $14000. Check the price details on this page for AWS p4d.
- On Azure cloud, you can use ND A100 v4-series virtual machines (VMs). It’s designed for high-end Deep Learning training and tightly coupled scale-up and scale-out high-performance computing (HPC) workloads. HPC workloads are essentially complex, data-intensive tasks that get divided up and run simultaneously on multiple computers working together. The ND A100 v4 series starts with a single VM and eight NVIDIA Ampere A100 40GB Tensor Core GPUs.
- On Google cloud, there are different options of accelerator optimized VMs. For A3 accelerator-optimized machine types, NVIDIA H100 80GB GPUs are attached. For A2 accelerator-optimized machine types, NVIDIA A100 GPUs are attached. For G2 accelerator-optimized machine types, NVIDIA L4 GPUs are attached.
- Operational costs: In addition to computing and storage, there are costs for data transfer, additional infrastructure (like CPU, memory, networking), and maintenance. A further 20% of computing costs for these overheads, roughly $4,320 monthly.
Based on a rough estimate as shown above, the total monthly cost of hosting a Llama-13B model in your cloud hosting server can be somewhere around the following:
- Storage: $1.20
- Compute: $21,600 (for 1 p4d instance of 8 GPUs and 320GB A100 memory)
- Operational Overheads: $4,320
- Total: Approximately $25,911.20 per month
Using the LLaMA-13B model would involve downloading and hosting it on the company’s infrastructure, with a monthly cost of around $25,921.20. This cost primarily includes the expenses for GPU computing, storage, and operational overheads.
In comparison, using a proprietary model like GPT-4 hosted by OpenAI would eliminate the need for local infrastructure and maintenance. Instead, the company would pay based on the number of tokens processed, which could be more cost-effective and simpler to manage depending on usage patterns and specific requirements.
Hosting Cost for LLM Trained In-house
When we train an LLM in-house, we need to take a little different approach in terms of library usage than the open-source models where HuggingFace transformer libraries can be used.
We will need to use a powerful computing infrastructure with large GPUs such as NVIDIA A100 GPUs and set up a distributed training environment given a large language model and dataset. The frameworks such as Tensorflow or Pytorch can be used to train the LLMs. Once the LLM is trained, we would need to export the model in a format suitable for deployment. Common formats include ONNX, TensorFlow SavedModel, or PyTorch models.
Now we are ready for the LLM deployment. We can choose to host the LLM in on-premises servers or cloud-based solutions (e.g., AWS, GCP, Azure). We can use serving frameworks like TensorFlow Serving and TorchServe to interact with Large Language Models (LLMs).
Suppose we trained an LLM from scratch with 5B parameters. For hosting such LLMs, we would typically use instances with powerful GPUs. AWS offers various GPU instances, but a common choice for deep learning tasks is the p4d.24xlarge or p4de.24xlarge instance, which comes with NVIDIA V100 GPUs. Here’s a cost estimate based on the p4d instance.
- Instance Type: p4d.24xlarge (You could use Azure ND 100 v4-series or Google’s A3 / A2 / G2 accelerator optimized VMs).
- GPUs: 8 NVIDIA A100
- vCPUs: 96
- Memory: 320 GB
- Cost: Approximately $32 per hour on-demand on AWS (as of June 2024)
Monthly Cost Calculation:
- Hourly Rate: $32
- Usage: 24 hours/day * 30 days/month = 720 hours/month
- Monthly Cost: 720 hours * $32/hour = $23,040
Storage Costs
You will also need storage for your model and data. Let’s assume you need around 100 GB of storage for the model weights, data, and other necessary files.
- Storage Type: Amazon EBS General Purpose SSD (gp3)
- Cost: $0.10 per GB-month
Monthly Storage Cost:
- Storage Cost: 100 GB * $0.10/GB-month = $10.00
Data Transfer Costs
Data transfer costs depend on the amount of data being transferred in and out of AWS. For simplicity, let’s assume a modest data transfer of 1 TB per month.
- Cost: $0.09 per GB for data transfer out beyond the free tier
Monthly Data Transfer Cost:
- Data Transfer Cost: 1024 GB * $0.09/GB = $92.16
Total Monthly Cost
- Compute Cost: $23,040
- Storage Cost: $10.00
- Data Transfer Cost: $92.16
- Total Monthly Cost: $23142.16
- Sparse Mixture of Experts (MoE) Models: Examples - October 6, 2024
- Anxiety Disorder Detection & Machine Learning Techniques - October 4, 2024
- Confounder Features & Machine Learning Models: Examples - October 2, 2024
I found it very helpful. However the differences are not too understandable for me