Contact us if you'd like to request a custom solution or instance type.

Amazon Web Services

Microsoft Azure

Google Cloud Platform

CPU GPU INF2

East US us-east

Nvidia T4

1 GPU · 16 GB

3 vCPUs · 15 GB

$ 0.5/h

Nvidia L4

1 GPU · 24 GB

7 vCPUs · 30 GB

$ 0.8/h

Nvidia A10G

1 GPU · 24 GB

6 vCPUs · 30 GB

$ 1/h

Nvidia L40S

1 GPU · 48 GB

7 vCPUs · 30 GB

$ 1.8/h

Nvidia A100

1 GPU · 80 GB

11 vCPUs · 145 GB

$ 2.5/h

Nvidia T4

4 GPUs · 64 GB

46 vCPUs · 192 GB

$ 3/h

Nvidia L4

4 GPUs · 96 GB

47 vCPUs · 185 GB

$ 3.8/h

Nvidia A10G

4 GPUs · 96 GB

46 vCPUs · 186 GB

$ 5/h

Nvidia A100

2 GPUs · 160 GB

22 vCPUs · 290 GB

$ 5/h

Nvidia H200

1 GPU · 141 GB

23 vCPUs · 256 GB

$ 5/h

Nvidia L40S

4 GPUs · 192 GB

47 vCPUs · 380 GB

$ 8.3/h

Nvidia A100

4 GPUs · 320 GB

44 vCPUs · 580 GB

$ 10/h

Nvidia H200

2 GPUs · 282 GB

46 vCPUs · 512 GB

$ 10/h

Nvidia A100

8 GPUs · 640 GB

88 vCPUs · 1160 GB

$ 20/h

Nvidia H200

4 GPUs · 564 GB

92 vCPUs · 1024 GB

$ 20/h

Nvidia L40S

8 GPUs · 384 GB

190 vCPUs · 1532 GB

$ 23.5/h

Nvidia H200

8 GPUs · 1128 GB

184 vCPUs · 2048 GB

$ 40/h

Protected Public HF Restricted AWS Private

The Endpoint is available from the Internet, and secured with TLS/SSL.
Only you can access it, using a Hugging Face Token generated from your personal account.

Automatic Scale-to-Zero

Endpoints scaled to 0 replicas are not billed. They may take some time to scale back up once they start receiving requests again.

Number of replicas

Automatically scale the number of replicas within Min and Max based on compute usage. Min is always 0 if Scale-To-Zero is active.

Min Max

More options

Autoscaling Strategy

Control what type of trigger will cause your Endpoint to scale up.

Hardware Usage Pending Requests

Hardware Utilization Threshold (%)

A scale up event will be triggered if the average hardware utilisation (%) exceeds this threshold for more than 20 seconds.

Container Type

The Default container is the easiest way to deploy endpoints, and is very flexible thanks to custom Inference Handlers. You can also select a container optimized for Text-Generation inference, or link your own Custom container.

Quantization

Quantization can reduce the model size and improve latency, with little degradation in model accuracy.

Max Input Length (per Query)

Increasing this value can impact the amount of RAM required. Some models can only handle a finite range of sequences.

optional

Max Number of Tokens (per Query)

The larger this value, the more memory each request will consume and the less effective batching can be.

optional

Max Batch Prefill Tokens

Number of prefill tokens used during continuous batching. It can be useful to adjust this number since the prefill operation is memory-intensive and compute-bound.

optional

Max Batch Total Tokens

Number of tokens that can be passed before forcing waiting queries to be put on the batch. A value of 1000 can fit 10 queries of 100 tokens or a single query of 1000 tokens.

optional

Default Env

Environment variables that will be provided to your container during deployment.

Key

Value

Secret Env

Same as Default, but people with access to this endpoint will not be able to read these values after creation.

Key

Value

Commit Revision

Specify a revision commit hash for the Hugging Face repository

optional

Task

Select a supported task with the right inputs and outputs for your model pipeline, or define a custom task.

Container Arguments

Arguments passed to the container entrypoint.

optional

Container Command

Command executed in the container.

optional

Download Pattern

Specify what file type(s) should be downloaded from the repository.

Llama-3.1-70B-Instruct