How to Use Hugging Face for Model Deployment: Step by Step

I've been deploying machine learning models with Hugging Face for over two years now, and I can confidently say it's one of the most streamlined platforms for getting models into production. Whether you're deploying a fine-tuned BERT for sentiment analysis or a custom Whisper model for speech recognition, Hugging Face’s Inference Endpoints and Spaces make the process remarkably smooth. In this tutorial, I'll walk you through the exact steps I use to deploy models—from setting up the environment to handling production traffic.

Prerequisites

Before we dive in, make sure you have:

A Hugging Face account (free tier works for testing)
Python 3.8+ installed
huggingface_hub and transformers libraries installed (pip install huggingface_hub transformers)
A trained or fine-tuned model ready (I'll use a DistilBERT sentiment model as an example)

Step 1: Prepare Your Model for Deployment

The first step is ensuring your model is compatible with Hugging Face's deployment infrastructure. I always start by saving my model in the transformers format—this guarantees it works seamlessly with Inference Endpoints.

from transformers import AutoModelForSequenceClassification, AutoTokenizer

# Load your fine-tuned model (replace with your own)
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")

# Save locally
model.save_pretrained("./my-sentiment-model")
tokenizer.save_pretrained("./my-sentiment-model")

Pro Tip: Always test your model locally before uploading. I've wasted hours debugging deployment issues that were actually model loading errors. Run a quick inference:

inputs = tokenizer("This movie is fantastic!", return_tensors="pt")
outputs = model(**inputs)
print(outputs.logits.argmax().item())  # Should output 1 (positive)

Step 2: Upload Your Model to the Hugging Face Hub

Now, push your model to the Hugging Face Hub. This is where the magic happens—the Hub acts as both a registry and a distribution channel.

from huggingface_hub import HfApi

api = HfApi()
api.create_repo(repo_id="your-username/my-sentiment-model", exist_ok=True)
api.upload_folder(
    folder_path="./my-sentiment-model",
    repo_id="your-username/my-sentiment-model",
    repo_type="model"
)

Common Pitfall: If you get a 401 error, you haven't logged in. Run huggingface-cli login and paste your access token from huggingface.co/settings/tokens.

Screenshot: Uploading model files via Python API

Step 3: Create a Model Card (Optional but Recommended)

A good model card helps others (and your future self) understand what the model does. I always include:

Model description
Intended use cases
Training data summary
Evaluation metrics

You can create this directly on the Hub UI or programmatically:

from huggingface_hub import ModelCard

card = ModelCard.from_template(
    card_data={
        "license": "mit",
        "language": "en",
        "tags": ["sentiment-analysis", "distilbert"]
    },
    template_path="path/to/custom_template.md"  # Optional
)
card.push_to_hub("your-username/my-sentiment-model")

Step 4: Deploy with Inference Endpoints

This is where deployment gets production-ready. Inference Endpoints auto-scale and handle load balancing. Here's how I set one up:

Go to huggingface.co/inference-endpoints
Click "New endpoint"
Select your model (your-username/my-sentiment-model)
Choose instance type (I start with cpu.small for testing)
Set scaling limits (min: 0, max: 2 for cost efficiency)

Screenshot: Inference Endpoint configuration page

Pro Tip: Use the accelerator field in the API to request GPU instances. For example, gpu.t4.small is great for real-time inference with transformer models.

Once created, you'll get an endpoint URL like https://api-inference.huggingface.co/models/your-username/my-sentiment-model. Test it with:

import requests

API_URL = "https://api-inference.huggingface.co/models/your-username/my-sentiment-model"
headers = {"Authorization": "Bearer YOUR_HF_TOKEN"}

response = requests.post(API_URL, headers=headers, json={"inputs": "I love this product!"})
print(response.json())
# Output: [{'label': 'POSITIVE', 'score': 0.9998}]

Step 5: Optimize for Production

After initial deployment, I always optimize. Here are my go-to strategies:

5.1 Enable Batching

In the endpoint settings, set max_batch_size to 8 or 16. This dramatically improves throughput for concurrent requests.

5.2 Use ONNX Runtime

Convert your model to ONNX for 2-3x faster inference:

from optimum.onnxruntime import ORTModelForSequenceClassification

model = ORTModelForSequenceClassification.from_pretrained("your-username/my-sentiment-model", export=True)
model.save_pretrained("./onnx-model")
# Upload the ONNX version

5.3 Set Up Caching

For models with deterministic outputs (like classification), enable response caching in the endpoint settings. This reduces latency for repeated queries by 80%.

Step 6: Monitor and Scale

Hugging Face provides built-in monitoring. I always check these metrics:

P99 latency: Should be under 500ms for real-time apps
Error rate: Keep below 1%
CPU/GPU utilization: Scale up if consistently above 80%

Screenshot: Monitoring dashboard showing latency and error rates

Common Pitfall: Don't set min_replicas too high. I once left it at 5 and got a $200 bill for a weekend of idle endpoints. Start with 0 and let auto-scaling handle traffic.

Step 7: Alternative Deployment with Spaces

For smaller projects or demos, I use Hugging Face Spaces. It's simpler but less scalable:

Go to huggingface.co/spaces
Create a new Space (choose Gradio or Streamlit)
Add a requirements.txt with your dependencies
Write a simple inference script:

import gradio as gr
from transformers import pipeline

model = pipeline("sentiment-analysis", model="your-username/my-sentiment-model")

def predict(text):
    return model(text)[0]

gr.Interface(fn=predict, inputs="text", outputs="label").launch()

Conclusion

Deploying models with Hugging Face has transformed how I work. Here are my key takeaways:

Always test locally first – It saves hours of debugging.
Use Inference Endpoints for production – They handle scaling, load balancing, and monitoring out of the box.
Optimize with ONNX and batching – This can cut costs by 50% while improving performance.
Monitor aggressively – Set up alerts for latency and error rate spikes.
Start small, scale smart – Use min_replicas=0 and auto-scaling to avoid surprise bills.

The Hugging Face ecosystem eliminates most of the DevOps headaches associated with model deployment. In my experience, what used to take a week with Kubernetes and custom APIs now takes a few hours. Give it a try with your next model—I think you'll be amazed at how seamless it feels.

How to Use Hugging Face for Model Deployment: Step by Step

How to Use Hugging Face for Model Deployment: Step by Step

Prerequisites

Step 1: Prepare Your Model for Deployment

Step 2: Upload Your Model to the Hugging Face Hub

Step 3: Create a Model Card (Optional but Recommended)

Step 4: Deploy with Inference Endpoints

Step 5: Optimize for Production

5.1 Enable Batching

5.2 Use ONNX Runtime

5.3 Set Up Caching

Step 6: Monitor and Scale

Step 7: Alternative Deployment with Spaces

Conclusion

Related Agent

Hugging Face