How to Use Hugging Face for Model Deployment: Step by Step
I've been deploying machine learning models with Hugging Face for over two years now, and I can confidently say it's one of the most streamlined platforms for getting models into production. Whether you're deploying a fine-tuned BERT for sentiment analysis or a custom Whisper model for speech recognition, Hugging Face’s Inference Endpoints and Spaces make the process remarkably smooth. In this tutorial, I'll walk you through the exact steps I use to deploy models—from setting up the environment to handling production traffic.
Prerequisites
Before we dive in, make sure you have:
- A Hugging Face account (free tier works for testing)
- Python 3.8+ installed
huggingface_hubandtransformerslibraries installed (pip install huggingface_hub transformers)- A trained or fine-tuned model ready (I'll use a DistilBERT sentiment model as an example)
Step 1: Prepare Your Model for Deployment
The first step is ensuring your model is compatible with Hugging Face's deployment infrastructure. I always start by saving my model in the transformers format—this guarantees it works seamlessly with Inference Endpoints.
from transformers import AutoModelForSequenceClassification, AutoTokenizer
# Load your fine-tuned model (replace with your own)
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
# Save locally
model.save_pretrained("./my-sentiment-model")
tokenizer.save_pretrained("./my-sentiment-model")
Pro Tip: Always test your model locally before uploading. I've wasted hours debugging deployment issues that were actually model loading errors. Run a quick inference:
inputs = tokenizer("This movie is fantastic!", return_tensors="pt")
outputs = model(**inputs)
print(outputs.logits.argmax().item()) # Should output 1 (positive)
Step 2: Upload Your Model to the Hugging Face Hub
Now, push your model to the Hugging Face Hub. This is where the magic happens—the Hub acts as both a registry and a distribution channel.
from huggingface_hub import HfApi
api = HfApi()
api.create_repo(repo_id="your-username/my-sentiment-model", exist_ok=True)
api.upload_folder(
folder_path="./my-sentiment-model",
repo_id="your-username/my-sentiment-model",
repo_type="model"
)
Common Pitfall: If you get a 401 error, you haven't logged in. Run huggingface-cli login and paste your access token from huggingface.co/settings/tokens.

Step 3: Create a Model Card (Optional but Recommended)
A good model card helps others (and your future self) understand what the model does. I always include:
- Model description
- Intended use cases
- Training data summary
- Evaluation metrics
You can create this directly on the Hub UI or programmatically:
from huggingface_hub import ModelCard
card = ModelCard.from_template(
card_data={
"license": "mit",
"language": "en",
"tags": ["sentiment-analysis", "distilbert"]
},
template_path="path/to/custom_template.md" # Optional
)
card.push_to_hub("your-username/my-sentiment-model")
Step 4: Deploy with Inference Endpoints
This is where deployment gets production-ready. Inference Endpoints auto-scale and handle load balancing. Here's how I set one up:
- Go to huggingface.co/inference-endpoints
- Click "New endpoint"
- Select your model (
your-username/my-sentiment-model) - Choose instance type (I start with
cpu.smallfor testing) - Set scaling limits (min: 0, max: 2 for cost efficiency)

Pro Tip: Use the accelerator field in the API to request GPU instances. For example, gpu.t4.small is great for real-time inference with transformer models.
Once created, you'll get an endpoint URL like https://api-inference.huggingface.co/models/your-username/my-sentiment-model. Test it with:
import requests
API_URL = "https://api-inference.huggingface.co/models/your-username/my-sentiment-model"
headers = {"Authorization": "Bearer YOUR_HF_TOKEN"}
response = requests.post(API_URL, headers=headers, json={"inputs": "I love this product!"})
print(response.json())
# Output: [{'label': 'POSITIVE', 'score': 0.9998}]
Step 5: Optimize for Production
After initial deployment, I always optimize. Here are my go-to strategies:
5.1 Enable Batching
In the endpoint settings, set max_batch_size to 8 or 16. This dramatically improves throughput for concurrent requests.
5.2 Use ONNX Runtime
Convert your model to ONNX for 2-3x faster inference:
from optimum.onnxruntime import ORTModelForSequenceClassification
model = ORTModelForSequenceClassification.from_pretrained("your-username/my-sentiment-model", export=True)
model.save_pretrained("./onnx-model")
# Upload the ONNX version
5.3 Set Up Caching
For models with deterministic outputs (like classification), enable response caching in the endpoint settings. This reduces latency for repeated queries by 80%.
Step 6: Monitor and Scale
Hugging Face provides built-in monitoring. I always check these metrics:
- P99 latency: Should be under 500ms for real-time apps
- Error rate: Keep below 1%
- CPU/GPU utilization: Scale up if consistently above 80%

Common Pitfall: Don't set min_replicas too high. I once left it at 5 and got a $200 bill for a weekend of idle endpoints. Start with 0 and let auto-scaling handle traffic.
Step 7: Alternative Deployment with Spaces
For smaller projects or demos, I use Hugging Face Spaces. It's simpler but less scalable:
- Go to huggingface.co/spaces
- Create a new Space (choose Gradio or Streamlit)
- Add a
requirements.txtwith your dependencies - Write a simple inference script:
import gradio as gr
from transformers import pipeline
model = pipeline("sentiment-analysis", model="your-username/my-sentiment-model")
def predict(text):
return model(text)[0]
gr.Interface(fn=predict, inputs="text", outputs="label").launch()
Conclusion
Deploying models with Hugging Face has transformed how I work. Here are my key takeaways:
- Always test locally first – It saves hours of debugging.
- Use Inference Endpoints for production – They handle scaling, load balancing, and monitoring out of the box.
- Optimize with ONNX and batching – This can cut costs by 50% while improving performance.
- Monitor aggressively – Set up alerts for latency and error rate spikes.
- Start small, scale smart – Use
min_replicas=0and auto-scaling to avoid surprise bills.
The Hugging Face ecosystem eliminates most of the DevOps headaches associated with model deployment. In my experience, what used to take a week with Kubernetes and custom APIs now takes a few hours. Give it a try with your next model—I think you'll be amazed at how seamless it feels.