Deploying a quantized 70B open model on Parel
Hosted models (gpt-5.4, claude-opus-4-7, qwen3-max) cover most needs. Sometimes
you specifically need a 70B+ open model: data control (data residency), a
fine-tuned checkpoint, or a quantization variant of a recent open release that
isn't on Instant API yet. In this guide we deploy
TheBloke/Llama-3.3-70B-Instruct-AWQ via Parel BYOM in 8-12 minutes.
The same flow applies to Qwen3-72B-AWQ, DeepSeek-V3.5-AWQ or your own fine-tune.
byom-DEPLOY_ID chat
endpoint is ready in 8-12 minutes for a 70B AWQ model. Routes to RunPod /
Vast / Modal automatically based on capacity, auto-shuts on idle, OpenAI
SDK compatible.
When BYOM, when hosted?
Hosted (Instant API) is enough
You need a general-purpose model. qwen3-max,
llama-3.3-70b and gpt-5.4 are all there; no setup,
no idle, billed per token. Preferred for POCs and most production.
BYOM is required
You have a fine-tuned checkpoint, you use a gated model (Llama-3 access), you need fixed capacity (no idle shutdown risk), or you need to control AWQ/GPTQ/FP8 quantization yourself.
Popular open 70B+ models for BYOM:
| Model | Size | Description | Min GPU |
|---|---|---|---|
| TheBloke/Llama-3.3-70B-Instruct-AWQ | 35 GB | Llama-3.3 70B 4-bit AWQ quantized | RTX A6000 48GB |
| Qwen/Qwen3-72B-Instruct-AWQ | 38 GB | Qwen3 flagship, AWQ | RTX A6000 48GB |
| deepseek-ai/DeepSeek-V3.5-AWQ | 180 GB | DeepSeek V3.5 MoE, AWQ | A100 4×80GB (TP=4) |
| meta-llama/Llama-3.3-70B-Instruct | 140 GB | Full fp16, max quality | A100 80GB or 2× |
1. Preview: pre-flight before deploy
The preview API hits Hugging Face for the model's metadata
(config.json, safetensors.index.json) without
creating any pod. It returns compatible GPU tiers + estimated ETA + hourly cost.
# 1) Pre-flight check (does NOT deploy)
curl -X POST https://api.parel.cloud/v1/deployments/preview \
-H "Authorization: Bearer $PAREL_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"huggingface_id": "TheBloke/Llama-3.3-70B-Instruct-AWQ"
}' {
"validator": "ok",
"model_size_gb": 35.4,
"architecture": "llama",
"weight_format": "awq",
"engine": "vllm",
"vllm_image": "vllm/vllm-openai:v0.10.1.1",
"compatible_tiers": ["rtx_a6000_48gb", "a100_80gb", "h100_80gb"],
"estimated_eta_seconds": {"runpod": 540, "vastai": 720, "modal": 410},
"estimated_hourly_usd": {"rtx_a6000_48gb": 0.68, "a100_80gb": 1.85, "h100_80gb": 3.20}
}
If tier_capacity_exceeded is returned, see the "Oversized model"
section. If weight_format_unsupported is returned, the model is
in GGUF or MLX format; BYOM currently supports only safetensors / bin / pytorch.
2. Deploy + status poll
The deploy enqueues an SQS task; a Lambda worker creates the pod on the
provider. The API returns 202 immediately; you poll for status.
idle_timeout_minutes and budget_limit_usd are not
optional — deploy is rejected without them.
# 2) Deploy — idle and budget required
curl -X POST https://api.parel.cloud/v1/deployments \
-H "Authorization: Bearer $PAREL_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"huggingface_id": "TheBloke/Llama-3.3-70B-Instruct-AWQ",
"tier": "rtx_a6000_48gb",
"idle_timeout_minutes": 15,
"budget_limit_usd": 10.00,
"name": "llama-70b-poc"
}'
# Response
# {"deployment_id": "d2k7x9", "status": "creating", "poll_url": "..."} # 3) Poll status (becomes "running" in 8-12 minutes for 70B AWQ)
parel deployments status d2k7x9
# Or via HTTP
curl https://api.parel.cloud/v1/deployments/d2k7x9 \
-H "Authorization: Bearer $PAREL_API_KEY"
# creating -> pulling_image -> downloading_weights -> starting -> running
70B AWQ takes ~8-12 minutes. The ETA is in the preview response; reality is
within ±30%. If status returns error, the provider pool is out
of capacity; Parel tried 3 providers and all failed. Retry in 5-15 minutes.
3. Smoke test
Once status is running the model is callable via the OpenAI Chat
Completions API as byom-DEPLOY_ID. Same code as for any Instant
model:
# 4) Smoke test the running endpoint
curl https://api.parel.cloud/v1/chat/completions \
-H "Authorization: Bearer $PAREL_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "byom-d2k7x9",
"messages": [
{"role": "user", "content": "Explain JWT in two sentences."}
],
"max_tokens": 256,
"temperature": 0.3
}'
# 5) ALWAYS delete when done
parel deployments delete d2k7x9 Test the response quality. AWQ quantization typically loses 1-2% on standard benchmarks but cuts memory in 4×. For most production workloads the trade-off is worth it.
Oversized model: 200B+ doesn't fit
For a 200B+ MoE model (DeepSeek-V3, Qwen2.5-Max), preview returns
tier_capacity_exceeded. Three strategies:
# Model doesn't fit on a single GPU?
# If preview returns tier_capacity_exceeded, three options:
#
# 1) Find an AWQ / GPTQ / FP8 quantized variant
# meta-llama/Llama-3.3-70B-Instruct (140GB)
# -> TheBloke/Llama-3.3-70B-Instruct-AWQ (35GB) -> rtx_a6000_48gb fits
#
# 2) Multi-GPU tier (TP=2 or TP=4)
# rtx3090_2x_48gb -> 2x 24GB combined, $0.42/hr
# a100_4x_320gb -> 4x 80GB combined, $7.40/hr
#
# 3) Larger single GPU (A100 80GB / H100 80GB) Decision
BYOM ship
Output quality acceptable for production, latency within POC budget.
For production, set idle_timeout=0 and a higher
budget_limit_usd. Endpoint stays open, hourly billing.
Back to Instant
The model is already in Instant API (e.g. llama-3.3-70b). BYOM adds operational overhead with no quality/cost win. Switch back to Instant; reserve BYOM for your fine-tune.
Back to a hosted showcase
Open model can't pass the bar (tool-use, long context, code edit).
claude-opus-4-7 or gpt-5.4 is the real answer. A
"no" from a BYOM POC is also a valuable result.