BYOM · Hands-on

Deploying a quantized 70B open model on Parel

Hosted models (gpt-5.4, claude-opus-4-7, qwen3-max) cover most needs. Sometimes you specifically need a 70B+ open model: data control (data residency), a fine-tuned checkpoint, or a quantization variant of a recent open release that isn't on Instant API yet. In this guide we deploy TheBloke/Llama-3.3-70B-Instruct-AWQ via Parel BYOM in 8-12 minutes. The same flow applies to Qwen3-72B-AWQ, DeepSeek-V3.5-AWQ or your own fine-tune.

Time8 min read · 30 min to apply
RoleBackend, ML eng.
Cost$0.68-$3.20/hr
Send the Hugging Face ID to Parel's preview API, get back compatible GPU tiers and prices, confirm the deploy and a byom-DEPLOY_ID chat endpoint is ready in 8-12 minutes for a 70B AWQ model. Routes to RunPod / Vast / Modal automatically based on capacity, auto-shuts on idle, OpenAI SDK compatible.

When BYOM, when hosted?

Hosted (Instant API) is enough

You need a general-purpose model. qwen3-max, llama-3.3-70b and gpt-5.4 are all there; no setup, no idle, billed per token. Preferred for POCs and most production.

BYOM is required

You have a fine-tuned checkpoint, you use a gated model (Llama-3 access), you need fixed capacity (no idle shutdown risk), or you need to control AWQ/GPTQ/FP8 quantization yourself.

Popular open 70B+ models for BYOM:

ModelSizeDescriptionMin GPU
TheBloke/Llama-3.3-70B-Instruct-AWQ35 GBLlama-3.3 70B 4-bit AWQ quantizedRTX A6000 48GB
Qwen/Qwen3-72B-Instruct-AWQ38 GBQwen3 flagship, AWQRTX A6000 48GB
deepseek-ai/DeepSeek-V3.5-AWQ180 GBDeepSeek V3.5 MoE, AWQA100 4×80GB (TP=4)
meta-llama/Llama-3.3-70B-Instruct140 GBFull fp16, max qualityA100 80GB or 2×

1. Preview: pre-flight before deploy

The preview API hits Hugging Face for the model's metadata (config.json, safetensors.index.json) without creating any pod. It returns compatible GPU tiers + estimated ETA + hourly cost.

POST /v1/deployments/preview
# 1) Pre-flight check (does NOT deploy)
curl -X POST https://api.parel.cloud/v1/deployments/preview \
  -H "Authorization: Bearer $PAREL_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "huggingface_id": "TheBloke/Llama-3.3-70B-Instruct-AWQ"
  }'
preview response
{
  "validator": "ok",
  "model_size_gb": 35.4,
  "architecture": "llama",
  "weight_format": "awq",
  "engine": "vllm",
  "vllm_image": "vllm/vllm-openai:v0.10.1.1",
  "compatible_tiers": ["rtx_a6000_48gb", "a100_80gb", "h100_80gb"],
  "estimated_eta_seconds": {"runpod": 540, "vastai": 720, "modal": 410},
  "estimated_hourly_usd": {"rtx_a6000_48gb": 0.68, "a100_80gb": 1.85, "h100_80gb": 3.20}
}

If tier_capacity_exceeded is returned, see the "Oversized model" section. If weight_format_unsupported is returned, the model is in GGUF or MLX format; BYOM currently supports only safetensors / bin / pytorch.

2. Deploy + status poll

The deploy enqueues an SQS task; a Lambda worker creates the pod on the provider. The API returns 202 immediately; you poll for status. idle_timeout_minutes and budget_limit_usd are not optional — deploy is rejected without them.

POST /v1/deployments
# 2) Deploy — idle and budget required
curl -X POST https://api.parel.cloud/v1/deployments \
  -H "Authorization: Bearer $PAREL_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "huggingface_id": "TheBloke/Llama-3.3-70B-Instruct-AWQ",
    "tier": "rtx_a6000_48gb",
    "idle_timeout_minutes": 15,
    "budget_limit_usd": 10.00,
    "name": "llama-70b-poc"
  }'

# Response
# {"deployment_id": "d2k7x9", "status": "creating", "poll_url": "..."}
status polling
# 3) Poll status (becomes "running" in 8-12 minutes for 70B AWQ)
parel deployments status d2k7x9

# Or via HTTP
curl https://api.parel.cloud/v1/deployments/d2k7x9 \
  -H "Authorization: Bearer $PAREL_API_KEY"

# creating -> pulling_image -> downloading_weights -> starting -> running

70B AWQ takes ~8-12 minutes. The ETA is in the preview response; reality is within ±30%. If status returns error, the provider pool is out of capacity; Parel tried 3 providers and all failed. Retry in 5-15 minutes.

3. Smoke test

Once status is running the model is callable via the OpenAI Chat Completions API as byom-DEPLOY_ID. Same code as for any Instant model:

chat smoke + delete
# 4) Smoke test the running endpoint
curl https://api.parel.cloud/v1/chat/completions \
  -H "Authorization: Bearer $PAREL_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "byom-d2k7x9",
    "messages": [
      {"role": "user", "content": "Explain JWT in two sentences."}
    ],
    "max_tokens": 256,
    "temperature": 0.3
  }'

# 5) ALWAYS delete when done
parel deployments delete d2k7x9

Test the response quality. AWQ quantization typically loses 1-2% on standard benchmarks but cuts memory in 4×. For most production workloads the trade-off is worth it.

Oversized model: 200B+ doesn't fit

For a 200B+ MoE model (DeepSeek-V3, Qwen2.5-Max), preview returns tier_capacity_exceeded. Three strategies:

oversized strategies
# Model doesn't fit on a single GPU?
# If preview returns tier_capacity_exceeded, three options:
#
# 1) Find an AWQ / GPTQ / FP8 quantized variant
#    meta-llama/Llama-3.3-70B-Instruct (140GB)
#         -> TheBloke/Llama-3.3-70B-Instruct-AWQ (35GB) -> rtx_a6000_48gb fits
#
# 2) Multi-GPU tier (TP=2 or TP=4)
#    rtx3090_2x_48gb -> 2x 24GB combined, $0.42/hr
#    a100_4x_320gb   -> 4x 80GB combined, $7.40/hr
#
# 3) Larger single GPU (A100 80GB / H100 80GB)

Decision

BYOM ship

Output quality acceptable for production, latency within POC budget. For production, set idle_timeout=0 and a higher budget_limit_usd. Endpoint stays open, hourly billing.

Back to Instant

The model is already in Instant API (e.g. llama-3.3-70b). BYOM adds operational overhead with no quality/cost win. Switch back to Instant; reserve BYOM for your fine-tune.

Back to a hosted showcase

Open model can't pass the bar (tool-use, long context, code edit). claude-opus-4-7 or gpt-5.4 is the real answer. A "no" from a BYOM POC is also a valuable result.