How to run an AI POC with Parel: a 1-day hands-on guide
This page is hands-on, not conceptual. With your Parel API key, the ready CSV below and the Python script you'll copy-paste, you'll have a working POC that compares three models side by side in about an hour. By the end you'll hold a decision table you can present to your manager.
To adapt this to your own use case (summarization, extraction, FAQ, QA), only the prompt and the test set change. The steps stay the same.
Step 0: What do you need?
Everything required for the POC:
- Parel account + API key. Generate one at
app.parel.cloud/api-keys
(format
parel_pk_...). New users get a $1 promo credit on signup. - $3 prepaid is enough. 50 tickets × 3 models = 150 requests totalling about $0.30. The Parel Compare UI adds ~$0.05 if you use it.
- Python 3.10+ and the
openaipackage. Parel is OpenAI SDK compatible; no extra library needed. - A 50-example test set. An example CSV is in this guide; or build one from your own data (5-10 minutes of work).
# Python 3.10+ and the openai package are enough
pip install openai
# Set your Parel API key as an env var
export PAREL_API_KEY="parel_pk_xxxxxxxxxxxx" Step 1: Prepare the test set
The test set is the foundation of the POC. The CSV below is an example: either expand these 6 rows to 50 (sampling real tickets from your system) or rebuild a 50-row CSV in the same format for your own task. Balance three difficulty levels: easy (category is obvious), medium (close to two categories) and hard (sarcastic tone, missing info, mixed topics).
ticket_text,expected_category,priority
"I was charged twice for the same order.",billing,high
"My API key returns 401 in production.",technical,high
"How does invoicing work if we upgrade to Pro?",sales,medium
"I want to close my account, what is the process?",cancellation,low
"My webhooks are randomly returning 502.",bug,high
"Thanks team, you've been very helpful.",other,low expected_category column).
That's a 30-minute job, but it's what makes the POC trustworthy.
Step 2: Smoke test (verify connectivity)
Before running the full runner, verify your connection to Parel with a single request. This step takes 30 seconds and confirms your API key works and you have credit.
# Single request to verify the connection
curl https://api.parel.cloud/v1/chat/completions \
-H "Authorization: Bearer $PAREL_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3-max",
"messages": [
{"role": "user", "content": "One sentence: which model are you?"}
],
"max_tokens": 64
}'
# Expected: 200 + JSON (choices[0].message.content)
# 401 = API key empty or wrong. 404 = wrong model name.
If you get 200 with a model name and an English sentence in the response,
you're ready. 401 means $PAREL_API_KEY is empty or wrong; 404
means a wrong model name (e.g. qwen3.max instead of
qwen3-max).
Step 3: The Python runner that tests three models in parallel
Save the file below as run_eval.py, place the
support_50.csv from step 1 in the same folder, and run it. The
script sends 50 tickets to 3 models sequentially and measures accuracy, p95
latency and token usage for each. Notice we test models from three different
providers with a single API key: this is Parel's most practical advantage.
# run_eval.py — run 50 tickets through 3 models side by side
import csv, json, time, statistics
from openai import OpenAI
client = OpenAI(
api_key="${PAREL_API_KEY}",
base_url="https://api.parel.cloud/v1",
)
# 3 models with different characters
MODELS = [
"gpt-4o-mini", # cheap + fast
"qwen3-max", # open-source reference
"claude-opus-4-7", # strong reasoning
]
PROMPT = """Classify the following support ticket into exactly one of:
billing, technical, sales, bug, cancellation, other
Return JSON only:
{"category": "...", "confidence": 0.0}
Ticket:
"""
def classify(model, ticket):
started = time.time()
response = client.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": "Return valid JSON only."},
{"role": "user", "content": PROMPT + ticket},
],
temperature=0,
)
latency_ms = int((time.time() - started) * 1000)
out = json.loads(response.choices[0].message.content)
usage = response.usage
return out, latency_ms, usage
results = {}
for model in MODELS:
correct, latencies, total_tokens = 0, [], 0
for row in csv.DictReader(open("support_50.csv")):
try:
out, latency, usage = classify(model, row["ticket_text"])
correct += out["category"] == row["expected_category"]
latencies.append(latency)
total_tokens += usage.total_tokens
except Exception as e:
print(f"{model} failed on row: {e}")
results[model] = {
"accuracy": correct / 50,
"p95_ms": int(statistics.quantiles(latencies, n=20)[-1]),
"total_tokens": total_tokens,
}
print(json.dumps(results, indent=2)) Run it:
python run_eval.py- 50 × 3 = 150 requests, takes 3-5 minutes
- Output:
accuracy,p95_ms,total_tokensper model
During the run, Parel routes each request to the right provider (OpenAI, DashScope, Anthropic) automatically. Even though we use the OpenAI SDK in the code, we're calling Qwen and Claude as well — switching providers didn't require a single code change.
Step 4: Read the results
The output looks roughly like the table below (your numbers will differ; the shape is the same):
| Model | Accuracy | p95 latency | Estimated $/1K |
|---|---|---|---|
| gpt-4o-mini | 88% | 720 ms | $0.18 |
| qwen3-max | 92% | 1.4 s | $0.42 |
| claude-opus-4-7 | 94% | 2.1 s | $1.85 |
Compare accuracy against cost. In this example, the most expensive model (claude-opus-4-7) has the highest accuracy but is 10× more expensive than gpt-4o-mini for only +6 points. Financially, gpt-4o-mini is likely your "ship" candidate.
Compute cost with one formula:
cost = total_tokens × $/1K_token. If 50 tickets consume ~8K tokens
on average, that's about $0.04 with gpt-4o-mini; for 100K tickets per month,
~$80.
Step 5: Decision — ship / iterate / stop
You now have a table you can present to a manager. The decision isn't based on a single metric; it's the combination of quality, cost, latency and error impact:
Ship
One model passes the quality threshold, latency fits the budget, cost is understood. Start a 5-10% pilot, keep the old rule-based system as a control group. Re-run the eval set weekly in production to detect drift.
Iterate
Accuracy is just below the threshold. Try in order: add few-shot examples to the prompt, tighten the output schema (required fields), expand the test set (especially hard cases). Switching models is the last step; it's rarely needed.
Stop
No model passes the threshold, the error impact is too high (a wrong decision is hard to reverse) or business value is unclear. Splitting the use case or returning to a rule-based solution is also a valuable POC outcome. Knowing where AI doesn't fit is a win.
Bonus: same POC, no code
If you'd rather not install Python, the Parel Compare UI does the same job. Upload the CSV, pick the models, click Run, see the same table after 2-3 minutes. The code version is more reproducible; the UI version is faster for PMs and non-devs.
# Or, no code: Parel Compare UI
# https://app.parel.cloud/compare
#
# 1. Click "New run"
# 2. Upload your CSV (input + expected columns)
# 3. Pick 3 models (gemini-3-flash, qwen3-max, gpt-5.4)
# 4. "Run" → ~2-3 minutes later: quality + latency + cost table What's next
POC done. The next playbooks take you a step further:
- Auto-classifying support tickets with AI and routing to the right team: the production-ready long version of this guide — hybrid routing, critical-ticket escalation, drift monitoring.
- Deploying a Hugging Face model: if an open-source model wins, run it on your own GPU via BYOM.
- Connect Claude Code to Parel: for developers who want to use Parel models from inside their IDE.