Multi-threading Triton Server for Batch Inference
配置修改
1
2
3
4
5
6
{
"max_batch_size": 1024,
"dynamic_batching": {
"max_queue_delay_microseconds": 100000
}
}
修改input_ids的padding拼接
web_server.py
L111行修改input_ids
padding 逻辑为3 (</s>)
1
2
- input_ids = tokenizer(self.prompt, add_special_tokens=True, return_tensors="pt", padding=False)
+ input_ids = tokenizer(self.prompt, add_special_tokens=True, return_tensors="pt", padding='</s>')
客户端模拟多线程批量发起请求
bench.py
文件中修改多线程请求部分L152~159:
1
2
3
4
5
6
7
8
with ThreadPoolExecutor(max_workers=worker_count) as executor:
for i in range(0, len(texts), worker_count):
futures = []
for j in range(worker_count):
if i+j < len(texts):
futures.append(executor.submit(request, texts[i+j]))
concurrent.futures.wait(futures)
executor.shutdown()
Check 是否成功批量
1
2
# 用grep 过滤日志文件中存在 batch_size 的前后3行
grep -C 3 "batch_size" <log_file>
如果你想要使用 grep
来查找包含特定模式的行,并且显示这些行的前后三行,你可以使用 -A
(after),-B
(before),和 -C
(context,表示前后都要显示的行数) 这些选项。
This post is licensed under CC BY 4.0 by the author.