官方benchmark的帧率值和用指令hrt_model_exec perf --thread_num 8得到的fps对不上,官方的帧率比自测的更高一点。请问这两种测量方式有什么区别?
perf.txt (1.8 KB)
-
X5的状态为最佳状态:CPU为8xA55@1.8G, 全核心Performance调度, BPU为1xBayes-e@1G, 共10TOPS等效int8算力。
-
单线程延迟为单帧,单线程,单BPU核心的延迟,BPU推理一个任务最理想的情况。
3 8线程极限帧率为8个线程同时向X5的BPU塞任务,目的是为了测试BPU的极限性能
可以尝试重新重新上电后多次运行 hrt_model_exec perf --thread_num 8 --model_file resnet18_224x224_nv12.bin
hrt_model_exec perf --thread_num 8 --model_file resnet18_224x224_nv12.bin
实测帧率440+
root@ubuntu:~# for cpu in /sys/devices/system/cpu/cpu[0-9]; do
echo performance | sudo tee $cpu/cpufreq/scaling_governor
done
performance
performance
performance
performance
performance
performance
performance
performance
root@ubuntu:~# sudo hrut_somstatus
=====================1=====================
temperature–>
DDR : 51.6 (C)
BPU : 51.0 (C)
CPU : 51.1 (C)
cpu frequency–>
-e min(M) cur(M) max(M)
-e cpu0: 300 1500 1500
-e cpu1: 300 1500 1500
-e cpu2: 300 1500 1500
-e cpu3: 300 1500 1500
-e cpu4: 300 1500 1500
-e cpu5: 300 1500 1500
-e cpu6: 300 1500 1500
-e cpu7: 300 1500 1500
bpu status information---->
-e min(M) cur(M) max(M) ratio
-e bpu0: 500 1000 1000 0
ddr frequency information---->
-e min(M) cur(M) max(M)
-e ddr: 266 4266 4266
GPU gc8000 frequency information---->
-e min(M) cur(M) max(M)
-e gc8000: 200 1000 1000
root@ubuntu:~# hrt_model_exec perf --thread_num 8 --model_file resnet18_224x224_nv12.bin
hrt_model_exec perf --thread_num 8 --model_file resnet18_224x224_nv12.bin
I0000 00:00:00.000000 3928 vlog_is_on.cc:197] RAW: Set VLOG level for “” to 3
[BPU_PLAT]BPU Platform Version(1.3.6)!
[HBRT] set log level as 0. version = 3.15.55.0
[DNN] Runtime version = 1.24.5_(3.15.55 HBRT)
[A][DNN][packed_model.cpp:247]Model [HorizonRT] The model builder version = 1.24.3
Load model to DDR cost 373.435ms.
[Warning]: These operators have range limitations on input data:
[Div, Tan, Acos, Asin, Sqrt, Gather, GatherElements, GatherND, GridSample, Log, Onehot, PsroiPooling, Range, ReverseSequence, RoiPooling, RoiAlign, ScatterElements, ScatterND, Slice, Tile, Topk, Upsample].
Please make sure that these operators are not in your model, when no input data is provided to the tool.
[Suggestion]: Using --input_file command to specify perf input data, which can appoint valid input data.
I1106 10:06:05.641021 3928 function_util.cpp:323] get model handle success
I1106 10:06:05.641122 3928 function_util.cpp:656] get model input count success
I1106 10:06:05.641259 3928 function_util.cpp:687] prepare input tensor success!
I1106 10:06:05.641291 3928 function_util.cpp:697] get model output count success
Frame count: 200, Thread Average: 19.406038 ms, thread max latency: 19.771999 ms, thread min latency: 4.463000 ms, FPS: 404.072235
Running condition:
Thread number is: 8
Frame count is: 200
Program run time: 495.127000 ms
Perf result:
Frame totally latency is: 3881.207764 ms
Average latency is: 19.406038 ms
Frame rate is: 403.936768 FPS
root@ubuntu:~# hrt_model_exec perf --thread_num 8 --model_file resnet18_224x224_nv12.bin
hrt_model_exec perf --thread_num 8 --model_file resnet18_224x224_nv12.bin
I0000 00:00:00.000000 3956 vlog_is_on.cc:197] RAW: Set VLOG level for “*” to 3
[BPU_PLAT]BPU Platform Version(1.3.6)!
[HBRT] set log level as 0. version = 3.15.55.0
[DNN] Runtime version = 1.24.5_(3.15.55 HBRT)
[A][DNN][packed_model.cpp:247]Model [HorizonRT] The model builder version = 1.24.3
Load model to DDR cost 329.313ms.
[Warning]: These operators have range limitations on input data:
[Div, Tan, Acos, Asin, Sqrt, Gather, GatherElements, GatherND, GridSample, Log, Onehot, PsroiPooling, Range, ReverseSequence, RoiPooling, RoiAlign, ScatterElements, ScatterND, Slice, Tile, Topk, Upsample].
Please make sure that these operators are not in your model, when no input data is provided to the tool.
[Suggestion]: Using --input_file command to specify perf input data, which can appoint valid input data.
I1106 10:06:09.468606 3956 function_util.cpp:323] get model handle success
I1106 10:06:09.468712 3956 function_util.cpp:656] get model input count success
I1106 10:06:09.468842 3956 function_util.cpp:687] prepare input tensor success!
I1106 10:06:09.468874 3956 function_util.cpp:697] get model output count success
Frame count: 200, Thread Average: 19.377138 ms, thread max latency: 19.739000 ms, thread min latency: 3.846000 ms, FPS: 404.544647
Running condition:
Thread number is: 8
Frame count is: 200
Program run time: 494.544000 ms
Perf result:
Frame totally latency is: 3875.427490 ms
Average latency is: 19.377138 ms
Frame rate is: 404.412954 FPS
root@ubuntu:~# hrt_model_exec perf --thread_num 8 --model_file resnet18_224x224_nv12.bin
hrt_model_exec perf --thread_num 8 --model_file resnet18_224x224_nv12.bin
I0000 00:00:00.000000 3984 vlog_is_on.cc:197] RAW: Set VLOG level for “*” to 3
[BPU_PLAT]BPU Platform Version(1.3.6)!
[HBRT] set log level as 0. version = 3.15.55.0
[DNN] Runtime version = 1.24.5_(3.15.55 HBRT)
[A][DNN][packed_model.cpp:247]Model [HorizonRT] The model builder version = 1.24.3
Load model to DDR cost 328.165ms.
[Warning]: These operators have range limitations on input data:
[Div, Tan, Acos, Asin, Sqrt, Gather, GatherElements, GatherND, GridSample, Log, Onehot, PsroiPooling, Range, ReverseSequence, RoiPooling, RoiAlign, ScatterElements, ScatterND, Slice, Tile, Topk, Upsample].
Please make sure that these operators are not in your model, when no input data is provided to the tool.
[Suggestion]: Using --input_file command to specify perf input data, which can appoint valid input data.
I1106 10:06:12.768787 3984 function_util.cpp:323] get model handle success
I1106 10:06:12.768891 3984 function_util.cpp:656] get model input count success
I1106 10:06:12.769011 3984 function_util.cpp:687] prepare input tensor success!
I1106 10:06:12.769066 3984 function_util.cpp:697] get model output count success
Frame count: 200, Thread Average: 19.399530 ms, thread max latency: 19.737000 ms, thread min latency: 4.158000 ms, FPS: 404.205353
Running condition:
Thread number is: 8
Frame count is: 200
Program run time: 494.951000 ms
Perf result:
Frame totally latency is: 3879.906250 ms
Average latency is: 19.399530 ms
Frame rate is: 404.080404 FPS
我的cpu都是Performance调度,bpu,ddr和gpu都是最高频率,用的模型也是官方sdk里转换得到的,也尝试重新上电测了,但帧率还是低于440.请问是哪里的设置有问题吗,还是这是正常差异呢?
您是自己转换的模型的吗,那整体上符合预期
请问为什么自己转换的模型帧率会比官方的benchmark低一些呢?
您自己训练的模型可能在结构上与官方直接使用的预模型不完全一致,即使是相同的网络架构(如ResNet、YOLO等),具体的层配置、特征图尺寸等细节可能有差异
量化时使用的校准数据集如果与训练数据分布差异较大,可能导致量化参数不够优化
校准数据量不足或质量不高,影响量化效果
新版本工具链为了支持更多模型类型和算子,可能牺牲了部分性能,不同版本的算子融合策略可能不同
了解。请问这个误差在多少属于合理范围呢?
10-20 % 都属于正常范围

