我想在RDK S100P上测试Qwen2.5-VL-3B-Instruct的视觉推理性能,参考了“https://github.com/AXERA-TECH/Qwen2.5-VL-3B-Instruct.axera/tree/main/model_convert”这个链接中的模型转换方法,将Qwen的视觉部分转为了onnx格式,然后直接通过“hb_compile
–march nash-m
–model ./Qwen2.5-VL-3B-Instruct_vision.onnx
–input-shape hidden_states 1x3x1024x392
”指令得到了视觉模型的hbm格式文件;然后通过llama.cpp的模型转换得到了gguf格式的语言模型文件,参考“https://github.com/zixi01chen/llama.cpp_vlm_bpu”修改了llama.cpp的编译方式,通过“./llama.cpp/build/bin/llama-intern2vl-bpu-cli -m ./llama.cpp/GGUF-BPU-models/Qwen2.5-VL-3B-Instruct-GGUF-BPU/Qwen2.5-VL-3B-q4_k_m.gguf --mmproj ./llama.cpp/GGUF-BPU-models/Qwen2.5-VL-3B-Instruct-GGUF-BPU/Qwen2.5-VL-3B-Instruct_vision.hbm --image ./img/image2.jpg -p “Describe this image” --temp 0.1 --threads 4”指令运行推理,结果报错:
[BPU][[BPU_MONITOR]][281473187122336][INFO]BPULib verison(2, 1, 2)[0d3f195]!
[DNN] HBTL_EXT_DNN log level:6
[DNN]: 3.7.3_(4.2.11 HBRT)
llama_init_from_model: n_seq_max = 1
llama_init_from_model: n_ctx = 6400
llama_init_from_model: n_ctx_per_seq = 6400
llama_init_from_model: n_batch = 2048
llama_init_from_model: n_ubatch = 512
llama_init_from_model: flash_attn = 0
llama_init_from_model: freq_base = 1000000.0
llama_init_from_model: freq_scale = 1
llama_init_from_model: n_ctx_per_seq (6400) < n_ctx_train (128000) – the full capacity of the model will not be utilized
llama_kv_cache_init: kv_size = 6400, offload = 1, type_k = ‘f16’, type_v = ‘f16’, n_layer = 36, can_shift = 1
llama_kv_cache_init: CPU KV buffer size = 225.00 MiB
llama_init_from_model: KV self size = 225.00 MiB, K (f16): 112.50 MiB, V (f16): 112.50 MiB
llama_init_from_model: CPU output buffer size = 0.58 MiB
llama_init_from_model: CPU compute buffer size = 304.75 MiB
llama_init_from_model: graph nodes = 1266
llama_init_from_model: graph splits = 1
Model input shape: [1, 3, 1024, 392]
[E][9393][02-06][16:09:05:755][dnn_task.cpp:304][llama-intern2vl-bpu-cli][DNN] [Task] input index 0 's sys mem size is not enough, required: 5111808, given: 4816896
[E][9393][02-06][16:09:05:755][dnn_task.cpp:248][llama-intern2vl-bpu-cli][DNN] [Task] Validate input[0] failed!
[E][9393][02-06][16:09:05:755][dnn_task.cpp:182][llama-intern2vl-bpu-cli][DNN] [Task] invalid input
[E][9393][02-06][16:09:05:755][hb_ucp.cpp:74][llama-intern2vl-bpu-cli][UCP] taskHandle is null pointer
[E][9393][02-06][16:09:05:755][hb_ucp.cpp:89][llama-intern2vl-bpu-cli][UCP] taskHandle is null pointer
[E][9393][02-06][16:09:05:755][hb_ucp.cpp:120][llama-intern2vl-bpu-cli][UCP] taskHandle is null pointer
想知道可以通过这种方式部署Qwen2.5-VL模型到板子上吗?视觉部分在BPU上推理,语言部分在CPU上?或者有其他的方式可以实现吗?