训练bev出现错误

用户您好,请详细描述您所遇到的问题。

1.硬件获取渠道:购买J5芯片

2.当前系统镜像版本:docker_openexplorer_ubuntu_20_j5_gpu_v1.1.40_py38

3.当前天工开物版本:horizon_j5_open_explorer_v1.1.40_py38_20230210

4.问题定位:执行bev训练的命令出现错误

5.开发的demo/案例:bev_release_package-1.6.16

6.需要提供的解决方案:

在ddk/package/host/路径下,执行bash install.sh后,再执行如下训练命令:python3 tools/train.py --config configs/bev/bev_mt_lss.py --stage float

出现错误如下:

2023-03-03 17:25:12,057 INFO [logger.py:147] Node[0] ==================================================BEGIN FLOAT STAGE==================================================

2023-03-03 17:25:12,090 INFO [thread_init.py:38] Node[1] init torch_num_thread is `12`,opencv_num_thread is `12`,openblas_num_thread is `12`,mkl_num_thread is `12`,omp_num_thread is `12`,

2023-03-03 17:25:12,108 INFO [thread_init.py:38] Node[3] init torch_num_thread is `12`,opencv_num_thread is `12`,openblas_num_thread is `12`,mkl_num_thread is `12`,omp_num_thread is `12`,

2023-03-03 17:25:12,108 INFO [thread_init.py:38] Node[2] init torch_num_thread is `12`,opencv_num_thread is `12`,openblas_num_thread is `12`,mkl_num_thread is `12`,omp_num_thread is `12`,

2023-03-03 17:25:12,111 INFO [thread_init.py:38] Node[0] init torch_num_thread is `12`,opencv_num_thread is `12`,openblas_num_thread is `12`,mkl_num_thread is `12`,omp_num_thread is `12`,

2023-03-03 17:25:12,143 ERROR [ddp_trainer.py:363] Node[1] Traceback (most recent call last):

File “/root/.local/lib/python3.8/site-packages/hat/engine/ddp_trainer.py”, line 359, in _with_exception

fn(*args)

File “/open_explorer/bev_release_package/tools/train.py”, line 185, in train_entrance

trainer = build_from_registry(trainer)

File “/root/.local/lib/python3.8/site-packages/hat/registry.py”, line 236, in build_from_registry

return _impl(x)

File “/root/.local/lib/python3.8/site-packages/hat/registry.py”, line 196, in _impl

build_x = dict(((key, _impl(value)) for key, value in x.items())) # noqa

File “/root/.local/lib/python3.8/site-packages/hat/registry.py”, line 196, in

build_x = dict(((key, _impl(value)) for key, value in x.items())) # noqa

File “/root/.local/lib/python3.8/site-packages/hat/registry.py”, line 196, in _impl

build_x = dict(((key, _impl(value)) for key, value in x.items())) # noqa

File “/root/.local/lib/python3.8/site-packages/hat/registry.py”, line 196, in

build_x = dict(((key, _impl(value)) for key, value in x.items())) # noqa

File “/root/.local/lib/python3.8/site-packages/hat/registry.py”, line 213, in _impl

_raise_invalid_type_error(object_type)

File “/root/.local/lib/python3.8/site-packages/hat/registry.py”, line 75, in _raise_invalid_type_error

raise TypeError(

TypeError: LSSTransformer has not registered in any of registry [‘HAT_OBJECT_REGISTRY’] and is not a class, which is not allowed

2023-03-03 17:25:12,157 ERROR [ddp_trainer.py:363] Node[0] Traceback (most recent call last):

File “/root/.local/lib/python3.8/site-packages/hat/engine/ddp_trainer.py”, line 359, in _with_exception

fn(*args)

File “/open_explorer/bev_release_package/tools/train.py”, line 185, in train_entrance

trainer = build_from_registry(trainer)

File “/root/.local/lib/python3.8/site-packages/hat/registry.py”, line 236, in build_from_registry

return _impl(x)

File “/root/.local/lib/python3.8/site-packages/hat/registry.py”, line 196, in _impl

build_x = dict(((key, _impl(value)) for key, value in x.items())) # noqa

File “/root/.local/lib/python3.8/site-packages/hat/registry.py”, line 196, in

build_x = dict(((key, _impl(value)) for key, value in x.items())) # noqa

File “/root/.local/lib/python3.8/site-packages/hat/registry.py”, line 196, in _impl

build_x = dict(((key, _impl(value)) for key, value in x.items())) # noqa

File “/root/.local/lib/python3.8/site-packages/hat/registry.py”, line 196, in

build_x = dict(((key, _impl(value)) for key, value in x.items())) # noqa

File “/root/.local/lib/python3.8/site-packages/hat/registry.py”, line 213, in _impl

_raise_invalid_type_error(object_type)

File “/root/.local/lib/python3.8/site-packages/hat/registry.py”, line 75, in _raise_invalid_type_error

raise TypeError(

TypeError: LSSTransformer has not registered in any of registry [‘HAT_OBJECT_REGISTRY’] and is not a class, which is not allowed

2023-03-03 17:25:12,157 ERROR [ddp_trainer.py:363] Node[3] Traceback (most recent call last):

File “/root/.local/lib/python3.8/site-packages/hat/engine/ddp_trainer.py”, line 359, in _with_exception

fn(*args)

File “/open_explorer/bev_release_package/tools/train.py”, line 185, in train_entrance

trainer = build_from_registry(trainer)

File “/root/.local/lib/python3.8/site-packages/hat/registry.py”, line 236, in build_from_registry

return _impl(x)

File “/root/.local/lib/python3.8/site-packages/hat/registry.py”, line 196, in _impl

build_x = dict(((key, _impl(value)) for key, value in x.items())) # noqa

File “/root/.local/lib/python3.8/site-packages/hat/registry.py”, line 196, in

build_x = dict(((key, _impl(value)) for key, value in x.items())) # noqa

File “/root/.local/lib/python3.8/site-packages/hat/registry.py”, line 196, in _impl

build_x = dict(((key, _impl(value)) for key, value in x.items())) # noqa

File “/root/.local/lib/python3.8/site-packages/hat/registry.py”, line 196, in

build_x = dict(((key, _impl(value)) for key, value in x.items())) # noqa

File “/root/.local/lib/python3.8/site-packages/hat/registry.py”, line 213, in _impl

_raise_invalid_type_error(object_type)

File “/root/.local/lib/python3.8/site-packages/hat/registry.py”, line 75, in _raise_invalid_type_error

raise TypeError(

TypeError: LSSTransformer has not registered in any of registry [‘HAT_OBJECT_REGISTRY’] and is not a class, which is not allowed

2023-03-03 17:25:12,158 ERROR [ddp_trainer.py:363] Node[2] Traceback (most recent call last):

File “/root/.local/lib/python3.8/site-packages/hat/engine/ddp_trainer.py”, line 359, in _with_exception

fn(*args)

File “/open_explorer/bev_release_package/tools/train.py”, line 185, in train_entrance

trainer = build_from_registry(trainer)

File “/root/.local/lib/python3.8/site-packages/hat/registry.py”, line 236, in build_from_registry

return _impl(x)

File “/root/.local/lib/python3.8/site-packages/hat/registry.py”, line 196, in _impl

build_x = dict(((key, _impl(value)) for key, value in x.items())) # noqa

File “/root/.local/lib/python3.8/site-packages/hat/registry.py”, line 196, in

build_x = dict(((key, _impl(value)) for key, value in x.items())) # noqa

File “/root/.local/lib/python3.8/site-packages/hat/registry.py”, line 196, in _impl

build_x = dict(((key, _impl(value)) for key, value in x.items())) # noqa

File “/root/.local/lib/python3.8/site-packages/hat/registry.py”, line 196, in

build_x = dict(((key, _impl(value)) for key, value in x.items())) # noqa

File “/root/.local/lib/python3.8/site-packages/hat/registry.py”, line 213, in _impl

_raise_invalid_type_error(object_type)

File “/root/.local/lib/python3.8/site-packages/hat/registry.py”, line 75, in _raise_invalid_type_error

raise TypeError(

TypeError: LSSTransformer has not registered in any of registry [‘HAT_OBJECT_REGISTRY’] and is not a class, which is not allowed

ERROR:__main__:launch trainer failed! process 0 terminated with exit code 1

Traceback (most recent call last):

File “tools/train.py”, line 277, in

train(

File “tools/train.py”, line 272, in train

raise e

File “tools/train.py”, line 255, in train

launch(

File “/root/.local/lib/python3.8/site-packages/hat/engine/ddp_trainer.py”, line 328, in launch

mp.spawn(

File “/root/.local/lib/python3.8/site-packages/torch/multiprocessing/spawn.py”, line 230, in spawn

return start_processes(fn, args, nprocs, join, daemon, start_method=‘spawn’)

File “/root/.local/lib/python3.8/site-packages/torch/multiprocessing/spawn.py”, line 188, in start_processes

while not context.join():

File “/root/.local/lib/python3.8/site-packages/torch/multiprocessing/spawn.py”, line 139, in join

raise ProcessExitedException(

torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with exit code 1

有解决这个问题吗

感谢您使用地平线芯片算法工具链,最近我们在收集大家的满意度反馈,欢迎您填写问卷,详细情况可见:https://developer.horizon.ai/forumDetail/146177053698464782

你好,bev_release_package-1.6.16 这个包方便共享一下吗

见文档3.1.1环境部署的step3 添加环境变量:https://developer.horizon.ai/forumDetail/143772473308124163

您好,docker环境已经可以满足bev的运行,不需要再执行bash install.sh(install.sh是本地环境部署脚本),请您按照文档中提供的教程部署环境后再运行浮点训练命令

你好,目前bev_release_package需要走项目哈,暂不支持直接分享~

解决了,环境没配置好。

个人学习研究有可能获取到嘛???

当前可以看帖子:https://developer.horizon.ai/forumDetail/146177165367615505,代码尚不支持哈,建议持续关注,未来是会开放的

你好,从J5 OE1.1.57开始,bev会合入OE包,个人也可以获取啦,建议直接使用当前最新版本J5 OE1.1.60,获取链接:https://developer.horizon.ai/forumDetail/118363912788935318