**部署环境:**docker-
**镜像版本:**docker_open_explorer_ubuntu_20_xj3_gpu_v2.6.6.tar.gz-
**工具链版本:**horizon_xj3_open_explorer_v2.6.6_py38_20240717.tar.gz
按照教程一步步操作的:(PS:这个mscoco是自己的数据集,只是文件夹名字和coco2017一样)
:
python3 tools/datasets/mscoco_packer.py --src-data-dir /data/horizon_x3/data/mscoco/train2017 --target-data-dir /data/horizon_x3/data/mscoco --split-name train --pack-type lmdb
python3 tools/datasets/mscoco_packer.py --src-data-dir /data/horizon_x3/data/mscoco/val2017 --target-data-dir /data/horizon_x3/data/mscoco --split-name val --pack-type lmdb
python3 tools/train.py --stage float --config configs/detection/fcos/fcos_efficientnetb0_mscoco.py
对 fcos_efficientnetb0_mscoco.py进行了修改(附件已上传)
:
task_name = "fcos_efficientnetb0_mscoco"
num_classes = 3 # 这里改成了自己的3类
batch_size_per_gpu = 16
device_ids = [0]
ckpt_dir = "./tmp_models/%s" % task_name
cudnn_benchmark = True
seed = None
log_rank_zero_only = True
bn_kwargs = {}
march = March.BERNOULLI2
convert_mode = "fx"
---包括后续的所有./tmp_data/mscoco改成了自己的路径---
:
2025-01-29 22:15:37,761 INFO [logger.py:199] Node[0] ==================================================BEGIN FLOAT STAGE==================================================
2025-01-29 22:15:37,780 INFO [logger.py:199] Node[0] init torch_num_thread is `12`,opencv_num_thread is `12`,openblas_num_thread is `12`,mkl_num_thread is `12`,omp_num_thread is `12`,
2025-01-29 22:15:39,994 WARNING: wrap usage has been changed, please pass necessary args
2025-01-29 22:15:40,759 INFO [logger.py:199] Node[0] building bifpn cell 0
2025-01-29 22:15:40,760 INFO [logger.py:199] Node[0] fnode 0 : {'inputs_offsets': [3, 4], 'sampling': ['keep', 'up'], 'upsample_type': ['function', 'function']}
2025-01-29 22:15:40,761 INFO [logger.py:199] Node[0] fnode 1 : {'inputs_offsets': [2, 5], 'sampling': ['keep', 'up'], 'upsample_type': ['function', 'function']}
2025-01-29 22:15:40,763 INFO [logger.py:199] Node[0] fnode 2 : {'inputs_offsets': [1, 6], 'sampling': ['keep', 'up'], 'upsample_type': ['function', 'function']}
2025-01-29 22:15:40,764 INFO [logger.py:199] Node[0] fnode 3 : {'inputs_offsets': [0, 7], 'sampling': ['keep', 'up'], 'upsample_type': ['function', 'function']}
2025-01-29 22:15:40,766 INFO [logger.py:199] Node[0] fnode 4 : {'inputs_offsets': [1, 7, 8], 'sampling': ['keep', 'keep', 'down'], 'upsample_type': ['function', 'function', 'function']}
2025-01-29 22:15:40,768 INFO [logger.py:199] Node[0] fnode 5 : {'inputs_offsets': [2, 6, 9], 'sampling': ['keep', 'keep', 'down'], 'upsample_type': ['function', 'function', 'function']}
2025-01-29 22:15:40,771 INFO [logger.py:199] Node[0] fnode 6 : {'inputs_offsets': [3, 5, 10], 'sampling': ['keep', 'keep', 'down'], 'upsample_type': ['function', 'function', 'function']}
2025-01-29 22:15:40,772 INFO [logger.py:199] Node[0] fnode 7 : {'inputs_offsets': [4, 11], 'sampling': ['keep', 'down'], 'upsample_type': ['function', 'function']}
2025-01-29 22:15:40,773 INFO [logger.py:199] Node[0] building bifpn cell 1
2025-01-29 22:15:40,774 INFO [logger.py:199] Node[0] fnode 0 : {'inputs_offsets': [3, 4], 'sampling': ['keep', 'up'], 'upsample_type': ['function', 'function']}
2025-01-29 22:15:40,775 INFO [logger.py:199] Node[0] fnode 1 : {'inputs_offsets': [2, 5], 'sampling': ['keep', 'up'], 'upsample_type': ['function', 'function']}
2025-01-29 22:15:40,776 INFO [logger.py:199] Node[0] fnode 2 : {'inputs_offsets': [1, 6], 'sampling': ['keep', 'up'], 'upsample_type': ['function', 'function']}
2025-01-29 22:15:40,777 INFO [logger.py:199] Node[0] fnode 3 : {'inputs_offsets': [0, 7], 'sampling': ['keep', 'up'], 'upsample_type': ['function', 'function']}
2025-01-29 22:15:40,778 INFO [logger.py:199] Node[0] fnode 4 : {'inputs_offsets': [1, 7, 8], 'sampling': ['keep', 'keep', 'down'], 'upsample_type': ['function', 'function', 'function']}
2025-01-29 22:15:40,780 INFO [logger.py:199] Node[0] fnode 5 : {'inputs_offsets': [2, 6, 9], 'sampling': ['keep', 'keep', 'down'], 'upsample_type': ['function', 'function', 'function']}
2025-01-29 22:15:40,781 INFO [logger.py:199] Node[0] fnode 6 : {'inputs_offsets': [3, 5, 10], 'sampling': ['keep', 'keep', 'down'], 'upsample_type': ['function', 'function', 'function']}
2025-01-29 22:15:40,782 INFO [logger.py:199] Node[0] fnode 7 : {'inputs_offsets': [4, 11], 'sampling': ['keep', 'down'], 'upsample_type': ['function', 'function']}
2025-01-29 22:15:40,783 INFO [logger.py:199] Node[0] building bifpn cell 2
2025-01-29 22:15:40,784 INFO [logger.py:199] Node[0] fnode 0 : {'inputs_offsets': [3, 4], 'sampling': ['keep', 'up'], 'upsample_type': ['function', 'function']}
2025-01-29 22:15:40,785 INFO [logger.py:199] Node[0] fnode 1 : {'inputs_offsets': [2, 5], 'sampling': ['keep', 'up'], 'upsample_type': ['function', 'function']}
2025-01-29 22:15:40,786 INFO [logger.py:199] Node[0] fnode 2 : {'inputs_offsets': [1, 6], 'sampling': ['keep', 'up'], 'upsample_type': ['function', 'function']}
2025-01-29 22:15:40,787 INFO [logger.py:199] Node[0] fnode 3 : {'inputs_offsets': [0, 7], 'sampling': ['keep', 'up'], 'upsample_type': ['function', 'function']}
2025-01-29 22:15:40,788 INFO [logger.py:199] Node[0] fnode 4 : {'inputs_offsets': [1, 7, 8], 'sampling': ['keep', 'keep', 'down'], 'upsample_type': ['function', 'function', 'function']}
2025-01-29 22:15:40,789 INFO [logger.py:199] Node[0] fnode 5 : {'inputs_offsets': [2, 6, 9], 'sampling': ['keep', 'keep', 'down'], 'upsample_type': ['function', 'function', 'function']}
2025-01-29 22:15:40,791 INFO [logger.py:199] Node[0] fnode 6 : {'inputs_offsets': [3, 5, 10], 'sampling': ['keep', 'keep', 'down'], 'upsample_type': ['function', 'function', 'function']}
2025-01-29 22:15:40,792 INFO [logger.py:199] Node[0] fnode 7 : {'inputs_offsets': [4, 11], 'sampling': ['keep', 'down'], 'upsample_type': ['function', 'function']}
/root/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py:554: UserWarning: This DataLoader will create 8 worker processes in total. Our suggested max number of worker in current system is 6, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
warnings.warn(_create_warning_msg(
loading annotations into memory...
Done (t=0.36s)
creating index...
index created!
2025-01-29 22:15:42,942 WARNING [ddp_trainer.py:61] Node[0] Detected that `torch==1.13.0+cu116`, and `hat.models.SyncBatchNorm `will be used to replace `torch.nn.SyncBatchNorm`, which will be faster during training.
NCCL version 2.14.3+cuda11.6
2025-01-29 22:15:43,055 WARNING [request.py:27] Node[0] Train strategy upload failed ! Cannot get USER_TOKEN/JOB_ID/CLUSTER from env.
2025-01-29 22:15:43,056 INFO [loop_base.py:480] Node[0] Start DistributedDataParallelTrainer loop from epoch 0, num_epochs=300
2025-01-29 22:15:43,192 INFO [monitor.py:143] Node[0] Epoch[0] Begin ==================================================
2025-01-29 22:15:43,192 INFO [lr_updater.py:204] Node[0] Epoch[0] Step[0] GlobalStep[0] lr=0.140000
`aidisdk` dependency is not available.
`aidisdk` dependency is not available.
`aidisdk` dependency is not available.
`aidisdk` dependency is not available.
`aidisdk` dependency is not available.
`aidisdk` dependency is not available.
`aidisdk` dependency is not available.
`aidisdk` dependency is not available.
2025-01-29 22:15:55,126 ERROR [ddp_trainer.py:463] Node[0] Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/hat/engine/ddp_trainer.py", line 457, in _with_exception
fn(*args)
File "/open_explorer/ddk/samples/ai_toolchain/horizon_model_train_sample/scripts/tools/train.py", line 186, in train_entrance
trainer.fit()
File "/usr/local/lib/python3.8/dist-packages/hat/engine/loop_base.py", line 546, in fit
self.batch_processor(
File "/usr/local/lib/python3.8/dist-packages/hat/utils/deterministic.py", line 235, in wrapper
result = func(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/hat/engine/processors/processor.py", line 734, in __call__
model_outs = model(*_as_list(batch_i))
File "/root/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/root/.local/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1040, in forward
output = self._run_ddp_forward(*inputs, **kwargs)
File "/root/.local/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1000, in _run_ddp_forward
return module_to_run(*inputs[0], **kwargs[0])
File "/root/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/hat/models/structures/detectors/fcos.py", line 92, in forward
return self._post_process(data, preds)
File "/usr/local/lib/python3.8/dist-packages/hat/models/structures/detectors/fcos.py", line 72, in _post_process
cls_targets, reg_targets, centerness_targets = self.targets(
File "/root/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/hat/models/task_modules/fcos/target.py", line 809, in forward
flatten_cls_scores = [
File "/usr/local/lib/python3.8/dist-packages/hat/models/task_modules/fcos/target.py", line 810, in <listcomp>
cls_score.permute(0, 2, 3, 1).reshape(
RuntimeError: shape '[16, -1, 80]' is invalid for input of size 196608
ERROR:__main__:train failed! process 0 terminated with exit code 1
Traceback (most recent call last):
File "tools/train.py", line 287, in <module>
raise e
File "tools/train.py", line 273, in <module>
train(
File "tools/train.py", line 254, in train
launch(
File "/usr/local/lib/python3.8/dist-packages/hat/engine/ddp_trainer.py", line 426, in launch
mp.spawn(
File "/root/.local/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/root/.local/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
while not context.join():
File "/root/.local/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 149, in join
raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with exit code 1
请问:-
1. 这个打印的 aidisdk dependency is not available. 影不影响后续?-
2. 貌似类别数修改没生效:RuntimeError: shape ‘[16, -1, 80]’ is invalid for input of size 196608-
3. 另外,官方有改进后的FCOS预训练权重进行微调吗?还是说训练时是默认使用预训练权重的
希望大佬指点一下