RDK X5 segformer分割模型量化转换的定点模型精度很低,无法实现分割结果

问题
1.地瓜工具链是否能够提供精度对齐的工具?来验证校准数据集的预处理和原始训练数据集的预处理是否一致。
2.或者说请问我以下的量化转换过程有问题吗?
3.想了解下特定地对于transformer架构(不一定是分割模型)的地瓜BPU量化实践,有没有一些经验以及最佳实践可以分享?

算法工具链文档参考
环境安装 | RDK DOC

算法工具链docker
openexplorer/ai_toolchain_ubuntu_20_x5_cpu

模型
采用的是预训练好的pytorch.bin分割模型: nvidia/segformer-b0-finetuned-ade-512-512 · Hugging Face

校准数据集图片
挑选了COCO2017数据集的100张含有草地的图片,例子如下

校准数据集准备

1.segformer的预处理配置文件preprocessor.json

{
  "do_normalize": true,
  "do_resize": true,
  "feature_extractor_type": "SegformerFeatureExtractor",
  "image_mean": [
    0.485,
    0.456,
    0.406
  ],
  "image_std": [
    0.229,
    0.224,
    0.225
  ],
  "reduce_labels": true,
  "resample": 2,
  "size": 512
}

2.根据预处理配置文件生成预处理好的校准数据集代码如下

from pathlib import Path
import numpy as np
from PIL import Image
from transformers import SegformerImageProcessor

# 必填参数
MODEL_DIR = r"nvidia/segformer-b0-finetuned-ade-512-512"    # 含 preprocessor_config.json 的目录
SRC       = r"./grass_img"     # 图片文件或图片文件夹
DST       = r"./calibrate_float512" # 单图->.bin文件;批量->输出目录
# DST       = r"inference_512int8"  # 单图->.bin文件;批量->输出目录
LAYOUT    = "NCHW"                          # 或 "NHWC"

EXTS = {".jpg", ".jpeg", ".png", ".bmp", ".webp", ".tif", ".tiff"}

def is_img(p: Path):
    return p.suffix.lower() in EXTS

def process_one(processor, img_path: Path, layout: str):
    img = Image.open(img_path).convert("RGB")#Image读图出来,转成numpy后,通道是HWC
    arr = processor(images=img, return_tensors="np")["pixel_values"]  # (1,3,H,W) float32
    print("ori_arr:",arr.shape)#1 3 512 512
    if layout.upper() == "NHWC":
        arr = np.transpose(arr, (0, 2, 3, 1))  # -> (1,H,W,C)
    arr = arr.astype(np.float32, copy=False)
    # print("arr:",arr)
    # arr = arr.astype(np.int8, copy=False)
    print("arr:", arr)
    return np.ascontiguousarray(arr)

def save_bin(arr: np.ndarray, out_path: Path):
    out_path.parent.mkdir(parents=True, exist_ok=True)
    arr.tofile(out_path)  # 原始小端 float32 二进制

def main():
    processor = SegformerImageProcessor.from_pretrained(MODEL_DIR)
    src, dst = Path(SRC), Path(DST)

    if src.is_file():  # 单张
        out = dst if dst.suffix.lower() == ".bin" else dst.with_suffix(".bin")
        arr = process_one(processor, src, LAYOUT)
        save_bin(arr, out)
        print(f"Saved {out} shape={arr.shape} dtype={arr.dtype}")
    else:  # 批量
        files = [p for p in src.rglob("*") if p.is_file() and is_img(p)]
        dst.mkdir(parents=True, exist_ok=True)
        for p in files:
            arr = process_one(processor, p, LAYOUT)
            save_bin(arr, dst / f"{p.stem}.bin")
        print(f"Saved {len(files)} files to {dst.resolve()}")

if __name__ == "__main__":
    main()

量化配置文件

# Copyright (c) 2020 Horizon Robotics.All Rights Reserved.
#
# The material in this file is confidential and contains trade secrets
# of Horizon Robotics Inc. This is proprietary information owned by
# Horizon Robotics Inc. No part of this work may be disclosed,
# reproduced, copied, transmitted, or used in any way for any purpose,
# without the express written permission of Horizon Robotics Inc.

# 模型转化相关的参数
# ------------------------------------
# model conversion related parameters
model_parameters:
  # onnx 浮点模型文件
  # -----------------------------------------------------------
  # floating-point ONNX model file
  onnx_model: "./segformer_512.onnx"

  # 适用BPU架构
  # --------------------------------
  # the applicable BPU architecture
  march: "bayes-e"

  # 指定模型转换过程中是否输出各层的中间结果,如果为True,则输出所有层的中间输出结果,
  # --------------------------------------------------------------------------------------
  # specifies whether or not to dump the intermediate results of all layers in conversion
  # if set to True, then the intermediate results of all layers shall be dumped
  layer_out_dump: False

  # 模型转换输出的结果的存放目录
  # -----------------------------------------------------------
  # the directory in which model conversion results are stored
  working_dir: './model_convert/board_model/'

  # 模型转换输出的用于上板执行的模型文件的名称前缀
  # -----------------------------------------------------------------------------------------
  # model conversion generated name prefix of those model files used for dev board execution
  output_model_file_prefix: 'segformer_b0_512x512'

  #remove_node_type: "Reshape; Cast"

# 模型输入相关参数, 若输入多个节点, 则应使用';'进行分隔, 使用默认缺省设置则写None
# --------------------------------------------------------------------------
# model input related parameters,
# please use ";" to seperate when inputting multiple nodes,
# please use None for default setting
input_parameters:

  # (选填) 模型输入的节点名称, 此名称应与模型文件中的名称一致, 否则会报错, 不填则会使用模型文件中的节点名称
  # --------------------------------------------------------------------------------------------------------
  # (Optional) node name of model input,
  # it shall be the same as the name of model file, otherwise an error will be reported,
  # the node name of model file will be used when left blank
  input_name: ""

  # 网络实际执行时,输入给网络的数据格式,包括 nv12/rgb/bgr/yuv444/gray/featuremap,
  # -----------------------------------------------------------------------------------------
  # the data formats to be passed into neural network when actually performing neural network
  # available options: nv12/rgb/bgr/yuv444/gray/featuremap,
  input_type_rt: 'rgb'

  # 网络实际执行时输入的数据排布, 可选值为 NHWC/NCHW
  # 若input_type_rt配置为nv12,则此处参数不需要配置
  # ------------------------------------------------------------------
  # the data layout formats to be passed into neural network when actually performing neural network, available options: NHWC/NCHW
  # If input_type_rt is configured as nv12, then this parameter does not need to be configured
  input_layout_rt: 'NCHW'

  # 网络训练时输入的数据格式,可选的值为rgb/bgr/gray/featuremap/yuv444
  # --------------------------------------------------------------------
  # the data formats in network training
  # available options: rgb/bgr/gray/featuremap/yuv444
  input_type_train: 'rgb'

  # 网络训练时输入的数据排布, 可选值为 NHWC/NCHW
  # ------------------------------------------------------------------
  # the data layout in network training, available options: NHWC/NCHW
  input_layout_train: 'NCHW'

  # (选填) 模型网络的输入大小, 以'x'分隔, 不填则会使用模型文件中的网络输入大小,否则会覆盖模型文件中输入大小
  # -------------------------------------------------------------------------------------------
  # (Optional)the input size of model network, seperated by 'x'
  # note that the network input size of model file will be used if left blank
  # otherwise it will overwrite the input size of model file
  input_shape: '1x3x512x512'

  # 网络输入的预处理方法,主要有以下几种:
  # no_preprocess 不做任何操作
  # data_mean 减去通道均值mean_value
  # data_scale 对图像像素乘以data_scale系数
  # data_mean_and_scale 减去通道均值后再乘以scale系数
  # -------------------------------------------------------------------------------------------
  # preprocessing methods of network input, available options:
  # 'no_preprocess' indicates that no preprocess will be made
  # 'data_mean' indicates that to minus the channel mean, i.e. mean_value
  # 'data_scale' indicates that image pixels to multiply data_scale ratio
  # 'data_mean_and_scale' indicates that to multiply scale ratio after channel mean is minused
  #  norm_type: 'data_mean_and_scale'
  norm_type: 'no_preprocess'

  # 图像减去的均值, 如果是通道均值,value之间必须用空格分隔
  # --------------------------------------------------------------------------
  # the mean value minused by image
  # note that values must be seperated by space if channel mean value is used
  # mean_value: 123.675 116.28 103.53
  mean_value: 0 0 0

  # 图像预处理缩放比例,如果是通道缩放比例,value之间必须用空格分隔
  # ---------------------------------------------------------------------------
  # scale value of image preprocess
  # note that values must be seperated by space if channel scale value is used
  # scale_value: 4.3668 4.4643 4.4444
  scale_value: 1 1 1

# 模型量化相关参数
# -----------------------------
# model calibration parameters
calibration_parameters:

  # 模型量化的参考图像的存放目录,图片格式支持Jpeg、Bmp等格式,输入的图片
  # 应该是使用的典型场景,一般是从测试集中选择20~100张图片,另外输入
  # 的图片要覆盖典型场景,不要是偏僻场景,如过曝光、饱和、模糊、纯黑、纯白等图片
  # 若有多个输入节点, 则应使用';'进行分隔
  # -------------------------------------------------------------------------------------------------
  # the directory where reference images of model quantization are stored
  # image formats include JPEG, BMP etc.
  # should be classic application scenarios, usually 20~100 images are picked out from test datasets
  # in addition, note that input images should cover typical scenarios
  # and try to avoid those overexposed, oversaturated, vague,
  # pure blank or pure white images
  # use ';' to seperate when there are multiple input nodes
  cal_data_dir: '../calibrate_precess/calibrate_float512'

  # 校准数据二进制文件的数据存储类型,可选值为:float32, uint8
  # calibration data binary file save type, available options: float32, uint8
  cal_data_type: 'float32'

  # 如果输入的图片文件尺寸和模型训练的尺寸不一致时,并且preprocess_on为true,
  # 则将采用默认预处理方法(skimage resize),
  # 将输入图片缩放或者裁减到指定尺寸,否则,需要用户提前把图片处理为训练时的尺寸
  # ---------------------------------------------------------------------------------
  # In case the size of input image file is different from that of in model training
  # and that preprocess_on is set to True,
  # shall the default preprocess method(skimage resize) be used
  # i.e., to resize or crop input image into specified size
  # otherwise user must keep image size as that of in training in advance
  preprocess_on: False

  # 模型量化的算法类型,支持default、mix、kl、max、load,通常采用default即可满足要求
  # 如不符合预期可先尝试修改为mix 仍不符合预期再尝试kl或max
  # 当使用QAT导出模型时,此参数则应设置为load
  # 相关参数的技术原理及说明请您参考用户手册中的PTQ原理及步骤中参数组详细介绍部分
  # ----------------------------------------------------------------------------------
  # The algorithm type of model quantization, support default, mix, kl, max, load, usually use default can meet the requirements.
  # If it does not meet the expectation, you can try to change it to mix first. If there is still no expectation, try kl or max again.
  # When using QAT to export the model, this parameter should be set to load.
  # For more details of the parameters, please refer to the parameter details in PTQ Principle And Steps section of the user manual.
  calibration_type: 'default'

  # 该参数为'max'校准方法的参数,用以调整'max'校准的截取点。此参数仅在calibration_type为'max'时有效。
  # 该参数取值范围:0.0 ~ 1.0。常用配置选项有:0.99999/0.99995/0.99990/0.99950/0.99900。
  # ------------------------------------------------------------------------------------------------
  # this is the parameter of the 'max' calibration method and it is used for adjusting the intercept point of the 'max' calibration.
  # this parameter will only become valid when the calibration_type is specified as 'max'.
  # RANGE: 0.0 - 1.0. Typical options includes: 0.99999/0.99995/0.99990/0.99950/0.99900.
  max_percentile: 0.99999


# 编译器相关参数
# ----------------------------
# compiler related parameters
compiler_parameters:

  # 编译策略,支持bandwidth和latency两种优化模式;
  # bandwidth以优化ddr的访问带宽为目标;
  # latency以优化推理时间为目标
  # -------------------------------------------------------------------------------------------
  # compilation strategy, there are 2 available optimization modes: 'bandwidth' and 'lantency'
  # the 'bandwidth' mode aims to optimize ddr access bandwidth
  # while the 'lantency' mode aims to optimize inference duration
  compile_mode: 'latency'

  # 设置debug为True将打开编译器的debug模式,能够输出性能仿真的相关信息,如帧率、DDR带宽占用等
  # -----------------------------------------------------------------------------------
  # the compiler's debug mode will be enabled by setting to True
  # this will dump performance simulation related information
  # such as: frame rate, DDR bandwidth usage etc.
  debug: True

  # 编译模型指定核数,不指定默认编译单核模型, 若编译双核模型,将下边注释打开即可
  # -------------------------------------------------------------------------------------
  # specifies number of cores to be used in model compilation
  # as default, single core is used as this value left blank
  # please delete the "# " below to enable dual-core mode when compiling dual-core model
  # core_num: 2

  # 优化等级可选范围为O0~O3
  # O0不做任何优化, 编译速度最快,优化程度最低,
  # O1-O3随着优化等级提高,预期编译后的模型的执行速度会更快,但是所需编译时间也会变长。
  # 推荐用O2做最快验证
  # ----------------------------------------------------------------------------------------------------------
  # optimization level ranges between O0~O3
  # O0 indicates that no optimization will be made
  # the faster the compilation, the lower optimization level will be
  # O1-O3: as optimization levels increase gradually, model execution, after compilation, shall become faster
  # while compilation will be prolonged
  # it is recommended to use O2 for fastest verification
  optimize_level: 'O3'

量化结果读取代码

sunrise@ubuntu:~$ python3

import bpu_infer_lib

inf = bpu_infer_lib.Infer(True)

inf.load_model("/userdata/segformer_b0_512x512.bin")

import numpy as np

input_img=np.fromfile("/userdata/000000000872.bin",dtype=np.int8)

inf.read_input(input_img,0)

inf.forward()

inf.get_output()

out_data=inf.outputs[0].data  

输出out_data的分割分类值均为负值,无法实现分割

已做的排查
1.量化中间的产物 original_float.onnx的分割结果,没有问题,输出值也正常

2.配置文件参数确认

3.量化精度cos值小于0.8

*新人发帖只能发一张图,需要我补充的信息随时联系我

你好, 在算法开发的过程中,遇到各种数值不可控的问题都是正常的,算法开发本身就是需要厚积薄发的领域。算法工具链提供了完整的流程说明,debug工具及流程说明,供您参考。

PTQ流程详解:6.1. PTQ转换原理及流程 — Horizon Open Explorer

精度调优:8.2. PTQ模型精度调优 — Horizon Open Explorer

性能调优:8.1. 模型性能调优 — Horizon Open Explorer

精度debug工具详解:6.2.12. 精度debug工具 — Horizon Open Explorer

Runtime程序编写详解:9. 嵌入式应用开发(runtime)手册 — Horizon Open Explorer

如果将工具链手册所述的所有流程走完仍然不及预期,则说明模型及其权重本身无法量化。特别的,过拟合的模型本身容易出现异常值导致量化表示能力不足。

新算法开发建议

  1. 基本上新算法都需要做pipeline检查,来摸明白前后处理,一般不会是精度问题。

  2. 编写使用ONNXRuntime来推理原始浮点onnx的程序,来确定前后处理的baseline。

  3. 将输入类型设置为NCHW和featuremap,包括train和rt的两个type,前处理类型修改为no_preprocess,这样编译出来的quantized模型和bin模型所需要的数据,也就是所需要的前处理,和浮点onnx完全一致。建议在全featuremap的基础上进行准备校准数据,和bin模型编译。由于featuremap在板子上的python接口无法推理,只能用C/C++推理,调试阶段建议使用开发机器的HB_ONNXRuntime推理quantized onnx来调试。quantized onnx在全featuremap的编译基础上,前处理与浮点onnx完全一致。

  4. 如果在全featuremap的基础上,精度不达预期,可以查阅手册使用全int16编译,来确定精度上限。

  5. 全featuremap的基础上调通了,再来尝试配置nv12或rgb等让BPU加速前处理的配置方式。

非常感谢!

  1. “调试阶段建议使用开发机器的HB_ONNXRuntime推理quantized onnx来调试。quantized onnx在全featuremap的编译基础上,前处理与浮点onnx完全一致。” 这句话里的quantized onnx上的推理,输入前处理与浮点onnx完全一致的话,需要改成int8吗,还是保持为float32?
    *浮点onnx模型的featuremap的输入, 经过前处理后的值范围是大约是-2到2(数据类型是float32)

  2. " 1. 编写使用ONNXRuntime来推理原始浮点onnx的程序,来确定前后处理的baseline。" 请问这里确定前后处理的baseline,指的是定性地看一下原始浮点onnx模型的效果对吗?

1.保持为FP32即可
2.是的,最终量化建议先走一下全INT16量化,推理正常后再尝试INT8量化部分算子来进行加速

非常感谢!目前我按照您的建议加上我对之前操作的一些修改,在使用开发机器的HB_ONNXRuntime推理转换后的quantized onnx模型上(整个过程使用的输入都是featuremap),获得了不错的可视化分割结果!

  1. 目前的结果否意味着我用转换得到的相应的.bin模型在板端也能取得不错的分割结果?

  2. 我要如何用工具链验证板端.bin模型的效果以及进行板端.bin模型推理?您第一版的回复里提到featuremap输入不能用bpu_infer_lib来进行推理。

  3. 对于全int16编译,我在文档上只看到把模型中的所有node名称配置到node_info 参数中实现全int16转换,您这边会有建议更简洁的算法链工具配置来实现吗?

  4. 另外想确认一下,这里的featuremap指的是:RGB经过读取,然后进行前处理后保存为.bin文件 吗?我的校准集bin文件是这么生成的,作为量化转换的输入

1.2. 是,但是要用C++写Runtime

  1. 没有

  2. ONNX浮点模型接收数据的前面,也就是前处理最后一刻,dump下来作为校准数据

感谢。
1.工程上实际板端推理模型时接收的是摄像头的数据或者jpg数据,地瓜工具链量化转换时建议插入预处理节点吗? input_type_rt建议配置成什么?

2.如果工具链的input设置成RGB,工具链插入预处理节点,norm_type中的mean和scale是基于int8还是uint8来计算设置?(因为需要设置成和原始onnx模型一样的预处理)

3.如果工具链插入了预处理节点,校准集按照您的建议是不做预处理保存成bin或者npy吗?这里直接用cv2读数据后直接保存吗?是int8还是uint8呢?

1、2、3. 没有任何建议和倾向,这块完全可以按照您自己的业务场景来定。BPU做预处理的配置技巧在前文都已经描述清楚了,全部从featuremap开始。其他的配置项目不一定完全生效,所以目前只建议nv12单输入或者纯featuremap输入

麻烦问一下,模型部署到RDKx5上后帧率有多少呀?