TVM 篇（2）：如何使用 TVMC 编译和优化模型

Author: stormQ

Created: Sunday, 05. September 2021 10:32AM

Last Modified: Sunday, 05. September 2021 01:46PM

摘要

本文通过一个简单的示例展示了在 Linux 系统中使用 TVMC 编译、运行以及优化模型的过程，并对比了调优模型与未调优模型的性能。从而，初步了解深度学习编译器框架——TVM 及其提供的命令行工具——TVMC 是如何工作的，以便更深入地研究相关内容。

如何编译并运行模型

step 1： 获取模型

执行如下命令获取卷积神经网络——ResNet-50 v2（ONNX 模型）：

$ wget https://github.com/onnx/models/raw/master/vision/classification/resnet/model/resnet50-v2-7.onnx

step 2： 编译模型

1）安装依赖

要使用 TVMC 编译 ONNX 模型，需要安装依赖——onnx。执行如下命令进行安装：

$ sudo pip3 install onnx

另外，目前还依赖版本为 3.17.3 的 protobuf。如果已安装的 protobuf 版本过低，则执行如下命令进行升级：

$ sudo pip3 install --upgrade protobuf

2）获取 LLVM 的 tophub 包（optional）

如果编译的target是llvm，并且编译 ONNX 模型时报错如下（部分）：

WARNING:root:Failed to download tophub package for llvm: <urlopen error [Errno 111] Connection refused>

则需要手动下载 tophub 包，并拷贝到~/.tvm/目录中。执行命令如下：

$ git clone https://github.com/uwsampl/tvm-distro
$ rm -r ~/.tvm/tophub
$ cp -r ./tvm-distro/tophub/ ~/.tvm/

执行上述命令后，目录~/.tvm/tophub的内容如下（部分）：

$ ll ~/.tvm/tophub/
total 11636
drwxrwxr-x 2 xxq xxq   4096 Sep  5 09:16 ./
drwxrwxr-x 3 xxq xxq   4096 Sep  5 09:16 ../
-rw-rw-r-- 1 xxq xxq  69140 Sep  5 09:16 amd_apu_v0.01.log
省略 ...
-rw-rw-r-- 1 xxq xxq 754094 Sep  5 09:16 arm_cpu_v0.08.log
-rw-rw-r-- 1 xxq xxq 120379 Sep  5 09:16 cuda_v0.01.log
省略 ...
-rw-rw-r-- 1 xxq xxq 490618 Sep  5 09:16 cuda_v0.10.log
-rw-rw-r-- 1 xxq xxq  40851 Sep  5 09:16 intel_graphics_v0.01.log
-rw-rw-r-- 1 xxq xxq  36829 Sep  5 09:16 intel_graphics_v0.02.log
-rw-rw-r-- 1 xxq xxq  19539 Sep  5 09:16 llvm_v0.01.log
省略 ...
-rw-rw-r-- 1 xxq xxq  17353 Sep  5 09:16 llvm_v0.04.log
-rw-rw-r-- 1 xxq xxq  76499 Sep  5 09:16 mali_v0.01.log
省略 ...
-rw-rw-r-- 1 xxq xxq 318180 Sep  5 09:16 mali_v0.06.log
-rw-rw-r-- 1 xxq xxq  62637 Sep  5 09:16 opencl_v0.01.log
省略 ...
-rw-rw-r-- 1 xxq xxq  58578 Sep  5 09:16 opencl_v0.04.log
-rw-rw-r-- 1 xxq xxq 116229 Sep  5 09:16 rocm_v0.01.log
省略 ...
-rw-rw-r-- 1 xxq xxq 131072 Sep  5 09:16 rocm_v0.05.log
-rw-rw-r-- 1 xxq xxq  54525 Sep  5 09:16 vta_v0.01.log
省略 ...
-rw-rw-r-- 1 xxq xxq  63738 Sep  5 09:16 vta_v0.10.log

3）编译 ONNX 模型

$ tvmc compile --target "llvm" --output resnet50-v2-7-tvm.tar resnet50-v2-7.onnx

编译成功后，会生成一个名称为resnet50-v2-7-tvm.tar的 tar 文件。其内容如下：

$ mkdir model && tar -xvf resnet50-v2-7-tvm.tar -C model && ll model
mod.so
mod.json
mod.params
total 100524
drwxrwxr-x 2 xxq xxq      4096 Sep  5 09:44 ./
drwxrwxr-x 5 xxq xxq      4096 Sep  5 09:44 ../
-rw-rw-r-- 1 xxq xxq     87509 Sep  5 09:28 mod.json
-rw-rw-r-- 1 xxq xxq 102125470 Sep  5 09:28 mod.params
-rwxrwxr-x 1 xxq xxq    709960 Sep  5 09:28 mod.so*

从上面的结果可以看出，实际生成了如下 3 个文件：

mod.json，TVM Relay 计算图的 ASCII 文本表示（数据格式为 JSON）
mod.params，包含预训练模型的参数
mod.so，表示模型，即可以被 TVM 运行时库加载的 C++ 共享库

step 3： 运行模型

TVMC 的输入和输出数据都采用了 NumPy 的.npz格式。

1）预处理输入数据

由于 TVM 官网所提供的preprocess.py脚本下载输入图像过慢。因此，执行如下命令手动下载输入图像：

$ wget https://s3.amazonaws.com/model-server/inputs/kitten.jpg

下载完成后，执行如下命令生成 TVMC 的输入数据——imagenet_cat.npz：

$ python3 preprocess.py

注：执行上述命令前，需要保证 preprocess.py 与 kitten.jpg 位于同一个目录中。

修改后的 preprocess.py 的完整内容如下：

#!python ./preprocess.py
from tvm.contrib.download import download_testdata
from PIL import Image
import numpy as np

# img_url = "https://s3.amazonaws.com/model-server/inputs/kitten.jpg"
# img_path = download_testdata(img_url, "imagenet_cat.png", module="data")
img_path = "./kitten.jpg"

# Resize it to 224x224
resized_image = Image.open(img_path).resize((224, 224))
img_data = np.asarray(resized_image).astype("float32")

# ONNX expects NCHW input, so convert the array
img_data = np.transpose(img_data, (2, 0, 1))

# Normalize according to ImageNet
imagenet_mean = np.array([0.485, 0.456, 0.406])
imagenet_stddev = np.array([0.229, 0.224, 0.225])
norm_img_data = np.zeros(img_data.shape).astype("float32")
for i in range(img_data.shape[0]):
        norm_img_data[i, :, :] = (img_data[i, :, :] / 255 - imagenet_mean[i]) / imagenet_stddev[i]

# Add batch dimension
img_data = np.expand_dims(norm_img_data, axis=0)

# Save to .npz (outputs imagenet_cat.npz)
np.savez("imagenet_cat", data=img_data)

2）运行模型

$ tvmc run --inputs imagenet_cat.npz --output predictions.npz resnet50-v2-7-tvm.tar

上述命令执行成功后，会输出一个predictions.npz文件，包含 NumPy 格式的模型输出张量。

3）后处理输出数据

$ python3 postprocess.py

输出结果如下：

class='n02123045 tabby, tabby cat' with probability=0.621104
class='n02123159 tiger cat' with probability=0.356378
class='n02124075 Egyptian cat' with probability=0.019712
class='n02129604 tiger, Panthera tigris' with probability=0.001215
class='n04040759 radiator' with probability=0.000262

postprocess.py 的完整内容如下（引自 TVM 官网）：

#!python ./postprocess.py
import os.path
import numpy as np

from scipy.special import softmax

from tvm.contrib.download import download_testdata

# Download a list of labels
labels_url = "https://s3.amazonaws.com/onnx-model-zoo/synset.txt"
labels_path = download_testdata(labels_url, "synset.txt", module="data")

with open(labels_path, "r") as f:
    labels = [l.rstrip() for l in f]

output_file = "predictions.npz"

# Open the output and read the output tensor
if os.path.exists(output_file):
    with np.load(output_file) as data:
        scores = softmax(data["output_0"])
        scores = np.squeeze(scores)
        ranks = np.argsort(scores)[::-1]

        for rank in ranks[0:5]:
            print("class='%s' with probability=%f" % (labels[rank], scores[rank]))

如何优化模型

step 1： 模型调优

1）安装依赖

TVMC 默认使用的调优搜索算法为gridsearch，由 Python 包——xgboost 提供。执行如下命令进行安装：

$ sudo pip3 install xgboost

注：可以通过--tuner选项指定要使用的调优搜索算法。查看可指定哪些算法：

$ tvmc tune --help | grep tuner
                 [--tuner {ga,gridsearch,random,xgb,xgb_knob,xgb-rank}]
  --tuner {ga,gridsearch,random,xgb,xgb_knob,xgb-rank}
                        type of tuner to use when tuning with autotvm.

另外，还需要执行如下命令安装其它依赖：

$ sudo pip3 install tornado cloudpickle

2）模型调优

$ tvmc tune --target "llvm" --output resnet50-v2-7-autotuner_records.json resnet50-v2-7.onnx

上述命令执行成功后，会输出一个调优数据文件——resnet50-v2-7-autotuner_records.json。该文件既可以作为 TVM 编译器的输入（即命令tvmc compile --tuning-records的输入，用于为指定目标上的模型生成高性能代码），也可以作为 TVM 优化器的输入（即命令tvmc tune --tuning-records的输入，用于进一步优化模型）。

输入结果如下：

[Task  1/25]  Current/Best:   10.35/  16.23 GFLOPS | Progress: (40/40) | 57.99 s Done.
[Task  2/25]  Current/Best:    8.30/  16.96 GFLOPS | Progress: (40/40) | 61.38 s Done.
[Task  3/25]  Current/Best:   44.45/  85.12 GFLOPS | Progress: (40/40) | 59.75 s Done.
[Task  4/25]  Current/Best:   35.55/  78.44 GFLOPS | Progress: (40/40) | 63.15 s Done.
[Task  5/25]  Current/Best:   75.24/  91.38 GFLOPS | Progress: (40/40) | 60.47 s Done.
[Task  6/25]  Current/Best:   51.89/  76.82 GFLOPS | Progress: (40/40) | 64.67 s Done.
[Task  7/25]  Current/Best:   29.40/  76.10 GFLOPS | Progress: (40/40) | 56.23 s Done.
[Task  8/25]  Current/Best:   83.11/  83.11 GFLOPS | Progress: (40/40) | 58.80 s Done.
[Task  9/25]  Current/Best:   69.41/  83.54 GFLOPS | Progress: (40/40) | 60.52 s Done.
[Task 10/25]  Current/Best:   13.51/  85.03 GFLOPS | Progress: (40/40) | 59.52 s Done.
[Task 11/25]  Current/Best:   72.05/  83.87 GFLOPS | Progress: (40/40) | 60.31 s Done.
[Task 12/25]  Current/Best:   29.50/  92.50 GFLOPS | Progress: (40/40) | 57.95 s Done.
[Task 13/25]  Current/Best:   40.43/  83.64 GFLOPS | Progress: (40/40) | 57.84 s Done.
[Task 14/25]  Current/Best:   29.07/  86.04 GFLOPS | Progress: (40/40) | 58.77 s Done.
[Task 15/25]  Current/Best:   43.92/  97.89 GFLOPS | Progress: (40/40) | 57.89 s Done.
[Task 16/25]  Current/Best:   30.39/  88.20 GFLOPS | Progress: (40/40) | 57.58 s Done.
[Task 17/25]  Current/Best:   27.61/  71.96 GFLOPS | Progress: (40/40) | 59.85 s Done.
[Task 18/25]  Current/Best:   43.94/  77.22 GFLOPS | Progress: (40/40) | 58.30 s Done.
[Task 19/25]  Current/Best:   26.52/ 102.31 GFLOPS | Progress: (40/40) | 61.32 s Done.
[Task 20/25]  Current/Best:   38.46/  65.55 GFLOPS | Progress: (40/40) | 59.96 s Done.
[Task 21/25]  Current/Best:   42.46/  91.49 GFLOPS | Progress: (40/40) | 57.80 s Done.
[Task 22/25]  Current/Best:   42.69/  97.00 GFLOPS | Progress: (40/40) | 60.31 s Done.
[Task 23/25]  Current/Best:   54.49/  95.70 GFLOPS | Progress: (40/40) | 58.52 s Done.
[Task 24/25]  Current/Best:   66.17/  95.56 GFLOPS | Progress: (40/40) | 61.63 s Done.
[Task 25/25]  Current/Best:   74.24/  74.24 GFLOPS | Progress: (40/40) | 62.14 s Done.

step 2： 生成并运行调优模型

1）生成调优模型

执行如下命令，使用调优数据——resnet50-v2-7-autotuner_records.json编译生成调优模型：

$ tvmc compile --target "llvm" --tuning-records resnet50-v2-7-autotuner_records.json --output resnet50-v2-7-tvm_autotuned.tar resnet50-v2-7.onnx

上述命令执行成功后，会输出一个调优模型——resnet50-v2-7-tvm_autotuned.tar。

2）运行调优模型

$ tvmc run --inputs imagenet_cat.npz --output predictions.npz resnet50-v2-7-tvm_autotuned.tar

3）后处理输出数据

$ python3 postprocess.py

输出结果如下：

class='n02123045 tabby, tabby cat' with probability=0.621104
class='n02123159 tiger cat' with probability=0.356379
class='n02124075 Egyptian cat' with probability=0.019712
class='n02129604 tiger, Panthera tigris' with probability=0.001215
class='n04040759 radiator' with probability=0.000262

从上面的结果可以看出，调优后的模型不会导致其运行结果发生变化。

step 3： 比较调优模型与未调优模型的性能

1）统计调优模型的性能数据

$ tvmc run \
--inputs imagenet_cat.npz \
--output predictions.npz  \
--print-time \
--repeat 100 \
resnet50-v2-7-tvm_autotuned.tar

输出结果如下：

Execution time summary:
 mean (ms)   median (ms)    max (ms)     min (ms)     std (ms) 
  103.7514     102.3751     116.3797     101.6071      2.9816

2）统计未调优模型的性能数据

$ tvmc run \
--inputs imagenet_cat.npz \
--output predictions.npz  \
--print-time \
--repeat 100 \
resnet50-v2-7-tvm.tar

输出结果如下：

Execution time summary:
 mean (ms)   median (ms)    max (ms)     min (ms)     std (ms) 
  123.2050     118.7625     196.0487     116.2879     12.6273

对比上述两个结果可以看出，在笔者的机器上，调优模型的平均运行速度是未调优模型的约 1.1875 倍。

References

下一篇：TVM 篇（3）：如何使用 C++ API 编译和优化模型

上一篇：TVM 篇（1）：如何从源码构建并安装 TVM

首页