Introduction

先月初めくらいに仕事で YOLOv2 (You Only Look Once v2) の検証をしていた矢先、突如現れた YOLOv3。
検証したくとも忙しいのと、自宅は750 Ti、会社で自由に使えるGPUマシンも750 Tiと検証するには、いささか物足りない状態でした。
が、今月頭に自宅の開発機を一新。

Intel i7-8700 3.20GHz
32GB RAM
Nvidia GeForce GTX 1080

を搭載し、一気に環境がグレードアップ。
GWに入ってもモチベーションが維持できていたため検証してみました。

Try out!!

と言っても、GPUが乗っているマシンはWindows。別にUbuntu入れても良いんですが、デュアルブートにしたいので放置。
なので、YOLOを実装している深層学習フレームワーク Darknet のWindows版を試すことに。

仕事でも上のリポジトリにはお世話になりました。
今回もこれを使います。
使い方は上に書いてあるとおり。

OpenCV 3.4.0
cuDNN 7.0
CUDA 9.1

を用意して、環境変数追加、C++のプロジェクトファイルの書き換えだけです。
なので省略!! 肝心の比較。
テスト画像はお馴染みの自転車と犬と車 (768x576)。
ちなみに公式ページと結果が異なるのは、Windows版であるからだとおもいますが、劇的に違うわけでは無いので無視。

YOLOV2

D:\Works\Local\darknet\build\darknet\x64> .\darknet.exe detect cfg\yolov2.cfg yolov2.weights data\dog.jpg
layer     filters    size              input                output
    0 conv     32  3 x 3 / 1   416 x 416 x   3   ->   416 x 416 x  32
    1 max          2 x 2 / 2   416 x 416 x  32   ->   208 x 208 x  32
    2 conv     64  3 x 3 / 1   208 x 208 x  32   ->   208 x 208 x  64
    3 max          2 x 2 / 2   208 x 208 x  64   ->   104 x 104 x  64
    4 conv    128  3 x 3 / 1   104 x 104 x  64   ->   104 x 104 x 128
    5 conv     64  1 x 1 / 1   104 x 104 x 128   ->   104 x 104 x  64
    6 conv    128  3 x 3 / 1   104 x 104 x  64   ->   104 x 104 x 128
    7 max          2 x 2 / 2   104 x 104 x 128   ->    52 x  52 x 128
    8 conv    256  3 x 3 / 1    52 x  52 x 128   ->    52 x  52 x 256
    9 conv    128  1 x 1 / 1    52 x  52 x 256   ->    52 x  52 x 128
   10 conv    256  3 x 3 / 1    52 x  52 x 128   ->    52 x  52 x 256
   11 max          2 x 2 / 2    52 x  52 x 256   ->    26 x  26 x 256
   12 conv    512  3 x 3 / 1    26 x  26 x 256   ->    26 x  26 x 512
   13 conv    256  1 x 1 / 1    26 x  26 x 512   ->    26 x  26 x 256
   14 conv    512  3 x 3 / 1    26 x  26 x 256   ->    26 x  26 x 512
   15 conv    256  1 x 1 / 1    26 x  26 x 512   ->    26 x  26 x 256
   16 conv    512  3 x 3 / 1    26 x  26 x 256   ->    26 x  26 x 512
   17 max          2 x 2 / 2    26 x  26 x 512   ->    13 x  13 x 512
   18 conv   1024  3 x 3 / 1    13 x  13 x 512   ->    13 x  13 x1024
   19 conv    512  1 x 1 / 1    13 x  13 x1024   ->    13 x  13 x 512
   20 conv   1024  3 x 3 / 1    13 x  13 x 512   ->    13 x  13 x1024
   21 conv    512  1 x 1 / 1    13 x  13 x1024   ->    13 x  13 x 512
   22 conv   1024  3 x 3 / 1    13 x  13 x 512   ->    13 x  13 x1024
   23 conv   1024  3 x 3 / 1    13 x  13 x1024   ->    13 x  13 x1024
   24 conv   1024  3 x 3 / 1    13 x  13 x1024   ->    13 x  13 x1024
   25 route  16
   26 conv     64  1 x 1 / 1    26 x  26 x 512   ->    26 x  26 x  64
   27 reorg              / 2    26 x  26 x  64   ->    13 x  13 x 256
   28 route  27 24
   29 conv   1024  3 x 3 / 1    13 x  13 x1280   ->    13 x  13 x1024
   30 conv    425  1 x 1 / 1    13 x  13 x1024   ->    13 x  13 x 425
   31 detection
mask_scale: Using default '1.000000'
Loading weights from yolov2.weights...
 seen 32
Done!
data\dog.jpg: Predicted in 0.012998 seconds.
pottedplant: 24%
bicycle: 25%
dog: 82%
bicycle: 82%
truck: 74%

YOLOV3

PS D:\Works\Local\darknet\build\darknet\x64> .\darknet.exe detect cfg\yolov3.cfg yolov3.weights data\dog.jpg
layer     filters    size              input                output
    0 conv     32  3 x 3 / 1   416 x 416 x   3   ->   416 x 416 x  32
    1 conv     64  3 x 3 / 2   416 x 416 x  32   ->   208 x 208 x  64
    2 conv     32  1 x 1 / 1   208 x 208 x  64   ->   208 x 208 x  32
    3 conv     64  3 x 3 / 1   208 x 208 x  32   ->   208 x 208 x  64
    4 Shortcut Layer: 1
    5 conv    128  3 x 3 / 2   208 x 208 x  64   ->   104 x 104 x 128
    6 conv     64  1 x 1 / 1   104 x 104 x 128   ->   104 x 104 x  64
    7 conv    128  3 x 3 / 1   104 x 104 x  64   ->   104 x 104 x 128
    8 Shortcut Layer: 5
    9 conv     64  1 x 1 / 1   104 x 104 x 128   ->   104 x 104 x  64
   10 conv    128  3 x 3 / 1   104 x 104 x  64   ->   104 x 104 x 128
   11 Shortcut Layer: 8
   12 conv    256  3 x 3 / 2   104 x 104 x 128   ->    52 x  52 x 256
   13 conv    128  1 x 1 / 1    52 x  52 x 256   ->    52 x  52 x 128
   14 conv    256  3 x 3 / 1    52 x  52 x 128   ->    52 x  52 x 256
   15 Shortcut Layer: 12
   16 conv    128  1 x 1 / 1    52 x  52 x 256   ->    52 x  52 x 128
   17 conv    256  3 x 3 / 1    52 x  52 x 128   ->    52 x  52 x 256
   18 Shortcut Layer: 15
   19 conv    128  1 x 1 / 1    52 x  52 x 256   ->    52 x  52 x 128
   20 conv    256  3 x 3 / 1    52 x  52 x 128   ->    52 x  52 x 256
   21 Shortcut Layer: 18
   22 conv    128  1 x 1 / 1    52 x  52 x 256   ->    52 x  52 x 128
   23 conv    256  3 x 3 / 1    52 x  52 x 128   ->    52 x  52 x 256
   24 Shortcut Layer: 21
   25 conv    128  1 x 1 / 1    52 x  52 x 256   ->    52 x  52 x 128
   26 conv    256  3 x 3 / 1    52 x  52 x 128   ->    52 x  52 x 256
   27 Shortcut Layer: 24
   28 conv    128  1 x 1 / 1    52 x  52 x 256   ->    52 x  52 x 128
   29 conv    256  3 x 3 / 1    52 x  52 x 128   ->    52 x  52 x 256
   30 Shortcut Layer: 27
   31 conv    128  1 x 1 / 1    52 x  52 x 256   ->    52 x  52 x 128
   32 conv    256  3 x 3 / 1    52 x  52 x 128   ->    52 x  52 x 256
   33 Shortcut Layer: 30
   34 conv    128  1 x 1 / 1    52 x  52 x 256   ->    52 x  52 x 128
   35 conv    256  3 x 3 / 1    52 x  52 x 128   ->    52 x  52 x 256
   36 Shortcut Layer: 33
   37 conv    512  3 x 3 / 2    52 x  52 x 256   ->    26 x  26 x 512
   38 conv    256  1 x 1 / 1    26 x  26 x 512   ->    26 x  26 x 256
   39 conv    512  3 x 3 / 1    26 x  26 x 256   ->    26 x  26 x 512
   40 Shortcut Layer: 37
   41 conv    256  1 x 1 / 1    26 x  26 x 512   ->    26 x  26 x 256
   42 conv    512  3 x 3 / 1    26 x  26 x 256   ->    26 x  26 x 512
   43 Shortcut Layer: 40
   44 conv    256  1 x 1 / 1    26 x  26 x 512   ->    26 x  26 x 256
   45 conv    512  3 x 3 / 1    26 x  26 x 256   ->    26 x  26 x 512
   46 Shortcut Layer: 43
   47 conv    256  1 x 1 / 1    26 x  26 x 512   ->    26 x  26 x 256
   48 conv    512  3 x 3 / 1    26 x  26 x 256   ->    26 x  26 x 512
   49 Shortcut Layer: 46
   50 conv    256  1 x 1 / 1    26 x  26 x 512   ->    26 x  26 x 256
   51 conv    512  3 x 3 / 1    26 x  26 x 256   ->    26 x  26 x 512
   52 Shortcut Layer: 49
   53 conv    256  1 x 1 / 1    26 x  26 x 512   ->    26 x  26 x 256
   54 conv    512  3 x 3 / 1    26 x  26 x 256   ->    26 x  26 x 512
   55 Shortcut Layer: 52
   56 conv    256  1 x 1 / 1    26 x  26 x 512   ->    26 x  26 x 256
   57 conv    512  3 x 3 / 1    26 x  26 x 256   ->    26 x  26 x 512
   58 Shortcut Layer: 55
   59 conv    256  1 x 1 / 1    26 x  26 x 512   ->    26 x  26 x 256
   60 conv    512  3 x 3 / 1    26 x  26 x 256   ->    26 x  26 x 512
   61 Shortcut Layer: 58
   62 conv   1024  3 x 3 / 2    26 x  26 x 512   ->    13 x  13 x1024
   63 conv    512  1 x 1 / 1    13 x  13 x1024   ->    13 x  13 x 512
   64 conv   1024  3 x 3 / 1    13 x  13 x 512   ->    13 x  13 x1024
   65 Shortcut Layer: 62
   66 conv    512  1 x 1 / 1    13 x  13 x1024   ->    13 x  13 x 512
   67 conv   1024  3 x 3 / 1    13 x  13 x 512   ->    13 x  13 x1024
   68 Shortcut Layer: 65
   69 conv    512  1 x 1 / 1    13 x  13 x1024   ->    13 x  13 x 512
   70 conv   1024  3 x 3 / 1    13 x  13 x 512   ->    13 x  13 x1024
   71 Shortcut Layer: 68
   72 conv    512  1 x 1 / 1    13 x  13 x1024   ->    13 x  13 x 512
   73 conv   1024  3 x 3 / 1    13 x  13 x 512   ->    13 x  13 x1024
   74 Shortcut Layer: 71
   75 conv    512  1 x 1 / 1    13 x  13 x1024   ->    13 x  13 x 512
   76 conv   1024  3 x 3 / 1    13 x  13 x 512   ->    13 x  13 x1024
   77 conv    512  1 x 1 / 1    13 x  13 x1024   ->    13 x  13 x 512
   78 conv   1024  3 x 3 / 1    13 x  13 x 512   ->    13 x  13 x1024
   79 conv    512  1 x 1 / 1    13 x  13 x1024   ->    13 x  13 x 512
   80 conv   1024  3 x 3 / 1    13 x  13 x 512   ->    13 x  13 x1024
   81 conv    255  1 x 1 / 1    13 x  13 x1024   ->    13 x  13 x 255
   82 detection
   83 route  79
   84 conv    256  1 x 1 / 1    13 x  13 x 512   ->    13 x  13 x 256
   85 upsample            2x    13 x  13 x 256   ->    26 x  26 x 256
   86 route  85 61
   87 conv    256  1 x 1 / 1    26 x  26 x 768   ->    26 x  26 x 256
   88 conv    512  3 x 3 / 1    26 x  26 x 256   ->    26 x  26 x 512
   89 conv    256  1 x 1 / 1    26 x  26 x 512   ->    26 x  26 x 256
   90 conv    512  3 x 3 / 1    26 x  26 x 256   ->    26 x  26 x 512
   91 conv    256  1 x 1 / 1    26 x  26 x 512   ->    26 x  26 x 256
   92 conv    512  3 x 3 / 1    26 x  26 x 256   ->    26 x  26 x 512
   93 conv    255  1 x 1 / 1    26 x  26 x 512   ->    26 x  26 x 255
   94 detection
   95 route  91
   96 conv    128  1 x 1 / 1    26 x  26 x 256   ->    26 x  26 x 128
   97 upsample            2x    26 x  26 x 128   ->    52 x  52 x 128
   98 route  97 36
   99 conv    128  1 x 1 / 1    52 x  52 x 384   ->    52 x  52 x 128
  100 conv    256  3 x 3 / 1    52 x  52 x 128   ->    52 x  52 x 256
  101 conv    128  1 x 1 / 1    52 x  52 x 256   ->    52 x  52 x 128
  102 conv    256  3 x 3 / 1    52 x  52 x 128   ->    52 x  52 x 256
  103 conv    128  1 x 1 / 1    52 x  52 x 256   ->    52 x  52 x 128
  104 conv    256  3 x 3 / 1    52 x  52 x 128   ->    52 x  52 x 256
  105 conv    255  1 x 1 / 1    52 x  52 x 256   ->    52 x  52 x 255
  106 detection
Loading weights from yolov3.weights...
 seen 64
Done!
data\dog.jpg: Predicted in 0.027030 seconds.
dog: 99%
car: 25%
bicycle: 99%
truck: 93%

Tiny YOLO

D:\Works\Local\darknet\build\darknet\x64> .\darknet.exe detect .\cfg\yolov2-tiny.cfg .\yolov2-tiny.weights .\data\dog
.jpg
layer     filters    size              input                output
    0 conv     16  3 x 3 / 1   416 x 416 x   3   ->   416 x 416 x  16
    1 max          2 x 2 / 2   416 x 416 x  16   ->   208 x 208 x  16
    2 conv     32  3 x 3 / 1   208 x 208 x  16   ->   208 x 208 x  32
    3 max          2 x 2 / 2   208 x 208 x  32   ->   104 x 104 x  32
    4 conv     64  3 x 3 / 1   104 x 104 x  32   ->   104 x 104 x  64
    5 max          2 x 2 / 2   104 x 104 x  64   ->    52 x  52 x  64
    6 conv    128  3 x 3 / 1    52 x  52 x  64   ->    52 x  52 x 128
    7 max          2 x 2 / 2    52 x  52 x 128   ->    26 x  26 x 128
    8 conv    256  3 x 3 / 1    26 x  26 x 128   ->    26 x  26 x 256
    9 max          2 x 2 / 2    26 x  26 x 256   ->    13 x  13 x 256
   10 conv    512  3 x 3 / 1    13 x  13 x 256   ->    13 x  13 x 512
   11 max          2 x 2 / 1    13 x  13 x 512   ->    13 x  13 x 512
   12 conv   1024  3 x 3 / 1    13 x  13 x 512   ->    13 x  13 x1024
   13 conv    512  3 x 3 / 1    13 x  13 x1024   ->    13 x  13 x 512
   14 conv    425  1 x 1 / 1    13 x  13 x 512   ->    13 x  13 x 425
   15 detection
mask_scale: Using default '1.000000'
Loading weights from .\yolov2-tiny.weights...
 seen 64
Done!
.\data\dog.jpg: Predicted in 0.005972 seconds.
dog: 82%
car: 24%
bicycle: 26%
car: 30%
bicycle: 59%
person: 26%
person: 28%
car: 41%
car: 73%
Compare!!
GPUの性能が良いからなのか、速度が恐ろしいことになっています。
リアルタイムいけるやん(震え声)
編集
 	YOLOV3	YOLOV2	TINY YOLO
Speed	0.027030 sec (27ms)	0.012998 sec (12ms)	0.005972 sec (5ms)
Layer Num	106	31	15
Detected Object	dog: 99%	dog: 82%	dog: 82%
bicycle: 99%	bicycle: 82%	car: 73%
truck: 93%	truck: 74%	bicycle: 59%
car: 25%	bicycle: 25%	car: 41%
pottedplant: 24%	car: 30%
person: 28%
person: 26%
bicycle: 26%
car: 24%

Compare!!

GPUの性能が良いからなのか、速度が恐ろしいことになっています。
リアルタイムいけるやん(震え声)

|項目|YOLOv3|YOLOv2|Tiny YOLO|
|:—|:—|:—|:—|:—|
|Speed|0.027030 sec (27ms)|0.012998 sec (12ms)|0.005972 sec (5ms)|
|Layer Num|106|31|15|
|Detected Object|dog: 99%|dog: 82%|dog: 82%|
||bicycle: 99%|bicycle: 82%|car: 73%|
||truck: 93%|truck: 74%|bicycle: 59%|
||car: 25%|bicycle: 25%|car: 41%|
|||pottedplant: 24%|car: 30%|
||||person: 28%|
||||person: 26%|
||||bicycle: 26%|
||||car: 24%|

画像の結果は下記。

YOLOv3

YOLOv2

Tiny YOLO

層が3倍になったからといって単純に速度が1/3という訳では無いですが、精度向上しても、これだけの速度低下なのは凄いことです。
YOLOv3の出力を見ると、Shortcut Layer という単語が頻出していますが、これはResidual Networkですかね。

論文を見ると確かにそう書いてあります。

We use a new network for performing feature extraction. Our new network is a hybrid approach between the network used in YOLOv2, Darknet-19, and that newfangled residual network stuff.
我々は特徴抽出を行う新しいネットワークを使用する。
我々の新しいネットワークは、YOLOv2、Darknet-19、そして、あの最新式のResidualネットワークで使われているネットワークのハイブリットなアプローチだ。

こいつは凄い(小並感)
ところでYOLOv4は何時ですかね(すっとぼけ)

A certain engineer "COMPLEX"

開発メモその112 YOLOv3をWindowsで試す

Introduction

Try out!!

YOLOV2

YOLOV3

Tiny YOLO

Compare!!

A certain engineer "COMPLEX"

開発メモ その112 YOLOv3をWindowsで試す

Introduction

Try out!!

YOLOV2

YOLOV3

Tiny YOLO

Compare!!

開発メモその112 YOLOv3をWindowsで試す