Dakotaの並列計算機能について - いわて駐在研究日誌

ようやく復帰。Dakotaの並列計算機能について調べたのでまとめてみる。

概観と簡単なサマリー

Dakota自体から起動する"simulator"はシリアル計算でも並列計算でももちろん構わない(たとえばOFの並列計算など)ので、ここではDakota本体の並列計算について調べてみた。

ユーザーマニュアルのChapter 17 Parallel Computingを読んで理解した限りでは、Dakota本体としては、

dakotaプロセス内からの複数のsimulation起動（応答関数の複数同時evaluation）
MPIによる複数のdakotaプロセスの起動（multi node用？）

の機能があり、何をどう並列計算したいかにより、並列度を変えて使い分けることができる。とりあえず一番わかりやすいSingle-Level Parallelでは、

①asynchronous local

単一ノード内で、独立した設計変数による応答関数評価を同時に起動して計算効率化

②message passing with single simulation per node

MPIにより多ノードでdakotaを起動し、それぞれのdakotaプロセスで応答関数評価simlationを１つ起動する。masterノードがそれらの結果を取りまとめする

③Hybrid

①と②の組み合わせ

となっている。

この中で、おそらく個々のsimlation負荷が小さいのであれば、①か③がよく、OF計算のような計算負荷が１つ１つ大きい場合には、②もしくは③がよいのかもしれない。

なお、並列プロセス実行にはforkを推奨とのこと。

【DACEでの例】

一番わかりやすいと思われる設計変数に対する応答関数の評価を独立に行うDACEでの並列計算の例を調べてみる。

具体例①：ラテン超方格を使ったDACEチュートリアル（ascynchronous evaluation_concurrency）

問題：Pythonを使った２設計変数２目的関数（objective.py）の最適評価

objective.py

#!/bin/python
from sys import argv
from math import *
import numpy as np

def f1(x1, x2):
    return 2.0*sqrt(x1)

def f2(x1, x2):
    return x1-x1*x2+5.0

if __name__ == '__main__':
    x1 = float(argv[1])
    x2 = float(argv[2])
    print f1(x1,x2), f2(x1,x2)

dakota_lhs_par.in

# Dakota Input File: dakota.in

environment
# graphics
tabular_graphics_data
    tabular_graphics_file = 'objective.dat'

method
dace oa_lhs
     seed = 5
     samples = 121
model
single

variables
continuous_design = 2
    initial_point     2.0      1.5
    lower_bounds      1.0      1.0
    upper_bounds      4.0      2.0
    descriptors       'x1'     "x2"

interface
fork
    asynchronous evaluation_concurrency = 4
analysis_driver = 'simulator_script'
parameters_file = 'params.in'
results_file = 'results.out'
work_directory directory_tag
    copy_files = 'templatedir/*'
named 'workdir' file_save directory_save
    aprepro

responses
response_functions = 2
no_gradients
no_hessians

赤字の部分のようにevaluation_concurencyを指定すると同時に評価サンプルを複数実行してくれる。この場合、シングルで24.2s, evaluation_concurrency = 4 で7.5sとなった。

$ dakota dakota_lhs_par.in

[snip]

<<<<< Function evaluation summary: 121 total (121 new, 0 duplicate)

Simple Correlation Matrix among all inputs and outputs:
                       x1           x2 response_fn_1 response_fn_2
          x1 1.00000e+00
          x2 1.37441e-02 1.00000e+00
response_fn_1 9.96397e-01 1.52353e-02 1.00000e+00
response_fn_2 -4.95228e-01 -8.29560e-01 -4.94665e-01 1.00000e+00

Partial Correlation Matrix between input and output:
             response_fn_1 response_fn_2
          x1 9.96398e-01 -8.66506e-01
          x2 1.81689e-02 -9.47129e-01

Simple Rank Correlation Matrix among all inputs and outputs:
                       x1           x2 response_fn_1 response_fn_2
          x1 1.00000e+00
          x2 1.41173e-02 1.00000e+00
response_fn_1 1.00000e+00 1.41173e-02 1.00000e+00
response_fn_2 -4.56767e-01 -8.63711e-01 -4.56767e-01 1.00000e+00

Partial Rank Correlation Matrix between input and output:
             response_fn_1 response_fn_2
          x1 1.00000e+00 -8.82201e-01
          x2 1.16427e-10 -9.63760e-01

<<<<< Iterator dace completed.
<<<<< Environment execution completed.
DAKOTA execution time in seconds:
Total CPU        =       0.09 [parent =   0.095986, child = -0.005986]
Total wall clock =    7.45185

※ analysis_concurrencyというキーワードもあるが、これを指定すると同時に評価する最大数を制限できる。

When asynchronous execution is enabled and each evaluation involves multiple analysis drivers, then the default behavior is to launch all drivers simultaneously. The analysis_concurrency keyword can be used to limit the number of concurrently run drivers.

具体例②：ラテン超方格を使ったDACEチュートリアル（mpirun on single node）

問題：Pythonを使った２設計変数２目的関数（objective.py）の最適評価(上記と同じ)

dakota_lhs.in (evaluation concurrencyの指定なしに注意)

# Dakota Input File: dakota.in
environment
# graphics
tabular_graphics_data
    tabular_graphics_file = 'objective.dat'

method
dace oa_lhs
     seed = 5
     samples = 100

model
single

variables
continuous_design = 2
    initial_point     2.0      1.5
    lower_bounds      1.0      1.0
    upper_bounds      4.0      2.0
    descriptors       'x1'     "x2"

interface
fork

###    asynchronous evaluation_concurrency = 4
analysis_driver = 'simulator_script'
parameters_file = 'params.in'
results_file = 'results.out'
work_directory directory_tag
    copy_files = 'templatedir/*'
named 'workdir' file_save directory_save
    aprepro

responses
response_functions = 2
no_gradients
no_hessians

$ mpirun -np 4 dakota dakota_lhs.in > dakota_lhs_openmpi.out

→ dakotaプロセスが４つ起動する。前例では１つだけである。

[snip]

<<<<< Function evaluation summary: 121 total (121 new, 0 duplicate)

Simple Correlation Matrix among all inputs and outputs:
                       x1           x2 response_fn_1 response_fn_2
          x1 1.00000e+00
          x2 1.37441e-02 1.00000e+00
response_fn_1 9.96397e-01 1.52353e-02 1.00000e+00
response_fn_2 -4.95228e-01 -8.29560e-01 -4.94665e-01 1.00000e+00

Partial Correlation Matrix between input and output:
             response_fn_1 response_fn_2
          x1 9.96398e-01 -8.66506e-01
          x2 1.81689e-02 -9.47129e-01

Simple Rank Correlation Matrix among all inputs and outputs:
                       x1           x2 response_fn_1 response_fn_2
          x1 1.00000e+00
          x2 1.41173e-02 1.00000e+00
response_fn_1 1.00000e+00 1.41173e-02 1.00000e+00
response_fn_2 -4.56767e-01 -8.63711e-01 -4.56767e-01 1.00000e+00

Partial Rank Correlation Matrix between input and output:
             response_fn_1 response_fn_2
          x1 1.00000e+00 -8.82201e-01
          x2 1.16427e-10 -9.63760e-01

<<<<< Iterator dace completed.
<<<<< Environment execution completed.
DAKOTA master processor execution time in seconds:
Total CPU        =       0.21 [parent   =   0.206969, child =   0.003031]
Total wall clock =    7.38266 [MPI_Init = 9.05991e-06, run   =    7.38266]

※ 計算時間はevaluation_concurrency指定の場合とほぼ変わらず7.4s。

ということで、マルチノードならmpi、シングルノードならevaluation concurrency指定でよさそうである。

→　次は、gradien base, Non-gradient baseの最適化の例について調査すべし。