참고 사이트 http://hwengineer.blogspot.com http://leearmykr.blogspot.com/ |
Procedure
|
OS 환경 구성
● Selinux disable
◎ 현재 상태 확인 [root@ac922 ~]# getenforce Enforcing è Enforcing은 enable 상태로 보안 적용이 되어 있는 상태
◎ Selinux를 임시로 disable 적용 [root@ac922 ~]# setenforce 0 [root@ac922 ~]# getenforce Permissive è Setenforce 명령어로 현재 적용되어 있는 selinux를 disable OS 재부팅시 enable로 변경됨 (임시 적용)
◎ Selinux up되지 않도록 설정 [root@ac922 ~]# vi /etc/selinux/config
SELINUX=enforcing 을 SELINUX=disabled 로 변경 후 저장 è OS 부팅시 selinux를 사용하지 않게 변경
|
● Firewall (iptables) disable
◎ Firewall service 상태 확인 [root@ac922 ~]# systemctl status firewalld.service ● firewalld.service - firewalld - dynamic firewall daemon Loaded: loaded (/usr/lib/systemd/system/firewalld.service; enabled; vendor preset: enabled) Active: active (running) since Thu 2018-02-22 15:40:27 KST; 19h ago Docs: man:firewalld(1) Main PID: 2012 (firewalld) CGroup: /system.slice/firewalld.service └─2012 /usr/bin/python -Es /usr/sbin/firewalld --nofork –nopid
è Active (running) 상태면 firewalld.service 를 내려야 함
◎ Firewall service shutdown [root@ac922 ~]# systemctl disable firewalld.service Removed symlink /etc/systemd/system/multi-user.target.wants/firewalld.service. Removed symlink /etc/systemd/system/dbus-org.fedoraproject.FirewallD1.service.
è OS 재부팅시 firewall (iptables) 서비스가 실행되지 않게 변경
◎ Firewall service shutdown
[root@ac922 ~]# iptables –F è Firewall 설정을 clear [root@ac922 ~]# iptables –L è Firewall 적용내역 확인
|
● 인터넷이 없는 환경에서 Local repository 구성방법
◎ CD-ROM 을 이용한 local repository 구성
[root@ac922 ~]# mkdir /cdrom [root@ac922 ~]# mount -t iso9660 /dev/sr0 /cdrom è OS 설치 CDROM을 /cdrom에 mount
[root@ac922 ~]# cd /etc/yum.repos.d/ [root@ac922 yum.repos.d]# vi local.repo
◎ YUM에 repository meta-db 업데이트
[root@ac922 yum.repos.d]# yum update Loaded plugins: langpacks, product-id, search-disabled-repos, subscription-manager This system is not registered with an entitlement server. You can use subscription-manager to register. local-repository | 4.1 kB 00:00:00 (1/2): local-repository/group_gz | 129 kB 00:00:00 (2/2): local-repository/primary_db | 3.2 MB 00:00:00 No packages marked for update
è YUM repository update 실시
|
● Redhat epel 설치
- Extra Package for Enterprise Linux repository configuration
- 기본적으로 Redhat(CentOS) 제공하는 패키지 외 extra package 사용을 원할 때 epel-release를 설치
[root@ac922 ~]# wget https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
[root@ac922 ~]# rpm -ihv epel-release-latest-7.noarch.rpm
[root@ac922 ~]# yum update
[root@ac922 ~]# yum install nmon Loaded plugins: langpacks, product-id, search-disabled-repos, subscription-manager This system is not registered with an entitlement server. You can use subscription-manager to register. Resolving Dependencies --> Running transaction check ---> Package nmon.ppc64le 0:16g-3.el7 will be installed --> Finished Dependency Resolution
Dependencies Resolved
=========================================================================================================================== Package Arch Version Repository Size =========================================================================================================================== Installing: nmon ppc64le 16g-3.el7 epel 70 k
Transaction Summary =========================================================================================================================== Install 1 Package
Total download size: 70 k Installed size: 199 k Is this ok [y/d/N]: y
è 테스트용 nmon 패키지 설치
|
● Redhat subscription enable
-> 당장 subscription 이 없을때 redhat.com 에서 30일 체험판 신청 후 subscrtion 등록 하면 30일간은 사용가능
### Operating System and Repository Setup 1. Enable 'optional' and 'extra' repo channels $ sudo subscription-manager repos --enable=rhel-7-for-power-9-optional-rpms $ sudo subscription-manager repos --enable=rhel-7-for-power-9-extras-rpms 2. Install packages needed for the installation $ sudo yum -y install wget nano bzip2 3. Enable EPEL repo ( 위에 설치함 생략 ) $ wget https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm $ sudo rpm -ihv epel-release-latest-7.noarch.rpm 4. Load the latest kernel $ sudo yum update kernel kernel-tools kernel-tools-libs kernel-bootwrapper $ reboot # This reboot may be deferred until after the NVIDIA steps below. Or do a full update $ sudo yum update $ sudo reboot # This reboot may be deferred until after the NVIDIA steps below.
|
● Power9 nvidia driver 설정
### NVIDIA Components Before installing the NVIDIA components the udev Memory Auto-Onlining Rule must be disabled for the CUDA driver to function properly. To disable it: 1. Edit the /lib/udev/rules.d/40-redhat.rules file. $ sudo nano /lib/udev/rules.d/40-redhat.rules 2. Comment out the following line and save the change: SUBSYSTEM=="memory", ACTION=="add", PROGRAM="/bin/uname -p", RESULT!="s390*", ATTR{state}=="offline", ATTR{state}="online" 3. Reboot the system for the changes to take effect. $ sudo reboot
|
The Deep Learning packages require CUDA, cuDNN, and GPU driver packages
from NVIDIA.
The required and recommended versions of these components are:
| Component | Required | Recommended |
|--------------|----------|-------------|
| CUDA Toolkit | 9.1 | 9.1.85 |
| cuDNN | 7.0.5 | 7.0.5 |
| GPU Driver | 387.36 | 387.36 |
1. Download and install NVIDIA CUDA 9.1 from [https://developer.nvidia.com/cuda-downloads](https://developer.nvidia.com/cuda-downloads) - Select *Operating System:* **Linux** - Select *Architecture:* **ppc64le** - Select *Distribution* **RHEL** - Select *Version* **7** - Select the *Installer Type* that best fits your needs - Follow the **Linux POWER9** installation instructions in the *CUDA Quick Start Guide* (linked from [https://developer.nvidia.com/cuda-downloads](https://developer.nvidia.com/cuda-downloads)), including the steps describing how to set up the CUDA development environment by updating `PATH` and `LD_LIBRARY_PATH`. 2. Download NVIDIA cuDNN 7.0.5 for CUDA 9.1 from [https://developer.nvidia.com/cudnn](https://developer.nvidia.com/cudnn) (Registration in NVIDIA's Accelerated Computing Developer Program is required) - cuDNN v7.0.5 Library for Linux (Power8/Power9) 3. Install the cuDNN v7.0 packages $ sudo tar -C /usr/local --no-same-owner -xzvf cudnn-9.1-linux-ppc64le-v7.0.5.tgz
|
A number of the Deep Learning frameworks require Anaconda. Anaconda is a platform-agnostic data science distribution with a collection of 1,000+ open source packages with free community support. Download and Install Anaconda. Installation requires input for license agreement, install location (default is `$HOME/anaconda2`) and permission to modify the `PATH` environment variable (via `.bashrc`). $ wget https://repo.continuum.io/archive/Anaconda2-5.0.0-Linux-ppc64le.sh $ bash Anaconda2-5.0.0-Linux-ppc64le.sh $ source ~/.bashrc
|
● Installing the Deep Learning Frameworks ( PowerAI 설치 )
### IBM Spectrum MPI Install
Download IBM Spectrum MPI from the ESP download site. 1. Install the rpms
$ sudo rpm -ihv ibm_smpi_lic_s-10.02.00*.ppc64le.rpm ibm_smpi-10.02.00*.ppc64le.rpm ### Software Repository Setup IBM TensorFlow ESP for Power AC922 Deep Learning packages are distributed in an rpm file and is available from the ESP download site. Installing the rpm creates an installation repository on the local machine. 1. Install the repository package: $ sudo rpm -ihv mldl-repo-local*.rpm ### Installing all frameworks at once All the Deep Learning frameworks can be installed at once using the `power-mldl` meta-package: $ sudo yum install power-mldl-esp ### Installing frameworks individually -> 위에 power-mldl-esp 로 설치 했으면 SKIP!!!!!! The Deep Learning frameworks can be installed individually if preferred. The framework packages are: - `tensorflow` - Google TensorFlow, v1.4.0 - `tensorboard` - Web Applications for inspecting TensorFlow runs and graphs, v0.4.0rc3 - `ddl-tensorflow` - Distributed Deep Learning custom operator for TensorFlow Each can be installed with: $ sudo yum install <framework>-cuda9.1 ### Accept the License Agreement Read the license agreements and accept the terms and conditions before using Spectrum MPI or any of the frameworks. $ sudo IBM_SPECTRUM_MPI_LICENSE_ACCEPT=no /opt/ibm/spectrum_mpi/lap_se/bin/accept_spectrum_mpi_license.sh $ sudo /opt/DL/license/bin/accept-powerai-license.sh After reading the license agreements, future installs may be automated to silently accept the license agreements. $ sudo IBM_SPECTRUM_MPI_LICENSE_ACCEPT=yes /opt/ibm/spectrum_mpi/lap_se/bin/accept_spectrum_mpi_license.sh $ sudo IBM_POWERAI_LICENSE_ACCEPT=yes /opt/DL/license/bin/accept-powerai-license.sh
|
● PowerAI 환경을 위한 OS 튜닝
## Tuning Recommendations Recommended settings for optimal Deep Learning performance on the IBM Power System AC922 are: - Enable Performance Governor $ sudo yum install kernel-tools $ sudo cpupower -c all frequency-set -g performance - Enable GPU persistence mode
$ sudo systemctl enable nvidia-persistenced $ sudo systemctl start nvidia-persistenced - For TensorFlow, set the SMT mode $ sudo ppc64_cpu --smt=4 - For TensorFlow with DDL, set the SMT mode $ sudo ppc64_cpu --smt=2 ## Getting Started with MLDL Frameworks ### General Setup Most of the PowerAI packages install outside the normal system search paths (to `/opt/DL/...`), so each framework package provides a shell script to simplify environmental setup (e.g. `PATH`, `LD_LIBRARY_PATH`, `PYTHONPATH`). We recommend users update their shell rc file (e.g. `.bashrc`) to source the desired setup scripts. For example: $ source /opt/DL/<framework>/bin/<framework>-activate Each framework also provides a test script to verify basic function: $ <framework>-test ### Note about dependencies A number of the PowerAI frameworks (for example, TensorFlow, and TensorBoard) have their dependencies satisfied via Anaconda packages. These dependencies are validated by the `<framework>-activate` script to ensure they are installed and, if not, the script will fail. For these frameworks, the `/opt/DL/<framework>/bin/install_dependencies` script must be run prior to activation to install the required packages. For example: $ source /opt/DL/tensorflow/bin/tensorflow-activate Missing dependencies ['backports.weakref', 'mock', 'protobuf'] Run "/opt/DL/tensorflow/bin/install_dependencies" to resolve this problem. $ /opt/DL/tensorflow/bin/install_dependencies Fetching package metadata ........... Solving package specifications: . Package plan for installation in environment /home/rhel/anaconda2: The following NEW packages will be INSTALLED: backports.weakref: 1.0rc1-py27_0 libprotobuf: 3.4.0-hd26fab5_0 mock: 2.0.0-py27_0 pbr: 1.10.0-py27_0 protobuf: 3.4.0-py27h7448ec6_0 Proceed ([y]/n)? y libprotobuf-3. 100% |###############################| Time: 0:00:02 2.04 MB/s backports.weak 100% |###############################| Time: 0:00:00 12.83 MB/s protobuf-3.4.0 100% |###############################| Time: 0:00:00 2.20 MB/s pbr-1.10.0-py2 100% |###############################| Time: 0:00:00 3.35 MB/s mock-2.0.0-py2 100% |###############################| Time: 0:00:00 3.26 MB/s $ source /opt/DL/tensorflow/bin/tensorflow-activate $
|
● Getting Started with Tensorflow
The TensorFlow homepage ([https://www.tensorflow.org/](https://www.tensorflow.org/)) has a variety of information, including Tutorials, How Tos, and a Getting Started guide. Additional tutorials and examples are available from the community, for example: - [https://github.com/nlintz/TensorFlow-Tutorials](https://github.com/nlintz/TensorFlow-Tutorials) - [https://github.com/aymericdamien/TensorFlow-Examples](https://github.com/aymericdamien/TensorFlow-Examples) #### Distributed Deep Learning (DDL) Custom Operator for TensorFlow IBM TensorFlow ESP for Power AC922 includes a Technology Preview of the IBM PowerAI Distributed Deep Learning (DDL) custom operator for TensorFlow. The DDL custom operator uses IBM Spectrum MPI and NCCL to provide high-speed communications for distributed TensorFlow. The DDL custom operator can be found in the `ddl-tensorflow` package. For more information about DDL and about the TensorFlow operator, see: - `/opt/DL/ddl-doc/doc/README.md` - `/opt/DL/ddl-tensorflow/doc/README.md` - `/opt/DL/ddl-tensorflow/doc/README-API.md` The DDL TensorFlow operator makes it easy to enable models for distribution. The package includes examples of models enabled with DDL including TensorFlow High Performance and Slim models: $ source /opt/DL/ddl-tensorflow/bin/ddl-tensorflow-activate $ ddl-tensorflow-install-samples <somedir> The Slim model examples are based on a specific commit of the TensorFlow models repo with a small adjustment. If you prefer to work from an upstream clone, rather than the packaged examples: $ git clone https://github.com/tensorflow/models.git $ cd models $ git checkout 11883ec6461afe961def44221486053a59f90a1b $ git revert fc7342bf047ec5fc7a707202adaf108661bd373d $ cp /opt/DL/ddl-tensorflow/examples/slim/train_image_classifier.py slim/ #### Additional TensorFlow Features The PowerAI TensorFlow packages include TensorBoard. See: [https://www.tensorflow.org/get_started/summaries_and_tensorboard](https://www.tensorflow.org/get_started/summaries_and_tensorboard) The TensorFlow 1.4.0 package includes support for additional features: - HDFS - NCCL - experimental XLA JIT compilation (see [https://www.tensorflow.org/performance/xla/](https://www.tensorflow.org/performance/xla/)) ## Uninstalling MLDL Frameworks The MLDL Framework packages can be uninstalled individually the same way they were installed. In order to uninstall all MLDL packages and the repo used to install them run: $ sudo yum remove powerai-license $ sudo yum remove mldl-repo-local-esp
|
● Docker / Nvidia-Docker 설치
참고 -> http://hwengineer.blogspot.com/2018/02/ppc64le-docker-nvidia-docker-repository.html
1. Redhat 환경의 경우
# systemctl enable docker.service
# docker images -> Docker 실행 확인
///////////////////////////////////////////////////////////////////////////////
[이하 nvidia-docker]
-> Nvidia-docker repository 업데이트
# yum update # yum install nvidia-docker
# nvidia-docker -> nvidia-docker 설치 확인
# /usr/bin/nvidia-docker-plugin & -> nvidia-docker 사용을 위해 plugin 을 background로 실행 -> 재부팅후에도 background 로 실행하기 위해 rc.local 에 등록
# vi /etc/rc.d/rc.local /usr/bin/nvidia-docker-plugin & 추가해줌
# chmod +x /etc/rc.d/rc.local -> rc.local에 실행권한 필수
|
● TF 1.5 버전 설치 시 참조
http://hwengineer.blogspot.com/2018/04/ac922-redhat-74-python-36-tensorflow.html
● AC922 에서 nvidia-smi 시 unknown 에러 발생시 추가 작업 및 설정 확인
https://hwengineer.blogspot.com/2018/04/ac922-cuda-91.html
● 인터넷 안되는 환경에서 -> Redhat Subscrtion-manager DISABLE !!!!!!!!!!!!!!!!!!!!!!!!
Disabling the Subscription-Manager Repository When a system is registered using Subscription-Manager, the rhsmcertd process creates a special yum repository — redhat.repo. As “Enabling Supplementary and Optional Repositories” describes, as the system adds subscriptions, the product channels are added to the redhat.repo file. Maintaining a redhat.repo file may not be desirable in some environments. It can create static in content management operations if that repository is not the one actually used for subscriptions, such as for a disconnected system or a system using a local content mirror. This default redhat.repo repository can be disabled by editing the Subscription-Manager configuration and setting the manage_repos value to zero (0). Raw [root@server ~]# subscription-manager config --rhsm.manage_repos=0 |
///////////// 이하 테스트 ///////////////
newwell 설치 정리
- Extra Packages for Enterprise Linux (EPEL)
-> 추가 패키지 repository 등록
redhat7 버전 repository 설치
# wget https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
# rpm -ivh epel-release-latest-7.noarch.rpm
# yum update
# yum repolist
- cuda 9.1
# rpm -ivh cuda-repo-ubuntu1604-9-0-local_9.0.176-1_ppc64el.deb
# yum install cuda
- cudnn 7.0 for cuda 9.1
# mv http://developer2.download.nvidia.com/compute/machine-learning/cudnn/secure/v7.0.5/prod/9.1_20171129/cudnn-9.1-linux-ppc64le-v7.tgz?9GVxLevnEbiZ58fRLwXMF4dcgjWPoUHm1vfRDm_87tF5yDIjNeOyAV5vZwaygOrMjgXlVlAeEPaB9CL2oPbggLw08gUYN8xq62eGOwbacmvE9X7Lyvdp7_yqzQQCMfyfGjHH40qyjLlMwt3l4CypdNdCtw4XyBRQdpOdUI8k5eAylpHnPnngkIcE9-ReD70rYBM50Oi75p75itEl cudnn-9.1-linux-ppc64le-v7.solitairetheme8
# tar -xf cudnn-9.1-linux-ppc64le-v7.solitairetheme8
cd cuda/tagets/ppc64le-linux
# scp -rp include/ /usr/local/cuda-9.1/targets/ppc64le-linux/
# scp -rp lib/ /usr/local/cuda-9.1/targets/ppc64le-linux/
- Bazel 9.0 설치
# unzip bazel-0.9.0-dist.zip
# yum install *jdk*
# compile.sh
# scp -rp output/bazel /usr/local/bin/
- protobuf 설치
yum install autoconf automake libtool
git clone https://github.com/google/protobuf
- Tensorflow 1.4
# yum install -y git patch python-pip python-wheel numpy
# git clone --recurse-submodules https://github.com/tensorflow/tensorflow
# cd tensorflow
# git checkout master
# ./configure
# bazel build --config=opt --config=cuda //tensorflow/tools/pip_package:build_pip_package