1
2opafindgood(8) Master map: IFSFFCLIRG (Man Page) opafindgood(8)
3
4
5
7 opafindgood
8
9
10
11 Checks for hosts that are able to be pinged, accessed via SSH, and
12 active on the Intel(R) Omni-Path Fabric. Produces a list of good hosts
13 meeting all criteria. Typically used to identify good hosts to undergo
14 further testing and benchmarking during initial cluster staging and
15 startup.
16
17 The resulting good file lists each good host exactly once and can be
18 used as input to create mpi_hosts files for running mpi_apps and the
19 HFI-SW cable test. The files alive, running, active, good, and bad are
20 created in the selected directory listing hosts passing each criteria.
21
22 This command assumes the Node Description for each host is based on the
23 hostname-s output in conjunction with an optional hfi1_# suffix. When
24 using a /etc/opa/hosts file that lists the hostnames, this assumption
25 may not be correct.
26
27 This command automatically generates the file FF_RESULT_DIR/punch‐
28 list.csv. This file provides a concise summary of the bad hosts found.
29 This can be imported into Excel directly as a *.csv file. Alterna‐
30 tively, it can be cut/pasted into Excel, and the Data/Text to Columns
31 toolbar can be used to separate the information into multiple columns
32 at the semicolons.
33
34 A sample generated output is:
35
36 # opafindgood
37 3 hosts will be checked
38 2 hosts are pingable (alive)
39 2 hosts are ssh'able (running)
40 2 total hosts have FIs active on one or more fabrics (active)
41 No Quarantine Node Records Returned
42 1 hosts are alive, running, active (good)
43 2 hosts are bad (bad)
44 Bad hosts have been added to /root/punchlist.csv
45 # cat /root/punchlist.csv
46 2015/10/04 11:33:22;phs1fnivd13u07n1 hfi1_0 p1 phs1swivd13u06 p16;Link
47 errors
48 2015/10/07 10:21:05;phs1swivd13u06;Switch not found in SA DB
49 2015/10/09 14:36:48;phs1fnivd13u07n4;Doesn't ping
50 2015/10/09 14:36:48;phs1fnivd13u07n3;No active port
51
52
53
54 For a given run, a line is generated for each failing host. Hosts are
55 reported exactly once for a given run. Therefore, a host that does not
56 ping is NOT listed as can't ssh nor No active port. There may be cases
57 where ports could be active for hosts that do not ping, especially if
58 Ethernet host names are used for the ping test. However, the lack of
59 ping often implies there are other fundamental issues, such as PXE boot
60 or inability to access DNS or DHCP to get proper host name and IP
61 address. Therefore, reporting hosts that do not ping is typically of
62 limited value.
63
64 Note that opafindgood queries the SA for NodeDescriptions to determine
65 hosts with active ports. As such, ports may be active for hosts that
66 cannot be accessed via SSH or pinged.
67
68 By default, opafindgood checks for and reports nodes that are quaran‐
69 tined for security reasons. To skip this, use the -Q option.
70
72 opafindgood [-R|-A|-Q] [-d dir] [-f hostfile] [-h 'hosts']
73 [-t portsfile] [-p ports] [-T timelimit]
74
76 --help Produces full help text.
77
78 -R Skips the running test (SSH). Recommended if password-less
79 SSH is not set up.
80
81 -A Skips the active test. Recommended if Intel(R) Omni-Path Fab‐
82 ric software or fabric is not up.
83
84 -Q Skips the quarantine test. Recommended if Intel(R) Omni-Path
85 Fabric software or fabric is not up.
86
87 -d dir Specifies the directory in which to create alive, active,
88 running, good, and bad files. Default is /etc/opa directory.
89
90 -f hostfile
91 Specifies the file with hosts in cluster. Default is
92 /etc/opa/hosts directory.
93
94 -h hosts Specifies the list of hosts to ping.
95
96 -t portsfile
97 Specifies the file with list of local HFI ports used to
98 access fabric(s) for analysis. Default is /etc/opa/ports
99 file.
100
101 -p ports Specifies the list of local HFI ports used to access fab‐
102 ric(s) for analysis.
103
104
105 Default is first active port. The first HFI in the system is
106 1. The first port on an HFI is 1. Uses the format hfi:port,
107 for example:
108
109
110
111 0:0 First active port in system.
112
113
114
115
116
117 0:y Port y within system.
118
119
120
121
122
123 x:0 First active port on HFI x.
124
125
126
127
128
129 x:y HFI x, port y.
130
131
132
133 -T timelimit
134 Specifies the time limit in seconds for host to respond to
135 SSH. Default = 20 seconds.
136
137
139 The following environment variables are also used by this command:
140
141 HOSTS List of hosts, used if -h option not supplied.
142
143
144 HOSTS_FILE
145 File containing list of hosts, used in absence of -f and -h.
146
147
148 PORTS List of ports, used in absence of -t and -p.
149
150
151 PORTS_FILE
152 File containing list of ports, used in absence of -t and -p.
153
154
155 FF_MAX_PARALLEL
156 Maximum concurrent operations.
157
158
160 opafindgood
161 opafindgood -f allhosts
162 opafindgood -h 'arwen elrond'
163 HOSTS='arwen elrond' opafindgood
164 HOSTS_FILE=allhosts opafindgood
165 opafindgood -p '1:1 1:2 2:1 2:2'
166
167
168
169Copyright(C) 2015-2018 Intel Corporation opafindgood(8)