1RESPERF(1) RESPERF(1)
2
3
4
6 resperf - test the resolution performance of a caching DNS server
7
9 resperf [ -d datafile ] [ -s server_addr ] [ -p port ] [ -b bufsize
10 ] [ -f family ] [ -e ] [ -D ] [ -y name:secret ] [ -A ] [ -h ] [
11 -i interval ] [ -m max_qps ] [ -P plot_data_file ] [ -r rampup_time
12 ] [ -L max_loss ]
13
15 resperf is a companion tool to dnsperf. dnsperf was primarily designed
16 for benchmarking authoritative servers, and is does not work well with
17 caching servers that are talking to the live Internet. One reason for
18 this is that dnsperf uses a "self-pacing" approach, which is based on
19 the assumption that you can keep the server 100% busy simply by sending
20 it a small burst of back-to-back queries to fill up network buffers,
21 and then send a new query whenever you get a response back. This
22 approach works well for authoritative servers that process queries in
23 order and one at a time; it also works pretty well for a caching server
24 in a closed laboratory environment talking to a simulated Internet
25 that's all on the same LAN. Unfortunately, it does not work well with
26 a caching server talking to the actual Internet, which may need to work
27 on thousands of queries in parallel to achieve its maximum throughput.
28 There have been numerous attempts to use dnsperf (or its predecessor,
29 queryperf) for benchmarking live caching servers, usually with poor
30 results. Therefore, a separate tool designed specifically for caching
31 servers is needed.
32
33 HOW RESPERF WORKS
34 Unlike the "self-pacing" approach of dnsperf, resperf works by sending
35 DNS queries at a controlled, steadily increasing rate. By default,
36 resperf will send traffic for 60 seconds, linearly increasing the
37 amount of traffic from zero to 100,000 queries per second.
38
39 During the test, resperf listens for responses from the server and
40 keeps track of response rates, failure rates, and latencies. It will
41 also continue listening for responses for an additional 40 seconds
42 after it has stopped sending traffic, so that there is time for the
43 server to respond to the last queries sent. This time period was chosen
44 to be longer than the overall query timeout of both Nominum CNS and
45 current versions of BIND.
46
47 If the test is successful, the query rate will at some point exceed the
48 capacity of the server and queries will be dropped, causing the
49 response rate to stop growing or even decrease as the query rate
50 increases.
51
52 The result of the test is a set of measurements of the query rate,
53 response rate, failure response rate, and average query latency as
54 functions of time. These are written to a file in a tabular format for
55 plotting using gnuplot or some similar plotting tool. The server's max‐
56 imum throughput can be determined from the plot either as the highest
57 response rate on the plot, or alternatively as the response rate at the
58 point where a significant number of queries begin to be dropped.
59
60 WHAT YOU WILL NEED
61 Benchmarking a live caching server is serious business. A fast caching
62 server like Nominum CNS running on an Opteron server, resolving a mix
63 of cacheable and non-cacheable queries typical of ISP customer traffic,
64 is capable of resolving more than 50,000 queries per second. In the
65 process, it will send more than 20,000 queries per second to authorita‐
66 tive servers on the Internet, and receive responses to most of them.
67 Assuming an average request size of 50 bytes and a response size of 100
68 bytes, this amounts to some 8 Mbps of outgoing and 16 Mbps of incoming
69 traffic. If your Internet connection can't handle the bandwidth, you
70 will end up measuring the speed of the connection, not the server, and
71 may saturate the connection causing a degradation in service for other
72 users.
73
74 Make sure there is no stateful firewall between the server and the
75 Internet, because most of them can't handle the amount of UDP traffic
76 the test will generate and will end up dropping packets, skewing the
77 test results. Some will even lock up or crash.
78
79 You should run resperf on a machine separate from the server under
80 test, on the same LAN. Preferably, this should be a Gigabit Ethernet
81 network. The machine running resperf should be at least as fast as the
82 machine being tested; otherwise, it may end up being the bottleneck.
83
84 There should be no other applications running on the machine running
85 resperf. Performance testing at the traffic levels involved is essen‐
86 tially a hard real-time application - consider the fact that at a query
87 rate of 100,000 queries per second, if resperf gets delayed by just
88 1/100 of a second, 1000 incoming UDP packets will arrive in the mean‐
89 time, which is more than most operating systems will buffer.
90
91 Because the granularity of the timers provided by operating systems is
92 typically too coarse to accurately schedule packet transmissions at
93 sub-millisecond intervals, resperf will busy-wait between packet trans‐
94 missions, constantly polling for responses. Therefore, it is normal for
95 resperf to consume 100% CPU during the whole test run, even during
96 periods where query rates are relatively low.
97
98 You will also need a set of test queries in the dnsperf file format.
99 See the dnsperf man page for instructions on how to construct this
100 query file. To make the test as realistic as possible, the queries
101 should be derived from recorded production client DNS traffic, without
102 removing duplicate queries or other filtering. With the default set‐
103 tings, resperf will use up to 3 million queries in each test run.
104
105 If the caching server to be tested has a configurable limit on the num‐
106 ber of simultaneous resolutions, like the "max-recursive-clients"
107 statement in Nominum CNS or the "recursive-clients" option in BIND 9,
108 you will probably have to increase it. As a starting point, we recom‐
109 mend a value of 10000 for Nominum CNS and 100000 for BIND 9. Should the
110 limit be reached, it will show up in the plots as an increase in the
111 number of failure responses.
112
113 For maximum realism, you could "prime" the cache of the server by hav‐
114 ing it resolve typical query traffic for some period of time before the
115 test, so that the cache is not empty when the test starts. However,
116 experience has shown that this is not really necessary, as the server
117 will reach a reasonable cache hit rate (70% or more) during the test
118 even when starting with an empty cache. If you do prime the cache, make
119 sure not to use the same set of queries as in the actual test, since
120 that would make the server answer almost all queries from the cache and
121 yield inflated performance numbers.
122
123 RUNNING THE TEST
124 When running resperf, you will need to specify at least the server IP
125 address and the query data file. A typical invocation will look like
126
127 resperf -s 10.0.0.2 -d queryfile
128
129
130
131 With default settings, the test run will take at most 100 seconds (60
132 seconds of ramping up traffic and then 40 seconds of waiting for
133 responses), but in practice, the 60-second traffic phase will usually
134 be cut short. To be precise, resperf can transition from the traffic-
135 sending phase to the waiting-for-responses phase in three different
136 ways:
137
138 · Running for the full allotted time and successfully reaching the max‐
139 imum query rate (by default, 60 seconds and 100,000 qps, respec‐
140 tively). Since this is a very high query rate, this will rarely hap‐
141 pen (with today's hardware); one of the other two conditions listed
142 below will usually occur first.
143
144 · Exceeding 65,536 outstanding queries. This often happens as a result
145 of (successfully) exceeding the capacity of the server being tested,
146 causing the excess queries to be dropped. The limit of 65,536 queries
147 comes from the number of possible values for the ID field in the DNS
148 packet. Resperf needs to allocate a unique ID for each outstanding
149 query, and is therefore unable to send further queries if the set of
150 possible IDs is exhausted.
151
152 · When resperf finds itself unable to send queries fast enough. Resperf
153 will notice if it is falling behind in its scheduled query transmis‐
154 sions, and if this backlog reaches 1000 queries, it will print a mes‐
155 sage like "Fell behind by 1000 queries" (or whatever the actual num‐
156 ber is at the time) and stop sending traffic.
157
158 Regardless of which of the above conditions caused the traffic-sending
159 phase of the test to end, you should examine the resulting plots to
160 make sure the server's response rate is flattening out toward the end
161 of the test. If it is not, then you are not loading the server enough.
162 If you are getting the "Fell behind" message, make sure that the
163 machine running resperf is fast enough and has no other applications
164 running.
165
166 You should also monitor the CPU usage of the server under test. It
167 should reach close to 100% CPU at the point of maximum traffic; if it
168 does not, you most likely have a bottleneck in some other part of your
169 test setup, for example, your external Internet connection.
170
171 As resperf runs, some status messages and summary statistics will be
172 written to standard output, and the table of plot data is written to
173 the file resperf.gnuplot in the current directory (or some other file
174 name given with the -P command line option).
175
176 THE PLOT DATA FILE
177 For purposes of generating the plot data file, the test run is divided
178 into time intervals of 0.5 seconds (or some other length of time speci‐
179 fied with the -i command line option). Each line in the plot data file
180 corresponds to one such interval, and contains the following values as
181 floating-point numbers:
182
183 Time The midpoint of this time interval, in seconds since the begin‐
184 ning of the run
185
186 Target queries per second
187 The number of queries per second scheduled to be sent in this
188 time interval
189
190 Actual queries per second
191 The number of queries per second actually sent in this time
192 interval
193
194 Responses per second
195 The number of responses received corresponding to queries sent
196 in this time interval, divided by the length of the interval
197
198 Failures per second
199 The number of responses received corresponding to queries sent
200 in this time interval and having an RCODE other than NOERROR or
201 NXDOMAIN, divided by the length of the interval
202
203 Average latency
204 The average time between sending the query and receiving a
205 response, for queries sent in this time interval
206
207 Note that the measurements for any given query are always applied to
208 the time interval when the query was sent, not the one when the
209 response (if any) was received. This makes it it easy to compare the
210 query and response rates; for example, if no queries are dropped, the
211 query and response curves will be identical. As another example, if the
212 plot shows 10% failure responses at t=5 seconds, this means that 10% of
213 the queries sent at t=5 seconds eventually failed, not that 10% of the
214 responses received at t=5 seconds were failures.
215
216 PLOTTING THE RESULTS
217 Resperf comes with a shell script, resperf-report, which will run res‐
218 perf with its output redirected to a file and then automatically gener‐
219 ate an illustrated report in HTML format. Command line arguments given
220 to resperf-report will be passed on unchanged to resperf.
221
222 You need to have gnuplot installed, because resperf-report uses it to
223 generate the plots. Make sure your version of gnuplot supports the
224 "gif" terminal driver.
225
226 The report will be stored with a unique file name based on the current
227 date and time, e.g., 20060812-1550.html. The GIF images of the plots
228 and other auxiliary files will be stored in separate files beginning
229 with the same date-time string. If you need to copy the report to a
230 separate machine for viewing, make sure to copy the .gif files along
231 with the .html file (or simply copy all the files, e.g., using scp
232 20060812-1550.* host:directory/).
233
234 For example, to benchmark a server running on 10.0.0.2, you could run
235
236 resperf-report -s 10.0.0.2 -d queryfile
237
238
239 and then open the resulting .html file in a web browser.
240
241 INTERPRETING THE RESULTS
242 The summary statistics printed on standard output at the end of the
243 test include the server's measured maximum throughput. By default, this
244 is simply the highest point on the response rate plot, without regard
245 to the number of queries being dropped or failing at that point.
246
247 You can also make resperf report the throughput at the point in the
248 test where the percentage of queries dropped exceeds a given limit (or
249 the maximum as above if the limit is never exceeded). This can be a
250 more realistic indication of how much the server can be loaded while
251 still providing an acceptable level of service. This is done using the
252 -L command line option; for example, specifying -L 10 makes resperf
253 report the highest throughput reached before the server starts dropping
254 more than 10% of the queries.
255
256 There is no corresponding way of automatically constraining results
257 based on the number of failed queries, because unlike dropped queries,
258 resolution failures will occur even when the the server is not over‐
259 loaded, and the number of such failures is heavily dependent on the
260 query data and network conditions. Therefore, the plots should be manu‐
261 ally inspected to ensure that there is not an abnormal number of fail‐
262 ures.
263
265 -d datafile
266 Specifies the input data file. If not specified, resperf will
267 read from standard input.
268
269 -s server_addr
270 Specifies the name or address of the server to which requests
271 will be sent. The default is the loopback address, 127.0.0.1.
272
273 -p port
274 Sets the port on which the DNS packets are sent. If not speci‐
275 fied, the standard DNS port (53) is used.
276
277 -b bufsize
278 Sets the size of the socket's send and receive buffers, in kilo‐
279 bytes. If not specified, the default value is 32k.
280
281 -f family
282 Specifies the address family used for sending DNS packets. The
283 possible values are "inet", "inet6", or "any". If "any" (the
284 default value) is specified, resperf will use whichever address
285 family is appropriate for the server it is sending packets to.
286
287 -e Enables EDNS0 [RFC2671], by adding an OPT record to all packets
288 sent.
289
290 -D Sets the DO (DNSSEC OK) bit [RFC3225] in all packets sent. This
291 also enables EDNS0, which is required for DNSSEC.
292
293 -y name:secret
294 Add a TSIG record [RFC2845] to all packets sent, using the spec‐
295 ified TSIG key name and secret, where the secret is expressed as
296 a base-64 encoded string.
297
298 -A Reports the command line arguments passed to resperf to standard
299 output.
300
301 -h Print a usage statement and exit.
302
303 -i interval
304 Specifies the time interval between data points in the plot
305 file.
306
307 -m max_qps
308 Specifies the target maximum query rate (in queries per second).
309 This should be higher than the expected maximum throughput of
310 the server being tested. Traffic will be ramped up at a lin‐
311 early increasing rate until this value is reached, or until one
312 of the other conditions described in the section "Running the
313 test" occurs. The default is 100000 queries per second.
314
315 -P plot_data_file
316 Spefifies the name of the plot data file. The default is res‐
317 perf.gnuplot.
318
319 -r rampup_time
320 Specifies the length of time over which traffic will be ramped
321 up. The default is 60 seconds.
322
323 -L max_loss
324 Specifies the maximum acceptable query loss percentage for pur‐
325 poses of determining the maximum throughput value. The default
326 is 100%, meaning that resperf will measure the maximum through‐
327 put without regard to query loss.
328
330 Nominum, Inc.
331
333 dnsperf(1)
334
335
336
337resperf December 4, 2007 RESPERF(1)