1resperf(1) General Commands Manual resperf(1)
2
3
4
6 resperf - test the resolution performance of a caching DNS server
7
9 resperf-report [-a local_addr] [-d datafile] [-s server_addr] [-p port]
10 [-x local_port] [-t timeout] [-b bufsize] [-f family] [-e] [-D]
11 [-y [alg:]name:secret] [-h] [-i interval] [-m max_qps] [-r rampup_time]
12 [-c constant_traffic_time] [-L max_loss] [-C clients]
13 [-q max_outstanding]
14
15 resperf [-a local_addr] [-d datafile] [-s server_addr] [-p port]
16 [-x local_port] [-t timeout] [-b bufsize] [-f family] [-e] [-D]
17 [-y [alg:]name:secret] [-h] [-i interval] [-m max_qps]
18 [-P plot_data_file] [-r rampup_time] [-c constant_traffic_time]
19 [-L max_loss] [-C clients] [-q max_outstanding]
20
22 resperf is a companion tool to dnsperf. dnsperf was primarily designed
23 for benchmarking authoritative servers, and it does not work well with
24 caching servers that are talking to the live Internet. One reason for
25 this is that dnsperf uses a "self-pacing" approach, which is based on
26 the assumption that you can keep the server 100% busy simply by sending
27 it a small burst of back-to-back queries to fill up network buffers,
28 and then send a new query whenever you get a response back. This ap‐
29 proach works well for authoritative servers that process queries in or‐
30 der and one at a time; it also works pretty well for a caching server
31 in a closed laboratory environment talking to a simulated Internet
32 that's all on the same LAN. Unfortunately, it does not work well with a
33 caching server talking to the actual Internet, which may need to work
34 on thousands of queries in parallel to achieve its maximum throughput.
35 There have been numerous attempts to use dnsperf (or its predecessor,
36 queryperf) for benchmarking live caching servers, usually with poor re‐
37 sults. Therefore, a separate tool designed specifically for caching
38 servers is needed.
39
40 How resperf works
41 Unlike the "self-pacing" approach of dnsperf, resperf works by sending
42 DNS queries at a controlled, steadily increasing rate. By default, res‐
43 perf will send traffic for 60 seconds, linearly increasing the amount
44 of traffic from zero to 100,000 queries per second.
45
46 During the test, resperf listens for responses from the server and
47 keeps track of response rates, failure rates, and latencies. It will
48 also continue listening for responses for an additional 40 seconds af‐
49 ter it has stopped sending traffic, so that there is time for the serv‐
50 er to respond to the last queries sent. This time period was chosen to
51 be longer than the overall query timeout of both Nominum CacheServe and
52 current versions of BIND.
53
54 If the test is successful, the query rate will at some point exceed the
55 capacity of the server and queries will be dropped, causing the re‐
56 sponse rate to stop growing or even decrease as the query rate increas‐
57 es.
58
59 The result of the test is a set of measurements of the query rate, re‐
60 sponse rate, failure response rate, and average query latency as func‐
61 tions of time.
62
63 What you will need
64 Benchmarking a live caching server is serious business. A fast caching
65 server like Nominum CacheServe, resolving a mix of cacheable and non-
66 cacheable queries typical of ISP customer traffic, is capable of re‐
67 solving well over 1,000,000 queries per second. In the process, it will
68 send more than 40,000 queries per second to authoritative servers on
69 the Internet, and receive responses to most of them. Assuming an aver‐
70 age request size of 50 bytes and a response size of 150 bytes, this
71 amounts to some 1216 Mbps of outgoing and 448 Mbps of incoming traffic.
72 If your Internet connection can't handle the bandwidth, you will end up
73 measuring the speed of the connection, not the server, and may saturate
74 the connection causing a degradation in service for other users.
75
76 Make sure there is no stateful firewall between the server and the In‐
77 ternet, because most of them can't handle the amount of UDP traffic the
78 test will generate and will end up dropping packets, skewing the test
79 results. Some will even lock up or crash.
80
81 You should run resperf on a machine separate from the server under
82 test, on the same LAN. Preferably, this should be a Gigabit Ethernet
83 network. The machine running resperf should be at least as fast as the
84 machine being tested; otherwise, it may end up being the bottleneck.
85
86 There should be no other applications running on the machine running
87 resperf. Performance testing at the traffic levels involved is essen‐
88 tially a hard real-time application - consider the fact that at a query
89 rate of 100,000 queries per second, if resperf gets delayed by just
90 1/100 of a second, 1000 incoming UDP packets will arrive in the mean‐
91 time. This is more than most operating systems will buffer, which means
92 packets will be dropped.
93
94 Because the granularity of the timers provided by operating systems is
95 typically too coarse to accurately schedule packet transmissions at
96 sub-millisecond intervals, resperf will busy-wait between packet trans‐
97 missions, constantly polling for responses in the meantime. Therefore,
98 it is normal for resperf to consume 100% CPU during the whole test run,
99 even during periods where query rates are relatively low.
100
101 You will also need a set of test queries in the dnsperf file format.
102 See the dnsperf man page for instructions on how to construct this
103 query file. To make the test as realistic as possible, the queries
104 should be derived from recorded production client DNS traffic, without
105 removing duplicate queries or other filtering. With the default set‐
106 tings, resperf will use up to 3 million queries in each test run.
107
108 If the caching server to be tested has a configurable limit on the num‐
109 ber of simultaneous resolutions, like the max-recursive-clients state‐
110 ment in Nominum CacheServe or the recursive-clients option in BIND 9,
111 you will probably have to increase it. As a starting point, we recom‐
112 mend a value of 10000 for Nominum CacheServe and 100000 for BIND 9.
113 Should the limit be reached, it will show up in the plots as an in‐
114 crease in the number of failure responses.
115
116 The server being tested should be restarted at the beginning of each
117 test to make sure it is starting with an empty cache. If the cache al‐
118 ready contains data from a previous test run that used the same set of
119 queries, almost all queries will be answered from the cache, yielding
120 inflated performance numbers.
121
122 To use the resperf-report script, you need to have gnuplot installed.
123 Make sure your installed version of gnuplot supports the png terminal
124 driver. If your gnuplot doesn't support png but does support gif, you
125 can change the line saying terminal=png in the resperf-report script to
126 terminal=gif.
127
128 Running the test
129 Resperf is typically invoked via the resperf-report script, which will
130 run resperf with its output redirected to a file and then automatically
131 generate an illustrated report in HTML format. Command line arguments
132 given to resperf-report will be passed on unchanged to resperf.
133
134 When running resperf-report, you will need to specify at least the
135 server IP address and the query data file. A typical invocation will
136 look like
137
138 resperf-report -s 10.0.0.2 -d queryfile
139
140 With default settings, the test run will take at most 100 seconds (60
141 seconds of ramping up traffic and then 40 seconds of waiting for re‐
142 sponses), but in practice, the 60-second traffic phase will usually be
143 cut short. To be precise, resperf can transition from the traffic-send‐
144 ing phase to the waiting-for-responses phase in three different ways:
145
146 · Running for the full allotted time and successfully reaching the max‐
147 imum query rate (by default, 60 seconds and 100,000 qps, respective‐
148 ly). Since this is a very high query rate, this will rarely happen
149 (with today's hardware); one of the other two conditions listed below
150 will usually occur first.
151
152 · Exceeding 65,536 outstanding queries. This often happens as a result
153 of (successfully) exceeding the capacity of the server being tested,
154 causing the excess queries to be dropped. The limit of 65,536 queries
155 comes from the number of possible values for the ID field in the DNS
156 packet. Resperf needs to allocate a unique ID for each outstanding
157 query, and is therefore unable to send further queries if the set of
158 possible IDs is exhausted.
159
160 · When resperf finds itself unable to send queries fast enough. Resperf
161 will notice if it is falling behind in its scheduled query transmis‐
162 sions, and if this backlog reaches 1000 queries, it will print a mes‐
163 sage like "Fell behind by 1000 queries" (or whatever the actual num‐
164 ber is at the time) and stop sending traffic.
165
166 Regardless of which of the above conditions caused the traffic-sending
167 phase of the test to end, you should examine the resulting plots to
168 make sure the server's response rate is flattening out toward the end
169 of the test. If it is not, then you are not loading the server enough.
170 If you are getting the "Fell behind" message, make sure that the ma‐
171 chine running resperf is fast enough and has no other applications run‐
172 ning.
173
174 You should also monitor the CPU usage of the server under test. It
175 should reach close to 100% CPU at the point of maximum traffic; if it
176 does not, you most likely have a bottleneck in some other part of your
177 test setup, for example, your external Internet connection.
178
179 The report generated by resperf-report will be stored with a unique
180 file name based on the current date and time, e.g., 20060812-1550.html.
181 The PNG images of the plots and other auxiliary files will be stored in
182 separate files beginning with the same date-time string. To view the
183 report, simply open the .html file in a web browser.
184
185 If you need to copy the report to a separate machine for viewing, make
186 sure to copy the .png files along with the .html file (or simply copy
187 all the files, e.g., using scp 20060812-1550.* host:directory/).
188
189 Interpreting the report
190 The .html file produced by resperf-report consists of two sections. The
191 first section, "Resperf output", contains output from the resperf pro‐
192 gram such as progress messages, a summary of the command line argu‐
193 ments, and summary statistics. The second section, "Plots", contains
194 two plots generated by gnuplot: "Query/response/failure rate" and "La‐
195 tency".
196
197 The "Query/response/failure rate" plot contains three graphs. The
198 "Queries sent per second" graph shows the amount of traffic being sent
199 to the server; this should be very close to a straight diagonal line,
200 reflecting the linear ramp-up of traffic.
201
202 The "Total responses received per second" graph shows how many of the
203 queries received a response from the server. All responses are counted,
204 whether successful (NOERROR or NXDOMAIN) or not (e.g., SERVFAIL).
205
206 The "Failure responses received per second" graph shows how many of the
207 queries received a failure response. A response is considered to be a
208 failure if its RCODE is neither NOERROR nor NXDOMAIN.
209
210 By visually inspecting the graphs, you can get an idea of how the serv‐
211 er behaves under increasing load. The "Total responses received per
212 second" graph will initially closely follow the "Queries sent per sec‐
213 ond" graph (often rendering it invisible in the plot as the two graphs
214 are plotted on top of one another), but when the load exceeds the serv‐
215 er's capacity, the "Total responses received per second" graph may di‐
216 verge from the "Queries sent per second" graph and flatten out, indi‐
217 cating that some of the queries are being dropped.
218
219 The "Failure responses received per second" graph will normally show a
220 roughly linear ramp close to the bottom of the plot with some random
221 fluctuation, since typical query traffic will contain some small per‐
222 centage of failing queries randomly interspersed with the successful
223 ones. As the total traffic increases, the number of failures will in‐
224 crease proportionally.
225
226 If the "Failure responses received per second" graph turns sharply up‐
227 wards, this can be another indication that the load has exceeded the
228 server's capacity. This will happen if the server reacts to overload by
229 sending SERVFAIL responses rather than by dropping queries. Since
230 Nominum CacheServe and BIND 9 will both respond with SERVFAIL when they
231 exceed their max-recursive-clients or recursive-clients limit, respec‐
232 tively, a sudden increase in the number of failures could mean that the
233 limit needs to be increased.
234
235 The "Latency" plot contains a single graph marked "Average latency".
236 This shows how the latency varies during the course of the test. Typi‐
237 cally, the latency graph will exhibit a downwards trend because the
238 cache hit rate improves as ever more responses are cached during the
239 test, and the latency for a cache hit is much smaller than for a cache
240 miss. The latency graph is provided as an aid in determining the point
241 where the server gets overloaded, which can be seen as a sharp upwards
242 turn in the graph. The latency graph is not intended for making abso‐
243 lute latency measurements or comparisons between servers; the latencies
244 shown in the graph are not representative of production latencies due
245 to the initially empty cache and the deliberate overloading of the
246 server towards the end of the test.
247
248 Note that all measurements are displayed on the plot at the horizontal
249 position corresponding to the point in time when the query was sent,
250 not when the response (if any) was received. This makes it it easy to
251 compare the query and response rates; for example, if no queries are
252 dropped, the query and response graphs will be identical. As another
253 example, if the plot shows 10% failure responses at t=5 seconds, this
254 means that 10% of the queries sent at t=5 seconds eventually failed,
255 not that 10% of the responses received at t=5 seconds were failures.
256
257 Determining the server's maximum throughput
258 Often, the goal of running resperf is to determine the server's maximum
259 throughput, in other words, the number of queries per second it is ca‐
260 pable of handling. This is not always an easy task, because as a server
261 is driven into overload, the service it provides may deteriorate gradu‐
262 ally, and this deterioration can manifest itself either as queries be‐
263 ing dropped, as an increase in the number of SERVFAIL responses, or an
264 increase in latency. The maximum throughput may be defined as the
265 highest level of traffic at which the server still provides an accept‐
266 able level of service, but that means you first need to decide what an
267 acceptable level of service means in terms of packet drop percentage,
268 SERVFAIL percentage, and latency.
269
270 The summary statistics in the "Resperf output" section of the report
271 contains a "Maximum throughput" value which by default is determined
272 from the maximum rate at which the server was able to return responses,
273 without regard to the number of queries being dropped or failing at
274 that point. This method of throughput measurement has the advantage of
275 simplicity, but it may or may not be appropriate for your needs; the
276 reported value should always be validated by a visual inspection of the
277 graphs to ensure that service has not already deteriorated unacceptably
278 before the maximum response rate is reached. It may also be helpful to
279 look at the "Lost at that point" value in the summary statistics; this
280 indicates the percentage of the queries that was being dropped at the
281 point in the test when the maximum throughput was reached.
282
283 Alternatively, you can make resperf report the throughput at the point
284 in the test where the percentage of queries dropped exceeds a given
285 limit (or the maximum as above if the limit is never exceeded). This
286 can be a more realistic indication of how much the server can be loaded
287 while still providing an acceptable level of service. This is done us‐
288 ing the -L command line option; for example, specifying -L 10 makes
289 resperf report the highest throughput reached before the server starts
290 dropping more than 10% of the queries.
291
292 There is no corresponding way of automatically constraining results
293 based on the number of failed queries, because unlike dropped queries,
294 resolution failures will occur even when the the server is not over‐
295 loaded, and the number of such failures is heavily dependent on the
296 query data and network conditions. Therefore, the plots should be manu‐
297 ally inspected to ensure that there is not an abnormal number of fail‐
298 ures.
299
301 In addition to ramping up traffic linearly, resperf also has the capa‐
302 bility to send a constant stream of traffic. This can be useful when
303 using resperf for tasks other than performance measurement; for exam‐
304 ple, it can be used to "soak test" a server by subjecting it to a sus‐
305 tained load for an extended period of time.
306
307 To generate a constant traffic load, use the -c command line option,
308 together with the -m option which specifies the desired constant query
309 rate. For example, to send 10000 queries per second for an hour, use -m
310 10000 -c 3600. This will include the usual 30-second gradual ramp-up of
311 traffic at the beginning, which may be useful to avoid initially over‐
312 whelming a server that is starting with an empty cache. To start the
313 onslaught of traffic instantly, use -m 10000 -c 3600 -r 0.
314
315 To be precise, resperf will do a linear ramp-up of traffic from 0 to -m
316 queries per second over a period of -r seconds, followed by a plateau
317 of steady traffic at -m queries per second lasting for -c seconds, fol‐
318 lowed by waiting for responses for an extra 40 seconds. Either the
319 ramp-up or the plateau can be suppressed by supplying a duration of ze‐
320 ro seconds with -r 0 and -c 0, respectively. The latter is the default.
321
322 Sending traffic at high rates for hours on end will of course require
323 very large amounts of input data. Also, a long-running test will gener‐
324 ate a large amount of plot data, which is kept in memory for the dura‐
325 tion of the test. To reduce the memory usage and the size of the plot
326 file, consider increasing the interval between measurements from the
327 default of 0.5 seconds using the -i option in long-running tests.
328
329 When using resperf for long-running tests, it is important that the
330 traffic rate specified using the -m is one that both resperf itself and
331 the server under test can sustain. Otherwise, the test is likely to be
332 cut short as a result of either running out of query IDs (because of
333 large numbers of dropped queries) or of resperf falling behind its
334 transmission schedule.
335
337 Because the resperf-report script passes its command line options di‐
338 rectly to the resperf programs, they both accept the same set of op‐
339 tions, with one exception: resperf-report automatically adds an appro‐
340 priate -P to the resperf command line, and therefore does not itself
341 take a -P option.
342
343 -d datafile
344 Specifies the input data file. If not specified, resperf will
345 read from standard input.
346
347 -s server_addr
348 Specifies the name or address of the server to which requests
349 will be sent. The default is the loopback address, 127.0.0.1.
350
351 -p port
352 Sets the port on which the DNS packets are sent. If not speci‐
353 fied, the standard DNS port (53) is used.
354
355 -a local_addr
356 Specifies the local address from which to send requests. The de‐
357 fault is the wildcard address.
358
359 -x local_port
360 Specifies the local port from which to send requests. The de‐
361 fault is the wildcard port (0).
362
363 If acting as multiple clients and the wildcard port is used,
364 each client will use a different random port. If a port is spec‐
365 ified, the clients will use a range of ports starting with the
366 specified one.
367
368 -t timeout
369 Specifies the request timeout value, in seconds. resperf will no
370 longer wait for a response to a particular request after this
371 many seconds have elapsed. The default is 45 seconds.
372
373 resperf times out unanswered requests in order to reclaim query
374 IDs so that the query ID space will not be exhausted in a long-
375 running test, such as when "soak testing" a server for an day
376 with -m 10000 -c 86400. The timeouts and the ability to tune
377 them are of little use in the more typical use case of a perfor‐
378 mance test lasting only a minute or two.
379
380 The default timeout of 45 seconds was chosen to be longer than
381 the query timeout of current caching servers. Note that this is
382 longer than the corresponding default in dnsperf, because
383 caching servers can take many orders of magnitude longer to an‐
384 swer a query than authoritative servers do.
385
386 If a short timeout is used, there is a possibility that resperf
387 will receive a response after the corresponding request has
388 timed out; in this case, a message like Warning: Received a re‐
389 sponse with an unexpected id: 141 will be printed.
390
391 -b bufsize
392 Sets the size of the socket's send and receive buffers, in kilo‐
393 bytes. If not specified, the operating system's default is used.
394
395 -f family
396 Specifies the address family used for sending DNS packets. The
397 possible values are "inet", "inet6", or "any". If "any" (the de‐
398 fault value) is specified, resperf will use whichever address
399 family is appropriate for the server it is sending packets to.
400
401 -e
402 Enables EDNS0 [RFC2671], by adding an OPT record to all packets
403 sent.
404
405 -D
406 Sets the DO (DNSSEC OK) bit [RFC3225] in all packets sent. This
407 also enables EDNS0, which is required for DNSSEC.
408
409 -y [alg:]name:secret
410 Add a TSIG record [RFC2845] to all packets sent, using the spec‐
411 ified TSIG key algorithm, name and secret, where the algorithm
412 defaults to hmac-md5 and the secret is expressed as a base-64
413 encoded string.
414
415 -h
416 Print a usage statement and exit.
417
418 -i interval
419 Specifies the time interval between data points in the plot
420 file. The default is 0.5 seconds.
421
422 -m max_qps
423 Specifies the target maximum query rate (in queries per second).
424 This should be higher than the expected maximum throughput of
425 the server being tested. Traffic will be ramped up at a linear‐
426 ly increasing rate until this value is reached, or until one of
427 the other conditions described in the section "Running the test"
428 occurs. The default is 100000 queries per second.
429
430 -P plot_data_file
431 Specifies the name of the plot data file. The default is res‐
432 perf.gnuplot.
433
434 -r rampup_time
435 Specifies the length of time over which traffic will be ramped
436 up. The default is 60 seconds.
437
438 -c constant_traffic_time
439 Specifies the length of time for which traffic will be sent at a
440 constant rate following the initial ramp-up. The default is 0
441 seconds, meaning no sending of traffic at a constant rate will
442 be done.
443
444 -L max_loss
445 Specifies the maximum acceptable query loss percentage for pur‐
446 poses of determining the maximum throughput value. The default
447 is 100%, meaning that resperf will measure the maximum through‐
448 put without regard to query loss.
449
450 -C clients
451 Act as multiple clients. Requests are sent from multiple sock‐
452 ets. The default is to act as 1 client.
453
454 -q max_outstanding
455 Sets the maximum number of outstanding requests. resperf will
456 stop ramping up traffic when this many queries are outstanding.
457 The default is 64k, and the limit is 64k per client.
458
460 The plot data file is written by the resperf program and contains the
461 data to be plotted using gnuplot. When running resperf via the res‐
462 perf-report script, there is no need for the user to deal with this
463 file directly, but its format and contents are documented here for com‐
464 pleteness and in case you wish to run resperf directly and use its out‐
465 put for purposes other than viewing it with gnuplot.
466
467 The first line of the file is a comment identifying the fields. It may
468 be recognized as a comment by its leading hash sign (#).
469
470 Subsequent lines contain the actual plot data. For purposes of generat‐
471 ing the plot data file, the test run is divided into time intervals of
472 0.5 seconds (or some other length of time specified with the -i command
473 line option). Each line corresponds to one such interval, and contains
474 the following values as floating-point numbers:
475
476 Time
477 The midpoint of this time interval, in seconds since the begin‐
478 ning of the run
479
480 Target queries per second
481 The number of queries per second scheduled to be sent in this
482 time interval
483
484 Actual queries per second
485 The number of queries per second actually sent in this time in‐
486 terval
487
488 Responses per second
489 The number of responses received corresponding to queries sent
490 in this time interval, divided by the length of the interval
491
492 Failures per second
493 The number of responses received corresponding to queries sent
494 in this time interval and having an RCODE other than NOERROR or
495 NXDOMAIN, divided by the length of the interval
496
497 Average latency
498 The average time between sending the query and receiving a re‐
499 sponse, for queries sent in this time interval
500
502 dnsperf(1)
503
505 Nominum, Inc.
506
507 Maintained by DNS-OARC
508
509 https://www.dns-oarc.net/
510
512 For issues and feature requests please use:
513
514 https://github.com/DNS-OARC/dnsperf/issues
515
516 For question and help please use:
517
518 admin@dns-oarc.net
519
520resperf 2.2.1 resperf(1)