1resperf(1) General Commands Manual resperf(1)
2
3
4
6 resperf - test the resolution performance of a caching DNS server
7
9 resperf-report [-a local_addr] [-d datafile] [-R] [-M mode]
10 [-s server_addr] [-p port] [-x local_port] [-t timeout] [-b bufsize]
11 [-f family] [-e] [-D] [-y [alg:]name:secret] [-h] [-i interval]
12 [-m max_qps] [-r rampup_time] [-c constant_traffic_time] [-L max_loss]
13 [-C clients] [-q max_outstanding] [-F fall_behind] [-v] [-W]
14 [-O option=value]
15
16 resperf [-a local_addr] [-d datafile] [-R] [-M mode] [-s server_addr]
17 [-p port] [-x local_port] [-t timeout] [-b bufsize] [-f family] [-e]
18 [-D] [-y [alg:]name:secret] [-h] [-i interval] [-m max_qps]
19 [-P plot_data_file] [-r rampup_time] [-c constant_traffic_time]
20 [-L max_loss] [-C clients] [-q max_outstanding] [-F fall_behind] [-v]
21 [-W] [-O option=value]
22
24 resperf is a companion tool to dnsperf. dnsperf was primarily designed
25 for benchmarking authoritative servers, and it does not work well with
26 caching servers that are talking to the live Internet. One reason for
27 this is that dnsperf uses a "self-pacing" approach, which is based on
28 the assumption that you can keep the server 100% busy simply by sending
29 it a small burst of back-to-back queries to fill up network buffers,
30 and then send a new query whenever you get a response back. This ap‐
31 proach works well for authoritative servers that process queries in or‐
32 der and one at a time; it also works pretty well for a caching server
33 in a closed laboratory environment talking to a simulated Internet
34 that's all on the same LAN. Unfortunately, it does not work well with
35 a caching server talking to the actual Internet, which may need to work
36 on thousands of queries in parallel to achieve its maximum throughput.
37 There have been numerous attempts to use dnsperf (or its predecessor,
38 queryperf) for benchmarking live caching servers, usually with poor re‐
39 sults. Therefore, a separate tool designed specifically for caching
40 servers is needed.
41
42 How resperf works
43 Unlike the "self-pacing" approach of dnsperf, resperf works by sending
44 DNS queries at a controlled, steadily increasing rate. By default,
45 resperf will send traffic for 60 seconds, linearly increasing the
46 amount of traffic from zero to 100,000 queries per second (or max_qps).
47
48 During the test, resperf listens for responses from the server and
49 keeps track of response rates, failure rates, and latencies. It will
50 also continue listening for responses for an additional 40 seconds af‐
51 ter it has stopped sending traffic, so that there is time for the serv‐
52 er to respond to the last queries sent. This time period was chosen to
53 be longer than the overall query timeout of both Nominum CacheServe and
54 current versions of BIND.
55
56 If the test is successful, the query rate will at some point exceed the
57 capacity of the server and queries will be dropped, causing the re‐
58 sponse rate to stop growing or even decrease as the query rate increas‐
59 es.
60
61 The result of the test is a set of measurements of the query rate, re‐
62 sponse rate, failure response rate, and average query latency as func‐
63 tions of time.
64
65 What you will need
66 Benchmarking a live caching server is serious business. A fast caching
67 server like Nominum CacheServe, resolving a mix of cacheable and non-
68 cacheable queries typical of ISP customer traffic, is capable of re‐
69 solving well over 1,000,000 queries per second. In the process, it
70 will send more than 40,000 queries per second to authoritative servers
71 on the Internet, and receive responses to most of them. Assuming an
72 average request size of 50 bytes and a response size of 150 bytes, this
73 amounts to some 1216 Mbps of outgoing and 448 Mbps of incoming traffic.
74 If your Internet connection can't handle the bandwidth, you will end up
75 measuring the speed of the connection, not the server, and may saturate
76 the connection causing a degradation in service for other users.
77
78 Make sure there is no stateful firewall between the server and the In‐
79 ternet, because most of them can't handle the amount of UDP traffic the
80 test will generate and will end up dropping packets, skewing the test
81 results. Some will even lock up or crash.
82
83 You should run resperf on a machine separate from the server under
84 test, on the same LAN. Preferably, this should be a Gigabit Ethernet
85 network. The machine running resperf should be at least as fast as the
86 machine being tested; otherwise, it may end up being the bottleneck.
87
88 There should be no other applications running on the machine running
89 resperf. Performance testing at the traffic levels involved is essen‐
90 tially a hard real-time application - consider the fact that at a query
91 rate of 100,000 queries per second, if resperf gets delayed by just
92 1/100 of a second, 1000 incoming UDP packets will arrive in the mean‐
93 time. This is more than most operating systems will buffer, which
94 means packets will be dropped.
95
96 Because the granularity of the timers provided by operating systems is
97 typically too coarse to accurately schedule packet transmissions at
98 sub-millisecond intervals, resperf will busy-wait between packet trans‐
99 missions, constantly polling for responses in the meantime. Therefore,
100 it is normal for resperf to consume 100% CPU during the whole test run,
101 even during periods where query rates are relatively low.
102
103 You will also need a set of test queries in the dnsperf file format.
104 See the dnsperf man page for instructions on how to construct this
105 query file. To make the test as realistic as possible, the queries
106 should be derived from recorded production client DNS traffic, without
107 removing duplicate queries or other filtering. With the default set‐
108 tings, resperf will use up to 3 million queries in each test run.
109
110 If the caching server to be tested has a configurable limit on the num‐
111 ber of simultaneous resolutions, like the max-recursive-clients state‐
112 ment in Nominum CacheServe or the recursive-clients option in BIND 9,
113 you will probably have to increase it. As a starting point, we recom‐
114 mend a value of 10000 for Nominum CacheServe and 100000 for BIND 9.
115 Should the limit be reached, it will show up in the plots as an in‐
116 crease in the number of failure responses.
117
118 The server being tested should be restarted at the beginning of each
119 test to make sure it is starting with an empty cache. If the cache al‐
120 ready contains data from a previous test run that used the same set of
121 queries, almost all queries will be answered from the cache, yielding
122 inflated performance numbers.
123
124 To use the resperf-report script, you need to have gnuplot installed.
125 Make sure your installed version of gnuplot supports the png terminal
126 driver. If your gnuplot doesn't support png but does support gif, you
127 can change the line saying terminal=png in the resperf-report script to
128 terminal=gif.
129
130 Running the test
131 resperf is typically invoked via the resperf-report script, which will
132 run resperf with its output redirected to a file and then automatically
133 generate an illustrated report in HTML format. Command line arguments
134 given to resperf-report will be passed on unchanged to resperf.
135
136 When running resperf-report, you will need to specify at least the
137 server IP address and the query data file. A typical invocation will
138 look like
139
140 resperf-report -s 10.0.0.2 -d queryfile
141
142 With default settings, the test run will take at most 100 seconds (60
143 seconds of ramping up traffic and then 40 seconds of waiting for re‐
144 sponses), but in practice, the 60-second traffic phase will usually be
145 cut short. To be precise, resperf can transition from the traffic-
146 sending phase to the waiting-for-responses phase in three different
147 ways:
148
149 • Running for the full allotted time and successfully reaching the max‐
150 imum query rate (by default, 60 seconds and 100,000 qps, respective‐
151 ly). Since this is a very high query rate, this will rarely happen
152 (with today's hardware); one of the other two conditions listed below
153 will usually occur first.
154
155 • Exceeding 65,536 outstanding queries. This often happens as a result
156 of (successfully) exceeding the capacity of the server being tested,
157 causing the excess queries to be dropped. The limit of 65,536
158 queries comes from the number of possible values for the ID field in
159 the DNS packet. resperf needs to allocate a unique ID for each out‐
160 standing query, and is therefore unable to send further queries if
161 the set of possible IDs is exhausted.
162
163 • When resperf finds itself unable to send queries fast enough. res‐
164 perf will notice if it is falling behind in its scheduled query
165 transmissions, and if this backlog reaches 1000 queries, it will
166 print a message like "Fell behind by 1000 queries" (or whatever the
167 actual number is at the time) and stop sending traffic.
168
169 Regardless of which of the above conditions caused the traffic-sending
170 phase of the test to end, you should examine the resulting plots to
171 make sure the server's response rate is flattening out toward the end
172 of the test. If it is not, then you are not loading the server enough.
173 If you are getting the "Fell behind" message, make sure that the ma‐
174 chine running resperf is fast enough and has no other applications run‐
175 ning.
176
177 You should also monitor the CPU usage of the server under test. It
178 should reach close to 100% CPU at the point of maximum traffic; if it
179 does not, you most likely have a bottleneck in some other part of your
180 test setup, for example, your external Internet connection.
181
182 The report generated by resperf-report will be stored with a unique
183 file name based on the current date and time, e.g., 20060812-1550.html.
184 The PNG images of the plots and other auxiliary files will be stored in
185 separate files beginning with the same date-time string. To view the
186 report, simply open the .html file in a web browser.
187
188 If you need to copy the report to a separate machine for viewing, make
189 sure to copy the .png files along with the .html file (or simply copy
190 all the files, e.g., using scp 20060812-1550.* host:directory/).
191
192 Interpreting the report
193 The .html file produced by resperf-report consists of two sections.
194 The first section, "Resperf output", contains output from the resperf
195 program such as progress messages, a summary of the command line argu‐
196 ments, and summary statistics. The second section, "Plots", contains
197 two plots generated by gnuplot: "Query/response/failure rate" and "La‐
198 tency".
199
200 The "Query/response/failure rate" plot contains three graphs. The
201 "Queries sent per second" graph shows the amount of traffic being sent
202 to the server; this should be very close to a straight diagonal line,
203 reflecting the linear ramp-up of traffic.
204
205 The "Total responses received per second" graph shows how many of the
206 queries received a response from the server. All responses are count‐
207 ed, whether successful (NOERROR or NXDOMAIN) or not (e.g., SERVFAIL).
208
209 The "Failure responses received per second" graph shows how many of the
210 queries received a failure response. A response is considered to be a
211 failure if its RCODE is neither NOERROR nor NXDOMAIN.
212
213 By visually inspecting the graphs, you can get an idea of how the serv‐
214 er behaves under increasing load. The "Total responses received per
215 second" graph will initially closely follow the "Queries sent per sec‐
216 ond" graph (often rendering it invisible in the plot as the two graphs
217 are plotted on top of one another), but when the load exceeds the serv‐
218 er's capacity, the "Total responses received per second" graph may di‐
219 verge from the "Queries sent per second" graph and flatten out, indi‐
220 cating that some of the queries are being dropped.
221
222 The "Failure responses received per second" graph will normally show a
223 roughly linear ramp close to the bottom of the plot with some random
224 fluctuation, since typical query traffic will contain some small per‐
225 centage of failing queries randomly interspersed with the successful
226 ones. As the total traffic increases, the number of failures will in‐
227 crease proportionally.
228
229 If the "Failure responses received per second" graph turns sharply up‐
230 wards, this can be another indication that the load has exceeded the
231 server's capacity. This will happen if the server reacts to overload
232 by sending SERVFAIL responses rather than by dropping queries. Since
233 Nominum CacheServe and BIND 9 will both respond with SERVFAIL when they
234 exceed their max-recursive-clients or recursive-clients limit, respec‐
235 tively, a sudden increase in the number of failures could mean that the
236 limit needs to be increased.
237
238 The "Latency" plot contains a single graph marked "Average latency".
239 This shows how the latency varies during the course of the test. Typi‐
240 cally, the latency graph will exhibit a downwards trend because the
241 cache hit rate improves as ever more responses are cached during the
242 test, and the latency for a cache hit is much smaller than for a cache
243 miss. The latency graph is provided as an aid in determining the point
244 where the server gets overloaded, which can be seen as a sharp upwards
245 turn in the graph. The latency graph is not intended for making abso‐
246 lute latency measurements or comparisons between servers; the latencies
247 shown in the graph are not representative of production latencies due
248 to the initially empty cache and the deliberate overloading of the
249 server towards the end of the test.
250
251 Note that all measurements are displayed on the plot at the horizontal
252 position corresponding to the point in time when the query was sent,
253 not when the response (if any) was received. This makes it it easy to
254 compare the query and response rates; for example, if no queries are
255 dropped, the query and response graphs will be identical. As another
256 example, if the plot shows 10% failure responses at t=5 seconds, this
257 means that 10% of the queries sent at t=5 seconds eventually failed,
258 not that 10% of the responses received at t=5 seconds were failures.
259
260 Determining the server's maximum throughput
261 Often, the goal of running resperf is to determine the server's maximum
262 throughput, in other words, the number of queries per second it is ca‐
263 pable of handling. This is not always an easy task, because as a serv‐
264 er is driven into overload, the service it provides may deteriorate
265 gradually, and this deterioration can manifest itself either as queries
266 being dropped, as an increase in the number of SERVFAIL responses, or
267 an increase in latency. The maximum throughput may be defined as the
268 highest level of traffic at which the server still provides an accept‐
269 able level of service, but that means you first need to decide what an
270 acceptable level of service means in terms of packet drop percentage,
271 SERVFAIL percentage, and latency.
272
273 The summary statistics in the "Resperf output" section of the report
274 contains a "Maximum throughput" value which by default is determined
275 from the maximum rate at which the server was able to return responses,
276 without regard to the number of queries being dropped or failing at
277 that point. This method of throughput measurement has the advantage of
278 simplicity, but it may or may not be appropriate for your needs; the
279 reported value should always be validated by a visual inspection of the
280 graphs to ensure that service has not already deteriorated unacceptably
281 before the maximum response rate is reached. It may also be helpful to
282 look at the "Lost at that point" value in the summary statistics; this
283 indicates the percentage of the queries that was being dropped at the
284 point in the test when the maximum throughput was reached.
285
286 Alternatively, you can make resperf report the throughput at the point
287 in the test where the percentage of queries dropped exceeds a given
288 limit (or the maximum as above if the limit is never exceeded). This
289 can be a more realistic indication of how much the server can be loaded
290 while still providing an acceptable level of service. This is done us‐
291 ing the -L command line option; for example, specifying -L 10 makes
292 resperf report the highest throughput reached before the server starts
293 dropping more than 10% of the queries.
294
295 There is no corresponding way of automatically constraining results
296 based on the number of failed queries, because unlike dropped queries,
297 resolution failures will occur even when the the server is not over‐
298 loaded, and the number of such failures is heavily dependent on the
299 query data and network conditions. Therefore, the plots should be man‐
300 ually inspected to ensure that there is not an abnormal number of fail‐
301 ures.
302
304 In addition to ramping up traffic linearly, resperf also has the capa‐
305 bility to send a constant stream of traffic. This can be useful when
306 using resperf for tasks other than performance measurement; for exam‐
307 ple, it can be used to "soak test" a server by subjecting it to a sus‐
308 tained load for an extended period of time.
309
310 To generate a constant traffic load, use the -c command line option,
311 together with the -m option which specifies the desired constant query
312 rate. For example, to send 10000 queries per second for an hour, use
313 -m 10000 -c 3600. This will include the usual 30-second gradual ramp-
314 up of traffic at the beginning, which may be useful to avoid initially
315 overwhelming a server that is starting with an empty cache. To start
316 the onslaught of traffic instantly, use -m 10000 -c 3600 -r 0.
317
318 To be precise, resperf will do a linear ramp-up of traffic from 0 to -m
319 queries per second over a period of -r seconds, followed by a plateau
320 of steady traffic at -m queries per second lasting for -c seconds, fol‐
321 lowed by waiting for responses for an extra 40 seconds. Either the
322 ramp-up or the plateau can be suppressed by supplying a duration of ze‐
323 ro seconds with -r 0 and -c 0, respectively. The latter is the de‐
324 fault.
325
326 Sending traffic at high rates for hours on end will of course require
327 very large amounts of input data. Also, a long-running test will gen‐
328 erate a large amount of plot data, which is kept in memory for the du‐
329 ration of the test. To reduce the memory usage and the size of the
330 plot file, consider increasing the interval between measurements from
331 the default of 0.5 seconds using the -i option in long-running tests.
332
333 When using resperf for long-running tests, it is important that the
334 traffic rate specified using the -m is one that both resperf itself and
335 the server under test can sustain. Otherwise, the test is likely to be
336 cut short as a result of either running out of query IDs (because of
337 large numbers of dropped queries) or of resperf falling behind its
338 transmission schedule.
339
340 Using DNS-over-HTTPS
341 When using DNS-over-HTTPS you must set the -O doh-uri=... to something
342 that works with the server you're sending to. Also note that the value
343 for maximum outstanding queries will be used to control the maximum
344 concurrent streams within the HTTP/2 connection.
345
347 Because the resperf-report script passes its command line options di‐
348 rectly to the resperf programs, they both accept the same set of op‐
349 tions, with one exception: resperf-report automatically adds an appro‐
350 priate -P to the resperf command line, and therefore does not itself
351 take a -P option.
352
353 -d datafile
354 Specifies the input data file. If not specified, resperf will
355 read from standard input.
356
357 -R
358 Reopen the datafile if it runs out of data before the testing is
359 completed. This allows for long running tests on very small and
360 simple query datafile.
361
362 -M mode
363 Specifies the transport mode to use, "udp", "tcp", "dot" or
364 "doh". Default is "udp".
365
366 -s server_addr
367 Specifies the name or address of the server to which requests
368 will be sent. The default is the loopback address, 127.0.0.1.
369
370 -p port
371 Sets the port on which the DNS packets are sent. If not speci‐
372 fied, the standard DNS port (udp/tcp 53, DoT 853, DoH 443) is
373 used.
374
375 -a local_addr
376 Specifies the local address from which to send requests. The
377 default is the wildcard address.
378
379 -x local_port
380 Specifies the local port from which to send requests. The de‐
381 fault is the wildcard port (0).
382
383 If acting as multiple clients and the wildcard port is used,
384 each client will use a different random port. If a port is
385 specified, the clients will use a range of ports starting with
386 the specified one.
387
388 -t timeout
389 Specifies the request timeout value, in seconds. resperf will
390 no longer wait for a response to a particular request after this
391 many seconds have elapsed. The default is 45 seconds.
392
393 resperf times out unanswered requests in order to reclaim query
394 IDs so that the query ID space will not be exhausted in a long-
395 running test, such as when "soak testing" a server for an day
396 with -m 10000 -c 86400. The timeouts and the ability to tune
397 them are of little use in the more typical use case of a perfor‐
398 mance test lasting only a minute or two.
399
400 The default timeout of 45 seconds was chosen to be longer than
401 the query timeout of current caching servers. Note that this is
402 longer than the corresponding default in dnsperf, because
403 caching servers can take many orders of magnitude longer to an‐
404 swer a query than authoritative servers do.
405
406 If a short timeout is used, there is a possibility that resperf
407 will receive a response after the corresponding request has
408 timed out; in this case, a message like Warning: Received a re‐
409 sponse with an unexpected id: 141 will be printed.
410
411 -b bufsize
412 Sets the size of the socket's send and receive buffers, in kilo‐
413 bytes. If not specified, the operating system's default is
414 used.
415
416 -f family
417 Specifies the address family used for sending DNS packets. The
418 possible values are "inet", "inet6", or "any". If "any" (the
419 default value) is specified, resperf will use whichever address
420 family is appropriate for the server it is sending packets to.
421
422 -e
423 Enables EDNS0 [RFC2671], by adding an OPT record to all packets
424 sent.
425
426 -D
427 Sets the DO (DNSSEC OK) bit [RFC3225] in all packets sent. This
428 also enables EDNS0, which is required for DNSSEC.
429
430 -y [alg:]name:secret
431 Add a TSIG record [RFC2845] to all packets sent, using the spec‐
432 ified TSIG key algorithm, name and secret, where the algorithm
433 defaults to hmac-md5 and the secret is expressed as a base-64
434 encoded string.
435
436 -h
437 Print a usage statement and exit.
438
439 -i interval
440 Specifies the time interval between data points in the plot
441 file. The default is 0.5 seconds.
442
443 -m max_qps
444 Specifies the target maximum query rate (in queries per second).
445 This should be higher than the expected maximum throughput of
446 the server being tested. Traffic will be ramped up at a linear‐
447 ly increasing rate until this value is reached, or until one of
448 the other conditions described in the section "Running the test"
449 occurs. The default is 100000 queries per second.
450
451 -P plot_data_file
452 Specifies the name of the plot data file. The default is res‐
453 perf.gnuplot.
454
455 -r rampup_time
456 Specifies the length of time over which traffic will be ramped
457 up. The default is 60 seconds.
458
459 -c constant_traffic_time
460 Specifies the length of time for which traffic will be sent at a
461 constant rate following the initial ramp-up. The default is 0
462 seconds, meaning no sending of traffic at a constant rate will
463 be done.
464
465 -L max_loss
466 Specifies the maximum acceptable query loss percentage for pur‐
467 poses of determining the maximum throughput value. The default
468 is 100%, meaning that resperf will measure the maximum through‐
469 put without regard to query loss.
470
471 -C clients
472 Act as multiple clients. Requests are sent from multiple sock‐
473 ets. The default is to act as 1 client.
474
475 -q max_outstanding
476 Sets the maximum number of outstanding requests. resperf will
477 stop ramping up traffic when this many queries are outstanding.
478 The default is 64k, and the limit is 64k per client.
479
480 -F fall_behind
481 Sets the maximum number of queries that can fall behind being
482 sent. resperf will stop when this many queries should have been
483 sent and it can be relative easy to hit if max_qps is set too
484 high. The default is 1000 and setting it to zero (0) disables
485 the check.
486
487 -v
488 Enables verbose mode to report about network readiness and con‐
489 gestion.
490
491 -W
492 Log warnings and errors to standard output instead of standard
493 error making it easier for script, test and automation to cap‐
494 ture all output.
495
496 -O option=value
497 Set an extended long option for various things to control dif‐
498 ferent aspects of testing or protocol modules, see EXTENDED OP‐
499 TIONS in dnsperf(1) for list of available options.
500
502 The plot data file is written by the resperf program and contains the
503 data to be plotted using gnuplot. When running resperf via the res‐
504 perf-report script, there is no need for the user to deal with this
505 file directly, but its format and contents are documented here for com‐
506 pleteness and in case you wish to run resperf directly and use its out‐
507 put for purposes other than viewing it with gnuplot.
508
509 The first line of the file is a comment identifying the fields. It may
510 be recognized as a comment by its leading hash sign (#).
511
512 Subsequent lines contain the actual plot data. For purposes of gener‐
513 ating the plot data file, the test run is divided into time intervals
514 of 0.5 seconds (or some other length of time specified with the -i com‐
515 mand line option). Each line corresponds to one such interval, and
516 contains the following values as floating-point numbers:
517
518 Time
519 The midpoint of this time interval, in seconds since the begin‐
520 ning of the run
521
522 Target queries per second
523 The number of queries per second scheduled to be sent in this
524 time interval
525
526 Actual queries per second
527 The number of queries per second actually sent in this time in‐
528 terval
529
530 Responses per second
531 The number of responses received corresponding to queries sent
532 in this time interval, divided by the length of the interval
533
534 Failures per second
535 The number of responses received corresponding to queries sent
536 in this time interval and having an RCODE other than NOERROR or
537 NXDOMAIN, divided by the length of the interval
538
539 Average latency
540 The average time between sending the query and receiving a re‐
541 sponse, for queries sent in this time interval
542
543 Connections
544 The number of connections done, including re-connections, during
545 this time interval. This is only relevant to connection orient‐
546 ed protocols, such as TCP and DoT.
547
548 Average connection latency
549 The average time between starting to connect and having the con‐
550 nection ready for sending queries to, for this time interval.
551 This is only relevant to connection oriented protocols, such as
552 TCP and DoT.
553
554
556 dnsperf(1)
557
559 Nominum, Inc.
560
561 Maintained by DNS-OARC
562
563 https://www.dns-oarc.net/
564
566 For issues and feature requests please use:
567
568 https://github.com/DNS-OARC/dnsperf/issues
569
570 For question and help please use:
571
572 admin@dns-oarc.net
573
574resperf 2.8.0 resperf(1)