resperf(1)

1resperf(1)                  General Commands Manual                 resperf(1)
2
3
4

NAME

6       resperf - test the resolution performance of a caching DNS server
7

SYNOPSIS

9       resperf-report [-a local_addr] [-d datafile] [-M mode] [-s server_addr]
10       [-p port] [-x local_port] [-t timeout] [-b bufsize] [-f family] [-e]
11       [-D] [-y [alg:]name:secret] [-h] [-i interval] [-m max_qps]
12       [-r rampup_time] [-c constant_traffic_time] [-L max_loss] [-C clients]
13       [-q max_outstanding] [-v]
14
15       resperf [-a local_addr] [-d datafile] [-M mode] [-s server_addr]
16       [-p port] [-x local_port] [-t timeout] [-b bufsize] [-f family] [-e]
17       [-D] [-y [alg:]name:secret] [-h] [-i interval] [-m max_qps]
18       [-P plot_data_file] [-r rampup_time] [-c constant_traffic_time]
19       [-L max_loss] [-C clients] [-q max_outstanding] [-v]
20

DESCRIPTION

22       resperf  is a companion tool to dnsperf. dnsperf was primarily designed
23       for benchmarking authoritative servers, and it does not work well  with
24       caching  servers  that are talking to the live Internet. One reason for
25       this is that dnsperf uses a "self-pacing" approach, which is  based  on
26       the assumption that you can keep the server 100% busy simply by sending
27       it a small burst of back-to-back queries to fill  up  network  buffers,
28       and  then  send  a new query whenever you get a response back. This ap‐
29       proach works well for authoritative servers that process queries in or‐
30       der  and  one at a time; it also works pretty well for a caching server
31       in a closed laboratory environment  talking  to  a  simulated  Internet
32       that's all on the same LAN. Unfortunately, it does not work well with a
33       caching server talking to the actual Internet, which may need  to  work
34       on  thousands of queries in parallel to achieve its maximum throughput.
35       There have been numerous attempts to use dnsperf (or  its  predecessor,
36       queryperf) for benchmarking live caching servers, usually with poor re‐
37       sults. Therefore, a separate tool  designed  specifically  for  caching
38       servers is needed.
39
40   How resperf works
41       Unlike  the "self-pacing" approach of dnsperf, resperf works by sending
42       DNS queries at a controlled, steadily increasing rate. By default, res‐
43       perf  will  send traffic for 60 seconds, linearly increasing the amount
44       of traffic from zero to 100,000 queries per second.
45
46       During the test, resperf listens for  responses  from  the  server  and
47       keeps  track  of  response rates, failure rates, and latencies. It will
48       also continue listening for responses for an additional 40 seconds  af‐
49       ter it has stopped sending traffic, so that there is time for the serv‐
50       er to respond to the last queries sent. This time period was chosen  to
51       be longer than the overall query timeout of both Nominum CacheServe and
52       current versions of BIND.
53
54       If the test is successful, the query rate will at some point exceed the
55       capacity  of  the  server  and queries will be dropped, causing the re‐
56       sponse rate to stop growing or even decrease as the query rate increas‐
57       es.
58
59       The  result of the test is a set of measurements of the query rate, re‐
60       sponse rate, failure response rate, and average query latency as  func‐
61       tions of time.
62
63   What you will need
64       Benchmarking  a live caching server is serious business. A fast caching
65       server like Nominum CacheServe, resolving a mix of cacheable  and  non-
66       cacheable  queries  typical  of ISP customer traffic, is capable of re‐
67       solving well over 1,000,000 queries per second. In the process, it will
68       send  more  than  40,000 queries per second to authoritative servers on
69       the Internet, and receive responses to most of them. Assuming an  aver‐
70       age  request  size  of  50 bytes and a response size of 150 bytes, this
71       amounts to some 1216 Mbps of outgoing and 448 Mbps of incoming traffic.
72       If your Internet connection can't handle the bandwidth, you will end up
73       measuring the speed of the connection, not the server, and may saturate
74       the connection causing a degradation in service for other users.
75
76       Make  sure there is no stateful firewall between the server and the In‐
77       ternet, because most of them can't handle the amount of UDP traffic the
78       test  will  generate and will end up dropping packets, skewing the test
79       results. Some will even lock up or crash.
80
81       You should run resperf on a machine  separate  from  the  server  under
82       test,  on  the  same LAN. Preferably, this should be a Gigabit Ethernet
83       network. The machine running resperf should be at least as fast as  the
84       machine being tested; otherwise, it may end up being the bottleneck.
85
86       There  should  be  no other applications running on the machine running
87       resperf. Performance testing at the traffic levels involved  is  essen‐
88       tially a hard real-time application - consider the fact that at a query
89       rate of 100,000 queries per second, if resperf  gets  delayed  by  just
90       1/100  of  a second, 1000 incoming UDP packets will arrive in the mean‐
91       time. This is more than most operating systems will buffer, which means
92       packets will be dropped.
93
94       Because  the granularity of the timers provided by operating systems is
95       typically too coarse to accurately  schedule  packet  transmissions  at
96       sub-millisecond intervals, resperf will busy-wait between packet trans‐
97       missions, constantly polling for responses in the meantime.  Therefore,
98       it is normal for resperf to consume 100% CPU during the whole test run,
99       even during periods where query rates are relatively low.
100
101       You will also need a set of test queries in the  dnsperf  file  format.
102       See  the  dnsperf  man  page  for instructions on how to construct this
103       query file. To make the test as  realistic  as  possible,  the  queries
104       should  be derived from recorded production client DNS traffic, without
105       removing duplicate queries or other filtering. With  the  default  set‐
106       tings, resperf will use up to 3 million queries in each test run.
107
108       If the caching server to be tested has a configurable limit on the num‐
109       ber of simultaneous resolutions, like the max-recursive-clients  state‐
110       ment  in  Nominum CacheServe or the recursive-clients option in BIND 9,
111       you will probably have to increase it. As a starting point,  we  recom‐
112       mend  a  value  of  10000 for Nominum CacheServe and 100000 for BIND 9.
113       Should the limit be reached, it will show up in the  plots  as  an  in‐
114       crease in the number of failure responses.
115
116       The  server  being  tested should be restarted at the beginning of each
117       test to make sure it is starting with an empty cache. If the cache  al‐
118       ready  contains data from a previous test run that used the same set of
119       queries, almost all queries will be answered from the  cache,  yielding
120       inflated performance numbers.
121
122       To  use  the resperf-report script, you need to have gnuplot installed.
123       Make sure your installed version of gnuplot supports the  png  terminal
124       driver.  If  your gnuplot doesn't support png but does support gif, you
125       can change the line saying terminal=png in the resperf-report script to
126       terminal=gif.
127
128   Running the test
129       Resperf  is typically invoked via the resperf-report script, which will
130       run resperf with its output redirected to a file and then automatically
131       generate  an  illustrated report in HTML format. Command line arguments
132       given to resperf-report will be passed on unchanged to resperf.
133
134       When running resperf-report, you will need  to  specify  at  least  the
135       server  IP  address  and the query data file. A typical invocation will
136       look like
137
138              resperf-report -s 10.0.0.2 -d queryfile
139
140       With default settings, the test run will take at most 100  seconds  (60
141       seconds  of  ramping  up traffic and then 40 seconds of waiting for re‐
142       sponses), but in practice, the 60-second traffic phase will usually  be
143       cut short. To be precise, resperf can transition from the traffic-send‐
144       ing phase to the waiting-for-responses phase in three different ways:
145
146       · Running for the full allotted time and successfully reaching the max‐
147         imum  query rate (by default, 60 seconds and 100,000 qps, respective‐
148         ly). Since this is a very high query rate, this  will  rarely  happen
149         (with today's hardware); one of the other two conditions listed below
150         will usually occur first.
151
152       · Exceeding 65,536 outstanding queries. This often happens as a  result
153         of  (successfully) exceeding the capacity of the server being tested,
154         causing the excess queries to be dropped. The limit of 65,536 queries
155         comes  from the number of possible values for the ID field in the DNS
156         packet. Resperf needs to allocate a unique ID  for  each  outstanding
157         query,  and is therefore unable to send further queries if the set of
158         possible IDs is exhausted.
159
160       · When resperf finds itself unable to send queries fast enough. Resperf
161         will  notice if it is falling behind in its scheduled query transmis‐
162         sions, and if this backlog reaches 1000 queries, it will print a mes‐
163         sage  like "Fell behind by 1000 queries" (or whatever the actual num‐
164         ber is at the time) and stop sending traffic.
165
166       Regardless of which of the above conditions caused the  traffic-sending
167       phase  of  the  test  to end, you should examine the resulting plots to
168       make sure the server's response rate is flattening out toward  the  end
169       of  the test. If it is not, then you are not loading the server enough.
170       If you are getting the "Fell behind" message, make sure  that  the  ma‐
171       chine running resperf is fast enough and has no other applications run‐
172       ning.
173
174       You should also monitor the CPU usage of  the  server  under  test.  It
175       should  reach  close to 100% CPU at the point of maximum traffic; if it
176       does not, you most likely have a bottleneck in some other part of  your
177       test setup, for example, your external Internet connection.
178
179       The  report  generated  by  resperf-report will be stored with a unique
180       file name based on the current date and time, e.g., 20060812-1550.html.
181       The PNG images of the plots and other auxiliary files will be stored in
182       separate files beginning with the same date-time string.  To  view  the
183       report, simply open the .html file in a web browser.
184
185       If  you need to copy the report to a separate machine for viewing, make
186       sure to copy the .png files along with the .html file (or  simply  copy
187       all the files, e.g., using scp 20060812-1550.* host:directory/).
188
189   Interpreting the report
190       The .html file produced by resperf-report consists of two sections. The
191       first section, "Resperf output", contains output from the resperf  pro‐
192       gram  such  as  progress  messages, a summary of the command line argu‐
193       ments, and summary statistics. The second  section,  "Plots",  contains
194       two  plots generated by gnuplot: "Query/response/failure rate" and "La‐
195       tency".
196
197       The "Query/response/failure  rate"  plot  contains  three  graphs.  The
198       "Queries  sent per second" graph shows the amount of traffic being sent
199       to the server; this should be very close to a straight  diagonal  line,
200       reflecting the linear ramp-up of traffic.
201
202       The  "Total  responses received per second" graph shows how many of the
203       queries received a response from the server. All responses are counted,
204       whether successful (NOERROR or NXDOMAIN) or not (e.g., SERVFAIL).
205
206       The "Failure responses received per second" graph shows how many of the
207       queries received a failure response. A response is considered to  be  a
208       failure if its RCODE is neither NOERROR nor NXDOMAIN.
209
210       By visually inspecting the graphs, you can get an idea of how the serv‐
211       er behaves under increasing load. The  "Total  responses  received  per
212       second"  graph will initially closely follow the "Queries sent per sec‐
213       ond" graph (often rendering it invisible in the plot as the two  graphs
214       are plotted on top of one another), but when the load exceeds the serv‐
215       er's capacity, the "Total responses received per second" graph may  di‐
216       verge  from  the "Queries sent per second" graph and flatten out, indi‐
217       cating that some of the queries are being dropped.
218
219       The "Failure responses received per second" graph will normally show  a
220       roughly  linear  ramp  close to the bottom of the plot with some random
221       fluctuation, since typical query traffic will contain some  small  per‐
222       centage  of  failing  queries randomly interspersed with the successful
223       ones. As the total traffic increases, the number of failures  will  in‐
224       crease proportionally.
225
226       If  the "Failure responses received per second" graph turns sharply up‐
227       wards, this can be another indication that the load  has  exceeded  the
228       server's capacity. This will happen if the server reacts to overload by
229       sending SERVFAIL responses  rather  than  by  dropping  queries.  Since
230       Nominum CacheServe and BIND 9 will both respond with SERVFAIL when they
231       exceed their max-recursive-clients or recursive-clients limit,  respec‐
232       tively, a sudden increase in the number of failures could mean that the
233       limit needs to be increased.
234
235       The "Latency" plot contains a single graph  marked  "Average  latency".
236       This  shows how the latency varies during the course of the test. Typi‐
237       cally, the latency graph will exhibit a  downwards  trend  because  the
238       cache  hit  rate  improves as ever more responses are cached during the
239       test, and the latency for a cache hit is much smaller than for a  cache
240       miss.  The latency graph is provided as an aid in determining the point
241       where the server gets overloaded, which can be seen as a sharp  upwards
242       turn  in  the graph. The latency graph is not intended for making abso‐
243       lute latency measurements or comparisons between servers; the latencies
244       shown  in  the graph are not representative of production latencies due
245       to the initially empty cache and  the  deliberate  overloading  of  the
246       server towards the end of the test.
247
248       Note  that all measurements are displayed on the plot at the horizontal
249       position corresponding to the point in time when the  query  was  sent,
250       not  when  the response (if any) was received. This makes it it easy to
251       compare the query and response rates; for example, if  no  queries  are
252       dropped,  the  query  and response graphs will be identical. As another
253       example, if the plot shows 10% failure responses at t=5  seconds,  this
254       means  that  10%  of the queries sent at t=5 seconds eventually failed,
255       not that 10% of the responses received at t=5 seconds were failures.
256
257   Determining the server's maximum throughput
258       Often, the goal of running resperf is to determine the server's maximum
259       throughput,  in other words, the number of queries per second it is ca‐
260       pable of handling. This is not always an easy task, because as a server
261       is driven into overload, the service it provides may deteriorate gradu‐
262       ally, and this deterioration can manifest itself either as queries  be‐
263       ing  dropped, as an increase in the number of SERVFAIL responses, or an
264       increase in latency.  The maximum throughput  may  be  defined  as  the
265       highest  level of traffic at which the server still provides an accept‐
266       able level of service, but that means you first need to decide what  an
267       acceptable  level  of service means in terms of packet drop percentage,
268       SERVFAIL percentage, and latency.
269
270       The summary statistics in the "Resperf output" section  of  the  report
271       contains  a  "Maximum  throughput" value which by default is determined
272       from the maximum rate at which the server was able to return responses,
273       without  regard  to  the  number of queries being dropped or failing at
274       that point. This method of throughput measurement has the advantage  of
275       simplicity,  but  it  may or may not be appropriate for your needs; the
276       reported value should always be validated by a visual inspection of the
277       graphs to ensure that service has not already deteriorated unacceptably
278       before the maximum response rate is reached. It may also be helpful  to
279       look  at the "Lost at that point" value in the summary statistics; this
280       indicates the percentage of the queries that was being dropped  at  the
281       point in the test when the maximum throughput was reached.
282
283       Alternatively,  you can make resperf report the throughput at the point
284       in the test where the percentage of queries  dropped  exceeds  a  given
285       limit  (or  the  maximum as above if the limit is never exceeded). This
286       can be a more realistic indication of how much the server can be loaded
287       while  still providing an acceptable level of service. This is done us‐
288       ing the -L command line option; for example,  specifying  -L  10  makes
289       resperf  report the highest throughput reached before the server starts
290       dropping more than 10% of the queries.
291
292       There is no corresponding way  of  automatically  constraining  results
293       based  on the number of failed queries, because unlike dropped queries,
294       resolution failures will occur even when the the server  is  not  over‐
295       loaded,  and  the  number  of such failures is heavily dependent on the
296       query data and network conditions. Therefore, the plots should be manu‐
297       ally  inspected to ensure that there is not an abnormal number of fail‐
298       ures.
299

GENERATING CONSTANT TRAFFIC

301       In addition to ramping up traffic linearly, resperf also has the  capa‐
302       bility  to  send  a constant stream of traffic. This can be useful when
303       using resperf for tasks other than performance measurement;  for  exam‐
304       ple,  it can be used to "soak test" a server by subjecting it to a sus‐
305       tained load for an extended period of time.
306
307       To generate a constant traffic load, use the -c  command  line  option,
308       together  with the -m option which specifies the desired constant query
309       rate. For example, to send 10000 queries per second for an hour, use -m
310       10000 -c 3600. This will include the usual 30-second gradual ramp-up of
311       traffic at the beginning, which may be useful to avoid initially  over‐
312       whelming  a  server  that is starting with an empty cache. To start the
313       onslaught of traffic instantly, use -m 10000 -c 3600 -r 0.
314
315       To be precise, resperf will do a linear ramp-up of traffic from 0 to -m
316       queries  per  second over a period of -r seconds, followed by a plateau
317       of steady traffic at -m queries per second lasting for -c seconds, fol‐
318       lowed  by  waiting  for  responses  for an extra 40 seconds. Either the
319       ramp-up or the plateau can be suppressed by supplying a duration of ze‐
320       ro seconds with -r 0 and -c 0, respectively. The latter is the default.
321
322       Sending  traffic  at high rates for hours on end will of course require
323       very large amounts of input data. Also, a long-running test will gener‐
324       ate  a large amount of plot data, which is kept in memory for the dura‐
325       tion of the test.  To reduce the memory usage and the size of the  plot
326       file,  consider  increasing  the interval between measurements from the
327       default of 0.5 seconds using the -i option in long-running tests.
328
329       When using resperf for long-running tests, it  is  important  that  the
330       traffic rate specified using the -m is one that both resperf itself and
331       the server under test can sustain. Otherwise, the test is likely to  be
332       cut  short  as  a result of either running out of query IDs (because of
333       large numbers of dropped queries) or  of  resperf  falling  behind  its
334       transmission schedule.
335

OPTIONS

337       Because  the  resperf-report script passes its command line options di‐
338       rectly to the resperf programs, they both accept the same  set  of  op‐
339       tions,  with one exception: resperf-report automatically adds an appro‐
340       priate -P to the resperf command line, and therefore  does  not  itself
341       take a -P option.
342
343       -d datafile
344              Specifies  the  input  data file. If not specified, resperf will
345              read from standard input.
346
347       -M mode
348              Specifies the transport mode to use, "udp", "tcp" or "tls".  De‐
349              fault is "udp".
350
351       -s server_addr
352              Specifies  the  name  or address of the server to which requests
353              will be sent.  The default is the loopback address, 127.0.0.1.
354
355       -p port
356              Sets the port on which the DNS packets are sent. If  not  speci‐
357              fied, the standard DNS port (udp/tcp 53, tls 853) is used.
358
359       -a local_addr
360              Specifies the local address from which to send requests. The de‐
361              fault is the wildcard address.
362
363       -x local_port
364              Specifies the local port from which to send  requests.  The  de‐
365              fault is the wildcard port (0).
366
367              If  acting  as  multiple  clients and the wildcard port is used,
368              each client will use a different random port. If a port is spec‐
369              ified,  the  clients will use a range of ports starting with the
370              specified one.
371
372       -t timeout
373              Specifies the request timeout value, in seconds. resperf will no
374              longer  wait  for  a response to a particular request after this
375              many seconds have elapsed. The default is 45 seconds.
376
377              resperf times out unanswered requests in order to reclaim  query
378              IDs  so that the query ID space will not be exhausted in a long-
379              running test, such as when "soak testing" a server  for  an  day
380              with  -m  10000  -c 86400.  The timeouts and the ability to tune
381              them are of little use in the more typical use case of a perfor‐
382              mance test lasting only a minute or two.
383
384              The  default  timeout of 45 seconds was chosen to be longer than
385              the query timeout of current caching servers. Note that this  is
386              longer  than  the  corresponding  default  in  dnsperf,  because
387              caching servers can take many orders of magnitude longer to  an‐
388              swer a query than authoritative servers do.
389
390              If  a short timeout is used, there is a possibility that resperf
391              will receive a response  after  the  corresponding  request  has
392              timed  out; in this case, a message like Warning: Received a re‐
393              sponse with an unexpected id: 141 will be printed.
394
395       -b bufsize
396              Sets the size of the socket's send and receive buffers, in kilo‐
397              bytes. If not specified, the operating system's default is used.
398
399       -f family
400              Specifies  the  address family used for sending DNS packets. The
401              possible values are "inet", "inet6", or "any". If "any" (the de‐
402              fault  value)  is  specified, resperf will use whichever address
403              family is appropriate for the server it is sending packets to.
404
405       -e
406              Enables EDNS0 [RFC2671], by adding an OPT record to all  packets
407              sent.
408
409       -D
410              Sets  the DO (DNSSEC OK) bit [RFC3225] in all packets sent. This
411              also enables EDNS0, which is required for DNSSEC.
412
413       -y [alg:]name:secret
414              Add a TSIG record [RFC2845] to all packets sent, using the spec‐
415              ified  TSIG  key algorithm, name and secret, where the algorithm
416              defaults to hmac-md5 and the secret is expressed  as  a  base-64
417              encoded string.
418
419       -h
420              Print a usage statement and exit.
421
422       -i interval
423              Specifies  the  time  interval  between  data points in the plot
424              file. The default is 0.5 seconds.
425
426       -m max_qps
427              Specifies the target maximum query rate (in queries per second).
428              This  should  be  higher than the expected maximum throughput of
429              the server being tested.  Traffic will be ramped up at a linear‐
430              ly  increasing rate until this value is reached, or until one of
431              the other conditions described in the section "Running the test"
432              occurs. The default is 100000 queries per second.
433
434       -P plot_data_file
435              Specifies  the  name  of the plot data file. The default is res‐
436              perf.gnuplot.
437
438       -r rampup_time
439              Specifies the length of time over which traffic will  be  ramped
440              up. The default is 60 seconds.
441
442       -c constant_traffic_time
443              Specifies the length of time for which traffic will be sent at a
444              constant rate following the initial ramp-up. The  default  is  0
445              seconds,  meaning  no sending of traffic at a constant rate will
446              be done.
447
448       -L max_loss
449              Specifies the maximum acceptable query loss percentage for  pur‐
450              poses  of  determining the maximum throughput value. The default
451              is 100%, meaning that resperf will measure the maximum  through‐
452              put without regard to query loss.
453
454       -C clients
455              Act  as  multiple clients. Requests are sent from multiple sock‐
456              ets. The default is to act as 1 client.
457
458       -q max_outstanding
459              Sets the maximum number of outstanding  requests.  resperf  will
460              stop  ramping up traffic when this many queries are outstanding.
461              The default is 64k, and the limit is 64k per client.
462
463       -v
464              Enables verbose mode to report about network readiness and  con‐
465              gestion.
466

THE PLOT DATA FILE

468       The  plot  data file is written by the resperf program and contains the
469       data to be plotted using gnuplot. When running  resperf  via  the  res‐
470       perf-report  script,  there  is  no need for the user to deal with this
471       file directly, but its format and contents are documented here for com‐
472       pleteness and in case you wish to run resperf directly and use its out‐
473       put for purposes other than viewing it with gnuplot.
474
475       The first line of the file is a comment identifying the fields. It  may
476       be recognized as a comment by its leading hash sign (#).
477
478       Subsequent lines contain the actual plot data. For purposes of generat‐
479       ing the plot data file, the test run is divided into time intervals  of
480       0.5 seconds (or some other length of time specified with the -i command
481       line option). Each line corresponds to one such interval, and  contains
482       the following values as floating-point numbers:
483
484       Time
485              The  midpoint of this time interval, in seconds since the begin‐
486              ning of the run
487
488       Target queries per second
489              The number of queries per second scheduled to be  sent  in  this
490              time interval
491
492       Actual queries per second
493              The  number of queries per second actually sent in this time in‐
494              terval
495
496       Responses per second
497              The number of responses received corresponding to  queries  sent
498              in this time interval, divided by the length of the interval
499
500       Failures per second
501              The  number  of responses received corresponding to queries sent
502              in this time interval and having an RCODE other than NOERROR  or
503              NXDOMAIN, divided by the length of the interval
504
505       Average latency
506              The  average  time between sending the query and receiving a re‐
507              sponse, for queries sent in this time interval
508

AUTHOR

513       Nominum, Inc.
514
515       Maintained by DNS-OARC
516
517              https://www.dns-oarc.net/
518

BUGS

520       For issues and feature requests please use:
521
522              https://github.com/DNS-OARC/dnsperf/issues
523
524       For question and help please use:
525
526              admin@dns-oarc.net
527
528resperf                              2.3.2                          resperf(1)