resperf(1)

1resperf(1)                  General Commands Manual                 resperf(1)
2
3
4

NAME

6       resperf - test the resolution performance of a caching DNS server
7

SYNOPSIS

9       resperf-report [-a local_addr] [-d datafile] [-s server_addr] [-p port]
10       [-x local_port] [-t timeout] [-b bufsize] [-f family] [-e] [-D]
11       [-y [alg:]name:secret] [-h] [-i interval] [-m max_qps] [-r rampup_time]
12       [-c constant_traffic_time] [-L max_loss] [-C clients]
13       [-q max_outstanding]
14
15       resperf [-a local_addr] [-d datafile] [-s server_addr] [-p port]
16       [-x local_port] [-t timeout] [-b bufsize] [-f family] [-e] [-D]
17       [-y [alg:]name:secret] [-h] [-i interval] [-m max_qps]
18       [-P plot_data_file] [-r rampup_time] [-c constant_traffic_time]
19       [-L max_loss] [-C clients] [-q max_outstanding]
20

DESCRIPTION

22       resperf  is a companion tool to dnsperf. dnsperf was primarily designed
23       for benchmarking authoritative servers, and it does not work well  with
24       caching  servers  that are talking to the live Internet. One reason for
25       this is that dnsperf uses a "self-pacing" approach, which is  based  on
26       the assumption that you can keep the server 100% busy simply by sending
27       it a small burst of back-to-back queries to fill  up  network  buffers,
28       and  then  send  a new query whenever you get a response back. This ap‐
29       proach works well for authoritative servers that process queries in or‐
30       der  and  one at a time; it also works pretty well for a caching server
31       in a closed laboratory environment  talking  to  a  simulated  Internet
32       that's all on the same LAN. Unfortunately, it does not work well with a
33       caching server talking to the actual Internet, which may need  to  work
34       on  thousands of queries in parallel to achieve its maximum throughput.
35       There have been numerous attempts to use dnsperf (or  its  predecessor,
36       queryperf) for benchmarking live caching servers, usually with poor re‐
37       sults. Therefore, a separate tool  designed  specifically  for  caching
38       servers is needed.
39
40   How resperf works
41       Unlike  the "self-pacing" approach of dnsperf, resperf works by sending
42       DNS queries at a controlled, steadily increasing rate. By default, res‐
43       perf  will  send traffic for 60 seconds, linearly increasing the amount
44       of traffic from zero to 100,000 queries per second.
45
46       During the test, resperf listens for  responses  from  the  server  and
47       keeps  track  of  response rates, failure rates, and latencies. It will
48       also continue listening for responses for an additional 40 seconds  af‐
49       ter it has stopped sending traffic, so that there is time for the serv‐
50       er to respond to the last queries sent. This time period was chosen  to
51       be longer than the overall query timeout of both Nominum CacheServe and
52       current versions of BIND.
53
54       If the test is successful, the query rate will at some point exceed the
55       capacity  of  the  server  and queries will be dropped, causing the re‐
56       sponse rate to stop growing or even decrease as the query rate increas‐
57       es.
58
59       The  result of the test is a set of measurements of the query rate, re‐
60       sponse rate, failure response rate, and average query latency as  func‐
61       tions of time.
62
63   What you will need
64       Benchmarking  a live caching server is serious business. A fast caching
65       server like Nominum CacheServe, resolving a mix of cacheable  and  non-
66       cacheable  queries  typical  of ISP customer traffic, is capable of re‐
67       solving well over 1,000,000 queries per second. In the process, it will
68       send  more  than  40,000 queries per second to authoritative servers on
69       the Internet, and receive responses to most of them. Assuming an  aver‐
70       age  request  size  of  50 bytes and a response size of 150 bytes, this
71       amounts to some 1216 Mbps of outgoing and 448 Mbps of incoming traffic.
72       If your Internet connection can't handle the bandwidth, you will end up
73       measuring the speed of the connection, not the server, and may saturate
74       the connection causing a degradation in service for other users.
75
76       Make  sure there is no stateful firewall between the server and the In‐
77       ternet, because most of them can't handle the amount of UDP traffic the
78       test  will  generate and will end up dropping packets, skewing the test
79       results. Some will even lock up or crash.
80
81       You should run resperf on a machine  separate  from  the  server  under
82       test,  on  the  same LAN. Preferably, this should be a Gigabit Ethernet
83       network. The machine running resperf should be at least as fast as  the
84       machine being tested; otherwise, it may end up being the bottleneck.
85
86       There  should  be  no other applications running on the machine running
87       resperf. Performance testing at the traffic levels involved  is  essen‐
88       tially a hard real-time application - consider the fact that at a query
89       rate of 100,000 queries per second, if resperf  gets  delayed  by  just
90       1/100  of  a second, 1000 incoming UDP packets will arrive in the mean‐
91       time. This is more than most operating systems will buffer, which means
92       packets will be dropped.
93
94       Because  the granularity of the timers provided by operating systems is
95       typically too coarse to accurately  schedule  packet  transmissions  at
96       sub-millisecond intervals, resperf will busy-wait between packet trans‐
97       missions, constantly polling for responses in the meantime.  Therefore,
98       it is normal for resperf to consume 100% CPU during the whole test run,
99       even during periods where query rates are relatively low.
100
101       You will also need a set of test queries in the  dnsperf  file  format.
102       See  the  dnsperf  man  page  for instructions on how to construct this
103       query file. To make the test as  realistic  as  possible,  the  queries
104       should  be derived from recorded production client DNS traffic, without
105       removing duplicate queries or other filtering. With  the  default  set‐
106       tings, resperf will use up to 3 million queries in each test run.
107
108       If the caching server to be tested has a configurable limit on the num‐
109       ber of simultaneous resolutions, like the max-recursive-clients  state‐
110       ment  in  Nominum CacheServe or the recursive-clients option in BIND 9,
111       you will probably have to increase it. As a starting point,  we  recom‐
112       mend  a  value  of  10000 for Nominum CacheServe and 100000 for BIND 9.
113       Should the limit be reached, it will show up in the  plots  as  an  in‐
114       crease in the number of failure responses.
115
116       The  server  being  tested should be restarted at the beginning of each
117       test to make sure it is starting with an empty cache. If the cache  al‐
118       ready  contains data from a previous test run that used the same set of
119       queries, almost all queries will be answered from the  cache,  yielding
120       inflated performance numbers.
121
122       To  use  the resperf-report script, you need to have gnuplot installed.
123       Make sure your installed version of gnuplot supports the  png  terminal
124       driver.  If  your gnuplot doesn't support png but does support gif, you
125       can change the line saying terminal=png in the resperf-report script to
126       terminal=gif.
127
128   Running the test
129       Resperf  is typically invoked via the resperf-report script, which will
130       run resperf with its output redirected to a file and then automatically
131       generate  an  illustrated report in HTML format. Command line arguments
132       given to resperf-report will be passed on unchanged to resperf.
133
134       When running resperf-report, you will need  to  specify  at  least  the
135       server  IP  address  and the query data file. A typical invocation will
136       look like
137
138              resperf-report -s 10.0.0.2 -d queryfile
139
140       With default settings, the test run will take at most 100  seconds  (60
141       seconds  of  ramping  up traffic and then 40 seconds of waiting for re‐
142       sponses), but in practice, the 60-second traffic phase will usually  be
143       cut short. To be precise, resperf can transition from the traffic-send‐
144       ing phase to the waiting-for-responses phase in three different ways:
145
146       · Running for the full allotted time and successfully reaching the max‐
147         imum  query rate (by default, 60 seconds and 100,000 qps, respective‐
148         ly). Since this is a very high query rate, this  will  rarely  happen
149         (with today's hardware); one of the other two conditions listed below
150         will usually occur first.
151
152       · Exceeding 65,536 outstanding queries. This often happens as a  result
153         of  (successfully) exceeding the capacity of the server being tested,
154         causing the excess queries to be dropped. The limit of 65,536 queries
155         comes  from the number of possible values for the ID field in the DNS
156         packet. Resperf needs to allocate a unique ID  for  each  outstanding
157         query,  and is therefore unable to send further queries if the set of
158         possible IDs is exhausted.
159
160       · When resperf finds itself unable to send queries fast enough. Resperf
161         will  notice if it is falling behind in its scheduled query transmis‐
162         sions, and if this backlog reaches 1000 queries, it will print a mes‐
163         sage  like "Fell behind by 1000 queries" (or whatever the actual num‐
164         ber is at the time) and stop sending traffic.
165
166       Regardless of which of the above conditions caused the  traffic-sending
167       phase  of  the  test  to end, you should examine the resulting plots to
168       make sure the server's response rate is flattening out toward  the  end
169       of  the test. If it is not, then you are not loading the server enough.
170       If you are getting the "Fell behind" message, make sure  that  the  ma‐
171       chine running resperf is fast enough and has no other applications run‐
172       ning.
173
174       You should also monitor the CPU usage of  the  server  under  test.  It
175       should  reach  close to 100% CPU at the point of maximum traffic; if it
176       does not, you most likely have a bottleneck in some other part of  your
177       test setup, for example, your external Internet connection.
178
179       The  report  generated  by  resperf-report will be stored with a unique
180       file name based on the current date and time, e.g., 20060812-1550.html.
181       The PNG images of the plots and other auxiliary files will be stored in
182       separate files beginning with the same date-time string.  To  view  the
183       report, simply open the .html file in a web browser.
184
185       If  you need to copy the report to a separate machine for viewing, make
186       sure to copy the .png files along with the .html file (or  simply  copy
187       all the files, e.g., using scp 20060812-1550.* host:directory/).
188
189   Interpreting the report
190       The .html file produced by resperf-report consists of two sections. The
191       first section, "Resperf output", contains output from the resperf  pro‐
192       gram  such  as  progress  messages, a summary of the command line argu‐
193       ments, and summary statistics. The second  section,  "Plots",  contains
194       two  plots generated by gnuplot: "Query/response/failure rate" and "La‐
195       tency".
196
197       The "Query/response/failure  rate"  plot  contains  three  graphs.  The
198       "Queries  sent per second" graph shows the amount of traffic being sent
199       to the server; this should be very close to a straight  diagonal  line,
200       reflecting the linear ramp-up of traffic.
201
202       The  "Total  responses received per second" graph shows how many of the
203       queries received a response from the server. All responses are counted,
204       whether successful (NOERROR or NXDOMAIN) or not (e.g., SERVFAIL).
205
206       The "Failure responses received per second" graph shows how many of the
207       queries received a failure response. A response is considered to  be  a
208       failure if its RCODE is neither NOERROR nor NXDOMAIN.
209
210       By visually inspecting the graphs, you can get an idea of how the serv‐
211       er behaves under increasing load. The  "Total  responses  received  per
212       second"  graph will initially closely follow the "Queries sent per sec‐
213       ond" graph (often rendering it invisible in the plot as the two  graphs
214       are plotted on top of one another), but when the load exceeds the serv‐
215       er's capacity, the "Total responses received per second" graph may  di‐
216       verge  from  the "Queries sent per second" graph and flatten out, indi‐
217       cating that some of the queries are being dropped.
218
219       The "Failure responses received per second" graph will normally show  a
220       roughly  linear  ramp  close to the bottom of the plot with some random
221       fluctuation, since typical query traffic will contain some  small  per‐
222       centage  of  failing  queries randomly interspersed with the successful
223       ones. As the total traffic increases, the number of failures  will  in‐
224       crease proportionally.
225
226       If  the "Failure responses received per second" graph turns sharply up‐
227       wards, this can be another indication that the load  has  exceeded  the
228       server's capacity. This will happen if the server reacts to overload by
229       sending SERVFAIL responses  rather  than  by  dropping  queries.  Since
230       Nominum CacheServe and BIND 9 will both respond with SERVFAIL when they
231       exceed their max-recursive-clients or recursive-clients limit,  respec‐
232       tively, a sudden increase in the number of failures could mean that the
233       limit needs to be increased.
234
235       The "Latency" plot contains a single graph  marked  "Average  latency".
236       This  shows how the latency varies during the course of the test. Typi‐
237       cally, the latency graph will exhibit a  downwards  trend  because  the
238       cache  hit  rate  improves as ever more responses are cached during the
239       test, and the latency for a cache hit is much smaller than for a  cache
240       miss.  The latency graph is provided as an aid in determining the point
241       where the server gets overloaded, which can be seen as a sharp  upwards
242       turn  in  the graph. The latency graph is not intended for making abso‐
243       lute latency measurements or comparisons between servers; the latencies
244       shown  in  the graph are not representative of production latencies due
245       to the initially empty cache and  the  deliberate  overloading  of  the
246       server towards the end of the test.
247
248       Note  that all measurements are displayed on the plot at the horizontal
249       position corresponding to the point in time when the  query  was  sent,
250       not  when  the response (if any) was received. This makes it it easy to
251       compare the query and response rates; for example, if  no  queries  are
252       dropped,  the  query  and response graphs will be identical. As another
253       example, if the plot shows 10% failure responses at t=5  seconds,  this
254       means  that  10%  of the queries sent at t=5 seconds eventually failed,
255       not that 10% of the responses received at t=5 seconds were failures.
256
257   Determining the server's maximum throughput
258       Often, the goal of running resperf is to determine the server's maximum
259       throughput,  in other words, the number of queries per second it is ca‐
260       pable of handling. This is not always an easy task, because as a server
261       is driven into overload, the service it provides may deteriorate gradu‐
262       ally, and this deterioration can manifest itself either as queries  be‐
263       ing  dropped, as an increase in the number of SERVFAIL responses, or an
264       increase in latency.  The maximum throughput  may  be  defined  as  the
265       highest  level of traffic at which the server still provides an accept‐
266       able level of service, but that means you first need to decide what  an
267       acceptable  level  of service means in terms of packet drop percentage,
268       SERVFAIL percentage, and latency.
269
270       The summary statistics in the "Resperf output" section  of  the  report
271       contains  a  "Maximum  throughput" value which by default is determined
272       from the maximum rate at which the server was able to return responses,
273       without  regard  to  the  number of queries being dropped or failing at
274       that point. This method of throughput measurement has the advantage  of
275       simplicity,  but  it  may or may not be appropriate for your needs; the
276       reported value should always be validated by a visual inspection of the
277       graphs to ensure that service has not already deteriorated unacceptably
278       before the maximum response rate is reached. It may also be helpful  to
279       look  at the "Lost at that point" value in the summary statistics; this
280       indicates the percentage of the queries that was being dropped  at  the
281       point in the test when the maximum throughput was reached.
282
283       Alternatively,  you can make resperf report the throughput at the point
284       in the test where the percentage of queries  dropped  exceeds  a  given
285       limit  (or  the  maximum as above if the limit is never exceeded). This
286       can be a more realistic indication of how much the server can be loaded
287       while  still providing an acceptable level of service. This is done us‐
288       ing the -L command line option; for example,  specifying  -L  10  makes
289       resperf  report the highest throughput reached before the server starts
290       dropping more than 10% of the queries.
291
292       There is no corresponding way  of  automatically  constraining  results
293       based  on the number of failed queries, because unlike dropped queries,
294       resolution failures will occur even when the the server  is  not  over‐
295       loaded,  and  the  number  of such failures is heavily dependent on the
296       query data and network conditions. Therefore, the plots should be manu‐
297       ally  inspected to ensure that there is not an abnormal number of fail‐
298       ures.
299

GENERATING CONSTANT TRAFFIC

301       In addition to ramping up traffic linearly, resperf also has the  capa‐
302       bility  to  send  a constant stream of traffic. This can be useful when
303       using resperf for tasks other than performance measurement;  for  exam‐
304       ple,  it can be used to "soak test" a server by subjecting it to a sus‐
305       tained load for an extended period of time.
306
307       To generate a constant traffic load, use the -c  command  line  option,
308       together  with the -m option which specifies the desired constant query
309       rate. For example, to send 10000 queries per second for an hour, use -m
310       10000 -c 3600. This will include the usual 30-second gradual ramp-up of
311       traffic at the beginning, which may be useful to avoid initially  over‐
312       whelming  a  server  that is starting with an empty cache. To start the
313       onslaught of traffic instantly, use -m 10000 -c 3600 -r 0.
314
315       To be precise, resperf will do a linear ramp-up of traffic from 0 to -m
316       queries  per  second over a period of -r seconds, followed by a plateau
317       of steady traffic at -m queries per second lasting for -c seconds, fol‐
318       lowed  by  waiting  for  responses  for an extra 40 seconds. Either the
319       ramp-up or the plateau can be suppressed by supplying a duration of ze‐
320       ro seconds with -r 0 and -c 0, respectively. The latter is the default.
321
322       Sending  traffic  at high rates for hours on end will of course require
323       very large amounts of input data. Also, a long-running test will gener‐
324       ate  a large amount of plot data, which is kept in memory for the dura‐
325       tion of the test.  To reduce the memory usage and the size of the  plot
326       file,  consider  increasing  the interval between measurements from the
327       default of 0.5 seconds using the -i option in long-running tests.
328
329       When using resperf for long-running tests, it  is  important  that  the
330       traffic rate specified using the -m is one that both resperf itself and
331       the server under test can sustain. Otherwise, the test is likely to  be
332       cut  short  as  a result of either running out of query IDs (because of
333       large numbers of dropped queries) or  of  resperf  falling  behind  its
334       transmission schedule.
335

OPTIONS

337       Because  the  resperf-report script passes its command line options di‐
338       rectly to the resperf programs, they both accept the same  set  of  op‐
339       tions,  with one exception: resperf-report automatically adds an appro‐
340       priate -P to the resperf command line, and therefore  does  not  itself
341       take a -P option.
342
343       -d datafile
344              Specifies  the  input  data file. If not specified, resperf will
345              read from standard input.
346
347       -s server_addr
348              Specifies the name or address of the server  to  which  requests
349              will be sent.  The default is the loopback address, 127.0.0.1.
350
351       -p port
352              Sets  the  port on which the DNS packets are sent. If not speci‐
353              fied, the standard DNS port (53) is used.
354
355       -a local_addr
356              Specifies the local address from which to send requests. The de‐
357              fault is the wildcard address.
358
359       -x local_port
360              Specifies  the  local  port from which to send requests. The de‐
361              fault is the wildcard port (0).
362
363              If acting as multiple clients and the  wildcard  port  is  used,
364              each client will use a different random port. If a port is spec‐
365              ified, the clients will use a range of ports starting  with  the
366              specified one.
367
368       -t timeout
369              Specifies the request timeout value, in seconds. resperf will no
370              longer wait for a response to a particular  request  after  this
371              many seconds have elapsed. The default is 45 seconds.
372
373              resperf  times out unanswered requests in order to reclaim query
374              IDs so that the query ID space will not be exhausted in a  long-
375              running  test,  such  as when "soak testing" a server for an day
376              with -m 10000 -c 86400.  The timeouts and the  ability  to  tune
377              them are of little use in the more typical use case of a perfor‐
378              mance test lasting only a minute or two.
379
380              The default timeout of 45 seconds was chosen to be  longer  than
381              the  query timeout of current caching servers. Note that this is
382              longer  than  the  corresponding  default  in  dnsperf,  because
383              caching  servers can take many orders of magnitude longer to an‐
384              swer a query than authoritative servers do.
385
386              If a short timeout is used, there is a possibility that  resperf
387              will  receive  a  response  after  the corresponding request has
388              timed out; in this case, a message like Warning: Received a  re‐
389              sponse with an unexpected id: 141 will be printed.
390
391       -b bufsize
392              Sets the size of the socket's send and receive buffers, in kilo‐
393              bytes. If not specified, the operating system's default is used.
394
395       -f family
396              Specifies the address family used for sending DNS  packets.  The
397              possible values are "inet", "inet6", or "any". If "any" (the de‐
398              fault value) is specified, resperf will  use  whichever  address
399              family is appropriate for the server it is sending packets to.
400
401       -e
402              Enables  EDNS0 [RFC2671], by adding an OPT record to all packets
403              sent.
404
405       -D
406              Sets the DO (DNSSEC OK) bit [RFC3225] in all packets sent.  This
407              also enables EDNS0, which is required for DNSSEC.
408
409       -y [alg:]name:secret
410              Add a TSIG record [RFC2845] to all packets sent, using the spec‐
411              ified TSIG key algorithm, name and secret, where  the  algorithm
412              defaults  to  hmac-md5  and the secret is expressed as a base-64
413              encoded string.
414
415       -h
416              Print a usage statement and exit.
417
418       -i interval
419              Specifies the time interval between  data  points  in  the  plot
420              file. The default is 0.5 seconds.
421
422       -m max_qps
423              Specifies the target maximum query rate (in queries per second).
424              This should be higher than the expected  maximum  throughput  of
425              the server being tested.  Traffic will be ramped up at a linear‐
426              ly increasing rate until this value is reached, or until one  of
427              the other conditions described in the section "Running the test"
428              occurs. The default is 100000 queries per second.
429
430       -P plot_data_file
431              Specifies the name of the plot data file. The  default  is  res‐
432              perf.gnuplot.
433
434       -r rampup_time
435              Specifies  the  length of time over which traffic will be ramped
436              up. The default is 60 seconds.
437
438       -c constant_traffic_time
439              Specifies the length of time for which traffic will be sent at a
440              constant  rate  following  the initial ramp-up. The default is 0
441              seconds, meaning no sending of traffic at a constant  rate  will
442              be done.
443
444       -L max_loss
445              Specifies  the maximum acceptable query loss percentage for pur‐
446              poses of determining the maximum throughput value.  The  default
447              is  100%, meaning that resperf will measure the maximum through‐
448              put without regard to query loss.
449
450       -C clients
451              Act as multiple clients. Requests are sent from  multiple  sock‐
452              ets. The default is to act as 1 client.
453
454       -q max_outstanding
455              Sets  the  maximum  number of outstanding requests. resperf will
456              stop ramping up traffic when this many queries are  outstanding.
457              The default is 64k, and the limit is 64k per client.
458

THE PLOT DATA FILE

460       The  plot  data file is written by the resperf program and contains the
461       data to be plotted using gnuplot. When running  resperf  via  the  res‐
462       perf-report  script,  there  is  no need for the user to deal with this
463       file directly, but its format and contents are documented here for com‐
464       pleteness and in case you wish to run resperf directly and use its out‐
465       put for purposes other than viewing it with gnuplot.
466
467       The first line of the file is a comment identifying the fields. It  may
468       be recognized as a comment by its leading hash sign (#).
469
470       Subsequent lines contain the actual plot data. For purposes of generat‐
471       ing the plot data file, the test run is divided into time intervals  of
472       0.5 seconds (or some other length of time specified with the -i command
473       line option). Each line corresponds to one such interval, and  contains
474       the following values as floating-point numbers:
475
476       Time
477              The  midpoint of this time interval, in seconds since the begin‐
478              ning of the run
479
480       Target queries per second
481              The number of queries per second scheduled to be  sent  in  this
482              time interval
483
484       Actual queries per second
485              The  number of queries per second actually sent in this time in‐
486              terval
487
488       Responses per second
489              The number of responses received corresponding to  queries  sent
490              in this time interval, divided by the length of the interval
491
492       Failures per second
493              The  number  of responses received corresponding to queries sent
494              in this time interval and having an RCODE other than NOERROR  or
495              NXDOMAIN, divided by the length of the interval
496
497       Average latency
498              The  average  time between sending the query and receiving a re‐
499              sponse, for queries sent in this time interval
500

AUTHOR

505       Nominum, Inc.
506
507       Maintained by DNS-OARC
508
509              https://www.dns-oarc.net/
510

BUGS

512       For issues and feature requests please use:
513
514              https://github.com/DNS-OARC/dnsperf/issues
515
516       For question and help please use:
517
518              admin@dns-oarc.net
519
520resperf                              2.2.1                          resperf(1)