resperf(1)

1RESPERF(1)                                                          RESPERF(1)
2
3
4

NAME

6       resperf - test the resolution performance of a caching DNS server
7

SYNOPSIS

9       resperf  [ -d datafile ]  [ -s server_addr ]  [ -p port ]  [ -b bufsize
10       ]  [ -f family ]  [ -e ]  [ -D ]  [ -y name:secret ]  [ -A ]  [ -h ]  [
11       -i  interval ]  [ -m max_qps ]  [ -P plot_data_file ]  [ -r rampup_time
12       ]  [ -L max_loss ]
13

DESCRIPTION

15       resperf is a companion tool to dnsperf. dnsperf was primarily  designed
16       for  benchmarking authoritative servers, and is does not work well with
17       caching servers that are talking to the live Internet. One  reason  for
18       this  is  that dnsperf uses a "self-pacing" approach, which is based on
19       the assumption that you can keep the server 100% busy simply by sending
20       it  a  small  burst of back-to-back queries to fill up network buffers,
21       and then send a new query  whenever  you  get  a  response  back.  This
22       approach  works  well for authoritative servers that process queries in
23       order and one at a time; it also works pretty well for a caching server
24       in  a  closed  laboratory  environment  talking to a simulated Internet
25       that's all on the same LAN.  Unfortunately, it does not work well  with
26       a caching server talking to the actual Internet, which may need to work
27       on thousands of queries in parallel to achieve its maximum  throughput.
28       There  have  been numerous attempts to use dnsperf (or its predecessor,
29       queryperf) for benchmarking live caching  servers,  usually  with  poor
30       results.  Therefore,  a separate tool designed specifically for caching
31       servers is needed.
32
33   HOW RESPERF WORKS
34       Unlike the "self-pacing" approach of dnsperf, resperf works by  sending
35       DNS  queries  at  a  controlled, steadily increasing rate.  By default,
36       resperf will send traffic  for  60  seconds,  linearly  increasing  the
37       amount of traffic from zero to 100,000 queries per second.
38
39       During  the  test,  resperf  listens  for responses from the server and
40       keeps track of response rates, failure rates, and  latencies.  It  will
41       also  continue  listening  for  responses  for an additional 40 seconds
42       after it has stopped sending traffic, so that there  is  time  for  the
43       server to respond to the last queries sent. This time period was chosen
44       to be longer than the overall query timeout of  both  Nominum  CNS  and
45       current versions of BIND.
46
47       If the test is successful, the query rate will at some point exceed the
48       capacity of the  server  and  queries  will  be  dropped,  causing  the
49       response  rate  to  stop  growing  or  even  decrease as the query rate
50       increases.
51
52       The result of the test is a set of  measurements  of  the  query  rate,
53       response  rate,  failure  response  rate,  and average query latency as
54       functions of time. These are written to a file in a tabular format  for
55       plotting using gnuplot or some similar plotting tool. The server's max‐
56       imum throughput can be determined from the plot either as  the  highest
57       response rate on the plot, or alternatively as the response rate at the
58       point where a significant number of queries begin to be dropped.
59
60   WHAT YOU WILL NEED
61       Benchmarking a live caching server is serious business. A fast  caching
62       server  like  Nominum CNS running on an Opteron server, resolving a mix
63       of cacheable and non-cacheable queries typical of ISP customer traffic,
64       is  capable  of  resolving  more than 50,000 queries per second. In the
65       process, it will send more than 20,000 queries per second to authorita‐
66       tive  servers  on  the Internet, and receive responses to most of them.
67       Assuming an average request size of 50 bytes and a response size of 100
68       bytes,  this amounts to some 8 Mbps of outgoing and 16 Mbps of incoming
69       traffic. If your Internet connection can't handle  the  bandwidth,  you
70       will  end up measuring the speed of the connection, not the server, and
71       may saturate the connection causing a degradation in service for  other
72       users.
73
74       Make  sure  there  is  no  stateful firewall between the server and the
75       Internet, because most of them can't handle the amount of  UDP  traffic
76       the  test  will  generate and will end up dropping packets, skewing the
77       test results. Some will even lock up or crash.
78
79       You should run resperf on a machine  separate  from  the  server  under
80       test,  on  the  same LAN. Preferably, this should be a Gigabit Ethernet
81       network. The machine running resperf should be at least as fast as  the
82       machine being tested; otherwise, it may end up being the bottleneck.
83
84       There  should  be  no other applications running on the machine running
85       resperf. Performance testing at the traffic levels involved  is  essen‐
86       tially a hard real-time application - consider the fact that at a query
87       rate of 100,000 queries per second, if resperf  gets  delayed  by  just
88       1/100  of  a second, 1000 incoming UDP packets will arrive in the mean‐
89       time, which is more than most operating systems will buffer.
90
91       Because the granularity of the timers provided by operating systems  is
92       typically  too  coarse  to  accurately schedule packet transmissions at
93       sub-millisecond intervals, resperf will busy-wait between packet trans‐
94       missions, constantly polling for responses. Therefore, it is normal for
95       resperf to consume 100% CPU during the  whole  test  run,  even  during
96       periods where query rates are relatively low.
97
98       You  will  also  need a set of test queries in the dnsperf file format.
99       See the dnsperf man page for instructions  on  how  to  construct  this
100       query  file.  To  make  the  test as realistic as possible, the queries
101       should be derived from recorded production client DNS traffic,  without
102       removing  duplicate  queries  or other filtering. With the default set‐
103       tings, resperf will use up to 3 million queries in each test run.
104
105       If the caching server to be tested has a configurable limit on the num‐
106       ber  of  simultaneous  resolutions,  like  the  "max-recursive-clients"
107       statement in Nominum CNS or the "recursive-clients" option in  BIND  9,
108       you  will  probably have to increase it. As a starting point, we recom‐
109       mend a value of 10000 for Nominum CNS and 100000 for BIND 9. Should the
110       limit  be  reached,  it will show up in the plots as an increase in the
111       number of failure responses.
112
113       For maximum realism, you could "prime" the cache of the server by  hav‐
114       ing it resolve typical query traffic for some period of time before the
115       test, so that the cache is not empty when  the  test  starts.  However,
116       experience  has  shown that this is not really necessary, as the server
117       will reach a reasonable cache hit rate (70% or more)  during  the  test
118       even when starting with an empty cache. If you do prime the cache, make
119       sure not to use the same set of queries as in the  actual  test,  since
120       that would make the server answer almost all queries from the cache and
121       yield inflated performance numbers.
122
123   RUNNING THE TEST
124       When running resperf, you will need to specify at least the  server  IP
125       address and the query data file. A typical invocation will look like
126
127              resperf -s 10.0.0.2 -d queryfile
128
129
130
131       With  default  settings, the test run will take at most 100 seconds (60
132       seconds of ramping up traffic  and  then  40  seconds  of  waiting  for
133       responses),  but  in practice, the 60-second traffic phase will usually
134       be cut short. To be precise, resperf can transition from  the  traffic-
135       sending  phase  to  the  waiting-for-responses phase in three different
136       ways:
137
138       · Running for the full allotted time and successfully reaching the max‐
139         imum  query  rate  (by  default,  60 seconds and 100,000 qps, respec‐
140         tively). Since this is a very high query rate, this will rarely  hap‐
141         pen  (with  today's hardware); one of the other two conditions listed
142         below will usually occur first.
143
144       · Exceeding 65,536 outstanding queries. This often happens as a  result
145         of  (successfully) exceeding the capacity of the server being tested,
146         causing the excess queries to be dropped. The limit of 65,536 queries
147         comes  from the number of possible values for the ID field in the DNS
148         packet. Resperf needs to allocate a unique ID  for  each  outstanding
149         query,  and is therefore unable to send further queries if the set of
150         possible IDs is exhausted.
151
152       · When resperf finds itself unable to send queries fast enough. Resperf
153         will  notice if it is falling behind in its scheduled query transmis‐
154         sions, and if this backlog reaches 1000 queries, it will print a mes‐
155         sage  like "Fell behind by 1000 queries" (or whatever the actual num‐
156         ber is at the time) and stop sending traffic.
157
158       Regardless of which of the above conditions caused the  traffic-sending
159       phase  of  the  test  to end, you should examine the resulting plots to
160       make sure the server's response rate is flattening out toward  the  end
161       of  the test. If it is not, then you are not loading the server enough.
162       If you are getting the  "Fell  behind"  message,  make  sure  that  the
163       machine  running  resperf  is fast enough and has no other applications
164       running.
165
166       You should also monitor the CPU usage of  the  server  under  test.  It
167       should  reach  close to 100% CPU at the point of maximum traffic; if it
168       does not, you most likely have a bottleneck in some other part of  your
169       test setup, for example, your external Internet connection.
170
171       As  resperf  runs,  some status messages and summary statistics will be
172       written to standard output, and the table of plot data  is  written  to
173       the  file  resperf.gnuplot in the current directory (or some other file
174       name given with the -P command line option).
175
176   THE PLOT DATA FILE
177       For purposes of generating the plot data file, the test run is  divided
178       into time intervals of 0.5 seconds (or some other length of time speci‐
179       fied with the -i command line option). Each line in the plot data  file
180       corresponds  to one such interval, and contains the following values as
181       floating-point numbers:
182
183       Time   The midpoint of this time interval, in seconds since the  begin‐
184              ning of the run
185
186       Target queries per second
187              The  number  of  queries per second scheduled to be sent in this
188              time interval
189
190       Actual queries per second
191              The number of queries per second  actually  sent  in  this  time
192              interval
193
194       Responses per second
195              The  number  of responses received corresponding to queries sent
196              in this time interval, divided by the length of the interval
197
198       Failures per second
199              The number of responses received corresponding to  queries  sent
200              in  this time interval and having an RCODE other than NOERROR or
201              NXDOMAIN, divided by the length of the interval
202
203       Average latency
204              The average time between  sending  the  query  and  receiving  a
205              response, for queries sent in this time interval
206
207       Note  that  the  measurements for any given query are always applied to
208       the time interval when the  query  was  sent,  not  the  one  when  the
209       response  (if  any)  was received. This makes it it easy to compare the
210       query and response rates; for example, if no queries are  dropped,  the
211       query and response curves will be identical. As another example, if the
212       plot shows 10% failure responses at t=5 seconds, this means that 10% of
213       the  queries sent at t=5 seconds eventually failed, not that 10% of the
214       responses received at t=5 seconds were failures.
215
216   PLOTTING THE RESULTS
217       Resperf comes with a shell script, resperf-report, which will run  res‐
218       perf with its output redirected to a file and then automatically gener‐
219       ate an illustrated report in HTML format. Command line arguments  given
220       to resperf-report will be passed on unchanged to resperf.
221
222       You  need  to have gnuplot installed, because resperf-report uses it to
223       generate the plots. Make sure your  version  of  gnuplot  supports  the
224       "gif" terminal driver.
225
226       The  report will be stored with a unique file name based on the current
227       date and time, e.g., 20060812-1550.html. The GIF images  of  the  plots
228       and  other  auxiliary  files will be stored in separate files beginning
229       with the same date-time string.  If you need to copy the  report  to  a
230       separate  machine  for  viewing, make sure to copy the .gif files along
231       with the .html file (or simply copy all  the  files,  e.g.,  using  scp
232       20060812-1550.* host:directory/).
233
234       For example, to benchmark a server running on 10.0.0.2, you could run
235
236              resperf-report -s 10.0.0.2 -d queryfile
237
238
239       and then open the resulting .html file in a web browser.
240
241   INTERPRETING THE RESULTS
242       The  summary  statistics  printed  on standard output at the end of the
243       test include the server's measured maximum throughput. By default, this
244       is  simply  the highest point on the response rate plot, without regard
245       to the number of queries being dropped or failing at that point.
246
247       You can also make resperf report the throughput at  the  point  in  the
248       test  where the percentage of queries dropped exceeds a given limit (or
249       the maximum as above if the limit is never exceeded).  This  can  be  a
250       more  realistic  indication  of how much the server can be loaded while
251       still providing an acceptable level of service. This is done using  the
252       -L  command  line  option;  for example, specifying -L 10 makes resperf
253       report the highest throughput reached before the server starts dropping
254       more than 10% of the queries.
255
256       There  is  no  corresponding  way of automatically constraining results
257       based on the number of failed queries, because unlike dropped  queries,
258       resolution  failures  will  occur even when the the server is not over‐
259       loaded, and the number of such failures is  heavily  dependent  on  the
260       query data and network conditions. Therefore, the plots should be manu‐
261       ally inspected to ensure that there is not an abnormal number of  fail‐
262       ures.
263

OPTIONS

265       -d datafile
266              Specifies  the  input  data file. If not specified, resperf will
267              read from standard input.
268
269       -s server_addr
270              Specifies the name or address of the server  to  which  requests
271              will be sent. The default is the loopback address, 127.0.0.1.
272
273       -p port
274              Sets  the  port on which the DNS packets are sent. If not speci‐
275              fied, the standard DNS port (53) is used.
276
277       -b bufsize
278              Sets the size of the socket's send and receive buffers, in kilo‐
279              bytes. If not specified, the default value is 32k.
280
281       -f family
282              Specifies  the address family used for sending DNS packets.  The
283              possible values are "inet", "inet6", or "any".   If  "any"  (the
284              default  value) is specified, resperf will use whichever address
285              family is appropriate for the server it is sending packets to.
286
287       -e     Enables EDNS0 [RFC2671], by adding an OPT record to all  packets
288              sent.
289
290       -D     Sets the DO (DNSSEC OK) bit [RFC3225] in all packets sent.  This
291              also enables EDNS0, which is required for DNSSEC.
292
293       -y name:secret
294              Add a TSIG record [RFC2845] to all packets sent, using the spec‐
295              ified TSIG key name and secret, where the secret is expressed as
296              a base-64 encoded string.
297
298       -A     Reports the command line arguments passed to resperf to standard
299              output.
300
301       -h     Print a usage statement and exit.
302
303       -i interval
304              Specifies  the  time  interval  between  data points in the plot
305              file.
306
307       -m max_qps
308              Specifies the target maximum query rate (in queries per second).
309              This  should  be  higher than the expected maximum throughput of
310              the server being tested.  Traffic will be ramped up  at  a  lin‐
311              early  increasing rate until this value is reached, or until one
312              of the other conditions described in the  section  "Running  the
313              test" occurs. The default is 100000 queries per second.
314
315       -P plot_data_file
316              Spefifies  the  name  of the plot data file. The default is res‐
317              perf.gnuplot.
318
319       -r rampup_time
320              Specifies the length of time over which traffic will  be  ramped
321              up. The default is 60 seconds.
322
323       -L max_loss
324              Specifies  the maximum acceptable query loss percentage for pur‐
325              poses of determining the maximum throughput value.  The  default
326              is  100%, meaning that resperf will measure the maximum through‐
327              put without regard to query loss.
328

AUTHOR

330       Nominum, Inc.
331

NAME

SYNOPSIS

DESCRIPTION

OPTIONS

AUTHOR

SEE ALSO