fastavro(1)

1FASTAVRO(1)                        fastavro                        FASTAVRO(1)
2
3
4

NAME

6       fastavro - fastavro Documentation
7
8       The current Python avro package is dog slow.
9
10       On  a  test  case of about 10K records, it takes about 14sec to iterate
11       over all of them. In comparison the JAVA avro  SDK  does  it  in  about
12       1.9sec.
13
14       fastavro is an alternative implementation that is much faster. It iter‐
15       ates over the same 10K records in 2.9sec, and if you use it  with  PyPy
16       it’ll  do  it  in  1.5sec (to be fair, the JAVA benchmark is doing some
17       extra JSON encoding/decoding).
18
19       If the optional C extension (generated by Cython)  is  available,  then
20       fastavro  will  be  even  faster. For the same 10K records it’ll run in
21       about 1.7sec.
22

SUPPORTED FEATURES

24       · File Writer
25
26       · File Reader (iterating via records or blocks)
27
28       · Schemaless Writer
29
30       · Schemaless Reader
31
32       · JSON Writer
33
34       · JSON Reader
35
36       · Codecs (Snappy, Deflate, Zstandard, Bzip2)
37
38       · Schema resolution
39
40       · Aliases
41
42       · Logical Types
43

MISSING FEATURES

45       · Anything involving Avro’s RPC features
46
47       · Parsing schemas into the canonical form
48
49       · Schema fingerprinting
50

EXAMPLE

52          from fastavro import writer, reader, parse_schema
53
54          schema = {
55              'doc': 'A weather reading.',
56              'name': 'Weather',
57              'namespace': 'test',
58              'type': 'record',
59              'fields': [
60                  {'name': 'station', 'type': 'string'},
61                  {'name': 'time', 'type': 'long'},
62                  {'name': 'temp', 'type': 'int'},
63              ],
64          }
65          parsed_schema = parse_schema(schema)
66
67          # 'records' can be an iterable (including generator)
68          records = [
69              {u'station': u'011990-99999', u'temp': 0, u'time': 1433269388},
70              {u'station': u'011990-99999', u'temp': 22, u'time': 1433270389},
71              {u'station': u'011990-99999', u'temp': -11, u'time': 1433273379},
72              {u'station': u'012650-99999', u'temp': 111, u'time': 1433275478},
73          ]
74
75          # Writing
76          with open('weather.avro', 'wb') as out:
77              writer(out, parsed_schema, records)
78
79          # Reading
80          with open('weather.avro', 'rb') as fo:
81              for record in reader(fo):
82                  print(record)
83

DOCUMENTATION

85   fastavro.read
86       class reader(fo, reader_schema=None, return_record_name=False)
87              Iterator over records in an avro file.
88
89              Parameters
90
91                     · fo (file-like) – Input stream
92
93                     · reader_schema (dict, optional) – Reader schema
94
95              Example:
96
97                 from fastavro import reader
98                 with open('some-file.avro', 'rb') as fo:
99                     avro_reader = reader(fo)
100                     for record in avro_reader:
101                         process_record(record)
102
103              metadata
104                     Key-value pairs in the header metadata
105
106              codec  The codec used when writing
107
108              writer_schema
109                     The schema used when writing
110
111              reader_schema
112                     The schema used when reading (if provided)
113
114       class block_reader(fo, reader_schema=None, return_record_name=False)
115              Iterator over Block in an avro file.
116
117              Parameters
118
119                     · fo (file-like) – Input stream
120
121                     · reader_schema (dict, optional) – Reader schema
122
123              Example:
124
125                 from fastavro import block_reader
126                 with open('some-file.avro', 'rb') as fo:
127                     avro_reader = block_reader(fo)
128                     for block in avro_reader:
129                         process_block(block)
130
131              metadata
132                     Key-value pairs in the header metadata
133
134              codec  The codec used when writing
135
136              writer_schema
137                     The schema used when writing
138
139              reader_schema
140                     The schema used when reading (if provided)
141
142       class Block(bytes_, num_records, codec,  reader_schema,  writer_schema,
143       offset, size, return_record_name=False)
144              An avro block. Will yield records when iterated over
145
146              num_records
147                     Number of records in the block
148
149              writer_schema
150                     The schema used when writing
151
152              reader_schema
153                     The schema used when reading (if provided)
154
155              offset Offset of the block from the begining of the avro file
156
157              size   Size of the block in bytes
158
159       schemaless_reader(fo,         writer_schema,        reader_schema=None,
160       return_record_name=False)
161              Reads a single record writen using the schemaless_writer()
162
163              Parameters
164
165                     · fo (file-like) – Input stream
166
167                     · writer_schema (dict) – Schema used when calling schema‐
168                       less_writer
169
170                     · reader_schema  (dict,  optional)  –  If  the schema has
171                       changed since being written then the new schema can  be
172                       given to allow for schema migration
173
174              Example:
175
176                 parsed_schema = fastavro.parse_schema(schema)
177                 with open('file.avro', 'rb') as fp:
178                     record = fastavro.schemaless_reader(fp, parsed_schema)
179
180              Note: The schemaless_reader can only read a single record.
181
182       is_avro(path_or_buffer)
183              Return True if path (or buffer) points to an Avro file.
184
185              Parameters
186                     path_or_buffer  (path to file or file-like object) – Path
187                     to file
188
189   fastavro.write
190       writer(fo, schema, records,  codec='null',  sync_interval=16000,  meta‐
191       data=None, validator=None, sync_marker=None)
192              Write records to fo (stream) according to schema
193
194              Parameters
195
196                     · fo (file-like) – Output stream
197
198                     · schema (dict) – Writer schema
199
200                     · records (iterable) – Records to write. This is commonly
201                       a list of the dictionary representation of the records,
202                       but it can be any iterable
203
204                     · codec  (string,  optional)  – Compression codec, can be
205                       ‘null’, ‘deflate’ or ‘snappy’ (if installed)
206
207                     · sync_interval (int, optional) – Size of sync interval
208
209                     · metadata (dict, optional) – Header metadata
210
211                     · validator (None, True or a function) – Validator  func‐
212                       tion.  If  None  (the default) - no validation. If True
213                       then then fastavro.validation.validate will be used. If
214                       it’s  a  function, it should have the same signature as
215                       fastavro.writer.validate  and  raise  an  exeption   on
216                       error.
217
218                     · sync_marker  (bytes,  optional) – A byte string used as
219                       the avro sync marker. If not provided,  a  random  byte
220                       string will be used.
221
222              Example:
223
224                 from fastavro import writer, parse_schema
225
226                 schema = {
227                     'doc': 'A weather reading.',
228                     'name': 'Weather',
229                     'namespace': 'test',
230                     'type': 'record',
231                     'fields': [
232                         {'name': 'station', 'type': 'string'},
233                         {'name': 'time', 'type': 'long'},
234                         {'name': 'temp', 'type': 'int'},
235                     ],
236                 }
237                 parsed_schema = parse_schema(schema)
238
239                 records = [
240                     {u'station': u'011990-99999', u'temp': 0, u'time': 1433269388},
241                     {u'station': u'011990-99999', u'temp': 22, u'time': 1433270389},
242                     {u'station': u'011990-99999', u'temp': -11, u'time': 1433273379},
243                     {u'station': u'012650-99999', u'temp': 111, u'time': 1433275478},
244                 ]
245
246                 with open('weather.avro', 'wb') as out:
247                     writer(out, parsed_schema, records)
248
249              Given  an  existing  avro file, it’s possible to append to it by
250              re-opening the file in a+b mode. If the file is only  opened  in
251              ab  mode,  we  aren’t  able  to read some of the existing header
252              information and an error will be raised. For example:
253
254                 # Write initial records
255                 with open('weather.avro', 'wb') as out:
256                     writer(out, parsed_schema, records)
257
258                 # Write some more records
259                 with open('weather.avro', 'a+b') as out:
260                     writer(out, parsed_schema, more_records)
261
262       schemaless_writer(fo, schema, record)
263              Write a single record without the schema or header information
264
265              Parameters
266
267                     · fo (file-like) – Output file
268
269                     · schema (dict) – Schema
270
271                     · record (dict) – Record to write
272
273              Example:
274
275                 parsed_schema = fastavro.parse_schema(schema)
276                 with open('file.avro', 'rb') as fp:
277                     fastavro.schemaless_writer(fp, parsed_schema, record)
278
279              Note: The schemaless_writer can only write a single record.
280
281   fastavro.json_read
282       json_reader(fo, schema)
283              Iterator over records in an avro json file.
284
285              Parameters
286
287                     · fo (file-like) – Input stream
288
289                     · reader_schema (dict) – Reader schema
290
291              Example:
292
293                 from fastavro import json_reader
294
295                 schema = {
296                     'doc': 'A weather reading.',
297                     'name': 'Weather',
298                     'namespace': 'test',
299                     'type': 'record',
300                     'fields': [
301                         {'name': 'station', 'type': 'string'},
302                         {'name': 'time', 'type': 'long'},
303                         {'name': 'temp', 'type': 'int'},
304                     ]
305                 }
306
307                 with open('some-file', 'r') as fo:
308                     avro_reader = json_reader(fo, schema)
309                     for record in avro_reader:
310                         print(record)
311
312   fastavro.json_write
313       json_writer(fo, schema, records)
314              Write records to fo (stream) according to schema
315
316              Parameters
317
318                     · fo (file-like) – Output stream
319
320                     · schema (dict) – Writer schema
321
322                     · records (iterable) – Records to write. This is commonly
323                       a list of the dictionary representation of the records,
324                       but it can be any iterable
325
326              Example:
327
328                 from fastavro import json_writer, parse_schema
329
330                 schema = {
331                     'doc': 'A weather reading.',
332                     'name': 'Weather',
333                     'namespace': 'test',
334                     'type': 'record',
335                     'fields': [
336                         {'name': 'station', 'type': 'string'},
337                         {'name': 'time', 'type': 'long'},
338                         {'name': 'temp', 'type': 'int'},
339                     ],
340                 }
341                 parsed_schema = parse_schema(schema)
342
343                 records = [
344                     {u'station': u'011990-99999', u'temp': 0, u'time': 1433269388},
345                     {u'station': u'011990-99999', u'temp': 22, u'time': 1433270389},
346                     {u'station': u'011990-99999', u'temp': -11, u'time': 1433273379},
347                     {u'station': u'012650-99999', u'temp': 111, u'time': 1433275478},
348                 ]
349
350                 with open('some-file', 'w') as out:
351                     json_writer(out, parsed_schema, records)
352
353   fastavro.schema
354       parse_schema(schema, _write_hint=True, _force=False)
355              Returns a parsed avro schema
356
357              It is not necessary to call parse_schema but doing so and saving
358              the  parsed  schema  for  use  later will make future operations
359              faster as the schema will not need to be reparsed.
360
361              Parameters
362
363                     · schema (dict) – Input schema
364
365                     · _write_hint (bool) – Internal API  argument  specifying
366                       whether  or  not the __fastavro_parsed marker should be
367                       added to the schema
368
369                     · _force (bool) – Internal API  argument.  If  True,  the
370                       schema will always be parsed even if it has been parsed
371                       and has the __fastavro_parsed marker
372
373              Example:
374
375                 from fastavro import parse_schema
376                 from fastavro import writer
377
378                 parsed_schema = parse_schema(original_schema)
379                 with open('weather.avro', 'wb') as out:
380                     writer(out, parsed_schema, records)
381
382   fastavro.validation
383       validate(datum, schema, field=None, raise_errors=True)
384              Determine if a python datum is an instance of a schema.
385
386              Parameters
387
388                     · datum (Any) – Data being validated
389
390                     · schema (dict) – Schema
391
392                     · field (str, optional) – Record field being validated
393
394                     · raise_errors (bool, optional) –  If  true,  errors  are
395                       raised  for  invalid  data.  If  false,  a  simple True
396                       (valid) or False (invalid) result is returned
397
398              Example:
399
400                 from fastavro.validation import validate
401                 schema = {...}
402                 record = {...}
403                 validate(record, schema)
404
405       validate_many(records, schema, raise_errors=True)
406              Validate a list of data!
407
408              Parameters
409
410                     · records (iterable) – List of records to validate
411
412                     · schema (dict) – Schema
413
414                     · raise_errors (bool, optional) –  If  true,  errors  are
415                       raised  for  invalid  data.  If  false,  a  simple True
416                       (valid) or False (invalid) result is returned
417
418              Example:
419
420                 from fastavro.validation import validate_many
421                 schema = {...}
422                 records = [{...}, {...}, ...]
423                 validate_many(records, schema)
424
425   fastavro command line script
426       A command line script is installed with the library that can be used to
427       dump the contents of avro file(s) to the standard output.
428
429       Usage:
430
431          usage: fastavro [-h] [--schema] [--codecs] [--version] [-p] [file [file ...]]
432
433          iter over avro file, emit records as JSON
434
435          positional arguments:
436            file          file(s) to parse
437
438          optional arguments:
439            -h, --help    show this help message and exit
440            --schema      dump schema instead of records
441            --codecs      print supported codecs
442            --version     show program's version number and exit
443            -p, --pretty  pretty print json
444
445   Examples
446       Read an avro file:
447
448          $ fastavro weather.avro
449
450          {"temp": 0, "station": "011990-99999", "time": -619524000000}
451          {"temp": 22, "station": "011990-99999", "time": -619506000000}
452          {"temp": -11, "station": "011990-99999", "time": -619484400000}
453          {"temp": 111, "station": "012650-99999", "time": -655531200000}
454          {"temp": 78, "station": "012650-99999", "time": -655509600000}
455
456       Show the schema:
457
458          $ fastavro --schema weather.avro
459
460          {
461           "type": "record",
462           "namespace": "test",
463           "doc": "A weather reading.",
464           "fields": [
465            {
466             "type": "string",
467             "name": "station"
468            },
469            {
470             "type": "long",
471             "name": "time"
472            },
473            {
474             "type": "int",
475             "name": "temp"
476            }
477           ],
478           "name": "Weather"
479          }
480
481       · genindex
482
483       · modindex
484
485       · search
486

AUTHOR

488       Miki Tebeka
489

COPYRIGHT

491       2019, Miki Tebeka
492
493
494
495
4960.22.7                           Dec 09, 2019                      FASTAVRO(1)