fastavro(1)

1FASTAVRO(1)                        fastavro                        FASTAVRO(1)
2
3
4

NAME

6       fastavro - fastavro Documentation
7
8       The current Python avro package is dog slow.
9
10       On  a  test  case of about 10K records, it takes about 14sec to iterate
11       over all of them. In comparison the JAVA avro  SDK  does  it  in  about
12       1.9sec.
13
14       fastavro is an alternative implementation that is much faster. It iter‐
15       ates over the same 10K records in 2.9sec, and if you use it  with  PyPy
16       it’ll  do  it  in  1.5sec (to be fair, the JAVA benchmark is doing some
17       extra JSON encoding/decoding).
18
19       If the optional C extension (generated by Cython)  is  available,  then
20       fastavro  will  be  even  faster. For the same 10K records it’ll run in
21       about 1.7sec.
22

SUPPORTED FEATURES

24       · File Writer
25
26       · File Reader (iterating via records or blocks)
27
28       · Schemaless Writer
29
30       · Schemaless Reader
31
32       · JSON Writer
33
34       · JSON Reader
35
36       · Codecs (Snappy, Deflate, Zstandard, Bzip2, LZ4, XZ)
37
38       · Schema resolution
39
40       · Aliases
41
42       · Logical Types
43

MISSING FEATURES

45       · Anything involving Avro’s RPC features
46
47       · Parsing schemas into the canonical form
48
49       · Schema fingerprinting
50

EXAMPLE

52          from fastavro import writer, reader, parse_schema
53
54          schema = {
55              'doc': 'A weather reading.',
56              'name': 'Weather',
57              'namespace': 'test',
58              'type': 'record',
59              'fields': [
60                  {'name': 'station', 'type': 'string'},
61                  {'name': 'time', 'type': 'long'},
62                  {'name': 'temp', 'type': 'int'},
63              ],
64          }
65          parsed_schema = parse_schema(schema)
66
67          # 'records' can be an iterable (including generator)
68          records = [
69              {u'station': u'011990-99999', u'temp': 0, u'time': 1433269388},
70              {u'station': u'011990-99999', u'temp': 22, u'time': 1433270389},
71              {u'station': u'011990-99999', u'temp': -11, u'time': 1433273379},
72              {u'station': u'012650-99999', u'temp': 111, u'time': 1433275478},
73          ]
74
75          # Writing
76          with open('weather.avro', 'wb') as out:
77              writer(out, parsed_schema, records)
78
79          # Reading
80          with open('weather.avro', 'rb') as fo:
81              for record in reader(fo):
82                  print(record)
83

DOCUMENTATION

85   fastavro.read
86       class reader(fo, reader_schema=None, return_record_name=False)
87              Iterator over records in an avro file.
88
89              Parameters
90
91                     · fo (file-like) – Input stream
92
93                     · reader_schema (dict, optional) – Reader schema
94
95                     · return_record_name (bool, optional)  –  If  true,  when
96                       reading  a union of records, the result will be a tuple
97                       where the first value is the name of the record and the
98                       second value is the record itself
99
100              Example:
101
102                 from fastavro import reader
103                 with open('some-file.avro', 'rb') as fo:
104                     avro_reader = reader(fo)
105                     for record in avro_reader:
106                         process_record(record)
107
108              The  fo argument is a file-like object so another common example
109              usage would use an io.BytesIO object like so:
110
111                 from io import BytesIO
112                 from fastavro import writer, reader
113
114                 fo = BytesIO()
115                 writer(fo, schema, records)
116                 fo.seek(0)
117                 for record in reader(fo):
118                     process_record(record)
119
120              metadata
121                     Key-value pairs in the header metadata
122
123              codec  The codec used when writing
124
125              writer_schema
126                     The schema used when writing
127
128              reader_schema
129                     The schema used when reading (if provided)
130
131       class block_reader(fo, reader_schema=None, return_record_name=False)
132              Iterator over Block in an avro file.
133
134              Parameters
135
136                     · fo (file-like) – Input stream
137
138                     · reader_schema (dict, optional) – Reader schema
139
140                     · return_record_name (bool, optional)  –  If  true,  when
141                       reading  a union of records, the result will be a tuple
142                       where the first value is the name of the record and the
143                       second value is the record itself
144
145              Example:
146
147                 from fastavro import block_reader
148                 with open('some-file.avro', 'rb') as fo:
149                     avro_reader = block_reader(fo)
150                     for block in avro_reader:
151                         process_block(block)
152
153              metadata
154                     Key-value pairs in the header metadata
155
156              codec  The codec used when writing
157
158              writer_schema
159                     The schema used when writing
160
161              reader_schema
162                     The schema used when reading (if provided)
163
164       class  Block(bytes_,  num_records, codec, reader_schema, writer_schema,
165       offset, size, return_record_name=False)
166              An avro block. Will yield records when iterated over
167
168              num_records
169                     Number of records in the block
170
171              writer_schema
172                     The schema used when writing
173
174              reader_schema
175                     The schema used when reading (if provided)
176
177              offset Offset of the block from the begining of the avro file
178
179              size   Size of the block in bytes
180
181       schemaless_reader(fo,        writer_schema,         reader_schema=None,
182       return_record_name=False)
183              Reads a single record writen using the schemaless_writer()
184
185              Parameters
186
187                     · fo (file-like) – Input stream
188
189                     · writer_schema (dict) – Schema used when calling schema‐
190                       less_writer
191
192                     · reader_schema (dict, optional)  –  If  the  schema  has
193                       changed  since being written then the new schema can be
194                       given to allow for schema migration
195
196                     · return_record_name (bool, optional)  –  If  true,  when
197                       reading  a union of records, the result will be a tuple
198                       where the first value is the name of the record and the
199                       second value is the record itself
200
201              Example:
202
203                 parsed_schema = fastavro.parse_schema(schema)
204                 with open('file.avro', 'rb') as fp:
205                     record = fastavro.schemaless_reader(fp, parsed_schema)
206
207              Note: The schemaless_reader can only read a single record.
208
209       is_avro(path_or_buffer)
210              Return True if path (or buffer) points to an Avro file.
211
212              Parameters
213                     path_or_buffer  (path to file or file-like object) – Path
214                     to file
215
216   fastavro.write
217       writer(fo, schema, records,  codec='null',  sync_interval=16000,  meta‐
218       data=None,     validator=None,     sync_marker=None,     codec_compres‐
219       sion_level=None)
220              Write records to fo (stream) according to schema
221
222              Parameters
223
224                     · fo (file-like) – Output stream
225
226                     · schema (dict) – Writer schema
227
228                     · records (iterable) – Records to write. This is commonly
229                       a list of the dictionary representation of the records,
230                       but it can be any iterable
231
232                     · codec (string, optional) – Compression  codec,  can  be
233                       ‘null’, ‘deflate’ or ‘snappy’ (if installed)
234
235                     · sync_interval (int, optional) – Size of sync interval
236
237                     · metadata (dict, optional) – Header metadata
238
239                     · validator  (None, True or a function) – Validator func‐
240                       tion. If None (the default) - no  validation.  If  True
241                       then then fastavro.validation.validate will be used. If
242                       it’s a function, it should have the same  signature  as
243                       fastavro.writer.validate   and  raise  an  exeption  on
244                       error.
245
246                     · sync_marker (bytes, optional) – A byte string  used  as
247                       the  avro  sync  marker. If not provided, a random byte
248                       string will be used.
249
250                     · codec_compression_level (int, optional)  –  Compression
251                       level  to  use  with  the specified codec (if the codec
252                       supports it)
253
254              Example:
255
256                 from fastavro import writer, parse_schema
257
258                 schema = {
259                     'doc': 'A weather reading.',
260                     'name': 'Weather',
261                     'namespace': 'test',
262                     'type': 'record',
263                     'fields': [
264                         {'name': 'station', 'type': 'string'},
265                         {'name': 'time', 'type': 'long'},
266                         {'name': 'temp', 'type': 'int'},
267                     ],
268                 }
269                 parsed_schema = parse_schema(schema)
270
271                 records = [
272                     {u'station': u'011990-99999', u'temp': 0, u'time': 1433269388},
273                     {u'station': u'011990-99999', u'temp': 22, u'time': 1433270389},
274                     {u'station': u'011990-99999', u'temp': -11, u'time': 1433273379},
275                     {u'station': u'012650-99999', u'temp': 111, u'time': 1433275478},
276                 ]
277
278                 with open('weather.avro', 'wb') as out:
279                     writer(out, parsed_schema, records)
280
281              The fo argument is a file-like object so another common  example
282              usage would use an io.BytesIO object like so:
283
284                 from io import BytesIO
285                 from fastavro import writer
286
287                 fo = BytesIO()
288                 writer(fo, schema, records)
289
290              Given  an  existing  avro file, it’s possible to append to it by
291              re-opening the file in a+b mode. If the file is only  opened  in
292              ab  mode,  we  aren’t  able  to read some of the existing header
293              information and an error will be raised. For example:
294
295                 # Write initial records
296                 with open('weather.avro', 'wb') as out:
297                     writer(out, parsed_schema, records)
298
299                 # Write some more records
300                 with open('weather.avro', 'a+b') as out:
301                     writer(out, parsed_schema, more_records)
302
303       schemaless_writer(fo, schema, record)
304              Write a single record without the schema or header information
305
306              Parameters
307
308                     · fo (file-like) – Output file
309
310                     · schema (dict) – Schema
311
312                     · record (dict) – Record to write
313
314              Example:
315
316                 parsed_schema = fastavro.parse_schema(schema)
317                 with open('file.avro', 'rb') as fp:
318                     fastavro.schemaless_writer(fp, parsed_schema, record)
319
320              Note: The schemaless_writer can only write a single record.
321
322   Using the tuple notation to specify which branch of a union to take
323       Since this library uses plain dictionaries to reprensent a  record,  it
324       is  possible for that dictionary to fit the definition of two different
325       records.
326
327       For example, given a dictionary like this:
328
329          {"name": "My Name"}
330
331       It would be valid against both of these records:
332
333          child_schema = {
334              "name": "Child",
335              "type": "record",
336              "fields": [
337                  {"name": "name", "type": "string"},
338                  {"name": "favorite_color", "type": ["null", "string"]},
339              ]
340          }
341
342          pet_schema = {
343              "name": "Pet",
344              "type": "record",
345              "fields": [
346                  {"name": "name", "type": "string"},
347                  {"name": "favorite_toy", "type": ["null", "string"]},
348              ]
349          }
350
351       This becomes a problem when a schema contains a union of these two sim‐
352       ilar records as it is not clear which record the dictionary represents.
353       For example, if you used the previous  dictionary  with  the  following
354       schema,  it  wouldn’t  be clear if the record should be serialized as a
355       Child or a
356       `
357       Pet:
358
359          household_schema = {
360              "name": "Household",
361              "type": "record",
362              "fields": [
363                  {"name": "address", "type": "string"},
364                  {
365                      "name": "family_members",
366                      "type": {
367                          "type": "array", "items": [
368                              {
369                                  "name": "Child",
370                                  "type": "record",
371                                  "fields": [
372                                      {"name": "name", "type": "string"},
373                                      {"name": "favorite_color", "type": ["null", "string"]},
374                                  ]
375                              }, {
376                                  "name": "Pet",
377                                  "type": "record",
378                                  "fields": [
379                                      {"name": "name", "type": "string"},
380                                      {"name": "favorite_toy", "type": ["null", "string"]},
381                                  ]
382                              }
383                          ]
384                      }
385                  },
386              ]
387          }
388
389       To resolve this, you can use a tuple notation where the first value  of
390       the  tuple  is the fully namespaced record name and the second value is
391       the dictionary. For example:
392
393          records = [
394              {
395                  "address": "123 Drive Street",
396                  "family_members": [
397                      ("Child", {"name": "Son"}),
398                      ("Child", {"name": "Daughter"}),
399                      ("Pet", {"name": "Dog"}),
400                  ]
401              }
402          ]
403
404   fastavro.json_read
405       json_reader(fo, schema)
406              Iterator over records in an avro json file.
407
408              Parameters
409
410                     · fo (file-like) – Input stream
411
412                     · reader_schema (dict) – Reader schema
413
414              Example:
415
416                 from fastavro import json_reader
417
418                 schema = {
419                     'doc': 'A weather reading.',
420                     'name': 'Weather',
421                     'namespace': 'test',
422                     'type': 'record',
423                     'fields': [
424                         {'name': 'station', 'type': 'string'},
425                         {'name': 'time', 'type': 'long'},
426                         {'name': 'temp', 'type': 'int'},
427                     ]
428                 }
429
430                 with open('some-file', 'r') as fo:
431                     avro_reader = json_reader(fo, schema)
432                     for record in avro_reader:
433                         print(record)
434
435   fastavro.json_write
436       json_writer(fo, schema, records)
437              Write records to fo (stream) according to schema
438
439              Parameters
440
441                     · fo (file-like) – Output stream
442
443                     · schema (dict) – Writer schema
444
445                     · records (iterable) – Records to write. This is commonly
446                       a list of the dictionary representation of the records,
447                       but it can be any iterable
448
449              Example:
450
451                 from fastavro import json_writer, parse_schema
452
453                 schema = {
454                     'doc': 'A weather reading.',
455                     'name': 'Weather',
456                     'namespace': 'test',
457                     'type': 'record',
458                     'fields': [
459                         {'name': 'station', 'type': 'string'},
460                         {'name': 'time', 'type': 'long'},
461                         {'name': 'temp', 'type': 'int'},
462                     ],
463                 }
464                 parsed_schema = parse_schema(schema)
465
466                 records = [
467                     {u'station': u'011990-99999', u'temp': 0, u'time': 1433269388},
468                     {u'station': u'011990-99999', u'temp': 22, u'time': 1433270389},
469                     {u'station': u'011990-99999', u'temp': -11, u'time': 1433273379},
470                     {u'station': u'012650-99999', u'temp': 111, u'time': 1433275478},
471                 ]
472
473                 with open('some-file', 'w') as out:
474                     json_writer(out, parsed_schema, records)
475
476   fastavro.schema
477       parse_schema(schema, expand=False, _write_hint=True, _force=False)
478              Returns a parsed avro schema
479
480              It is not necessary to call parse_schema but doing so and saving
481              the  parsed  schema  for  use  later will make future operations
482              faster as the schema will not need to be reparsed.
483
484              Parameters
485
486                     · schema (dict) – Input schema
487
488                     · expand (bool) –
489
490                       NOTE: This option should be considered a  keyword  only
491                       argument  and  may  get  enforced as such when Python 2
492                       support is dropped.
493
494                       If true, named schemas will be fully expanded to  their
495                       true  schemas rather than being represented as just the
496                       name. This format should be considered an  output  only
497                       and  not  passed in to other reader/writer functions as
498                       it does not conform to the avro specification and  will
499                       likely cause an exception
500
501
502                     · _write_hint  (bool)  – Internal API argument specifying
503                       whether or not the __fastavro_parsed marker  should  be
504                       added to the schema
505
506                     · _force  (bool)  –  Internal  API argument. If True, the
507                       schema will always be parsed even if it has been parsed
508                       and has the __fastavro_parsed marker
509
510              Example:
511
512                 from fastavro import parse_schema
513                 from fastavro import writer
514
515                 parsed_schema = parse_schema(original_schema)
516                 with open('weather.avro', 'wb') as out:
517                     writer(out, parsed_schema, records)
518
519       fullname(schema)
520              Returns the fullname of a schema
521
522              Parameters
523                     schema (dict) – Input schema
524
525              Example:
526
527                 from fastavro.schema import fullname
528
529                 schema = {
530                     'doc': 'A weather reading.',
531                     'name': 'Weather',
532                     'namespace': 'test',
533                     'type': 'record',
534                     'fields': [
535                         {'name': 'station', 'type': 'string'},
536                         {'name': 'time', 'type': 'long'},
537                         {'name': 'temp', 'type': 'int'},
538                     ],
539                 }
540
541                 fname = fullname(schema)
542                 assert fname == "test.Weather"
543
544       expand_schema(schema)
545              Returns  a  schema  where  all named types are expanded to their
546              real schema
547
548              NOTE: The output of this function produces  a  schema  that  can
549              include  multiple  definitions  of  the  same named type (as per
550              design) which are not valid per the avro  specification.  There‐
551              fore,  the  output  of  this  should not be passed to the normal
552              writer/reader functions as it will likely result in an error.
553
554              Parameters
555                     schema (dict) – Input schema
556
557              Example:
558
559                 from fastavro.schema import expand_schema
560
561                 original_schema = {
562                     "name": "MasterSchema",
563                     "namespace": "com.namespace.master",
564                     "type": "record",
565                     "fields": [{
566                         "name": "field_1",
567                         "type": {
568                             "name": "Dependency",
569                             "namespace": "com.namespace.dependencies",
570                             "type": "record",
571                             "fields": [
572                                 {"name": "sub_field_1", "type": "string"}
573                             ]
574                         }
575                     }, {
576                         "name": "field_2",
577                         "type": "com.namespace.dependencies.Dependency"
578                     }]
579                 }
580
581                 expanded_schema = expand_schema(original_schema)
582
583                 assert expanded_schema == {
584                     "name": "com.namespace.master.MasterSchema",
585                     "type": "record",
586                     "fields": [{
587                         "name": "field_1",
588                         "type": {
589                             "name": "com.namespace.dependencies.Dependency",
590                             "type": "record",
591                             "fields": [
592                                 {"name": "sub_field_1", "type": "string"}
593                             ]
594                         }
595                     }, {
596                         "name": "field_2",
597                         "type": {
598                             "name": "com.namespace.dependencies.Dependency",
599                             "type": "record",
600                             "fields": [
601                                 {"name": "sub_field_1", "type": "string"}
602                             ]
603                         }
604                     }]
605                 }
606
607   fastavro.validation
608       validate(datum, schema, field=None, raise_errors=True)
609              Determine if a python datum is an instance of a schema.
610
611              Parameters
612
613                     · datum (Any) – Data being validated
614
615                     · schema (dict) – Schema
616
617                     · field (str, optional) – Record field being validated
618
619                     · raise_errors (bool, optional) –  If  true,  errors  are
620                       raised  for  invalid  data.  If  false,  a  simple True
621                       (valid) or False (invalid) result is returned
622
623              Example:
624
625                 from fastavro.validation import validate
626                 schema = {...}
627                 record = {...}
628                 validate(record, schema)
629
630       validate_many(records, schema, raise_errors=True)
631              Validate a list of data!
632
633              Parameters
634
635                     · records (iterable) – List of records to validate
636
637                     · schema (dict) – Schema
638
639                     · raise_errors (bool, optional) –  If  true,  errors  are
640                       raised  for  invalid  data.  If  false,  a  simple True
641                       (valid) or False (invalid) result is returned
642
643              Example:
644
645                 from fastavro.validation import validate_many
646                 schema = {...}
647                 records = [{...}, {...}, ...]
648                 validate_many(records, schema)
649
650   fastavro command line script
651       A command line script is installed with the library that can be used to
652       dump the contents of avro file(s) to the standard output.
653
654       Usage:
655
656          usage: fastavro [-h] [--schema] [--codecs] [--version] [-p] [file [file ...]]
657
658          iter over avro file, emit records as JSON
659
660          positional arguments:
661            file          file(s) to parse
662
663          optional arguments:
664            -h, --help    show this help message and exit
665            --schema      dump schema instead of records
666            --codecs      print supported codecs
667            --version     show program's version number and exit
668            -p, --pretty  pretty print json
669
670   Examples
671       Read an avro file:
672
673          $ fastavro weather.avro
674
675          {"temp": 0, "station": "011990-99999", "time": -619524000000}
676          {"temp": 22, "station": "011990-99999", "time": -619506000000}
677          {"temp": -11, "station": "011990-99999", "time": -619484400000}
678          {"temp": 111, "station": "012650-99999", "time": -655531200000}
679          {"temp": 78, "station": "012650-99999", "time": -655509600000}
680
681       Show the schema:
682
683          $ fastavro --schema weather.avro
684
685          {
686           "type": "record",
687           "namespace": "test",
688           "doc": "A weather reading.",
689           "fields": [
690            {
691             "type": "string",
692             "name": "station"
693            },
694            {
695             "type": "long",
696             "name": "time"
697            },
698            {
699             "type": "int",
700             "name": "temp"
701            }
702           ],
703           "name": "Weather"
704          }
705
706       · genindex
707
708       · modindex
709
710       · search
711

AUTHOR

713       Miki Tebeka
714

COPYRIGHT

716       2020, Miki Tebeka
717
718
719
720
7210.23.3                           Jul 29, 2020                      FASTAVRO(1)