1FASTAVRO(1) fastavro FASTAVRO(1)
2
3
4
6 fastavro - fastavro Documentation
7
8 The current Python avro package is dog slow.
9
10 On a test case of about 10K records, it takes about 14sec to iterate
11 over all of them. In comparison the JAVA avro SDK does it in about
12 1.9sec.
13
14 fastavro is an alternative implementation that is much faster. It iter‐
15 ates over the same 10K records in 2.9sec, and if you use it with PyPy
16 it’ll do it in 1.5sec (to be fair, the JAVA benchmark is doing some
17 extra JSON encoding/decoding).
18
19 If the optional C extension (generated by Cython) is available, then
20 fastavro will be even faster. For the same 10K records it’ll run in
21 about 1.7sec.
22
24 · File Writer
25
26 · File Reader (iterating via records or blocks)
27
28 · Schemaless Writer
29
30 · Schemaless Reader
31
32 · JSON Writer
33
34 · JSON Reader
35
36 · Codecs (Snappy, Deflate, Zstandard, Bzip2, LZ4, XZ)
37
38 · Schema resolution
39
40 · Aliases
41
42 · Logical Types
43
45 · Anything involving Avro’s RPC features
46
47 · Parsing schemas into the canonical form
48
49 · Schema fingerprinting
50
52 from fastavro import writer, reader, parse_schema
53
54 schema = {
55 'doc': 'A weather reading.',
56 'name': 'Weather',
57 'namespace': 'test',
58 'type': 'record',
59 'fields': [
60 {'name': 'station', 'type': 'string'},
61 {'name': 'time', 'type': 'long'},
62 {'name': 'temp', 'type': 'int'},
63 ],
64 }
65 parsed_schema = parse_schema(schema)
66
67 # 'records' can be an iterable (including generator)
68 records = [
69 {u'station': u'011990-99999', u'temp': 0, u'time': 1433269388},
70 {u'station': u'011990-99999', u'temp': 22, u'time': 1433270389},
71 {u'station': u'011990-99999', u'temp': -11, u'time': 1433273379},
72 {u'station': u'012650-99999', u'temp': 111, u'time': 1433275478},
73 ]
74
75 # Writing
76 with open('weather.avro', 'wb') as out:
77 writer(out, parsed_schema, records)
78
79 # Reading
80 with open('weather.avro', 'rb') as fo:
81 for record in reader(fo):
82 print(record)
83
85 fastavro.read
86 class reader(fo, reader_schema=None, return_record_name=False)
87 Iterator over records in an avro file.
88
89 Parameters
90
91 · fo (file-like) – Input stream
92
93 · reader_schema (dict, optional) – Reader schema
94
95 · return_record_name (bool, optional) – If true, when
96 reading a union of records, the result will be a tuple
97 where the first value is the name of the record and the
98 second value is the record itself
99
100 Example:
101
102 from fastavro import reader
103 with open('some-file.avro', 'rb') as fo:
104 avro_reader = reader(fo)
105 for record in avro_reader:
106 process_record(record)
107
108 The fo argument is a file-like object so another common example
109 usage would use an io.BytesIO object like so:
110
111 from io import BytesIO
112 from fastavro import writer, reader
113
114 fo = BytesIO()
115 writer(fo, schema, records)
116 fo.seek(0)
117 for record in reader(fo):
118 process_record(record)
119
120 metadata
121 Key-value pairs in the header metadata
122
123 codec The codec used when writing
124
125 writer_schema
126 The schema used when writing
127
128 reader_schema
129 The schema used when reading (if provided)
130
131 class block_reader(fo, reader_schema=None, return_record_name=False)
132 Iterator over Block in an avro file.
133
134 Parameters
135
136 · fo (file-like) – Input stream
137
138 · reader_schema (dict, optional) – Reader schema
139
140 · return_record_name (bool, optional) – If true, when
141 reading a union of records, the result will be a tuple
142 where the first value is the name of the record and the
143 second value is the record itself
144
145 Example:
146
147 from fastavro import block_reader
148 with open('some-file.avro', 'rb') as fo:
149 avro_reader = block_reader(fo)
150 for block in avro_reader:
151 process_block(block)
152
153 metadata
154 Key-value pairs in the header metadata
155
156 codec The codec used when writing
157
158 writer_schema
159 The schema used when writing
160
161 reader_schema
162 The schema used when reading (if provided)
163
164 class Block(bytes_, num_records, codec, reader_schema, writer_schema,
165 offset, size, return_record_name=False)
166 An avro block. Will yield records when iterated over
167
168 num_records
169 Number of records in the block
170
171 writer_schema
172 The schema used when writing
173
174 reader_schema
175 The schema used when reading (if provided)
176
177 offset Offset of the block from the begining of the avro file
178
179 size Size of the block in bytes
180
181 schemaless_reader(fo, writer_schema, reader_schema=None,
182 return_record_name=False)
183 Reads a single record writen using the schemaless_writer()
184
185 Parameters
186
187 · fo (file-like) – Input stream
188
189 · writer_schema (dict) – Schema used when calling schema‐
190 less_writer
191
192 · reader_schema (dict, optional) – If the schema has
193 changed since being written then the new schema can be
194 given to allow for schema migration
195
196 · return_record_name (bool, optional) – If true, when
197 reading a union of records, the result will be a tuple
198 where the first value is the name of the record and the
199 second value is the record itself
200
201 Example:
202
203 parsed_schema = fastavro.parse_schema(schema)
204 with open('file.avro', 'rb') as fp:
205 record = fastavro.schemaless_reader(fp, parsed_schema)
206
207 Note: The schemaless_reader can only read a single record.
208
209 is_avro(path_or_buffer)
210 Return True if path (or buffer) points to an Avro file.
211
212 Parameters
213 path_or_buffer (path to file or file-like object) – Path
214 to file
215
216 fastavro.write
217 writer(fo, schema, records, codec='null', sync_interval=16000, meta‐
218 data=None, validator=None, sync_marker=None, codec_compres‐
219 sion_level=None)
220 Write records to fo (stream) according to schema
221
222 Parameters
223
224 · fo (file-like) – Output stream
225
226 · schema (dict) – Writer schema
227
228 · records (iterable) – Records to write. This is commonly
229 a list of the dictionary representation of the records,
230 but it can be any iterable
231
232 · codec (string, optional) – Compression codec, can be
233 ‘null’, ‘deflate’ or ‘snappy’ (if installed)
234
235 · sync_interval (int, optional) – Size of sync interval
236
237 · metadata (dict, optional) – Header metadata
238
239 · validator (None, True or a function) – Validator func‐
240 tion. If None (the default) - no validation. If True
241 then then fastavro.validation.validate will be used. If
242 it’s a function, it should have the same signature as
243 fastavro.writer.validate and raise an exeption on
244 error.
245
246 · sync_marker (bytes, optional) – A byte string used as
247 the avro sync marker. If not provided, a random byte
248 string will be used.
249
250 · codec_compression_level (int, optional) – Compression
251 level to use with the specified codec (if the codec
252 supports it)
253
254 Example:
255
256 from fastavro import writer, parse_schema
257
258 schema = {
259 'doc': 'A weather reading.',
260 'name': 'Weather',
261 'namespace': 'test',
262 'type': 'record',
263 'fields': [
264 {'name': 'station', 'type': 'string'},
265 {'name': 'time', 'type': 'long'},
266 {'name': 'temp', 'type': 'int'},
267 ],
268 }
269 parsed_schema = parse_schema(schema)
270
271 records = [
272 {u'station': u'011990-99999', u'temp': 0, u'time': 1433269388},
273 {u'station': u'011990-99999', u'temp': 22, u'time': 1433270389},
274 {u'station': u'011990-99999', u'temp': -11, u'time': 1433273379},
275 {u'station': u'012650-99999', u'temp': 111, u'time': 1433275478},
276 ]
277
278 with open('weather.avro', 'wb') as out:
279 writer(out, parsed_schema, records)
280
281 The fo argument is a file-like object so another common example
282 usage would use an io.BytesIO object like so:
283
284 from io import BytesIO
285 from fastavro import writer
286
287 fo = BytesIO()
288 writer(fo, schema, records)
289
290 Given an existing avro file, it’s possible to append to it by
291 re-opening the file in a+b mode. If the file is only opened in
292 ab mode, we aren’t able to read some of the existing header
293 information and an error will be raised. For example:
294
295 # Write initial records
296 with open('weather.avro', 'wb') as out:
297 writer(out, parsed_schema, records)
298
299 # Write some more records
300 with open('weather.avro', 'a+b') as out:
301 writer(out, parsed_schema, more_records)
302
303 schemaless_writer(fo, schema, record)
304 Write a single record without the schema or header information
305
306 Parameters
307
308 · fo (file-like) – Output file
309
310 · schema (dict) – Schema
311
312 · record (dict) – Record to write
313
314 Example:
315
316 parsed_schema = fastavro.parse_schema(schema)
317 with open('file.avro', 'rb') as fp:
318 fastavro.schemaless_writer(fp, parsed_schema, record)
319
320 Note: The schemaless_writer can only write a single record.
321
322 Using the tuple notation to specify which branch of a union to take
323 Since this library uses plain dictionaries to reprensent a record, it
324 is possible for that dictionary to fit the definition of two different
325 records.
326
327 For example, given a dictionary like this:
328
329 {"name": "My Name"}
330
331 It would be valid against both of these records:
332
333 child_schema = {
334 "name": "Child",
335 "type": "record",
336 "fields": [
337 {"name": "name", "type": "string"},
338 {"name": "favorite_color", "type": ["null", "string"]},
339 ]
340 }
341
342 pet_schema = {
343 "name": "Pet",
344 "type": "record",
345 "fields": [
346 {"name": "name", "type": "string"},
347 {"name": "favorite_toy", "type": ["null", "string"]},
348 ]
349 }
350
351 This becomes a problem when a schema contains a union of these two sim‐
352 ilar records as it is not clear which record the dictionary represents.
353 For example, if you used the previous dictionary with the following
354 schema, it wouldn’t be clear if the record should be serialized as a
355 Child or a
356 `
357 Pet:
358
359 household_schema = {
360 "name": "Household",
361 "type": "record",
362 "fields": [
363 {"name": "address", "type": "string"},
364 {
365 "name": "family_members",
366 "type": {
367 "type": "array", "items": [
368 {
369 "name": "Child",
370 "type": "record",
371 "fields": [
372 {"name": "name", "type": "string"},
373 {"name": "favorite_color", "type": ["null", "string"]},
374 ]
375 }, {
376 "name": "Pet",
377 "type": "record",
378 "fields": [
379 {"name": "name", "type": "string"},
380 {"name": "favorite_toy", "type": ["null", "string"]},
381 ]
382 }
383 ]
384 }
385 },
386 ]
387 }
388
389 To resolve this, you can use a tuple notation where the first value of
390 the tuple is the fully namespaced record name and the second value is
391 the dictionary. For example:
392
393 records = [
394 {
395 "address": "123 Drive Street",
396 "family_members": [
397 ("Child", {"name": "Son"}),
398 ("Child", {"name": "Daughter"}),
399 ("Pet", {"name": "Dog"}),
400 ]
401 }
402 ]
403
404 fastavro.json_read
405 json_reader(fo, schema)
406 Iterator over records in an avro json file.
407
408 Parameters
409
410 · fo (file-like) – Input stream
411
412 · reader_schema (dict) – Reader schema
413
414 Example:
415
416 from fastavro import json_reader
417
418 schema = {
419 'doc': 'A weather reading.',
420 'name': 'Weather',
421 'namespace': 'test',
422 'type': 'record',
423 'fields': [
424 {'name': 'station', 'type': 'string'},
425 {'name': 'time', 'type': 'long'},
426 {'name': 'temp', 'type': 'int'},
427 ]
428 }
429
430 with open('some-file', 'r') as fo:
431 avro_reader = json_reader(fo, schema)
432 for record in avro_reader:
433 print(record)
434
435 fastavro.json_write
436 json_writer(fo, schema, records)
437 Write records to fo (stream) according to schema
438
439 Parameters
440
441 · fo (file-like) – Output stream
442
443 · schema (dict) – Writer schema
444
445 · records (iterable) – Records to write. This is commonly
446 a list of the dictionary representation of the records,
447 but it can be any iterable
448
449 Example:
450
451 from fastavro import json_writer, parse_schema
452
453 schema = {
454 'doc': 'A weather reading.',
455 'name': 'Weather',
456 'namespace': 'test',
457 'type': 'record',
458 'fields': [
459 {'name': 'station', 'type': 'string'},
460 {'name': 'time', 'type': 'long'},
461 {'name': 'temp', 'type': 'int'},
462 ],
463 }
464 parsed_schema = parse_schema(schema)
465
466 records = [
467 {u'station': u'011990-99999', u'temp': 0, u'time': 1433269388},
468 {u'station': u'011990-99999', u'temp': 22, u'time': 1433270389},
469 {u'station': u'011990-99999', u'temp': -11, u'time': 1433273379},
470 {u'station': u'012650-99999', u'temp': 111, u'time': 1433275478},
471 ]
472
473 with open('some-file', 'w') as out:
474 json_writer(out, parsed_schema, records)
475
476 fastavro.schema
477 parse_schema(schema, expand=False, _write_hint=True, _force=False)
478 Returns a parsed avro schema
479
480 It is not necessary to call parse_schema but doing so and saving
481 the parsed schema for use later will make future operations
482 faster as the schema will not need to be reparsed.
483
484 Parameters
485
486 · schema (dict) – Input schema
487
488 · expand (bool) –
489
490 NOTE: This option should be considered a keyword only
491 argument and may get enforced as such when Python 2
492 support is dropped.
493
494 If true, named schemas will be fully expanded to their
495 true schemas rather than being represented as just the
496 name. This format should be considered an output only
497 and not passed in to other reader/writer functions as
498 it does not conform to the avro specification and will
499 likely cause an exception
500
501
502 · _write_hint (bool) – Internal API argument specifying
503 whether or not the __fastavro_parsed marker should be
504 added to the schema
505
506 · _force (bool) – Internal API argument. If True, the
507 schema will always be parsed even if it has been parsed
508 and has the __fastavro_parsed marker
509
510 Example:
511
512 from fastavro import parse_schema
513 from fastavro import writer
514
515 parsed_schema = parse_schema(original_schema)
516 with open('weather.avro', 'wb') as out:
517 writer(out, parsed_schema, records)
518
519 fullname(schema)
520 Returns the fullname of a schema
521
522 Parameters
523 schema (dict) – Input schema
524
525 Example:
526
527 from fastavro.schema import fullname
528
529 schema = {
530 'doc': 'A weather reading.',
531 'name': 'Weather',
532 'namespace': 'test',
533 'type': 'record',
534 'fields': [
535 {'name': 'station', 'type': 'string'},
536 {'name': 'time', 'type': 'long'},
537 {'name': 'temp', 'type': 'int'},
538 ],
539 }
540
541 fname = fullname(schema)
542 assert fname == "test.Weather"
543
544 expand_schema(schema)
545 Returns a schema where all named types are expanded to their
546 real schema
547
548 NOTE: The output of this function produces a schema that can
549 include multiple definitions of the same named type (as per
550 design) which are not valid per the avro specification. There‐
551 fore, the output of this should not be passed to the normal
552 writer/reader functions as it will likely result in an error.
553
554 Parameters
555 schema (dict) – Input schema
556
557 Example:
558
559 from fastavro.schema import expand_schema
560
561 original_schema = {
562 "name": "MasterSchema",
563 "namespace": "com.namespace.master",
564 "type": "record",
565 "fields": [{
566 "name": "field_1",
567 "type": {
568 "name": "Dependency",
569 "namespace": "com.namespace.dependencies",
570 "type": "record",
571 "fields": [
572 {"name": "sub_field_1", "type": "string"}
573 ]
574 }
575 }, {
576 "name": "field_2",
577 "type": "com.namespace.dependencies.Dependency"
578 }]
579 }
580
581 expanded_schema = expand_schema(original_schema)
582
583 assert expanded_schema == {
584 "name": "com.namespace.master.MasterSchema",
585 "type": "record",
586 "fields": [{
587 "name": "field_1",
588 "type": {
589 "name": "com.namespace.dependencies.Dependency",
590 "type": "record",
591 "fields": [
592 {"name": "sub_field_1", "type": "string"}
593 ]
594 }
595 }, {
596 "name": "field_2",
597 "type": {
598 "name": "com.namespace.dependencies.Dependency",
599 "type": "record",
600 "fields": [
601 {"name": "sub_field_1", "type": "string"}
602 ]
603 }
604 }]
605 }
606
607 fastavro.validation
608 validate(datum, schema, field=None, raise_errors=True)
609 Determine if a python datum is an instance of a schema.
610
611 Parameters
612
613 · datum (Any) – Data being validated
614
615 · schema (dict) – Schema
616
617 · field (str, optional) – Record field being validated
618
619 · raise_errors (bool, optional) – If true, errors are
620 raised for invalid data. If false, a simple True
621 (valid) or False (invalid) result is returned
622
623 Example:
624
625 from fastavro.validation import validate
626 schema = {...}
627 record = {...}
628 validate(record, schema)
629
630 validate_many(records, schema, raise_errors=True)
631 Validate a list of data!
632
633 Parameters
634
635 · records (iterable) – List of records to validate
636
637 · schema (dict) – Schema
638
639 · raise_errors (bool, optional) – If true, errors are
640 raised for invalid data. If false, a simple True
641 (valid) or False (invalid) result is returned
642
643 Example:
644
645 from fastavro.validation import validate_many
646 schema = {...}
647 records = [{...}, {...}, ...]
648 validate_many(records, schema)
649
650 fastavro command line script
651 A command line script is installed with the library that can be used to
652 dump the contents of avro file(s) to the standard output.
653
654 Usage:
655
656 usage: fastavro [-h] [--schema] [--codecs] [--version] [-p] [file [file ...]]
657
658 iter over avro file, emit records as JSON
659
660 positional arguments:
661 file file(s) to parse
662
663 optional arguments:
664 -h, --help show this help message and exit
665 --schema dump schema instead of records
666 --codecs print supported codecs
667 --version show program's version number and exit
668 -p, --pretty pretty print json
669
670 Examples
671 Read an avro file:
672
673 $ fastavro weather.avro
674
675 {"temp": 0, "station": "011990-99999", "time": -619524000000}
676 {"temp": 22, "station": "011990-99999", "time": -619506000000}
677 {"temp": -11, "station": "011990-99999", "time": -619484400000}
678 {"temp": 111, "station": "012650-99999", "time": -655531200000}
679 {"temp": 78, "station": "012650-99999", "time": -655509600000}
680
681 Show the schema:
682
683 $ fastavro --schema weather.avro
684
685 {
686 "type": "record",
687 "namespace": "test",
688 "doc": "A weather reading.",
689 "fields": [
690 {
691 "type": "string",
692 "name": "station"
693 },
694 {
695 "type": "long",
696 "name": "time"
697 },
698 {
699 "type": "int",
700 "name": "temp"
701 }
702 ],
703 "name": "Weather"
704 }
705
706 · genindex
707
708 · modindex
709
710 · search
711
713 Miki Tebeka
714
716 2020, Miki Tebeka
717
718
719
720
7210.23.3 Jul 29, 2020 FASTAVRO(1)