1FASTAVRO(1) fastavro FASTAVRO(1)
2
3
4
6 fastavro - fastavro Documentation
7
8 The current Python avro package is dog slow.
9
10 On a test case of about 10K records, it takes about 14sec to iterate
11 over all of them. In comparison the JAVA avro SDK does it in about
12 1.9sec.
13
14 fastavro is an alternative implementation that is much faster. It iter‐
15 ates over the same 10K records in 2.9sec, and if you use it with PyPy
16 it’ll do it in 1.5sec (to be fair, the JAVA benchmark is doing some
17 extra JSON encoding/decoding).
18
19 If the optional C extension (generated by Cython) is available, then
20 fastavro will be even faster. For the same 10K records it’ll run in
21 about 1.7sec.
22
24 · File Writer
25
26 · File Reader (iterating via records or blocks)
27
28 · Schemaless Writer
29
30 · Schemaless Reader
31
32 · JSON Writer
33
34 · JSON Reader
35
36 · Codecs (Snappy, Deflate, Zstandard, Bzip2)
37
38 · Schema resolution
39
40 · Aliases
41
42 · Logical Types
43
45 · Anything involving Avro’s RPC features
46
47 · Parsing schemas into the canonical form
48
49 · Schema fingerprinting
50
52 from fastavro import writer, reader, parse_schema
53
54 schema = {
55 'doc': 'A weather reading.',
56 'name': 'Weather',
57 'namespace': 'test',
58 'type': 'record',
59 'fields': [
60 {'name': 'station', 'type': 'string'},
61 {'name': 'time', 'type': 'long'},
62 {'name': 'temp', 'type': 'int'},
63 ],
64 }
65 parsed_schema = parse_schema(schema)
66
67 # 'records' can be an iterable (including generator)
68 records = [
69 {u'station': u'011990-99999', u'temp': 0, u'time': 1433269388},
70 {u'station': u'011990-99999', u'temp': 22, u'time': 1433270389},
71 {u'station': u'011990-99999', u'temp': -11, u'time': 1433273379},
72 {u'station': u'012650-99999', u'temp': 111, u'time': 1433275478},
73 ]
74
75 # Writing
76 with open('weather.avro', 'wb') as out:
77 writer(out, parsed_schema, records)
78
79 # Reading
80 with open('weather.avro', 'rb') as fo:
81 for record in reader(fo):
82 print(record)
83
85 fastavro.read
86 class reader(fo, reader_schema=None, return_record_name=False)
87 Iterator over records in an avro file.
88
89 Parameters
90
91 · fo (file-like) – Input stream
92
93 · reader_schema (dict, optional) – Reader schema
94
95 Example:
96
97 from fastavro import reader
98 with open('some-file.avro', 'rb') as fo:
99 avro_reader = reader(fo)
100 for record in avro_reader:
101 process_record(record)
102
103 metadata
104 Key-value pairs in the header metadata
105
106 codec The codec used when writing
107
108 writer_schema
109 The schema used when writing
110
111 reader_schema
112 The schema used when reading (if provided)
113
114 class block_reader(fo, reader_schema=None, return_record_name=False)
115 Iterator over Block in an avro file.
116
117 Parameters
118
119 · fo (file-like) – Input stream
120
121 · reader_schema (dict, optional) – Reader schema
122
123 Example:
124
125 from fastavro import block_reader
126 with open('some-file.avro', 'rb') as fo:
127 avro_reader = block_reader(fo)
128 for block in avro_reader:
129 process_block(block)
130
131 metadata
132 Key-value pairs in the header metadata
133
134 codec The codec used when writing
135
136 writer_schema
137 The schema used when writing
138
139 reader_schema
140 The schema used when reading (if provided)
141
142 class Block(bytes_, num_records, codec, reader_schema, writer_schema,
143 offset, size, return_record_name=False)
144 An avro block. Will yield records when iterated over
145
146 num_records
147 Number of records in the block
148
149 writer_schema
150 The schema used when writing
151
152 reader_schema
153 The schema used when reading (if provided)
154
155 offset Offset of the block from the begining of the avro file
156
157 size Size of the block in bytes
158
159 schemaless_reader(fo, writer_schema, reader_schema=None,
160 return_record_name=False)
161 Reads a single record writen using the schemaless_writer()
162
163 Parameters
164
165 · fo (file-like) – Input stream
166
167 · writer_schema (dict) – Schema used when calling schema‐
168 less_writer
169
170 · reader_schema (dict, optional) – If the schema has
171 changed since being written then the new schema can be
172 given to allow for schema migration
173
174 Example:
175
176 parsed_schema = fastavro.parse_schema(schema)
177 with open('file.avro', 'rb') as fp:
178 record = fastavro.schemaless_reader(fp, parsed_schema)
179
180 Note: The schemaless_reader can only read a single record.
181
182 is_avro(path_or_buffer)
183 Return True if path (or buffer) points to an Avro file.
184
185 Parameters
186 path_or_buffer (path to file or file-like object) – Path
187 to file
188
189 fastavro.write
190 writer(fo, schema, records, codec='null', sync_interval=16000, meta‐
191 data=None, validator=None, sync_marker=None)
192 Write records to fo (stream) according to schema
193
194 Parameters
195
196 · fo (file-like) – Output stream
197
198 · schema (dict) – Writer schema
199
200 · records (iterable) – Records to write. This is commonly
201 a list of the dictionary representation of the records,
202 but it can be any iterable
203
204 · codec (string, optional) – Compression codec, can be
205 ‘null’, ‘deflate’ or ‘snappy’ (if installed)
206
207 · sync_interval (int, optional) – Size of sync interval
208
209 · metadata (dict, optional) – Header metadata
210
211 · validator (None, True or a function) – Validator func‐
212 tion. If None (the default) - no validation. If True
213 then then fastavro.validation.validate will be used. If
214 it’s a function, it should have the same signature as
215 fastavro.writer.validate and raise an exeption on
216 error.
217
218 · sync_marker (bytes, optional) – A byte string used as
219 the avro sync marker. If not provided, a random byte
220 string will be used.
221
222 Example:
223
224 from fastavro import writer, parse_schema
225
226 schema = {
227 'doc': 'A weather reading.',
228 'name': 'Weather',
229 'namespace': 'test',
230 'type': 'record',
231 'fields': [
232 {'name': 'station', 'type': 'string'},
233 {'name': 'time', 'type': 'long'},
234 {'name': 'temp', 'type': 'int'},
235 ],
236 }
237 parsed_schema = parse_schema(schema)
238
239 records = [
240 {u'station': u'011990-99999', u'temp': 0, u'time': 1433269388},
241 {u'station': u'011990-99999', u'temp': 22, u'time': 1433270389},
242 {u'station': u'011990-99999', u'temp': -11, u'time': 1433273379},
243 {u'station': u'012650-99999', u'temp': 111, u'time': 1433275478},
244 ]
245
246 with open('weather.avro', 'wb') as out:
247 writer(out, parsed_schema, records)
248
249 Given an existing avro file, it’s possible to append to it by
250 re-opening the file in a+b mode. If the file is only opened in
251 ab mode, we aren’t able to read some of the existing header
252 information and an error will be raised. For example:
253
254 # Write initial records
255 with open('weather.avro', 'wb') as out:
256 writer(out, parsed_schema, records)
257
258 # Write some more records
259 with open('weather.avro', 'a+b') as out:
260 writer(out, parsed_schema, more_records)
261
262 schemaless_writer(fo, schema, record)
263 Write a single record without the schema or header information
264
265 Parameters
266
267 · fo (file-like) – Output file
268
269 · schema (dict) – Schema
270
271 · record (dict) – Record to write
272
273 Example:
274
275 parsed_schema = fastavro.parse_schema(schema)
276 with open('file.avro', 'rb') as fp:
277 fastavro.schemaless_writer(fp, parsed_schema, record)
278
279 Note: The schemaless_writer can only write a single record.
280
281 fastavro.json_read
282 json_reader(fo, schema)
283 Iterator over records in an avro json file.
284
285 Parameters
286
287 · fo (file-like) – Input stream
288
289 · reader_schema (dict) – Reader schema
290
291 Example:
292
293 from fastavro import json_reader
294
295 schema = {
296 'doc': 'A weather reading.',
297 'name': 'Weather',
298 'namespace': 'test',
299 'type': 'record',
300 'fields': [
301 {'name': 'station', 'type': 'string'},
302 {'name': 'time', 'type': 'long'},
303 {'name': 'temp', 'type': 'int'},
304 ]
305 }
306
307 with open('some-file', 'r') as fo:
308 avro_reader = json_reader(fo, schema)
309 for record in avro_reader:
310 print(record)
311
312 fastavro.json_write
313 json_writer(fo, schema, records)
314 Write records to fo (stream) according to schema
315
316 Parameters
317
318 · fo (file-like) – Output stream
319
320 · schema (dict) – Writer schema
321
322 · records (iterable) – Records to write. This is commonly
323 a list of the dictionary representation of the records,
324 but it can be any iterable
325
326 Example:
327
328 from fastavro import json_writer, parse_schema
329
330 schema = {
331 'doc': 'A weather reading.',
332 'name': 'Weather',
333 'namespace': 'test',
334 'type': 'record',
335 'fields': [
336 {'name': 'station', 'type': 'string'},
337 {'name': 'time', 'type': 'long'},
338 {'name': 'temp', 'type': 'int'},
339 ],
340 }
341 parsed_schema = parse_schema(schema)
342
343 records = [
344 {u'station': u'011990-99999', u'temp': 0, u'time': 1433269388},
345 {u'station': u'011990-99999', u'temp': 22, u'time': 1433270389},
346 {u'station': u'011990-99999', u'temp': -11, u'time': 1433273379},
347 {u'station': u'012650-99999', u'temp': 111, u'time': 1433275478},
348 ]
349
350 with open('some-file', 'w') as out:
351 json_writer(out, parsed_schema, records)
352
353 fastavro.schema
354 parse_schema(schema, _write_hint=True, _force=False)
355 Returns a parsed avro schema
356
357 It is not necessary to call parse_schema but doing so and saving
358 the parsed schema for use later will make future operations
359 faster as the schema will not need to be reparsed.
360
361 Parameters
362
363 · schema (dict) – Input schema
364
365 · _write_hint (bool) – Internal API argument specifying
366 whether or not the __fastavro_parsed marker should be
367 added to the schema
368
369 · _force (bool) – Internal API argument. If True, the
370 schema will always be parsed even if it has been parsed
371 and has the __fastavro_parsed marker
372
373 Example:
374
375 from fastavro import parse_schema
376 from fastavro import writer
377
378 parsed_schema = parse_schema(original_schema)
379 with open('weather.avro', 'wb') as out:
380 writer(out, parsed_schema, records)
381
382 fastavro.validation
383 validate(datum, schema, field=None, raise_errors=True)
384 Determine if a python datum is an instance of a schema.
385
386 Parameters
387
388 · datum (Any) – Data being validated
389
390 · schema (dict) – Schema
391
392 · field (str, optional) – Record field being validated
393
394 · raise_errors (bool, optional) – If true, errors are
395 raised for invalid data. If false, a simple True
396 (valid) or False (invalid) result is returned
397
398 Example:
399
400 from fastavro.validation import validate
401 schema = {...}
402 record = {...}
403 validate(record, schema)
404
405 validate_many(records, schema, raise_errors=True)
406 Validate a list of data!
407
408 Parameters
409
410 · records (iterable) – List of records to validate
411
412 · schema (dict) – Schema
413
414 · raise_errors (bool, optional) – If true, errors are
415 raised for invalid data. If false, a simple True
416 (valid) or False (invalid) result is returned
417
418 Example:
419
420 from fastavro.validation import validate_many
421 schema = {...}
422 records = [{...}, {...}, ...]
423 validate_many(records, schema)
424
425 fastavro command line script
426 A command line script is installed with the library that can be used to
427 dump the contents of avro file(s) to the standard output.
428
429 Usage:
430
431 usage: fastavro [-h] [--schema] [--codecs] [--version] [-p] [file [file ...]]
432
433 iter over avro file, emit records as JSON
434
435 positional arguments:
436 file file(s) to parse
437
438 optional arguments:
439 -h, --help show this help message and exit
440 --schema dump schema instead of records
441 --codecs print supported codecs
442 --version show program's version number and exit
443 -p, --pretty pretty print json
444
445 Examples
446 Read an avro file:
447
448 $ fastavro weather.avro
449
450 {"temp": 0, "station": "011990-99999", "time": -619524000000}
451 {"temp": 22, "station": "011990-99999", "time": -619506000000}
452 {"temp": -11, "station": "011990-99999", "time": -619484400000}
453 {"temp": 111, "station": "012650-99999", "time": -655531200000}
454 {"temp": 78, "station": "012650-99999", "time": -655509600000}
455
456 Show the schema:
457
458 $ fastavro --schema weather.avro
459
460 {
461 "type": "record",
462 "namespace": "test",
463 "doc": "A weather reading.",
464 "fields": [
465 {
466 "type": "string",
467 "name": "station"
468 },
469 {
470 "type": "long",
471 "name": "time"
472 },
473 {
474 "type": "int",
475 "name": "temp"
476 }
477 ],
478 "name": "Weather"
479 }
480
481 · genindex
482
483 · modindex
484
485 · search
486
488 Miki Tebeka
489
491 2019, Miki Tebeka
492
493
494
495
4960.22.7 Dec 09, 2019 FASTAVRO(1)