1PARALLEL_EXAMPLES(7) parallel PARALLEL_EXAMPLES(7)
2
3
4
6 EXAMPLE: Working as xargs -n1. Argument appending
7 GNU parallel can work similar to xargs -n1.
8
9 To compress all html files using gzip run:
10
11 find . -name '*.html' | parallel gzip --best
12
13 If the file names may contain a newline use -0. Substitute FOO BAR with
14 FUBAR in all files in this dir and subdirs:
15
16 find . -type f -print0 | \
17 parallel -q0 perl -i -pe 's/FOO BAR/FUBAR/g'
18
19 Note -q is needed because of the space in 'FOO BAR'.
20
21 EXAMPLE: Simple network scanner
22 prips can generate IP-addresses from CIDR notation. With GNU parallel
23 you can build a simple network scanner to see which addresses respond
24 to ping:
25
26 prips 130.229.16.0/20 | \
27 parallel --timeout 2 -j0 \
28 'ping -c 1 {} >/dev/null && echo {}' 2>/dev/null
29
30 EXAMPLE: Reading arguments from command line
31 GNU parallel can take the arguments from command line instead of stdin
32 (standard input). To compress all html files in the current dir using
33 gzip run:
34
35 parallel gzip --best ::: *.html
36
37 To convert *.wav to *.mp3 using LAME running one process per CPU run:
38
39 parallel lame {} -o {.}.mp3 ::: *.wav
40
41 EXAMPLE: Inserting multiple arguments
42 When moving a lot of files like this: mv *.log destdir you will
43 sometimes get the error:
44
45 bash: /bin/mv: Argument list too long
46
47 because there are too many files. You can instead do:
48
49 ls | grep -E '\.log$' | parallel mv {} destdir
50
51 This will run mv for each file. It can be done faster if mv gets as
52 many arguments that will fit on the line:
53
54 ls | grep -E '\.log$' | parallel -m mv {} destdir
55
56 In many shells you can also use printf:
57
58 printf '%s\0' *.log | parallel -0 -m mv {} destdir
59
60 EXAMPLE: Context replace
61 To remove the files pict0000.jpg .. pict9999.jpg you could do:
62
63 seq -w 0 9999 | parallel rm pict{}.jpg
64
65 You could also do:
66
67 seq -w 0 9999 | perl -pe 's/(.*)/pict$1.jpg/' | parallel -m rm
68
69 The first will run rm 10000 times, while the last will only run rm as
70 many times needed to keep the command line length short enough to avoid
71 Argument list too long (it typically runs 1-2 times).
72
73 You could also run:
74
75 seq -w 0 9999 | parallel -X rm pict{}.jpg
76
77 This will also only run rm as many times needed to keep the command
78 line length short enough.
79
80 EXAMPLE: Compute intensive jobs and substitution
81 If ImageMagick is installed this will generate a thumbnail of a jpg
82 file:
83
84 convert -geometry 120 foo.jpg thumb_foo.jpg
85
86 This will run with number-of-cpus jobs in parallel for all jpg files in
87 a directory:
88
89 ls *.jpg | parallel convert -geometry 120 {} thumb_{}
90
91 To do it recursively use find:
92
93 find . -name '*.jpg' | \
94 parallel convert -geometry 120 {} {}_thumb.jpg
95
96 Notice how the argument has to start with {} as {} will include path
97 (e.g. running convert -geometry 120 ./foo/bar.jpg thumb_./foo/bar.jpg
98 would clearly be wrong). The command will generate files like
99 ./foo/bar.jpg_thumb.jpg.
100
101 Use {.} to avoid the extra .jpg in the file name. This command will
102 make files like ./foo/bar_thumb.jpg:
103
104 find . -name '*.jpg' | \
105 parallel convert -geometry 120 {} {.}_thumb.jpg
106
107 EXAMPLE: Substitution and redirection
108 This will generate an uncompressed version of .gz-files next to the
109 .gz-file:
110
111 parallel zcat {} ">"{.} ::: *.gz
112
113 Quoting of > is necessary to postpone the redirection. Another solution
114 is to quote the whole command:
115
116 parallel "zcat {} >{.}" ::: *.gz
117
118 Other special shell characters (such as * ; $ > < | >> <<) also need
119 to be put in quotes, as they may otherwise be interpreted by the shell
120 and not given to GNU parallel.
121
122 EXAMPLE: Composed commands
123 A job can consist of several commands. This will print the number of
124 files in each directory:
125
126 ls | parallel 'echo -n {}" "; ls {}|wc -l'
127
128 To put the output in a file called <name>.dir:
129
130 ls | parallel '(echo -n {}" "; ls {}|wc -l) >{}.dir'
131
132 Even small shell scripts can be run by GNU parallel:
133
134 find . | parallel 'a={}; name=${a##*/};' \
135 'upper=$(echo "$name" | tr "[:lower:]" "[:upper:]");'\
136 'echo "$name - $upper"'
137
138 ls | parallel 'mv {} "$(echo {} | tr "[:upper:]" "[:lower:]")"'
139
140 Given a list of URLs, list all URLs that fail to download. Print the
141 line number and the URL.
142
143 cat urlfile | parallel "wget {} 2>/dev/null || grep -n {} urlfile"
144
145 Create a mirror directory with the same file names except all files and
146 symlinks are empty files.
147
148 cp -rs /the/source/dir mirror_dir
149 find mirror_dir -type l | parallel -m rm {} '&&' touch {}
150
151 Find the files in a list that do not exist
152
153 cat file_list | parallel 'if [ ! -e {} ] ; then echo {}; fi'
154
155 EXAMPLE: Composed command with perl replacement string
156 You have a bunch of file. You want them sorted into dirs. The dir of
157 each file should be named the first letter of the file name.
158
159 parallel 'mkdir -p {=s/(.).*/$1/=}; mv {} {=s/(.).*/$1/=}' ::: *
160
161 EXAMPLE: Composed command with multiple input sources
162 You have a dir with files named as 24 hours in 5 minute intervals:
163 00:00, 00:05, 00:10 .. 23:55. You want to find the files missing:
164
165 parallel [ -f {1}:{2} ] "||" echo {1}:{2} does not exist \
166 ::: {00..23} ::: {00..55..5}
167
168 EXAMPLE: Calling Bash functions
169 If the composed command is longer than a line, it becomes hard to read.
170 In Bash you can use functions. Just remember to export -f the function.
171
172 doit() {
173 echo Doing it for $1
174 sleep 2
175 echo Done with $1
176 }
177 export -f doit
178 parallel doit ::: 1 2 3
179
180 doubleit() {
181 echo Doing it for $1 $2
182 sleep 2
183 echo Done with $1 $2
184 }
185 export -f doubleit
186 parallel doubleit ::: 1 2 3 ::: a b
187
188 To do this on remote servers you need to transfer the function using
189 --env:
190
191 parallel --env doit -S server doit ::: 1 2 3
192 parallel --env doubleit -S server doubleit ::: 1 2 3 ::: a b
193
194 If your environment (aliases, variables, and functions) is small you
195 can copy the full environment without having to export -f anything. See
196 env_parallel.
197
198 EXAMPLE: Function tester
199 To test a program with different parameters:
200
201 tester() {
202 if (eval "$@") >&/dev/null; then
203 perl -e 'printf "\033[30;102m[ OK ]\033[0m @ARGV\n"' "$@"
204 else
205 perl -e 'printf "\033[30;101m[FAIL]\033[0m @ARGV\n"' "$@"
206 fi
207 }
208 export -f tester
209 parallel tester my_program ::: arg1 arg2
210 parallel tester exit ::: 1 0 2 0
211
212 If my_program fails a red FAIL will be printed followed by the failing
213 command; otherwise a green OK will be printed followed by the command.
214
215 EXAMPLE: Continously show the latest line of output
216 It can be useful to monitor the output of running jobs.
217
218 This shows the most recent output line until a job finishes. After
219 which the output of the job is printed in full:
220
221 parallel '{} | tee >(cat >&3)' ::: 'command 1' 'command 2' \
222 3> >(perl -ne '$|=1;chomp;printf"%.'$COLUMNS's\r",$_." "x100')
223
224 EXAMPLE: Log rotate
225 Log rotation renames a logfile to an extension with a higher number:
226 log.1 becomes log.2, log.2 becomes log.3, and so on. The oldest log is
227 removed. To avoid overwriting files the process starts backwards from
228 the high number to the low number. This will keep 10 old versions of
229 the log:
230
231 seq 9 -1 1 | parallel -j1 mv log.{} log.'{= $_++ =}'
232 mv log log.1
233
234 EXAMPLE: Removing file extension when processing files
235 When processing files removing the file extension using {.} is often
236 useful.
237
238 Create a directory for each zip-file and unzip it in that dir:
239
240 parallel 'mkdir {.}; cd {.}; unzip ../{}' ::: *.zip
241
242 Recompress all .gz files in current directory using bzip2 running 1 job
243 per CPU in parallel:
244
245 parallel "zcat {} | bzip2 >{.}.bz2 && rm {}" ::: *.gz
246
247 Convert all WAV files to MP3 using LAME:
248
249 find sounddir -type f -name '*.wav' | parallel lame {} -o {.}.mp3
250
251 Put all converted in the same directory:
252
253 find sounddir -type f -name '*.wav' | \
254 parallel lame {} -o mydir/{/.}.mp3
255
256 EXAMPLE: Replacing parts of file names
257 If you deal with paired end reads, you will have files like
258 barcode1_R1.fq.gz, barcode1_R2.fq.gz, barcode2_R1.fq.gz, and
259 barcode2_R2.fq.gz.
260
261 You want barcodeN_R1 to be processed with barcodeN_R2.
262
263 parallel --plus myprocess {} {/_R1.fq.gz/_R2.fq.gz} ::: *_R1.fq.gz
264
265 If the barcode does not contain '_R1', you can do:
266
267 parallel --plus myprocess {} {/_R1/_R2} ::: *_R1.fq.gz
268
269 EXAMPLE: Removing strings from the argument
270 If you have directory with tar.gz files and want these extracted in the
271 corresponding dir (e.g foo.tar.gz will be extracted in the dir foo) you
272 can do:
273
274 parallel --plus 'mkdir {..}; tar -C {..} -xf {}' ::: *.tar.gz
275
276 If you want to remove a different ending, you can use {%string}:
277
278 parallel --plus echo {%_demo} ::: mycode_demo keep_demo_here
279
280 You can also remove a starting string with {#string}
281
282 parallel --plus echo {#demo_} ::: demo_mycode keep_demo_here
283
284 To remove a string anywhere you can use regular expressions with
285 {/regexp/replacement} and leave the replacement empty:
286
287 parallel --plus echo {/demo_/} ::: demo_mycode remove_demo_here
288
289 EXAMPLE: Download 24 images for each of the past 30 days
290 Let us assume a website stores images like:
291
292 https://www.example.com/path/to/YYYYMMDD_##.jpg
293
294 where YYYYMMDD is the date and ## is the number 01-24. This will
295 download images for the past 30 days:
296
297 getit() {
298 date=$(date -d "today -$1 days" +%Y%m%d)
299 num=$2
300 echo wget https://www.example.com/path/to/${date}_${num}.jpg
301 }
302 export -f getit
303
304 parallel getit ::: $(seq 30) ::: $(seq -w 24)
305
306 $(date -d "today -$1 days" +%Y%m%d) will give the dates in YYYYMMDD
307 with $1 days subtracted.
308
309 EXAMPLE: Download world map from NASA
310 NASA provides tiles to download on earthdata.nasa.gov. Download tiles
311 for Blue Marble world map and create a 10240x20480 map.
312
313 base=https://map1a.vis.earthdata.nasa.gov/wmts-geo/wmts.cgi
314 service="SERVICE=WMTS&REQUEST=GetTile&VERSION=1.0.0"
315 layer="LAYER=BlueMarble_ShadedRelief_Bathymetry"
316 set="STYLE=&TILEMATRIXSET=EPSG4326_500m&TILEMATRIX=5"
317 tile="TILEROW={1}&TILECOL={2}"
318 format="FORMAT=image%2Fjpeg"
319 url="$base?$service&$layer&$set&$tile&$format"
320
321 parallel -j0 -q wget "$url" -O {1}_{2}.jpg ::: {0..19} ::: {0..39}
322 parallel eval convert +append {}_{0..39}.jpg line{}.jpg ::: {0..19}
323 convert -append line{0..19}.jpg world.jpg
324
325 EXAMPLE: Download Apollo-11 images from NASA using jq
326 Search NASA using their API to get JSON for images related to 'apollo
327 11' and has 'moon landing' in the description.
328
329 The search query returns JSON containing URLs to JSON containing
330 collections of pictures. One of the pictures in each of these
331 collection is large.
332
333 wget is used to get the JSON for the search query. jq is then used to
334 extract the URLs of the collections. parallel then calls wget to get
335 each collection, which is passed to jq to extract the URLs of all
336 images. grep filters out the large images, and parallel finally uses
337 wget to fetch the images.
338
339 base="https://images-api.nasa.gov/search"
340 q="q=apollo 11"
341 description="description=moon landing"
342 media_type="media_type=image"
343 wget -O - "$base?$q&$description&$media_type" |
344 jq -r .collection.items[].href |
345 parallel wget -O - |
346 jq -r .[] |
347 grep large |
348 parallel wget
349
350 EXAMPLE: Download video playlist in parallel
351 youtube-dl is an excellent tool to download videos. It can, however,
352 not download videos in parallel. This takes a playlist and downloads 10
353 videos in parallel.
354
355 url='youtu.be/watch?v=0wOf2Fgi3DE&list=UU_cznB5YZZmvAmeq7Y3EriQ'
356 export url
357 youtube-dl --flat-playlist "https://$url" |
358 parallel --tagstring {#} --lb -j10 \
359 youtube-dl --playlist-start {#} --playlist-end {#} '"https://$url"'
360
361 EXAMPLE: Prepend last modified date (ISO8601) to file name
362 parallel mv {} '{= $a=pQ($_); $b=$_;' \
363 '$_=qx{date -r "$a" +%FT%T}; chomp; $_="$_ $b" =}' ::: *
364
365 {= and =} mark a perl expression. pQ perl-quotes the string. date
366 +%FT%T is the date in ISO8601 with time.
367
368 EXAMPLE: Save output in ISO8601 dirs
369 Save output from ps aux every second into dirs named
370 yyyy-mm-ddThh:mm:ss+zz:zz.
371
372 seq 1000 | parallel -N0 -j1 --delay 1 \
373 --results '{= $_=`date -Isec`; chomp=}/' ps aux
374
375 EXAMPLE: Digital clock with "blinking" :
376 The : in a digital clock blinks. To make every other line have a ':'
377 and the rest a ' ' a perl expression is used to look at the 3rd input
378 source. If the value modulo 2 is 1: Use ":" otherwise use " ":
379
380 parallel -k echo {1}'{=3 $_=$_%2?":":" "=}'{2}{3} \
381 ::: {0..12} ::: {0..5} ::: {0..9}
382
383 EXAMPLE: Aggregating content of files
384 This:
385
386 parallel --header : echo x{X}y{Y}z{Z} \> x{X}y{Y}z{Z} \
387 ::: X {1..5} ::: Y {01..10} ::: Z {1..5}
388
389 will generate the files x1y01z1 .. x5y10z5. If you want to aggregate
390 the output grouping on x and z you can do this:
391
392 parallel eval 'cat {=s/y01/y*/=} > {=s/y01//=}' ::: *y01*
393
394 For all values of x and z it runs commands like:
395
396 cat x1y*z1 > x1z1
397
398 So you end up with x1z1 .. x5z5 each containing the content of all
399 values of y.
400
401 EXAMPLE: Breadth first parallel web crawler/mirrorer
402 This script below will crawl and mirror a URL in parallel. It
403 downloads first pages that are 1 click down, then 2 clicks down, then
404 3; instead of the normal depth first, where the first link link on each
405 page is fetched first.
406
407 Run like this:
408
409 PARALLEL=-j100 ./parallel-crawl http://gatt.org.yeslab.org/
410
411 Remove the wget part if you only want a web crawler.
412
413 It works by fetching a page from a list of URLs and looking for links
414 in that page that are within the same starting URL and that have not
415 already been seen. These links are added to a new queue. When all the
416 pages from the list is done, the new queue is moved to the list of URLs
417 and the process is started over until no unseen links are found.
418
419 #!/bin/bash
420
421 # E.g. http://gatt.org.yeslab.org/
422 URL=$1
423 # Stay inside the start dir
424 BASEURL=$(echo $URL | perl -pe 's:#.*::; s:(//.*/)[^/]*:$1:')
425 URLLIST=$(mktemp urllist.XXXX)
426 URLLIST2=$(mktemp urllist.XXXX)
427 SEEN=$(mktemp seen.XXXX)
428
429 # Spider to get the URLs
430 echo $URL >$URLLIST
431 cp $URLLIST $SEEN
432
433 while [ -s $URLLIST ] ; do
434 cat $URLLIST |
435 parallel lynx -listonly -image_links -dump {} \; \
436 wget -qm -l1 -Q1 {} \; echo Spidered: {} \>\&2 |
437 perl -ne 's/#.*//; s/\s+\d+.\s(\S+)$/$1/ and
438 do { $seen{$1}++ or print }' |
439 grep -F $BASEURL |
440 grep -v -x -F -f $SEEN | tee -a $SEEN > $URLLIST2
441 mv $URLLIST2 $URLLIST
442 done
443
444 rm -f $URLLIST $URLLIST2 $SEEN
445
446 EXAMPLE: Process files from a tar file while unpacking
447 If the files to be processed are in a tar file then unpacking one file
448 and processing it immediately may be faster than first unpacking all
449 files.
450
451 tar xvf foo.tgz | perl -ne 'print $l;$l=$_;END{print $l}' | \
452 parallel echo
453
454 The Perl one-liner is needed to make sure the file is complete before
455 handing it to GNU parallel.
456
457 EXAMPLE: Rewriting a for-loop and a while-read-loop
458 for-loops like this:
459
460 (for x in `cat list` ; do
461 do_something $x
462 done) | process_output
463
464 and while-read-loops like this:
465
466 cat list | (while read x ; do
467 do_something $x
468 done) | process_output
469
470 can be written like this:
471
472 cat list | parallel do_something | process_output
473
474 For example: Find which host name in a list has IP address 1.2.3 4:
475
476 cat hosts.txt | parallel -P 100 host | grep 1.2.3.4
477
478 If the processing requires more steps the for-loop like this:
479
480 (for x in `cat list` ; do
481 no_extension=${x%.*};
482 do_step1 $x scale $no_extension.jpg
483 do_step2 <$x $no_extension
484 done) | process_output
485
486 and while-loops like this:
487
488 cat list | (while read x ; do
489 no_extension=${x%.*};
490 do_step1 $x scale $no_extension.jpg
491 do_step2 <$x $no_extension
492 done) | process_output
493
494 can be written like this:
495
496 cat list | parallel "do_step1 {} scale {.}.jpg ; do_step2 <{} {.}" |\
497 process_output
498
499 If the body of the loop is bigger, it improves readability to use a
500 function:
501
502 (for x in `cat list` ; do
503 do_something $x
504 [... 100 lines that do something with $x ...]
505 done) | process_output
506
507 cat list | (while read x ; do
508 do_something $x
509 [... 100 lines that do something with $x ...]
510 done) | process_output
511
512 can both be rewritten as:
513
514 doit() {
515 x=$1
516 do_something $x
517 [... 100 lines that do something with $x ...]
518 }
519 export -f doit
520 cat list | parallel doit
521
522 EXAMPLE: Rewriting nested for-loops
523 Nested for-loops like this:
524
525 (for x in `cat xlist` ; do
526 for y in `cat ylist` ; do
527 do_something $x $y
528 done
529 done) | process_output
530
531 can be written like this:
532
533 parallel do_something {1} {2} :::: xlist ylist | process_output
534
535 Nested for-loops like this:
536
537 (for colour in red green blue ; do
538 for size in S M L XL XXL ; do
539 echo $colour $size
540 done
541 done) | sort
542
543 can be written like this:
544
545 parallel echo {1} {2} ::: red green blue ::: S M L XL XXL | sort
546
547 EXAMPLE: Finding the lowest difference between files
548 diff is good for finding differences in text files. diff | wc -l gives
549 an indication of the size of the difference. To find the differences
550 between all files in the current dir do:
551
552 parallel --tag 'diff {1} {2} | wc -l' ::: * ::: * | sort -nk3
553
554 This way it is possible to see if some files are closer to other files.
555
556 EXAMPLE: for-loops with column names
557 When doing multiple nested for-loops it can be easier to keep track of
558 the loop variable if is is named instead of just having a number. Use
559 --header : to let the first argument be an named alias for the
560 positional replacement string:
561
562 parallel --header : echo {colour} {size} \
563 ::: colour red green blue ::: size S M L XL XXL
564
565 This also works if the input file is a file with columns:
566
567 cat addressbook.tsv | \
568 parallel --colsep '\t' --header : echo {Name} {E-mail address}
569
570 EXAMPLE: All combinations in a list
571 GNU parallel makes all combinations when given two lists.
572
573 To make all combinations in a single list with unique values, you
574 repeat the list and use replacement string {choose_k}:
575
576 parallel --plus echo {choose_k} ::: A B C D ::: A B C D
577
578 parallel --plus echo 2{2choose_k} 1{1choose_k} ::: A B C D ::: A B C D
579
580 {choose_k} works for any number of input sources:
581
582 parallel --plus echo {choose_k} ::: A B C D ::: A B C D ::: A B C D
583
584 Where {choose_k} does not care about order, {uniq} cares about order.
585 It simply skips jobs where values from different input sources are the
586 same:
587
588 parallel --plus echo {uniq} ::: A B C ::: A B C ::: A B C
589 parallel --plus echo {1uniq}+{2uniq}+{3uniq} \
590 ::: A B C ::: A B C ::: A B C
591
592 The behaviour of {choose_k} is undefined, if the input values of each
593 source are different.
594
595 EXAMPLE: From a to b and b to c
596 Assume you have input like:
597
598 aardvark
599 babble
600 cab
601 dab
602 each
603
604 and want to run combinations like:
605
606 aardvark babble
607 babble cab
608 cab dab
609 dab each
610
611 If the input is in the file in.txt:
612
613 parallel echo {1} - {2} ::::+ <(head -n -1 in.txt) <(tail -n +2 in.txt)
614
615 If the input is in the array $a here are two solutions:
616
617 seq $((${#a[@]}-1)) | \
618 env_parallel --env a echo '${a[{=$_--=}]} - ${a[{}]}'
619 parallel echo {1} - {2} ::: "${a[@]::${#a[@]}-1}" :::+ "${a[@]:1}"
620
621 EXAMPLE: Count the differences between all files in a dir
622 Using --results the results are saved in /tmp/diffcount*.
623
624 parallel --results /tmp/diffcount "diff -U 0 {1} {2} | \
625 tail -n +3 |grep -v '^@'|wc -l" ::: * ::: *
626
627 To see the difference between file A and file B look at the file
628 '/tmp/diffcount/1/A/2/B'.
629
630 EXAMPLE: Speeding up fast jobs
631 Starting a job on the local machine takes around 3-10 ms. This can be a
632 big overhead if the job takes very few ms to run. Often you can group
633 small jobs together using -X which will make the overhead less
634 significant. Compare the speed of these:
635
636 seq -w 0 9999 | parallel touch pict{}.jpg
637 seq -w 0 9999 | parallel -X touch pict{}.jpg
638
639 If your program cannot take multiple arguments, then you can use GNU
640 parallel to spawn multiple GNU parallels:
641
642 seq -w 0 9999999 | \
643 parallel -j10 -q -I,, --pipe parallel -j0 touch pict{}.jpg
644
645 If -j0 normally spawns 252 jobs, then the above will try to spawn 2520
646 jobs. On a normal GNU/Linux system you can spawn 32000 jobs using this
647 technique with no problems. To raise the 32000 jobs limit raise
648 /proc/sys/kernel/pid_max to 4194303.
649
650 If you do not need GNU parallel to have control over each job (so no
651 need for --retries or --joblog or similar), then it can be even faster
652 if you can generate the command lines and pipe those to a shell. So if
653 you can do this:
654
655 mygenerator | sh
656
657 Then that can be parallelized like this:
658
659 mygenerator | parallel --pipe --block 10M sh
660
661 E.g.
662
663 mygenerator() {
664 seq 10000000 | perl -pe 'print "echo This is fast job number "';
665 }
666 mygenerator | parallel --pipe --block 10M sh
667
668 The overhead is 100000 times smaller namely around 100 nanoseconds per
669 job.
670
671 EXAMPLE: Using shell variables
672 When using shell variables you need to quote them correctly as they may
673 otherwise be interpreted by the shell.
674
675 Notice the difference between:
676
677 ARR=("My brother's 12\" records are worth <\$\$\$>"'!' Foo Bar)
678 parallel echo ::: ${ARR[@]} # This is probably not what you want
679
680 and:
681
682 ARR=("My brother's 12\" records are worth <\$\$\$>"'!' Foo Bar)
683 parallel echo ::: "${ARR[@]}"
684
685 When using variables in the actual command that contains special
686 characters (e.g. space) you can quote them using '"$VAR"' or using "'s
687 and -q:
688
689 VAR="My brother's 12\" records are worth <\$\$\$>"
690 parallel -q echo "$VAR" ::: '!'
691 export VAR
692 parallel echo '"$VAR"' ::: '!'
693
694 If $VAR does not contain ' then "'$VAR'" will also work (and does not
695 need export):
696
697 VAR="My 12\" records are worth <\$\$\$>"
698 parallel echo "'$VAR'" ::: '!'
699
700 If you use them in a function you just quote as you normally would do:
701
702 VAR="My brother's 12\" records are worth <\$\$\$>"
703 export VAR
704 myfunc() { echo "$VAR" "$1"; }
705 export -f myfunc
706 parallel myfunc ::: '!'
707
708 EXAMPLE: Group output lines
709 When running jobs that output data, you often do not want the output of
710 multiple jobs to run together. GNU parallel defaults to grouping the
711 output of each job, so the output is printed when the job finishes. If
712 you want full lines to be printed while the job is running you can use
713 --line-buffer. If you want output to be printed as soon as possible you
714 can use -u.
715
716 Compare the output of:
717
718 parallel wget --progress=dot --limit-rate=100k \
719 https://ftpmirror.gnu.org/parallel/parallel-20{}0822.tar.bz2 \
720 ::: {12..16}
721 parallel --line-buffer wget --progress=dot --limit-rate=100k \
722 https://ftpmirror.gnu.org/parallel/parallel-20{}0822.tar.bz2 \
723 ::: {12..16}
724 parallel --latest-line wget --progress=dot --limit-rate=100k \
725 https://ftpmirror.gnu.org/parallel/parallel-20{}0822.tar.bz2 \
726 ::: {12..16}
727 parallel -u wget --progress=dot --limit-rate=100k \
728 https://ftpmirror.gnu.org/parallel/parallel-20{}0822.tar.bz2 \
729 ::: {12..16}
730
731 EXAMPLE: Tag output lines
732 GNU parallel groups the output lines, but it can be hard to see where
733 the different jobs begin. --tag prepends the argument to make that more
734 visible:
735
736 parallel --tag wget --limit-rate=100k \
737 https://ftpmirror.gnu.org/parallel/parallel-20{}0822.tar.bz2 \
738 ::: {12..16}
739
740 --tag works with --line-buffer but not with -u:
741
742 parallel --tag --line-buffer wget --limit-rate=100k \
743 https://ftpmirror.gnu.org/parallel/parallel-20{}0822.tar.bz2 \
744 ::: {12..16}
745
746 Check the uptime of the servers in ~/.parallel/sshloginfile:
747
748 parallel --tag -S .. --nonall uptime
749
750 EXAMPLE: Colorize output
751 Give each job a new color. Most terminals support ANSI colors with the
752 escape code "\033[30;3Xm" where 0 <= X <= 7:
753
754 seq 10 | \
755 parallel --tagstring '\033[30;3{=$_=++$::color%8=}m' seq {}
756 parallel --rpl '{color} $_="\033[30;3".(++$::color%8)."m"' \
757 --tagstring {color} seq {} ::: {1..10}
758
759 To get rid of the initial \t (which comes from --tagstring):
760
761 ... | perl -pe 's/\t//'
762
763 EXAMPLE: Keep order of output same as order of input
764 Normally the output of a job will be printed as soon as it completes.
765 Sometimes you want the order of the output to remain the same as the
766 order of the input. This is often important, if the output is used as
767 input for another system. -k will make sure the order of output will be
768 in the same order as input even if later jobs end before earlier jobs.
769
770 Append a string to every line in a text file:
771
772 cat textfile | parallel -k echo {} append_string
773
774 If you remove -k some of the lines may come out in the wrong order.
775
776 Another example is traceroute:
777
778 parallel traceroute ::: qubes-os.org debian.org freenetproject.org
779
780 will give traceroute of qubes-os.org, debian.org and
781 freenetproject.org, but it will be sorted according to which job
782 completed first.
783
784 To keep the order the same as input run:
785
786 parallel -k traceroute ::: qubes-os.org debian.org freenetproject.org
787
788 This will make sure the traceroute to qubes-os.org will be printed
789 first.
790
791 A bit more complex example is downloading a huge file in chunks in
792 parallel: Some internet connections will deliver more data if you
793 download files in parallel. For downloading files in parallel see:
794 "EXAMPLE: Download 10 images for each of the past 30 days". But if you
795 are downloading a big file you can download the file in chunks in
796 parallel.
797
798 To download byte 10000000-19999999 you can use curl:
799
800 curl -r 10000000-19999999 https://example.com/the/big/file >file.part
801
802 To download a 1 GB file we need 100 10MB chunks downloaded and combined
803 in the correct order.
804
805 seq 0 99 | parallel -k curl -r \
806 {}0000000-{}9999999 https://example.com/the/big/file > file
807
808 EXAMPLE: Parallel grep
809 grep -r greps recursively through directories. GNU parallel can often
810 speed this up.
811
812 find . -type f | parallel -k -j150% -n 1000 -m grep -H -n STRING {}
813
814 This will run 1.5 job per CPU, and give 1000 arguments to grep.
815
816 There are situations where the above will be slower than grep -r:
817
818 • If data is already in RAM. The overhead of starting jobs and
819 buffering output may outweigh the benefit of running in parallel.
820
821 • If the files are big. If a file cannot be read in a single seek, the
822 disk may start thrashing.
823
824 The speedup is caused by two factors:
825
826 • On rotating harddisks small files often require a seek for each file.
827 By searching for more files in parallel, the arm may pass another
828 wanted file on its way.
829
830 • NVMe drives often perform better by having multiple command running
831 in parallel.
832
833 EXAMPLE: Grepping n lines for m regular expressions.
834 The simplest solution to grep a big file for a lot of regexps is:
835
836 grep -f regexps.txt bigfile
837
838 Or if the regexps are fixed strings:
839
840 grep -F -f regexps.txt bigfile
841
842 There are 3 limiting factors: CPU, RAM, and disk I/O.
843
844 RAM is easy to measure: If the grep process takes up most of your free
845 memory (e.g. when running top), then RAM is a limiting factor.
846
847 CPU is also easy to measure: If the grep takes >90% CPU in top, then
848 the CPU is a limiting factor, and parallelization will speed this up.
849
850 It is harder to see if disk I/O is the limiting factor, and depending
851 on the disk system it may be faster or slower to parallelize. The only
852 way to know for certain is to test and measure.
853
854 Limiting factor: RAM
855
856 The normal grep -f regexps.txt bigfile works no matter the size of
857 bigfile, but if regexps.txt is so big it cannot fit into memory, then
858 you need to split this.
859
860 grep -F takes around 100 bytes of RAM and grep takes about 500 bytes of
861 RAM per 1 byte of regexp. So if regexps.txt is 1% of your RAM, then it
862 may be too big.
863
864 If you can convert your regexps into fixed strings do that. E.g. if the
865 lines you are looking for in bigfile all looks like:
866
867 ID1 foo bar baz Identifier1 quux
868 fubar ID2 foo bar baz Identifier2
869
870 then your regexps.txt can be converted from:
871
872 ID1.*Identifier1
873 ID2.*Identifier2
874
875 into:
876
877 ID1 foo bar baz Identifier1
878 ID2 foo bar baz Identifier2
879
880 This way you can use grep -F which takes around 80% less memory and is
881 much faster.
882
883 If it still does not fit in memory you can do this:
884
885 parallel --pipe-part -a regexps.txt --block 1M grep -F -f - -n bigfile | \
886 sort -un | perl -pe 's/^\d+://'
887
888 The 1M should be your free memory divided by the number of CPU threads
889 and divided by 200 for grep -F and by 1000 for normal grep. On
890 GNU/Linux you can do:
891
892 free=$(awk '/^((Swap)?Cached|MemFree|Buffers):/ { sum += $2 }
893 END { print sum }' /proc/meminfo)
894 percpu=$((free / 200 / $(parallel --number-of-threads)))k
895
896 parallel --pipe-part -a regexps.txt --block $percpu --compress \
897 grep -F -f - -n bigfile | \
898 sort -un | perl -pe 's/^\d+://'
899
900 If you can live with duplicated lines and wrong order, it is faster to
901 do:
902
903 parallel --pipe-part -a regexps.txt --block $percpu --compress \
904 grep -F -f - bigfile
905
906 Limiting factor: CPU
907
908 If the CPU is the limiting factor parallelization should be done on the
909 regexps:
910
911 cat regexps.txt | parallel --pipe -L1000 --round-robin --compress \
912 grep -f - -n bigfile | \
913 sort -un | perl -pe 's/^\d+://'
914
915 The command will start one grep per CPU and read bigfile one time per
916 CPU, but as that is done in parallel, all reads except the first will
917 be cached in RAM. Depending on the size of regexps.txt it may be faster
918 to use --block 10m instead of -L1000.
919
920 Some storage systems perform better when reading multiple chunks in
921 parallel. This is true for some RAID systems and for some network file
922 systems. To parallelize the reading of bigfile:
923
924 parallel --pipe-part --block 100M -a bigfile -k --compress \
925 grep -f regexps.txt
926
927 This will split bigfile into 100MB chunks and run grep on each of these
928 chunks. To parallelize both reading of bigfile and regexps.txt combine
929 the two using --cat:
930
931 parallel --pipe-part --block 100M -a bigfile --cat cat regexps.txt \
932 \| parallel --pipe -L1000 --round-robin grep -f - {}
933
934 If a line matches multiple regexps, the line may be duplicated.
935
936 Bigger problem
937
938 If the problem is too big to be solved by this, you are probably ready
939 for Lucene.
940
941 EXAMPLE: Using remote computers
942 To run commands on a remote computer SSH needs to be set up and you
943 must be able to login without entering a password (The commands ssh-
944 copy-id, ssh-agent, and sshpass may help you do that).
945
946 If you need to login to a whole cluster, you typically do not want to
947 accept the host key for every host. You want to accept them the first
948 time and be warned if they are ever changed. To do that:
949
950 # Add the servers to the sshloginfile
951 (echo servera; echo serverb) > .parallel/my_cluster
952 # Make sure .ssh/config exist
953 touch .ssh/config
954 cp .ssh/config .ssh/config.backup
955 # Disable StrictHostKeyChecking temporarily
956 (echo 'Host *'; echo StrictHostKeyChecking no) >> .ssh/config
957 parallel --slf my_cluster --nonall true
958 # Remove the disabling of StrictHostKeyChecking
959 mv .ssh/config.backup .ssh/config
960
961 The servers in .parallel/my_cluster are now added in .ssh/known_hosts.
962
963 To run echo on server.example.com:
964
965 seq 10 | parallel --sshlogin server.example.com echo
966
967 To run commands on more than one remote computer run:
968
969 seq 10 | parallel --sshlogin s1.example.com,s2.example.net echo
970
971 Or:
972
973 seq 10 | parallel --sshlogin server.example.com \
974 --sshlogin server2.example.net echo
975
976 If the login username is foo on server2.example.net use:
977
978 seq 10 | parallel --sshlogin server.example.com \
979 --sshlogin foo@server2.example.net echo
980
981 If your list of hosts is server1-88.example.net with login foo:
982
983 seq 10 | parallel -Sfoo@server{1..88}.example.net echo
984
985 To distribute the commands to a list of computers, make a file
986 mycomputers with all the computers:
987
988 server.example.com
989 foo@server2.example.com
990 server3.example.com
991
992 Then run:
993
994 seq 10 | parallel --sshloginfile mycomputers echo
995
996 To include the local computer add the special sshlogin ':' to the list:
997
998 server.example.com
999 foo@server2.example.com
1000 server3.example.com
1001 :
1002
1003 GNU parallel will try to determine the number of CPUs on each of the
1004 remote computers, and run one job per CPU - even if the remote
1005 computers do not have the same number of CPUs.
1006
1007 If the number of CPUs on the remote computers is not identified
1008 correctly the number of CPUs can be added in front. Here the computer
1009 has 8 CPUs.
1010
1011 seq 10 | parallel --sshlogin 8/server.example.com echo
1012
1013 EXAMPLE: Transferring of files
1014 To recompress gzipped files with bzip2 using a remote computer run:
1015
1016 find logs/ -name '*.gz' | \
1017 parallel --sshlogin server.example.com \
1018 --transfer "zcat {} | bzip2 -9 >{.}.bz2"
1019
1020 This will list the .gz-files in the logs directory and all directories
1021 below. Then it will transfer the files to server.example.com to the
1022 corresponding directory in $HOME/logs. On server.example.com the file
1023 will be recompressed using zcat and bzip2 resulting in the
1024 corresponding file with .gz replaced with .bz2.
1025
1026 If you want the resulting bz2-file to be transferred back to the local
1027 computer add --return {.}.bz2:
1028
1029 find logs/ -name '*.gz' | \
1030 parallel --sshlogin server.example.com \
1031 --transfer --return {.}.bz2 "zcat {} | bzip2 -9 >{.}.bz2"
1032
1033 After the recompressing is done the .bz2-file is transferred back to
1034 the local computer and put next to the original .gz-file.
1035
1036 If you want to delete the transferred files on the remote computer add
1037 --cleanup. This will remove both the file transferred to the remote
1038 computer and the files transferred from the remote computer:
1039
1040 find logs/ -name '*.gz' | \
1041 parallel --sshlogin server.example.com \
1042 --transfer --return {.}.bz2 --cleanup "zcat {} | bzip2 -9 >{.}.bz2"
1043
1044 If you want run on several computers add the computers to --sshlogin
1045 either using ',' or multiple --sshlogin:
1046
1047 find logs/ -name '*.gz' | \
1048 parallel --sshlogin server.example.com,server2.example.com \
1049 --sshlogin server3.example.com \
1050 --transfer --return {.}.bz2 --cleanup "zcat {} | bzip2 -9 >{.}.bz2"
1051
1052 You can add the local computer using --sshlogin :. This will disable
1053 the removing and transferring for the local computer only:
1054
1055 find logs/ -name '*.gz' | \
1056 parallel --sshlogin server.example.com,server2.example.com \
1057 --sshlogin server3.example.com \
1058 --sshlogin : \
1059 --transfer --return {.}.bz2 --cleanup "zcat {} | bzip2 -9 >{.}.bz2"
1060
1061 Often --transfer, --return and --cleanup are used together. They can be
1062 shortened to --trc:
1063
1064 find logs/ -name '*.gz' | \
1065 parallel --sshlogin server.example.com,server2.example.com \
1066 --sshlogin server3.example.com \
1067 --sshlogin : \
1068 --trc {.}.bz2 "zcat {} | bzip2 -9 >{.}.bz2"
1069
1070 With the file mycomputers containing the list of computers it becomes:
1071
1072 find logs/ -name '*.gz' | parallel --sshloginfile mycomputers \
1073 --trc {.}.bz2 "zcat {} | bzip2 -9 >{.}.bz2"
1074
1075 If the file ~/.parallel/sshloginfile contains the list of computers the
1076 special short hand -S .. can be used:
1077
1078 find logs/ -name '*.gz' | parallel -S .. \
1079 --trc {.}.bz2 "zcat {} | bzip2 -9 >{.}.bz2"
1080
1081 EXAMPLE: Advanced file transfer
1082 Assume you have files in in/*, want them processed on server, and
1083 transferred back into /other/dir:
1084
1085 parallel -S server --trc /other/dir/./{/}.out \
1086 cp {/} {/}.out ::: in/./*
1087
1088 EXAMPLE: Distributing work to local and remote computers
1089 Convert *.mp3 to *.ogg running one process per CPU on local computer
1090 and server2:
1091
1092 parallel --trc {.}.ogg -S server2,: \
1093 'mpg321 -w - {} | oggenc -q0 - -o {.}.ogg' ::: *.mp3
1094
1095 EXAMPLE: Running the same command on remote computers
1096 To run the command uptime on remote computers you can do:
1097
1098 parallel --tag --nonall -S server1,server2 uptime
1099
1100 --nonall reads no arguments. If you have a list of jobs you want to run
1101 on each computer you can do:
1102
1103 parallel --tag --onall -S server1,server2 echo ::: 1 2 3
1104
1105 Remove --tag if you do not want the sshlogin added before the output.
1106
1107 If you have a lot of hosts use '-j0' to access more hosts in parallel.
1108
1109 EXAMPLE: Running 'sudo' on remote computers
1110 Put the password into passwordfile then run:
1111
1112 parallel --ssh 'cat passwordfile | ssh' --nonall \
1113 -S user@server1,user@server2 sudo -S ls -l /root
1114
1115 EXAMPLE: Using remote computers behind NAT wall
1116 If the workers are behind a NAT wall, you need some trickery to get to
1117 them.
1118
1119 If you can ssh to a jumphost, and reach the workers from there, then
1120 the obvious solution would be this, but it does not work:
1121
1122 parallel --ssh 'ssh jumphost ssh' -S host1 echo ::: DOES NOT WORK
1123
1124 It does not work because the command is dequoted by ssh twice where as
1125 GNU parallel only expects it to be dequoted once.
1126
1127 You can use a bash function and have GNU parallel quote the command:
1128
1129 jumpssh() { ssh -A jumphost ssh $(parallel --shellquote ::: "$@"); }
1130 export -f jumpssh
1131 parallel --ssh jumpssh -S host1 echo ::: this works
1132
1133 Or you can instead put this in ~/.ssh/config:
1134
1135 Host host1 host2 host3
1136 ProxyCommand ssh jumphost.domain nc -w 1 %h 22
1137
1138 It requires nc(netcat) to be installed on jumphost. With this you can
1139 simply:
1140
1141 parallel -S host1,host2,host3 echo ::: This does work
1142
1143 No jumphost, but port forwards
1144
1145 If there is no jumphost but each server has port 22 forwarded from the
1146 firewall (e.g. the firewall's port 22001 = port 22 on host1, 22002 =
1147 host2, 22003 = host3) then you can use ~/.ssh/config:
1148
1149 Host host1.v
1150 Port 22001
1151 Host host2.v
1152 Port 22002
1153 Host host3.v
1154 Port 22003
1155 Host *.v
1156 Hostname firewall
1157
1158 And then use host{1..3}.v as normal hosts:
1159
1160 parallel -S host1.v,host2.v,host3.v echo ::: a b c
1161
1162 No jumphost, no port forwards
1163
1164 If ports cannot be forwarded, you need some sort of VPN to traverse the
1165 NAT-wall. TOR is one options for that, as it is very easy to get
1166 working.
1167
1168 You need to install TOR and setup a hidden service. In torrc put:
1169
1170 HiddenServiceDir /var/lib/tor/hidden_service/
1171 HiddenServicePort 22 127.0.0.1:22
1172
1173 Then start TOR: /etc/init.d/tor restart
1174
1175 The TOR hostname is now in /var/lib/tor/hidden_service/hostname and is
1176 something similar to izjafdceobowklhz.onion. Now you simply prepend
1177 torsocks to ssh:
1178
1179 parallel --ssh 'torsocks ssh' -S izjafdceobowklhz.onion \
1180 -S zfcdaeiojoklbwhz.onion,auclucjzobowklhi.onion echo ::: a b c
1181
1182 If not all hosts are accessible through TOR:
1183
1184 parallel -S 'torsocks ssh izjafdceobowklhz.onion,host2,host3' \
1185 echo ::: a b c
1186
1187 See more ssh tricks on
1188 https://en.wikibooks.org/wiki/OpenSSH/Cookbook/Proxies_and_Jump_Hosts
1189
1190 EXAMPLE: Use sshpass with ssh
1191 If you cannot use passwordless login, you may be able to use sshpass:
1192
1193 seq 10 | parallel -S user-with-password:MyPassword@server echo
1194
1195 or:
1196
1197 export SSHPASS='MyPa$$w0rd'
1198 seq 10 | parallel -S user-with-password:@server echo
1199
1200 EXAMPLE: Use outrun instead of ssh
1201 outrun lets you run a command on a remote server. outrun sets up a
1202 connection to access files at the source server, and automatically
1203 transfers files. outrun must be installed on the remote system.
1204
1205 You can use outrun in an sshlogin this way:
1206
1207 parallel -S 'outrun user@server' command
1208
1209 or:
1210
1211 parallel --ssh outrun -S server command
1212
1213 EXAMPLE: Slurm cluster
1214 The Slurm Workload Manager is used in many clusters.
1215
1216 Here is a simple example of using GNU parallel to call srun:
1217
1218 #!/bin/bash
1219
1220 #SBATCH --time 00:02:00
1221 #SBATCH --ntasks=4
1222 #SBATCH --job-name GnuParallelDemo
1223 #SBATCH --output gnuparallel.out
1224
1225 module purge
1226 module load gnu_parallel
1227
1228 my_parallel="parallel --delay .2 -j $SLURM_NTASKS"
1229 my_srun="srun --export=all --exclusive -n1"
1230 my_srun="$my_srun --cpus-per-task=1 --cpu-bind=cores"
1231 $my_parallel "$my_srun" echo This is job {} ::: {1..20}
1232
1233 EXAMPLE: Parallelizing rsync
1234 rsync is a great tool, but sometimes it will not fill up the available
1235 bandwidth. Running multiple rsync in parallel can fix this.
1236
1237 cd src-dir
1238 find . -type f |
1239 parallel -j10 -X rsync -zR -Ha ./{} fooserver:/dest-dir/
1240
1241 Adjust -j10 until you find the optimal number.
1242
1243 rsync -R will create the needed subdirectories, so all files are not
1244 put into a single dir. The ./ is needed so the resulting command looks
1245 similar to:
1246
1247 rsync -zR ././sub/dir/file fooserver:/dest-dir/
1248
1249 The /./ is what rsync -R works on.
1250
1251 If you are unable to push data, but need to pull them and the files are
1252 called digits.png (e.g. 000000.png) you might be able to do:
1253
1254 seq -w 0 99 | parallel rsync -Havessh fooserver:src/*{}.png destdir/
1255
1256 EXAMPLE: Use multiple inputs in one command
1257 Copy files like foo.es.ext to foo.ext:
1258
1259 ls *.es.* | perl -pe 'print; s/\.es//' | parallel -N2 cp {1} {2}
1260
1261 The perl command spits out 2 lines for each input. GNU parallel takes 2
1262 inputs (using -N2) and replaces {1} and {2} with the inputs.
1263
1264 Count in binary:
1265
1266 parallel -k echo ::: 0 1 ::: 0 1 ::: 0 1 ::: 0 1 ::: 0 1 ::: 0 1
1267
1268 Print the number on the opposing sides of a six sided die:
1269
1270 parallel --link -a <(seq 6) -a <(seq 6 -1 1) echo
1271 parallel --link echo :::: <(seq 6) <(seq 6 -1 1)
1272
1273 Convert files from all subdirs to PNG-files with consecutive numbers
1274 (useful for making input PNG's for ffmpeg):
1275
1276 parallel --link -a <(find . -type f | sort) \
1277 -a <(seq $(find . -type f|wc -l)) convert {1} {2}.png
1278
1279 Alternative version:
1280
1281 find . -type f | sort | parallel convert {} {#}.png
1282
1283 EXAMPLE: Use a table as input
1284 Content of table_file.tsv:
1285
1286 foo<TAB>bar
1287 baz <TAB> quux
1288
1289 To run:
1290
1291 cmd -o bar -i foo
1292 cmd -o quux -i baz
1293
1294 you can run:
1295
1296 parallel -a table_file.tsv --colsep '\t' cmd -o {2} -i {1}
1297
1298 Note: The default for GNU parallel is to remove the spaces around the
1299 columns. To keep the spaces:
1300
1301 parallel -a table_file.tsv --trim n --colsep '\t' cmd -o {2} -i {1}
1302
1303 EXAMPLE: Output to database
1304 GNU parallel can output to a database table and a CSV-file:
1305
1306 dburl=csv:///%2Ftmp%2Fmydir
1307 dbtableurl=$dburl/mytable.csv
1308 parallel --sqlandworker $dbtableurl seq ::: {1..10}
1309
1310 It is rather slow and takes up a lot of CPU time because GNU parallel
1311 parses the whole CSV file for each update.
1312
1313 A better approach is to use an SQLite-base and then convert that to
1314 CSV:
1315
1316 dburl=sqlite3:///%2Ftmp%2Fmy.sqlite
1317 dbtableurl=$dburl/mytable
1318 parallel --sqlandworker $dbtableurl seq ::: {1..10}
1319 sql $dburl '.headers on' '.mode csv' 'SELECT * FROM mytable;'
1320
1321 This takes around a second per job.
1322
1323 If you have access to a real database system, such as PostgreSQL, it is
1324 even faster:
1325
1326 dburl=pg://user:pass@host/mydb
1327 dbtableurl=$dburl/mytable
1328 parallel --sqlandworker $dbtableurl seq ::: {1..10}
1329 sql $dburl \
1330 "COPY (SELECT * FROM mytable) TO stdout DELIMITER ',' CSV HEADER;"
1331
1332 Or MySQL:
1333
1334 dburl=mysql://user:pass@host/mydb
1335 dbtableurl=$dburl/mytable
1336 parallel --sqlandworker $dbtableurl seq ::: {1..10}
1337 sql -p -B $dburl "SELECT * FROM mytable;" > mytable.tsv
1338 perl -pe 's/"/""/g; s/\t/","/g; s/^/"/; s/$/"/;
1339 %s=("\\" => "\\", "t" => "\t", "n" => "\n");
1340 s/\\([\\tn])/$s{$1}/g;' mytable.tsv
1341
1342 EXAMPLE: Output to CSV-file for R
1343 If you have no need for the advanced job distribution control that a
1344 database provides, but you simply want output into a CSV file that you
1345 can read into R or LibreCalc, then you can use --results:
1346
1347 parallel --results my.csv seq ::: 10 20 30
1348 R
1349 > mydf <- read.csv("my.csv");
1350 > print(mydf[2,])
1351 > write(as.character(mydf[2,c("Stdout")]),'')
1352
1353 EXAMPLE: Use XML as input
1354 The show Aflyttet on Radio 24syv publishes an RSS feed with their audio
1355 podcasts on: http://arkiv.radio24syv.dk/audiopodcast/channel/4466232
1356
1357 Using xpath you can extract the URLs for 2019 and download them using
1358 GNU parallel:
1359
1360 wget -O - http://arkiv.radio24syv.dk/audiopodcast/channel/4466232 | \
1361 xpath -e "//pubDate[contains(text(),'2019')]/../enclosure/@url" | \
1362 parallel -u wget '{= s/ url="//; s/"//; =}'
1363
1364 EXAMPLE: Run the same command 10 times
1365 If you want to run the same command with the same arguments 10 times in
1366 parallel you can do:
1367
1368 seq 10 | parallel -n0 my_command my_args
1369
1370 EXAMPLE: Working as cat | sh. Resource inexpensive jobs and evaluation
1371 GNU parallel can work similar to cat | sh.
1372
1373 A resource inexpensive job is a job that takes very little CPU, disk
1374 I/O and network I/O. Ping is an example of a resource inexpensive job.
1375 wget is too - if the webpages are small.
1376
1377 The content of the file jobs_to_run:
1378
1379 ping -c 1 10.0.0.1
1380 wget http://example.com/status.cgi?ip=10.0.0.1
1381 ping -c 1 10.0.0.2
1382 wget http://example.com/status.cgi?ip=10.0.0.2
1383 ...
1384 ping -c 1 10.0.0.255
1385 wget http://example.com/status.cgi?ip=10.0.0.255
1386
1387 To run 100 processes simultaneously do:
1388
1389 parallel -j 100 < jobs_to_run
1390
1391 As there is not a command the jobs will be evaluated by the shell.
1392
1393 EXAMPLE: Call program with FASTA sequence
1394 FASTA files have the format:
1395
1396 >Sequence name1
1397 sequence
1398 sequence continued
1399 >Sequence name2
1400 sequence
1401 sequence continued
1402 more sequence
1403
1404 To call myprog with the sequence as argument run:
1405
1406 cat file.fasta |
1407 parallel --pipe -N1 --recstart '>' --rrs \
1408 'read a; echo Name: "$a"; myprog $(tr -d "\n")'
1409
1410 EXAMPLE: Call program with interleaved FASTQ records
1411 FASTQ files have the format:
1412
1413 @M10991:61:000000000-A7EML:1:1101:14011:1001 1:N:0:28
1414 CTCCTAGGTCGGCATGATGGGGGAAGGAGAGCATGGGAAGAAATGAGAGAGTAGCAAGG
1415 +
1416 #8BCCGGGGGFEFECFGGGGGGGGG@;FFGGGEG@FF<EE<@FFC,CEGCCGGFF<FGF
1417
1418 Interleaved FASTQ starts with a line like these:
1419
1420 @HWUSI-EAS100R:6:73:941:1973#0/1
1421 @EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG
1422 @EAS139:136:FC706VJ:2:2104:15343:197393 1:N:18:1
1423
1424 where '/1' and ' 1:' determines this is read 1.
1425
1426 This will cut big.fq into one chunk per CPU thread and pass it on stdin
1427 (standard input) to the program fastq-reader:
1428
1429 parallel --pipe-part -a big.fq --block -1 --regexp \
1430 --recend '\n' --recstart '@.*(/1| 1:.*)\n[A-Za-z\n\.~]' \
1431 fastq-reader
1432
1433 EXAMPLE: Processing a big file using more CPUs
1434 To process a big file or some output you can use --pipe to split up the
1435 data into blocks and pipe the blocks into the processing program.
1436
1437 If the program is gzip -9 you can do:
1438
1439 cat bigfile | parallel --pipe --recend '' -k gzip -9 > bigfile.gz
1440
1441 This will split bigfile into blocks of 1 MB and pass that to gzip -9 in
1442 parallel. One gzip will be run per CPU. The output of gzip -9 will be
1443 kept in order and saved to bigfile.gz
1444
1445 gzip works fine if the output is appended, but some processing does not
1446 work like that - for example sorting. For this GNU parallel can put the
1447 output of each command into a file. This will sort a big file in
1448 parallel:
1449
1450 cat bigfile | parallel --pipe --files sort |\
1451 parallel -Xj1 sort -m {} ';' rm {} >bigfile.sort
1452
1453 Here bigfile is split into blocks of around 1MB, each block ending in
1454 '\n' (which is the default for --recend). Each block is passed to sort
1455 and the output from sort is saved into files. These files are passed to
1456 the second parallel that runs sort -m on the files before it removes
1457 the files. The output is saved to bigfile.sort.
1458
1459 GNU parallel's --pipe maxes out at around 100 MB/s because every byte
1460 has to be copied through GNU parallel. But if bigfile is a real
1461 (seekable) file GNU parallel can by-pass the copying and send the parts
1462 directly to the program:
1463
1464 parallel --pipe-part --block 100m -a bigfile --files sort |\
1465 parallel -Xj1 sort -m {} ';' rm {} >bigfile.sort
1466
1467 EXAMPLE: Grouping input lines
1468 When processing with --pipe you may have lines grouped by a value. Here
1469 is my.csv:
1470
1471 Transaction Customer Item
1472 1 a 53
1473 2 b 65
1474 3 b 82
1475 4 c 96
1476 5 c 67
1477 6 c 13
1478 7 d 90
1479 8 d 43
1480 9 d 91
1481 10 d 84
1482 11 e 72
1483 12 e 102
1484 13 e 63
1485 14 e 56
1486 15 e 74
1487
1488 Let us assume you want GNU parallel to process each customer. In other
1489 words: You want all the transactions for a single customer to be
1490 treated as a single record.
1491
1492 To do this we preprocess the data with a program that inserts a record
1493 separator before each customer (column 2 = $F[1]). Here we first make a
1494 50 character random string, which we then use as the separator:
1495
1496 sep=`perl -e 'print map { ("a".."z","A".."Z")[rand(52)] } (1..50);'`
1497 cat my.csv | \
1498 perl -ape '$F[1] ne $l and print "'$sep'"; $l = $F[1]' | \
1499 parallel --recend $sep --rrs --pipe -N1 wc
1500
1501 If your program can process multiple customers replace -N1 with a
1502 reasonable --blocksize.
1503
1504 EXAMPLE: Running more than 250 jobs workaround
1505 If you need to run a massive amount of jobs in parallel, then you will
1506 likely hit the filehandle limit which is often around 250 jobs. If you
1507 are super user you can raise the limit in /etc/security/limits.conf but
1508 you can also use this workaround. The filehandle limit is per process.
1509 That means that if you just spawn more GNU parallels then each of them
1510 can run 250 jobs. This will spawn up to 2500 jobs:
1511
1512 cat myinput |\
1513 parallel --pipe -N 50 --round-robin -j50 parallel -j50 your_prg
1514
1515 This will spawn up to 62500 jobs (use with caution - you need 64 GB RAM
1516 to do this, and you may need to increase /proc/sys/kernel/pid_max):
1517
1518 cat myinput |\
1519 parallel --pipe -N 250 --round-robin -j250 parallel -j250 your_prg
1520
1521 EXAMPLE: Working as mutex and counting semaphore
1522 The command sem is an alias for parallel --semaphore.
1523
1524 A counting semaphore will allow a given number of jobs to be started in
1525 the background. When the number of jobs are running in the background,
1526 GNU sem will wait for one of these to complete before starting another
1527 command. sem --wait will wait for all jobs to complete.
1528
1529 Run 10 jobs concurrently in the background:
1530
1531 for i in *.log ; do
1532 echo $i
1533 sem -j10 gzip $i ";" echo done
1534 done
1535 sem --wait
1536
1537 A mutex is a counting semaphore allowing only one job to run. This will
1538 edit the file myfile and prepends the file with lines with the numbers
1539 1 to 3.
1540
1541 seq 3 | parallel sem sed -i -e '1i{}' myfile
1542
1543 As myfile can be very big it is important only one process edits the
1544 file at the same time.
1545
1546 Name the semaphore to have multiple different semaphores active at the
1547 same time:
1548
1549 seq 3 | parallel sem --id mymutex sed -i -e '1i{}' myfile
1550
1551 EXAMPLE: Mutex for a script
1552 Assume a script is called from cron or from a web service, but only one
1553 instance can be run at a time. With sem and --shebang-wrap the script
1554 can be made to wait for other instances to finish. Here in bash:
1555
1556 #!/usr/bin/sem --shebang-wrap -u --id $0 --fg /bin/bash
1557
1558 echo This will run
1559 sleep 5
1560 echo exclusively
1561
1562 Here perl:
1563
1564 #!/usr/bin/sem --shebang-wrap -u --id $0 --fg /usr/bin/perl
1565
1566 print "This will run ";
1567 sleep 5;
1568 print "exclusively\n";
1569
1570 Here python:
1571
1572 #!/usr/local/bin/sem --shebang-wrap -u --id $0 --fg /usr/bin/python
1573
1574 import time
1575 print "This will run ";
1576 time.sleep(5)
1577 print "exclusively";
1578
1579 EXAMPLE: Start editor with file names from stdin (standard input)
1580 You can use GNU parallel to start interactive programs like emacs or
1581 vi:
1582
1583 cat filelist | parallel --tty -X emacs
1584 cat filelist | parallel --tty -X vi
1585
1586 If there are more files than will fit on a single command line, the
1587 editor will be started again with the remaining files.
1588
1589 EXAMPLE: Running sudo
1590 sudo requires a password to run a command as root. It caches the
1591 access, so you only need to enter the password again if you have not
1592 used sudo for a while.
1593
1594 The command:
1595
1596 parallel sudo echo ::: This is a bad idea
1597
1598 is no good, as you would be prompted for the sudo password for each of
1599 the jobs. Instead do:
1600
1601 sudo parallel echo ::: This is a good idea
1602
1603 This way you only have to enter the sudo password once.
1604
1605 EXAMPLE: Run ping in parallel
1606 ping prints out statistics when killed with CTRL-C.
1607
1608 Unfortunately, CTRL-C will also normally kill GNU parallel.
1609
1610 But by using --open-tty and ignoring SIGINT you can get the wanted
1611 effect:
1612
1613 parallel -j0 --open-tty --lb --tag ping '{= $SIG{INT}=sub {} =}' \
1614 ::: 1.1.1.1 8.8.8.8 9.9.9.9 21.21.21.21 80.80.80.80 88.88.88.88
1615
1616 --open-tty will make the pings receive SIGINT (from CTRL-C). CTRL-C
1617 will not kill GNU parallel, so that will only exit after ping is done.
1618
1619 EXAMPLE: GNU Parallel as queue system/batch manager
1620 GNU parallel can work as a simple job queue system or batch manager.
1621 The idea is to put the jobs into a file and have GNU parallel read from
1622 that continuously. As GNU parallel will stop at end of file we use tail
1623 to continue reading:
1624
1625 true >jobqueue; tail -n+0 -f jobqueue | parallel
1626
1627 To submit your jobs to the queue:
1628
1629 echo my_command my_arg >> jobqueue
1630
1631 You can of course use -S to distribute the jobs to remote computers:
1632
1633 true >jobqueue; tail -n+0 -f jobqueue | parallel -S ..
1634
1635 Output only will be printed when reading the next input after a job has
1636 finished: So you need to submit a job after the first has finished to
1637 see the output from the first job.
1638
1639 If you keep this running for a long time, jobqueue will grow. A way of
1640 removing the jobs already run is by making GNU parallel stop when it
1641 hits a special value and then restart. To use --eof to make GNU
1642 parallel exit, tail also needs to be forced to exit:
1643
1644 true >jobqueue;
1645 while true; do
1646 tail -n+0 -f jobqueue |
1647 (parallel -E StOpHeRe -S ..; echo GNU Parallel is now done;
1648 perl -e 'while(<>){/StOpHeRe/ and last};print <>' jobqueue > j2;
1649 (seq 1000 >> jobqueue &);
1650 echo Done appending dummy data forcing tail to exit)
1651 echo tail exited;
1652 mv j2 jobqueue
1653 done
1654
1655 In some cases you can run on more CPUs and computers during the night:
1656
1657 # Day time
1658 echo 50% > jobfile
1659 cp day_server_list ~/.parallel/sshloginfile
1660 # Night time
1661 echo 100% > jobfile
1662 cp night_server_list ~/.parallel/sshloginfile
1663 tail -n+0 -f jobqueue | parallel --jobs jobfile -S ..
1664
1665 GNU parallel discovers if jobfile or ~/.parallel/sshloginfile changes.
1666
1667 EXAMPLE: GNU Parallel as dir processor
1668 If you have a dir in which users drop files that needs to be processed
1669 you can do this on GNU/Linux (If you know what inotifywait is called on
1670 other platforms file a bug report):
1671
1672 inotifywait -qmre MOVED_TO -e CLOSE_WRITE --format %w%f my_dir |\
1673 parallel -u echo
1674
1675 This will run the command echo on each file put into my_dir or subdirs
1676 of my_dir.
1677
1678 You can of course use -S to distribute the jobs to remote computers:
1679
1680 inotifywait -qmre MOVED_TO -e CLOSE_WRITE --format %w%f my_dir |\
1681 parallel -S .. -u echo
1682
1683 If the files to be processed are in a tar file then unpacking one file
1684 and processing it immediately may be faster than first unpacking all
1685 files. Set up the dir processor as above and unpack into the dir.
1686
1687 Using GNU parallel as dir processor has the same limitations as using
1688 GNU parallel as queue system/batch manager.
1689
1690 EXAMPLE: Locate the missing package
1691 If you have downloaded source and tried compiling it, you may have
1692 seen:
1693
1694 $ ./configure
1695 [...]
1696 checking for something.h... no
1697 configure: error: "libsomething not found"
1698
1699 Often it is not obvious which package you should install to get that
1700 file. Debian has `apt-file` to search for a file. `tracefile` from
1701 https://gitlab.com/ole.tange/tangetools can tell which files a program
1702 tried to access. In this case we are interested in one of the last
1703 files:
1704
1705 $ tracefile -un ./configure | tail | parallel -j0 apt-file search
1706
1708 When using GNU parallel for a publication please cite:
1709
1710 O. Tange (2011): GNU Parallel - The Command-Line Power Tool, ;login:
1711 The USENIX Magazine, February 2011:42-47.
1712
1713 This helps funding further development; and it won't cost you a cent.
1714 If you pay 10000 EUR you should feel free to use GNU Parallel without
1715 citing.
1716
1717 Copyright (C) 2007-10-18 Ole Tange, http://ole.tange.dk
1718
1719 Copyright (C) 2008-2010 Ole Tange, http://ole.tange.dk
1720
1721 Copyright (C) 2010-2023 Ole Tange, http://ole.tange.dk and Free
1722 Software Foundation, Inc.
1723
1724 Parts of the manual concerning xargs compatibility is inspired by the
1725 manual of xargs from GNU findutils 4.4.2.
1726
1728 This program is free software; you can redistribute it and/or modify it
1729 under the terms of the GNU General Public License as published by the
1730 Free Software Foundation; either version 3 of the License, or at your
1731 option any later version.
1732
1733 This program is distributed in the hope that it will be useful, but
1734 WITHOUT ANY WARRANTY; without even the implied warranty of
1735 MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
1736 General Public License for more details.
1737
1738 You should have received a copy of the GNU General Public License along
1739 with this program. If not, see <https://www.gnu.org/licenses/>.
1740
1741 Documentation license I
1742 Permission is granted to copy, distribute and/or modify this
1743 documentation under the terms of the GNU Free Documentation License,
1744 Version 1.3 or any later version published by the Free Software
1745 Foundation; with no Invariant Sections, with no Front-Cover Texts, and
1746 with no Back-Cover Texts. A copy of the license is included in the
1747 file LICENSES/GFDL-1.3-or-later.txt.
1748
1749 Documentation license II
1750 You are free:
1751
1752 to Share to copy, distribute and transmit the work
1753
1754 to Remix to adapt the work
1755
1756 Under the following conditions:
1757
1758 Attribution
1759 You must attribute the work in the manner specified by the
1760 author or licensor (but not in any way that suggests that they
1761 endorse you or your use of the work).
1762
1763 Share Alike
1764 If you alter, transform, or build upon this work, you may
1765 distribute the resulting work only under the same, similar or
1766 a compatible license.
1767
1768 With the understanding that:
1769
1770 Waiver Any of the above conditions can be waived if you get
1771 permission from the copyright holder.
1772
1773 Public Domain
1774 Where the work or any of its elements is in the public domain
1775 under applicable law, that status is in no way affected by the
1776 license.
1777
1778 Other Rights
1779 In no way are any of the following rights affected by the
1780 license:
1781
1782 • Your fair dealing or fair use rights, or other applicable
1783 copyright exceptions and limitations;
1784
1785 • The author's moral rights;
1786
1787 • Rights other persons may have either in the work itself or
1788 in how the work is used, such as publicity or privacy
1789 rights.
1790
1791 Notice For any reuse or distribution, you must make clear to others
1792 the license terms of this work.
1793
1794 A copy of the full license is included in the file as
1795 LICENCES/CC-BY-SA-4.0.txt
1796
1798 parallel(1), parallel_tutorial(7), env_parallel(1), parset(1),
1799 parsort(1), parallel_alternatives(7), parallel_design(7), niceload(1),
1800 sql(1), ssh(1), ssh-agent(1), sshpass(1), ssh-copy-id(1), rsync(1)
1801
1802
1803
180420230722 2023-07-28 PARALLEL_EXAMPLES(7)