1PARALLEL_EXAMPLES(7) parallel PARALLEL_EXAMPLES(7)
2
3
4
6 EXAMPLE: Working as xargs -n1. Argument appending
7 GNU parallel can work similar to xargs -n1.
8
9 To compress all html files using gzip run:
10
11 find . -name '*.html' | parallel gzip --best
12
13 If the file names may contain a newline use -0. Substitute FOO BAR with
14 FUBAR in all files in this dir and subdirs:
15
16 find . -type f -print0 | \
17 parallel -q0 perl -i -pe 's/FOO BAR/FUBAR/g'
18
19 Note -q is needed because of the space in 'FOO BAR'.
20
21 EXAMPLE: Simple network scanner
22 prips can generate IP-addresses from CIDR notation. With GNU parallel
23 you can build a simple network scanner to see which addresses respond
24 to ping:
25
26 prips 130.229.16.0/20 | \
27 parallel --timeout 2 -j0 \
28 'ping -c 1 {} >/dev/null && echo {}' 2>/dev/null
29
30 EXAMPLE: Reading arguments from command line
31 GNU parallel can take the arguments from command line instead of stdin
32 (standard input). To compress all html files in the current dir using
33 gzip run:
34
35 parallel gzip --best ::: *.html
36
37 To convert *.wav to *.mp3 using LAME running one process per CPU run:
38
39 parallel lame {} -o {.}.mp3 ::: *.wav
40
41 EXAMPLE: Inserting multiple arguments
42 When moving a lot of files like this: mv *.log destdir you will
43 sometimes get the error:
44
45 bash: /bin/mv: Argument list too long
46
47 because there are too many files. You can instead do:
48
49 ls | grep -E '\.log$' | parallel mv {} destdir
50
51 This will run mv for each file. It can be done faster if mv gets as
52 many arguments that will fit on the line:
53
54 ls | grep -E '\.log$' | parallel -m mv {} destdir
55
56 In many shells you can also use printf:
57
58 printf '%s\0' *.log | parallel -0 -m mv {} destdir
59
60 EXAMPLE: Context replace
61 To remove the files pict0000.jpg .. pict9999.jpg you could do:
62
63 seq -w 0 9999 | parallel rm pict{}.jpg
64
65 You could also do:
66
67 seq -w 0 9999 | perl -pe 's/(.*)/pict$1.jpg/' | parallel -m rm
68
69 The first will run rm 10000 times, while the last will only run rm as
70 many times needed to keep the command line length short enough to avoid
71 Argument list too long (it typically runs 1-2 times).
72
73 You could also run:
74
75 seq -w 0 9999 | parallel -X rm pict{}.jpg
76
77 This will also only run rm as many times needed to keep the command
78 line length short enough.
79
80 EXAMPLE: Compute intensive jobs and substitution
81 If ImageMagick is installed this will generate a thumbnail of a jpg
82 file:
83
84 convert -geometry 120 foo.jpg thumb_foo.jpg
85
86 This will run with number-of-cpus jobs in parallel for all jpg files in
87 a directory:
88
89 ls *.jpg | parallel convert -geometry 120 {} thumb_{}
90
91 To do it recursively use find:
92
93 find . -name '*.jpg' | \
94 parallel convert -geometry 120 {} {}_thumb.jpg
95
96 Notice how the argument has to start with {} as {} will include path
97 (e.g. running convert -geometry 120 ./foo/bar.jpg thumb_./foo/bar.jpg
98 would clearly be wrong). The command will generate files like
99 ./foo/bar.jpg_thumb.jpg.
100
101 Use {.} to avoid the extra .jpg in the file name. This command will
102 make files like ./foo/bar_thumb.jpg:
103
104 find . -name '*.jpg' | \
105 parallel convert -geometry 120 {} {.}_thumb.jpg
106
107 EXAMPLE: Substitution and redirection
108 This will generate an uncompressed version of .gz-files next to the
109 .gz-file:
110
111 parallel zcat {} ">"{.} ::: *.gz
112
113 Quoting of > is necessary to postpone the redirection. Another solution
114 is to quote the whole command:
115
116 parallel "zcat {} >{.}" ::: *.gz
117
118 Other special shell characters (such as * ; $ > < | >> <<) also need
119 to be put in quotes, as they may otherwise be interpreted by the shell
120 and not given to GNU parallel.
121
122 EXAMPLE: Composed commands
123 A job can consist of several commands. This will print the number of
124 files in each directory:
125
126 ls | parallel 'echo -n {}" "; ls {}|wc -l'
127
128 To put the output in a file called <name>.dir:
129
130 ls | parallel '(echo -n {}" "; ls {}|wc -l) >{}.dir'
131
132 Even small shell scripts can be run by GNU parallel:
133
134 find . | parallel 'a={}; name=${a##*/};' \
135 'upper=$(echo "$name" | tr "[:lower:]" "[:upper:]");'\
136 'echo "$name - $upper"'
137
138 ls | parallel 'mv {} "$(echo {} | tr "[:upper:]" "[:lower:]")"'
139
140 Given a list of URLs, list all URLs that fail to download. Print the
141 line number and the URL.
142
143 cat urlfile | parallel "wget {} 2>/dev/null || grep -n {} urlfile"
144
145 Create a mirror directory with the same filenames except all files and
146 symlinks are empty files.
147
148 cp -rs /the/source/dir mirror_dir
149 find mirror_dir -type l | parallel -m rm {} '&&' touch {}
150
151 Find the files in a list that do not exist
152
153 cat file_list | parallel 'if [ ! -e {} ] ; then echo {}; fi'
154
155 EXAMPLE: Composed command with perl replacement string
156 You have a bunch of file. You want them sorted into dirs. The dir of
157 each file should be named the first letter of the file name.
158
159 parallel 'mkdir -p {=s/(.).*/$1/=}; mv {} {=s/(.).*/$1/=}' ::: *
160
161 EXAMPLE: Composed command with multiple input sources
162 You have a dir with files named as 24 hours in 5 minute intervals:
163 00:00, 00:05, 00:10 .. 23:55. You want to find the files missing:
164
165 parallel [ -f {1}:{2} ] "||" echo {1}:{2} does not exist \
166 ::: {00..23} ::: {00..55..5}
167
168 EXAMPLE: Calling Bash functions
169 If the composed command is longer than a line, it becomes hard to read.
170 In Bash you can use functions. Just remember to export -f the function.
171
172 doit() {
173 echo Doing it for $1
174 sleep 2
175 echo Done with $1
176 }
177 export -f doit
178 parallel doit ::: 1 2 3
179
180 doubleit() {
181 echo Doing it for $1 $2
182 sleep 2
183 echo Done with $1 $2
184 }
185 export -f doubleit
186 parallel doubleit ::: 1 2 3 ::: a b
187
188 To do this on remote servers you need to transfer the function using
189 --env:
190
191 parallel --env doit -S server doit ::: 1 2 3
192 parallel --env doubleit -S server doubleit ::: 1 2 3 ::: a b
193
194 If your environment (aliases, variables, and functions) is small you
195 can copy the full environment without having to export -f anything. See
196 env_parallel.
197
198 EXAMPLE: Function tester
199 To test a program with different parameters:
200
201 tester() {
202 if (eval "$@") >&/dev/null; then
203 perl -e 'printf "\033[30;102m[ OK ]\033[0m @ARGV\n"' "$@"
204 else
205 perl -e 'printf "\033[30;101m[FAIL]\033[0m @ARGV\n"' "$@"
206 fi
207 }
208 export -f tester
209 parallel tester my_program ::: arg1 arg2
210 parallel tester exit ::: 1 0 2 0
211
212 If my_program fails a red FAIL will be printed followed by the failing
213 command; otherwise a green OK will be printed followed by the command.
214
215 EXAMPLE: Continously show the latest line of output
216 It can be useful to monitor the output of running jobs.
217
218 This shows the most recent output line until a job finishes. After
219 which the output of the job is printed in full:
220
221 parallel '{} | tee >(cat >&3)' ::: 'command 1' 'command 2' \
222 3> >(perl -ne '$|=1;chomp;printf"%.'$COLUMNS's\r",$_." "x100')
223
224 EXAMPLE: Log rotate
225 Log rotation renames a logfile to an extension with a higher number:
226 log.1 becomes log.2, log.2 becomes log.3, and so on. The oldest log is
227 removed. To avoid overwriting files the process starts backwards from
228 the high number to the low number. This will keep 10 old versions of
229 the log:
230
231 seq 9 -1 1 | parallel -j1 mv log.{} log.'{= $_++ =}'
232 mv log log.1
233
234 EXAMPLE: Removing file extension when processing files
235 When processing files removing the file extension using {.} is often
236 useful.
237
238 Create a directory for each zip-file and unzip it in that dir:
239
240 parallel 'mkdir {.}; cd {.}; unzip ../{}' ::: *.zip
241
242 Recompress all .gz files in current directory using bzip2 running 1 job
243 per CPU in parallel:
244
245 parallel "zcat {} | bzip2 >{.}.bz2 && rm {}" ::: *.gz
246
247 Convert all WAV files to MP3 using LAME:
248
249 find sounddir -type f -name '*.wav' | parallel lame {} -o {.}.mp3
250
251 Put all converted in the same directory:
252
253 find sounddir -type f -name '*.wav' | \
254 parallel lame {} -o mydir/{/.}.mp3
255
256 EXAMPLE: Removing strings from the argument
257 If you have directory with tar.gz files and want these extracted in the
258 corresponding dir (e.g foo.tar.gz will be extracted in the dir foo) you
259 can do:
260
261 parallel --plus 'mkdir {..}; tar -C {..} -xf {}' ::: *.tar.gz
262
263 If you want to remove a different ending, you can use {%string}:
264
265 parallel --plus echo {%_demo} ::: mycode_demo keep_demo_here
266
267 You can also remove a starting string with {#string}
268
269 parallel --plus echo {#demo_} ::: demo_mycode keep_demo_here
270
271 To remove a string anywhere you can use regular expressions with
272 {/regexp/replacement} and leave the replacement empty:
273
274 parallel --plus echo {/demo_/} ::: demo_mycode remove_demo_here
275
276 EXAMPLE: Download 24 images for each of the past 30 days
277 Let us assume a website stores images like:
278
279 https://www.example.com/path/to/YYYYMMDD_##.jpg
280
281 where YYYYMMDD is the date and ## is the number 01-24. This will
282 download images for the past 30 days:
283
284 getit() {
285 date=$(date -d "today -$1 days" +%Y%m%d)
286 num=$2
287 echo wget https://www.example.com/path/to/${date}_${num}.jpg
288 }
289 export -f getit
290
291 parallel getit ::: $(seq 30) ::: $(seq -w 24)
292
293 $(date -d "today -$1 days" +%Y%m%d) will give the dates in YYYYMMDD
294 with $1 days subtracted.
295
296 EXAMPLE: Download world map from NASA
297 NASA provides tiles to download on earthdata.nasa.gov. Download tiles
298 for Blue Marble world map and create a 10240x20480 map.
299
300 base=https://map1a.vis.earthdata.nasa.gov/wmts-geo/wmts.cgi
301 service="SERVICE=WMTS&REQUEST=GetTile&VERSION=1.0.0"
302 layer="LAYER=BlueMarble_ShadedRelief_Bathymetry"
303 set="STYLE=&TILEMATRIXSET=EPSG4326_500m&TILEMATRIX=5"
304 tile="TILEROW={1}&TILECOL={2}"
305 format="FORMAT=image%2Fjpeg"
306 url="$base?$service&$layer&$set&$tile&$format"
307
308 parallel -j0 -q wget "$url" -O {1}_{2}.jpg ::: {0..19} ::: {0..39}
309 parallel eval convert +append {}_{0..39}.jpg line{}.jpg ::: {0..19}
310 convert -append line{0..19}.jpg world.jpg
311
312 EXAMPLE: Download Apollo-11 images from NASA using jq
313 Search NASA using their API to get JSON for images related to 'apollo
314 11' and has 'moon landing' in the description.
315
316 The search query returns JSON containing URLs to JSON containing
317 collections of pictures. One of the pictures in each of these
318 collection is large.
319
320 wget is used to get the JSON for the search query. jq is then used to
321 extract the URLs of the collections. parallel then calls wget to get
322 each collection, which is passed to jq to extract the URLs of all
323 images. grep filters out the large images, and parallel finally uses
324 wget to fetch the images.
325
326 base="https://images-api.nasa.gov/search"
327 q="q=apollo 11"
328 description="description=moon landing"
329 media_type="media_type=image"
330 wget -O - "$base?$q&$description&$media_type" |
331 jq -r .collection.items[].href |
332 parallel wget -O - |
333 jq -r .[] |
334 grep large |
335 parallel wget
336
337 EXAMPLE: Download video playlist in parallel
338 youtube-dl is an excellent tool to download videos. It can, however,
339 not download videos in parallel. This takes a playlist and downloads 10
340 videos in parallel.
341
342 url='youtu.be/watch?v=0wOf2Fgi3DE&list=UU_cznB5YZZmvAmeq7Y3EriQ'
343 export url
344 youtube-dl --flat-playlist "https://$url" |
345 parallel --tagstring {#} --lb -j10 \
346 youtube-dl --playlist-start {#} --playlist-end {#} '"https://$url"'
347
348 EXAMPLE: Prepend last modified date (ISO8601) to file name
349 parallel mv {} '{= $a=pQ($_); $b=$_;' \
350 '$_=qx{date -r "$a" +%FT%T}; chomp; $_="$_ $b" =}' ::: *
351
352 {= and =} mark a perl expression. pQ perl-quotes the string. date
353 +%FT%T is the date in ISO8601 with time.
354
355 EXAMPLE: Save output in ISO8601 dirs
356 Save output from ps aux every second into dirs named
357 yyyy-mm-ddThh:mm:ss+zz:zz.
358
359 seq 1000 | parallel -N0 -j1 --delay 1 \
360 --results '{= $_=`date -Isec`; chomp=}/' ps aux
361
362 EXAMPLE: Digital clock with "blinking" :
363 The : in a digital clock blinks. To make every other line have a ':'
364 and the rest a ' ' a perl expression is used to look at the 3rd input
365 source. If the value modulo 2 is 1: Use ":" otherwise use " ":
366
367 parallel -k echo {1}'{=3 $_=$_%2?":":" "=}'{2}{3} \
368 ::: {0..12} ::: {0..5} ::: {0..9}
369
370 EXAMPLE: Aggregating content of files
371 This:
372
373 parallel --header : echo x{X}y{Y}z{Z} \> x{X}y{Y}z{Z} \
374 ::: X {1..5} ::: Y {01..10} ::: Z {1..5}
375
376 will generate the files x1y01z1 .. x5y10z5. If you want to aggregate
377 the output grouping on x and z you can do this:
378
379 parallel eval 'cat {=s/y01/y*/=} > {=s/y01//=}' ::: *y01*
380
381 For all values of x and z it runs commands like:
382
383 cat x1y*z1 > x1z1
384
385 So you end up with x1z1 .. x5z5 each containing the content of all
386 values of y.
387
388 EXAMPLE: Breadth first parallel web crawler/mirrorer
389 This script below will crawl and mirror a URL in parallel. It
390 downloads first pages that are 1 click down, then 2 clicks down, then
391 3; instead of the normal depth first, where the first link link on each
392 page is fetched first.
393
394 Run like this:
395
396 PARALLEL=-j100 ./parallel-crawl http://gatt.org.yeslab.org/
397
398 Remove the wget part if you only want a web crawler.
399
400 It works by fetching a page from a list of URLs and looking for links
401 in that page that are within the same starting URL and that have not
402 already been seen. These links are added to a new queue. When all the
403 pages from the list is done, the new queue is moved to the list of URLs
404 and the process is started over until no unseen links are found.
405
406 #!/bin/bash
407
408 # E.g. http://gatt.org.yeslab.org/
409 URL=$1
410 # Stay inside the start dir
411 BASEURL=$(echo $URL | perl -pe 's:#.*::; s:(//.*/)[^/]*:$1:')
412 URLLIST=$(mktemp urllist.XXXX)
413 URLLIST2=$(mktemp urllist.XXXX)
414 SEEN=$(mktemp seen.XXXX)
415
416 # Spider to get the URLs
417 echo $URL >$URLLIST
418 cp $URLLIST $SEEN
419
420 while [ -s $URLLIST ] ; do
421 cat $URLLIST |
422 parallel lynx -listonly -image_links -dump {} \; \
423 wget -qm -l1 -Q1 {} \; echo Spidered: {} \>\&2 |
424 perl -ne 's/#.*//; s/\s+\d+.\s(\S+)$/$1/ and
425 do { $seen{$1}++ or print }' |
426 grep -F $BASEURL |
427 grep -v -x -F -f $SEEN | tee -a $SEEN > $URLLIST2
428 mv $URLLIST2 $URLLIST
429 done
430
431 rm -f $URLLIST $URLLIST2 $SEEN
432
433 EXAMPLE: Process files from a tar file while unpacking
434 If the files to be processed are in a tar file then unpacking one file
435 and processing it immediately may be faster than first unpacking all
436 files.
437
438 tar xvf foo.tgz | perl -ne 'print $l;$l=$_;END{print $l}' | \
439 parallel echo
440
441 The Perl one-liner is needed to make sure the file is complete before
442 handing it to GNU parallel.
443
444 EXAMPLE: Rewriting a for-loop and a while-read-loop
445 for-loops like this:
446
447 (for x in `cat list` ; do
448 do_something $x
449 done) | process_output
450
451 and while-read-loops like this:
452
453 cat list | (while read x ; do
454 do_something $x
455 done) | process_output
456
457 can be written like this:
458
459 cat list | parallel do_something | process_output
460
461 For example: Find which host name in a list has IP address 1.2.3 4:
462
463 cat hosts.txt | parallel -P 100 host | grep 1.2.3.4
464
465 If the processing requires more steps the for-loop like this:
466
467 (for x in `cat list` ; do
468 no_extension=${x%.*};
469 do_step1 $x scale $no_extension.jpg
470 do_step2 <$x $no_extension
471 done) | process_output
472
473 and while-loops like this:
474
475 cat list | (while read x ; do
476 no_extension=${x%.*};
477 do_step1 $x scale $no_extension.jpg
478 do_step2 <$x $no_extension
479 done) | process_output
480
481 can be written like this:
482
483 cat list | parallel "do_step1 {} scale {.}.jpg ; do_step2 <{} {.}" |\
484 process_output
485
486 If the body of the loop is bigger, it improves readability to use a
487 function:
488
489 (for x in `cat list` ; do
490 do_something $x
491 [... 100 lines that do something with $x ...]
492 done) | process_output
493
494 cat list | (while read x ; do
495 do_something $x
496 [... 100 lines that do something with $x ...]
497 done) | process_output
498
499 can both be rewritten as:
500
501 doit() {
502 x=$1
503 do_something $x
504 [... 100 lines that do something with $x ...]
505 }
506 export -f doit
507 cat list | parallel doit
508
509 EXAMPLE: Rewriting nested for-loops
510 Nested for-loops like this:
511
512 (for x in `cat xlist` ; do
513 for y in `cat ylist` ; do
514 do_something $x $y
515 done
516 done) | process_output
517
518 can be written like this:
519
520 parallel do_something {1} {2} :::: xlist ylist | process_output
521
522 Nested for-loops like this:
523
524 (for colour in red green blue ; do
525 for size in S M L XL XXL ; do
526 echo $colour $size
527 done
528 done) | sort
529
530 can be written like this:
531
532 parallel echo {1} {2} ::: red green blue ::: S M L XL XXL | sort
533
534 EXAMPLE: Finding the lowest difference between files
535 diff is good for finding differences in text files. diff | wc -l gives
536 an indication of the size of the difference. To find the differences
537 between all files in the current dir do:
538
539 parallel --tag 'diff {1} {2} | wc -l' ::: * ::: * | sort -nk3
540
541 This way it is possible to see if some files are closer to other files.
542
543 EXAMPLE: for-loops with column names
544 When doing multiple nested for-loops it can be easier to keep track of
545 the loop variable if is is named instead of just having a number. Use
546 --header : to let the first argument be an named alias for the
547 positional replacement string:
548
549 parallel --header : echo {colour} {size} \
550 ::: colour red green blue ::: size S M L XL XXL
551
552 This also works if the input file is a file with columns:
553
554 cat addressbook.tsv | \
555 parallel --colsep '\t' --header : echo {Name} {E-mail address}
556
557 EXAMPLE: All combinations in a list
558 GNU parallel makes all combinations when given two lists.
559
560 To make all combinations in a single list with unique values, you
561 repeat the list and use replacement string {choose_k}:
562
563 parallel --plus echo {choose_k} ::: A B C D ::: A B C D
564
565 parallel --plus echo 2{2choose_k} 1{1choose_k} ::: A B C D ::: A B C D
566
567 {choose_k} works for any number of input sources:
568
569 parallel --plus echo {choose_k} ::: A B C D ::: A B C D ::: A B C D
570
571 Where {choose_k} does not care about order, {uniq} cares about order.
572 It simply skips jobs where values from different input sources are the
573 same:
574
575 parallel --plus echo {uniq} ::: A B C ::: A B C ::: A B C
576 parallel --plus echo {1uniq}+{2uniq}+{3uniq} ::: A B C ::: A B C ::: A B C
577
578 EXAMPLE: From a to b and b to c
579 Assume you have input like:
580
581 aardvark
582 babble
583 cab
584 dab
585 each
586
587 and want to run combinations like:
588
589 aardvark babble
590 babble cab
591 cab dab
592 dab each
593
594 If the input is in the file in.txt:
595
596 parallel echo {1} - {2} ::::+ <(head -n -1 in.txt) <(tail -n +2 in.txt)
597
598 If the input is in the array $a here are two solutions:
599
600 seq $((${#a[@]}-1)) | \
601 env_parallel --env a echo '${a[{=$_--=}]} - ${a[{}]}'
602 parallel echo {1} - {2} ::: "${a[@]::${#a[@]}-1}" :::+ "${a[@]:1}"
603
604 EXAMPLE: Count the differences between all files in a dir
605 Using --results the results are saved in /tmp/diffcount*.
606
607 parallel --results /tmp/diffcount "diff -U 0 {1} {2} | \
608 tail -n +3 |grep -v '^@'|wc -l" ::: * ::: *
609
610 To see the difference between file A and file B look at the file
611 '/tmp/diffcount/1/A/2/B'.
612
613 EXAMPLE: Speeding up fast jobs
614 Starting a job on the local machine takes around 3-10 ms. This can be a
615 big overhead if the job takes very few ms to run. Often you can group
616 small jobs together using -X which will make the overhead less
617 significant. Compare the speed of these:
618
619 seq -w 0 9999 | parallel touch pict{}.jpg
620 seq -w 0 9999 | parallel -X touch pict{}.jpg
621
622 If your program cannot take multiple arguments, then you can use GNU
623 parallel to spawn multiple GNU parallels:
624
625 seq -w 0 9999999 | \
626 parallel -j10 -q -I,, --pipe parallel -j0 touch pict{}.jpg
627
628 If -j0 normally spawns 252 jobs, then the above will try to spawn 2520
629 jobs. On a normal GNU/Linux system you can spawn 32000 jobs using this
630 technique with no problems. To raise the 32000 jobs limit raise
631 /proc/sys/kernel/pid_max to 4194303.
632
633 If you do not need GNU parallel to have control over each job (so no
634 need for --retries or --joblog or similar), then it can be even faster
635 if you can generate the command lines and pipe those to a shell. So if
636 you can do this:
637
638 mygenerator | sh
639
640 Then that can be parallelized like this:
641
642 mygenerator | parallel --pipe --block 10M sh
643
644 E.g.
645
646 mygenerator() {
647 seq 10000000 | perl -pe 'print "echo This is fast job number "';
648 }
649 mygenerator | parallel --pipe --block 10M sh
650
651 The overhead is 100000 times smaller namely around 100 nanoseconds per
652 job.
653
654 EXAMPLE: Using shell variables
655 When using shell variables you need to quote them correctly as they may
656 otherwise be interpreted by the shell.
657
658 Notice the difference between:
659
660 ARR=("My brother's 12\" records are worth <\$\$\$>"'!' Foo Bar)
661 parallel echo ::: ${ARR[@]} # This is probably not what you want
662
663 and:
664
665 ARR=("My brother's 12\" records are worth <\$\$\$>"'!' Foo Bar)
666 parallel echo ::: "${ARR[@]}"
667
668 When using variables in the actual command that contains special
669 characters (e.g. space) you can quote them using '"$VAR"' or using "'s
670 and -q:
671
672 VAR="My brother's 12\" records are worth <\$\$\$>"
673 parallel -q echo "$VAR" ::: '!'
674 export VAR
675 parallel echo '"$VAR"' ::: '!'
676
677 If $VAR does not contain ' then "'$VAR'" will also work (and does not
678 need export):
679
680 VAR="My 12\" records are worth <\$\$\$>"
681 parallel echo "'$VAR'" ::: '!'
682
683 If you use them in a function you just quote as you normally would do:
684
685 VAR="My brother's 12\" records are worth <\$\$\$>"
686 export VAR
687 myfunc() { echo "$VAR" "$1"; }
688 export -f myfunc
689 parallel myfunc ::: '!'
690
691 EXAMPLE: Group output lines
692 When running jobs that output data, you often do not want the output of
693 multiple jobs to run together. GNU parallel defaults to grouping the
694 output of each job, so the output is printed when the job finishes. If
695 you want full lines to be printed while the job is running you can use
696 --line-buffer. If you want output to be printed as soon as possible you
697 can use -u.
698
699 Compare the output of:
700
701 parallel wget --progress=dot --limit-rate=100k \
702 https://ftpmirror.gnu.org/parallel/parallel-20{}0822.tar.bz2 \
703 ::: {12..16}
704 parallel --line-buffer wget --progress=dot --limit-rate=100k \
705 https://ftpmirror.gnu.org/parallel/parallel-20{}0822.tar.bz2 \
706 ::: {12..16}
707 parallel --latest-line wget --progress=dot --limit-rate=100k \
708 https://ftpmirror.gnu.org/parallel/parallel-20{}0822.tar.bz2 \
709 ::: {12..16}
710 parallel -u wget --progress=dot --limit-rate=100k \
711 https://ftpmirror.gnu.org/parallel/parallel-20{}0822.tar.bz2 \
712 ::: {12..16}
713
714 EXAMPLE: Tag output lines
715 GNU parallel groups the output lines, but it can be hard to see where
716 the different jobs begin. --tag prepends the argument to make that more
717 visible:
718
719 parallel --tag wget --limit-rate=100k \
720 https://ftpmirror.gnu.org/parallel/parallel-20{}0822.tar.bz2 \
721 ::: {12..16}
722
723 --tag works with --line-buffer but not with -u:
724
725 parallel --tag --line-buffer wget --limit-rate=100k \
726 https://ftpmirror.gnu.org/parallel/parallel-20{}0822.tar.bz2 \
727 ::: {12..16}
728
729 Check the uptime of the servers in ~/.parallel/sshloginfile:
730
731 parallel --tag -S .. --nonall uptime
732
733 EXAMPLE: Colorize output
734 Give each job a new color. Most terminals support ANSI colors with the
735 escape code "\033[30;3Xm" where 0 <= X <= 7:
736
737 seq 10 | \
738 parallel --tagstring '\033[30;3{=$_=++$::color%8=}m' seq {}
739 parallel --rpl '{color} $_="\033[30;3".(++$::color%8)."m"' \
740 --tagstring {color} seq {} ::: {1..10}
741
742 To get rid of the initial \t (which comes from --tagstring):
743
744 ... | perl -pe 's/\t//'
745
746 EXAMPLE: Keep order of output same as order of input
747 Normally the output of a job will be printed as soon as it completes.
748 Sometimes you want the order of the output to remain the same as the
749 order of the input. This is often important, if the output is used as
750 input for another system. -k will make sure the order of output will be
751 in the same order as input even if later jobs end before earlier jobs.
752
753 Append a string to every line in a text file:
754
755 cat textfile | parallel -k echo {} append_string
756
757 If you remove -k some of the lines may come out in the wrong order.
758
759 Another example is traceroute:
760
761 parallel traceroute ::: qubes-os.org debian.org freenetproject.org
762
763 will give traceroute of qubes-os.org, debian.org and
764 freenetproject.org, but it will be sorted according to which job
765 completed first.
766
767 To keep the order the same as input run:
768
769 parallel -k traceroute ::: qubes-os.org debian.org freenetproject.org
770
771 This will make sure the traceroute to qubes-os.org will be printed
772 first.
773
774 A bit more complex example is downloading a huge file in chunks in
775 parallel: Some internet connections will deliver more data if you
776 download files in parallel. For downloading files in parallel see:
777 "EXAMPLE: Download 10 images for each of the past 30 days". But if you
778 are downloading a big file you can download the file in chunks in
779 parallel.
780
781 To download byte 10000000-19999999 you can use curl:
782
783 curl -r 10000000-19999999 https://example.com/the/big/file >file.part
784
785 To download a 1 GB file we need 100 10MB chunks downloaded and combined
786 in the correct order.
787
788 seq 0 99 | parallel -k curl -r \
789 {}0000000-{}9999999 https://example.com/the/big/file > file
790
791 EXAMPLE: Parallel grep
792 grep -r greps recursively through directories. GNU parallel can often
793 speed this up.
794
795 find . -type f | parallel -k -j150% -n 1000 -m grep -H -n STRING {}
796
797 This will run 1.5 job per CPU, and give 1000 arguments to grep.
798
799 There are situations where the above will be slower than grep -r:
800
801 • If data is already in RAM. The overhead of starting jobs and
802 buffering output may outweigh the benefit of running in parallel.
803
804 • If the files are big. If a file cannot be read in a single seek, the
805 disk may start thrashing.
806
807 The speedup is caused by two factors:
808
809 • On rotating harddisks small files often require a seek for each file.
810 By searching for more files in parallel, the arm may pass another
811 wanted file on its way.
812
813 • NVMe drives often perform better by having multiple command running
814 in parallel.
815
816 EXAMPLE: Grepping n lines for m regular expressions.
817 The simplest solution to grep a big file for a lot of regexps is:
818
819 grep -f regexps.txt bigfile
820
821 Or if the regexps are fixed strings:
822
823 grep -F -f regexps.txt bigfile
824
825 There are 3 limiting factors: CPU, RAM, and disk I/O.
826
827 RAM is easy to measure: If the grep process takes up most of your free
828 memory (e.g. when running top), then RAM is a limiting factor.
829
830 CPU is also easy to measure: If the grep takes >90% CPU in top, then
831 the CPU is a limiting factor, and parallelization will speed this up.
832
833 It is harder to see if disk I/O is the limiting factor, and depending
834 on the disk system it may be faster or slower to parallelize. The only
835 way to know for certain is to test and measure.
836
837 Limiting factor: RAM
838
839 The normal grep -f regexps.txt bigfile works no matter the size of
840 bigfile, but if regexps.txt is so big it cannot fit into memory, then
841 you need to split this.
842
843 grep -F takes around 100 bytes of RAM and grep takes about 500 bytes of
844 RAM per 1 byte of regexp. So if regexps.txt is 1% of your RAM, then it
845 may be too big.
846
847 If you can convert your regexps into fixed strings do that. E.g. if the
848 lines you are looking for in bigfile all looks like:
849
850 ID1 foo bar baz Identifier1 quux
851 fubar ID2 foo bar baz Identifier2
852
853 then your regexps.txt can be converted from:
854
855 ID1.*Identifier1
856 ID2.*Identifier2
857
858 into:
859
860 ID1 foo bar baz Identifier1
861 ID2 foo bar baz Identifier2
862
863 This way you can use grep -F which takes around 80% less memory and is
864 much faster.
865
866 If it still does not fit in memory you can do this:
867
868 parallel --pipe-part -a regexps.txt --block 1M grep -F -f - -n bigfile | \
869 sort -un | perl -pe 's/^\d+://'
870
871 The 1M should be your free memory divided by the number of CPU threads
872 and divided by 200 for grep -F and by 1000 for normal grep. On
873 GNU/Linux you can do:
874
875 free=$(awk '/^((Swap)?Cached|MemFree|Buffers):/ { sum += $2 }
876 END { print sum }' /proc/meminfo)
877 percpu=$((free / 200 / $(parallel --number-of-threads)))k
878
879 parallel --pipe-part -a regexps.txt --block $percpu --compress \
880 grep -F -f - -n bigfile | \
881 sort -un | perl -pe 's/^\d+://'
882
883 If you can live with duplicated lines and wrong order, it is faster to
884 do:
885
886 parallel --pipe-part -a regexps.txt --block $percpu --compress \
887 grep -F -f - bigfile
888
889 Limiting factor: CPU
890
891 If the CPU is the limiting factor parallelization should be done on the
892 regexps:
893
894 cat regexps.txt | parallel --pipe -L1000 --round-robin --compress \
895 grep -f - -n bigfile | \
896 sort -un | perl -pe 's/^\d+://'
897
898 The command will start one grep per CPU and read bigfile one time per
899 CPU, but as that is done in parallel, all reads except the first will
900 be cached in RAM. Depending on the size of regexps.txt it may be faster
901 to use --block 10m instead of -L1000.
902
903 Some storage systems perform better when reading multiple chunks in
904 parallel. This is true for some RAID systems and for some network file
905 systems. To parallelize the reading of bigfile:
906
907 parallel --pipe-part --block 100M -a bigfile -k --compress \
908 grep -f regexps.txt
909
910 This will split bigfile into 100MB chunks and run grep on each of these
911 chunks. To parallelize both reading of bigfile and regexps.txt combine
912 the two using --cat:
913
914 parallel --pipe-part --block 100M -a bigfile --cat cat regexps.txt \
915 \| parallel --pipe -L1000 --round-robin grep -f - {}
916
917 If a line matches multiple regexps, the line may be duplicated.
918
919 Bigger problem
920
921 If the problem is too big to be solved by this, you are probably ready
922 for Lucene.
923
924 EXAMPLE: Using remote computers
925 To run commands on a remote computer SSH needs to be set up and you
926 must be able to login without entering a password (The commands ssh-
927 copy-id, ssh-agent, and sshpass may help you do that).
928
929 If you need to login to a whole cluster, you typically do not want to
930 accept the host key for every host. You want to accept them the first
931 time and be warned if they are ever changed. To do that:
932
933 # Add the servers to the sshloginfile
934 (echo servera; echo serverb) > .parallel/my_cluster
935 # Make sure .ssh/config exist
936 touch .ssh/config
937 cp .ssh/config .ssh/config.backup
938 # Disable StrictHostKeyChecking temporarily
939 (echo 'Host *'; echo StrictHostKeyChecking no) >> .ssh/config
940 parallel --slf my_cluster --nonall true
941 # Remove the disabling of StrictHostKeyChecking
942 mv .ssh/config.backup .ssh/config
943
944 The servers in .parallel/my_cluster are now added in .ssh/known_hosts.
945
946 To run echo on server.example.com:
947
948 seq 10 | parallel --sshlogin server.example.com echo
949
950 To run commands on more than one remote computer run:
951
952 seq 10 | parallel --sshlogin s1.example.com,s2.example.net echo
953
954 Or:
955
956 seq 10 | parallel --sshlogin server.example.com \
957 --sshlogin server2.example.net echo
958
959 If the login username is foo on server2.example.net use:
960
961 seq 10 | parallel --sshlogin server.example.com \
962 --sshlogin foo@server2.example.net echo
963
964 If your list of hosts is server1-88.example.net with login foo:
965
966 seq 10 | parallel -Sfoo@server{1..88}.example.net echo
967
968 To distribute the commands to a list of computers, make a file
969 mycomputers with all the computers:
970
971 server.example.com
972 foo@server2.example.com
973 server3.example.com
974
975 Then run:
976
977 seq 10 | parallel --sshloginfile mycomputers echo
978
979 To include the local computer add the special sshlogin ':' to the list:
980
981 server.example.com
982 foo@server2.example.com
983 server3.example.com
984 :
985
986 GNU parallel will try to determine the number of CPUs on each of the
987 remote computers, and run one job per CPU - even if the remote
988 computers do not have the same number of CPUs.
989
990 If the number of CPUs on the remote computers is not identified
991 correctly the number of CPUs can be added in front. Here the computer
992 has 8 CPUs.
993
994 seq 10 | parallel --sshlogin 8/server.example.com echo
995
996 EXAMPLE: Transferring of files
997 To recompress gzipped files with bzip2 using a remote computer run:
998
999 find logs/ -name '*.gz' | \
1000 parallel --sshlogin server.example.com \
1001 --transfer "zcat {} | bzip2 -9 >{.}.bz2"
1002
1003 This will list the .gz-files in the logs directory and all directories
1004 below. Then it will transfer the files to server.example.com to the
1005 corresponding directory in $HOME/logs. On server.example.com the file
1006 will be recompressed using zcat and bzip2 resulting in the
1007 corresponding file with .gz replaced with .bz2.
1008
1009 If you want the resulting bz2-file to be transferred back to the local
1010 computer add --return {.}.bz2:
1011
1012 find logs/ -name '*.gz' | \
1013 parallel --sshlogin server.example.com \
1014 --transfer --return {.}.bz2 "zcat {} | bzip2 -9 >{.}.bz2"
1015
1016 After the recompressing is done the .bz2-file is transferred back to
1017 the local computer and put next to the original .gz-file.
1018
1019 If you want to delete the transferred files on the remote computer add
1020 --cleanup. This will remove both the file transferred to the remote
1021 computer and the files transferred from the remote computer:
1022
1023 find logs/ -name '*.gz' | \
1024 parallel --sshlogin server.example.com \
1025 --transfer --return {.}.bz2 --cleanup "zcat {} | bzip2 -9 >{.}.bz2"
1026
1027 If you want run on several computers add the computers to --sshlogin
1028 either using ',' or multiple --sshlogin:
1029
1030 find logs/ -name '*.gz' | \
1031 parallel --sshlogin server.example.com,server2.example.com \
1032 --sshlogin server3.example.com \
1033 --transfer --return {.}.bz2 --cleanup "zcat {} | bzip2 -9 >{.}.bz2"
1034
1035 You can add the local computer using --sshlogin :. This will disable
1036 the removing and transferring for the local computer only:
1037
1038 find logs/ -name '*.gz' | \
1039 parallel --sshlogin server.example.com,server2.example.com \
1040 --sshlogin server3.example.com \
1041 --sshlogin : \
1042 --transfer --return {.}.bz2 --cleanup "zcat {} | bzip2 -9 >{.}.bz2"
1043
1044 Often --transfer, --return and --cleanup are used together. They can be
1045 shortened to --trc:
1046
1047 find logs/ -name '*.gz' | \
1048 parallel --sshlogin server.example.com,server2.example.com \
1049 --sshlogin server3.example.com \
1050 --sshlogin : \
1051 --trc {.}.bz2 "zcat {} | bzip2 -9 >{.}.bz2"
1052
1053 With the file mycomputers containing the list of computers it becomes:
1054
1055 find logs/ -name '*.gz' | parallel --sshloginfile mycomputers \
1056 --trc {.}.bz2 "zcat {} | bzip2 -9 >{.}.bz2"
1057
1058 If the file ~/.parallel/sshloginfile contains the list of computers the
1059 special short hand -S .. can be used:
1060
1061 find logs/ -name '*.gz' | parallel -S .. \
1062 --trc {.}.bz2 "zcat {} | bzip2 -9 >{.}.bz2"
1063
1064 EXAMPLE: Advanced file transfer
1065 Assume you have files in in/*, want them processed on server, and
1066 transferred back into /other/dir:
1067
1068 parallel -S server --trc /other/dir/./{/}.out \
1069 cp {/} {/}.out ::: in/./*
1070
1071 EXAMPLE: Distributing work to local and remote computers
1072 Convert *.mp3 to *.ogg running one process per CPU on local computer
1073 and server2:
1074
1075 parallel --trc {.}.ogg -S server2,: \
1076 'mpg321 -w - {} | oggenc -q0 - -o {.}.ogg' ::: *.mp3
1077
1078 EXAMPLE: Running the same command on remote computers
1079 To run the command uptime on remote computers you can do:
1080
1081 parallel --tag --nonall -S server1,server2 uptime
1082
1083 --nonall reads no arguments. If you have a list of jobs you want to run
1084 on each computer you can do:
1085
1086 parallel --tag --onall -S server1,server2 echo ::: 1 2 3
1087
1088 Remove --tag if you do not want the sshlogin added before the output.
1089
1090 If you have a lot of hosts use '-j0' to access more hosts in parallel.
1091
1092 EXAMPLE: Running 'sudo' on remote computers
1093 Put the password into passwordfile then run:
1094
1095 parallel --ssh 'cat passwordfile | ssh' --nonall \
1096 -S user@server1,user@server2 sudo -S ls -l /root
1097
1098 EXAMPLE: Using remote computers behind NAT wall
1099 If the workers are behind a NAT wall, you need some trickery to get to
1100 them.
1101
1102 If you can ssh to a jumphost, and reach the workers from there, then
1103 the obvious solution would be this, but it does not work:
1104
1105 parallel --ssh 'ssh jumphost ssh' -S host1 echo ::: DOES NOT WORK
1106
1107 It does not work because the command is dequoted by ssh twice where as
1108 GNU parallel only expects it to be dequoted once.
1109
1110 You can use a bash function and have GNU parallel quote the command:
1111
1112 jumpssh() { ssh -A jumphost ssh $(parallel --shellquote ::: "$@"); }
1113 export -f jumpssh
1114 parallel --ssh jumpssh -S host1 echo ::: this works
1115
1116 Or you can instead put this in ~/.ssh/config:
1117
1118 Host host1 host2 host3
1119 ProxyCommand ssh jumphost.domain nc -w 1 %h 22
1120
1121 It requires nc(netcat) to be installed on jumphost. With this you can
1122 simply:
1123
1124 parallel -S host1,host2,host3 echo ::: This does work
1125
1126 No jumphost, but port forwards
1127
1128 If there is no jumphost but each server has port 22 forwarded from the
1129 firewall (e.g. the firewall's port 22001 = port 22 on host1, 22002 =
1130 host2, 22003 = host3) then you can use ~/.ssh/config:
1131
1132 Host host1.v
1133 Port 22001
1134 Host host2.v
1135 Port 22002
1136 Host host3.v
1137 Port 22003
1138 Host *.v
1139 Hostname firewall
1140
1141 And then use host{1..3}.v as normal hosts:
1142
1143 parallel -S host1.v,host2.v,host3.v echo ::: a b c
1144
1145 No jumphost, no port forwards
1146
1147 If ports cannot be forwarded, you need some sort of VPN to traverse the
1148 NAT-wall. TOR is one options for that, as it is very easy to get
1149 working.
1150
1151 You need to install TOR and setup a hidden service. In torrc put:
1152
1153 HiddenServiceDir /var/lib/tor/hidden_service/
1154 HiddenServicePort 22 127.0.0.1:22
1155
1156 Then start TOR: /etc/init.d/tor restart
1157
1158 The TOR hostname is now in /var/lib/tor/hidden_service/hostname and is
1159 something similar to izjafdceobowklhz.onion. Now you simply prepend
1160 torsocks to ssh:
1161
1162 parallel --ssh 'torsocks ssh' -S izjafdceobowklhz.onion \
1163 -S zfcdaeiojoklbwhz.onion,auclucjzobowklhi.onion echo ::: a b c
1164
1165 If not all hosts are accessible through TOR:
1166
1167 parallel -S 'torsocks ssh izjafdceobowklhz.onion,host2,host3' \
1168 echo ::: a b c
1169
1170 See more ssh tricks on
1171 https://en.wikibooks.org/wiki/OpenSSH/Cookbook/Proxies_and_Jump_Hosts
1172
1173 EXAMPLE: Use sshpass with ssh
1174 If you cannot use passwordless login, you may be able to use sshpass:
1175
1176 seq 10 | parallel -S user-with-password:MyPassword@server echo
1177
1178 or:
1179
1180 export SSHPASS='MyPa$$w0rd'
1181 seq 10 | parallel -S user-with-password:@server echo
1182
1183 EXAMPLE: Use outrun instead of ssh
1184 outrun lets you run a command on a remote server. outrun sets up a
1185 connection to access files at the source server, and automatically
1186 transfers files. outrun must be installed on the remote system.
1187
1188 You can use outrun in an sshlogin this way:
1189
1190 parallel -S 'outrun user@server' command
1191
1192 or:
1193
1194 parallel --ssh outrun -S server command
1195
1196 EXAMPLE: Slurm cluster
1197 The Slurm Workload Manager is used in many clusters.
1198
1199 Here is a simple example of using GNU parallel to call srun:
1200
1201 #!/bin/bash
1202
1203 #SBATCH --time 00:02:00
1204 #SBATCH --ntasks=4
1205 #SBATCH --job-name GnuParallelDemo
1206 #SBATCH --output gnuparallel.out
1207
1208 module purge
1209 module load gnu_parallel
1210
1211 my_parallel="parallel --delay .2 -j $SLURM_NTASKS"
1212 my_srun="srun --export=all --exclusive -n1 --cpus-per-task=1 --cpu-bind=cores"
1213 $my_parallel "$my_srun" echo This is job {} ::: {1..20}
1214
1215 EXAMPLE: Parallelizing rsync
1216 rsync is a great tool, but sometimes it will not fill up the available
1217 bandwidth. Running multiple rsync in parallel can fix this.
1218
1219 cd src-dir
1220 find . -type f |
1221 parallel -j10 -X rsync -zR -Ha ./{} fooserver:/dest-dir/
1222
1223 Adjust -j10 until you find the optimal number.
1224
1225 rsync -R will create the needed subdirectories, so all files are not
1226 put into a single dir. The ./ is needed so the resulting command looks
1227 similar to:
1228
1229 rsync -zR ././sub/dir/file fooserver:/dest-dir/
1230
1231 The /./ is what rsync -R works on.
1232
1233 If you are unable to push data, but need to pull them and the files are
1234 called digits.png (e.g. 000000.png) you might be able to do:
1235
1236 seq -w 0 99 | parallel rsync -Havessh fooserver:src/*{}.png destdir/
1237
1238 EXAMPLE: Use multiple inputs in one command
1239 Copy files like foo.es.ext to foo.ext:
1240
1241 ls *.es.* | perl -pe 'print; s/\.es//' | parallel -N2 cp {1} {2}
1242
1243 The perl command spits out 2 lines for each input. GNU parallel takes 2
1244 inputs (using -N2) and replaces {1} and {2} with the inputs.
1245
1246 Count in binary:
1247
1248 parallel -k echo ::: 0 1 ::: 0 1 ::: 0 1 ::: 0 1 ::: 0 1 ::: 0 1
1249
1250 Print the number on the opposing sides of a six sided die:
1251
1252 parallel --link -a <(seq 6) -a <(seq 6 -1 1) echo
1253 parallel --link echo :::: <(seq 6) <(seq 6 -1 1)
1254
1255 Convert files from all subdirs to PNG-files with consecutive numbers
1256 (useful for making input PNG's for ffmpeg):
1257
1258 parallel --link -a <(find . -type f | sort) \
1259 -a <(seq $(find . -type f|wc -l)) convert {1} {2}.png
1260
1261 Alternative version:
1262
1263 find . -type f | sort | parallel convert {} {#}.png
1264
1265 EXAMPLE: Use a table as input
1266 Content of table_file.tsv:
1267
1268 foo<TAB>bar
1269 baz <TAB> quux
1270
1271 To run:
1272
1273 cmd -o bar -i foo
1274 cmd -o quux -i baz
1275
1276 you can run:
1277
1278 parallel -a table_file.tsv --colsep '\t' cmd -o {2} -i {1}
1279
1280 Note: The default for GNU parallel is to remove the spaces around the
1281 columns. To keep the spaces:
1282
1283 parallel -a table_file.tsv --trim n --colsep '\t' cmd -o {2} -i {1}
1284
1285 EXAMPLE: Output to database
1286 GNU parallel can output to a database table and a CSV-file:
1287
1288 dburl=csv:///%2Ftmp%2Fmydir
1289 dbtableurl=$dburl/mytable.csv
1290 parallel --sqlandworker $dbtableurl seq ::: {1..10}
1291
1292 It is rather slow and takes up a lot of CPU time because GNU parallel
1293 parses the whole CSV file for each update.
1294
1295 A better approach is to use an SQLite-base and then convert that to
1296 CSV:
1297
1298 dburl=sqlite3:///%2Ftmp%2Fmy.sqlite
1299 dbtableurl=$dburl/mytable
1300 parallel --sqlandworker $dbtableurl seq ::: {1..10}
1301 sql $dburl '.headers on' '.mode csv' 'SELECT * FROM mytable;'
1302
1303 This takes around a second per job.
1304
1305 If you have access to a real database system, such as PostgreSQL, it is
1306 even faster:
1307
1308 dburl=pg://user:pass@host/mydb
1309 dbtableurl=$dburl/mytable
1310 parallel --sqlandworker $dbtableurl seq ::: {1..10}
1311 sql $dburl \
1312 "COPY (SELECT * FROM mytable) TO stdout DELIMITER ',' CSV HEADER;"
1313
1314 Or MySQL:
1315
1316 dburl=mysql://user:pass@host/mydb
1317 dbtableurl=$dburl/mytable
1318 parallel --sqlandworker $dbtableurl seq ::: {1..10}
1319 sql -p -B $dburl "SELECT * FROM mytable;" > mytable.tsv
1320 perl -pe 's/"/""/g; s/\t/","/g; s/^/"/; s/$/"/;
1321 %s=("\\" => "\\", "t" => "\t", "n" => "\n");
1322 s/\\([\\tn])/$s{$1}/g;' mytable.tsv
1323
1324 EXAMPLE: Output to CSV-file for R
1325 If you have no need for the advanced job distribution control that a
1326 database provides, but you simply want output into a CSV file that you
1327 can read into R or LibreCalc, then you can use --results:
1328
1329 parallel --results my.csv seq ::: 10 20 30
1330 R
1331 > mydf <- read.csv("my.csv");
1332 > print(mydf[2,])
1333 > write(as.character(mydf[2,c("Stdout")]),'')
1334
1335 EXAMPLE: Use XML as input
1336 The show Aflyttet on Radio 24syv publishes an RSS feed with their audio
1337 podcasts on: http://arkiv.radio24syv.dk/audiopodcast/channel/4466232
1338
1339 Using xpath you can extract the URLs for 2019 and download them using
1340 GNU parallel:
1341
1342 wget -O - http://arkiv.radio24syv.dk/audiopodcast/channel/4466232 | \
1343 xpath -e "//pubDate[contains(text(),'2019')]/../enclosure/@url" | \
1344 parallel -u wget '{= s/ url="//; s/"//; =}'
1345
1346 EXAMPLE: Run the same command 10 times
1347 If you want to run the same command with the same arguments 10 times in
1348 parallel you can do:
1349
1350 seq 10 | parallel -n0 my_command my_args
1351
1352 EXAMPLE: Working as cat | sh. Resource inexpensive jobs and evaluation
1353 GNU parallel can work similar to cat | sh.
1354
1355 A resource inexpensive job is a job that takes very little CPU, disk
1356 I/O and network I/O. Ping is an example of a resource inexpensive job.
1357 wget is too - if the webpages are small.
1358
1359 The content of the file jobs_to_run:
1360
1361 ping -c 1 10.0.0.1
1362 wget http://example.com/status.cgi?ip=10.0.0.1
1363 ping -c 1 10.0.0.2
1364 wget http://example.com/status.cgi?ip=10.0.0.2
1365 ...
1366 ping -c 1 10.0.0.255
1367 wget http://example.com/status.cgi?ip=10.0.0.255
1368
1369 To run 100 processes simultaneously do:
1370
1371 parallel -j 100 < jobs_to_run
1372
1373 As there is not a command the jobs will be evaluated by the shell.
1374
1375 EXAMPLE: Call program with FASTA sequence
1376 FASTA files have the format:
1377
1378 >Sequence name1
1379 sequence
1380 sequence continued
1381 >Sequence name2
1382 sequence
1383 sequence continued
1384 more sequence
1385
1386 To call myprog with the sequence as argument run:
1387
1388 cat file.fasta |
1389 parallel --pipe -N1 --recstart '>' --rrs \
1390 'read a; echo Name: "$a"; myprog $(tr -d "\n")'
1391
1392 EXAMPLE: Call program with interleaved FASTQ records
1393 FASTQ files have the format:
1394
1395 @M10991:61:000000000-A7EML:1:1101:14011:1001 1:N:0:28
1396 CTCCTAGGTCGGCATGATGGGGGAAGGAGAGCATGGGAAGAAATGAGAGAGTAGCAAGG
1397 +
1398 #8BCCGGGGGFEFECFGGGGGGGGG@;FFGGGEG@FF<EE<@FFC,CEGCCGGFF<FGF
1399
1400 Interleaved FASTQ starts with a line like these:
1401
1402 @HWUSI-EAS100R:6:73:941:1973#0/1
1403 @EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG
1404 @EAS139:136:FC706VJ:2:2104:15343:197393 1:N:18:1
1405
1406 where '/1' and ' 1:' determines this is read 1.
1407
1408 This will cut big.fq into one chunk per CPU thread and pass it on stdin
1409 (standard input) to the program fastq-reader:
1410
1411 parallel --pipe-part -a big.fq --block -1 --regexp \
1412 --recend '\n' --recstart '@.*(/1| 1:.*)\n[A-Za-z\n\.~]' \
1413 fastq-reader
1414
1415 EXAMPLE: Processing a big file using more CPUs
1416 To process a big file or some output you can use --pipe to split up the
1417 data into blocks and pipe the blocks into the processing program.
1418
1419 If the program is gzip -9 you can do:
1420
1421 cat bigfile | parallel --pipe --recend '' -k gzip -9 > bigfile.gz
1422
1423 This will split bigfile into blocks of 1 MB and pass that to gzip -9 in
1424 parallel. One gzip will be run per CPU. The output of gzip -9 will be
1425 kept in order and saved to bigfile.gz
1426
1427 gzip works fine if the output is appended, but some processing does not
1428 work like that - for example sorting. For this GNU parallel can put the
1429 output of each command into a file. This will sort a big file in
1430 parallel:
1431
1432 cat bigfile | parallel --pipe --files sort |\
1433 parallel -Xj1 sort -m {} ';' rm {} >bigfile.sort
1434
1435 Here bigfile is split into blocks of around 1MB, each block ending in
1436 '\n' (which is the default for --recend). Each block is passed to sort
1437 and the output from sort is saved into files. These files are passed to
1438 the second parallel that runs sort -m on the files before it removes
1439 the files. The output is saved to bigfile.sort.
1440
1441 GNU parallel's --pipe maxes out at around 100 MB/s because every byte
1442 has to be copied through GNU parallel. But if bigfile is a real
1443 (seekable) file GNU parallel can by-pass the copying and send the parts
1444 directly to the program:
1445
1446 parallel --pipe-part --block 100m -a bigfile --files sort |\
1447 parallel -Xj1 sort -m {} ';' rm {} >bigfile.sort
1448
1449 EXAMPLE: Grouping input lines
1450 When processing with --pipe you may have lines grouped by a value. Here
1451 is my.csv:
1452
1453 Transaction Customer Item
1454 1 a 53
1455 2 b 65
1456 3 b 82
1457 4 c 96
1458 5 c 67
1459 6 c 13
1460 7 d 90
1461 8 d 43
1462 9 d 91
1463 10 d 84
1464 11 e 72
1465 12 e 102
1466 13 e 63
1467 14 e 56
1468 15 e 74
1469
1470 Let us assume you want GNU parallel to process each customer. In other
1471 words: You want all the transactions for a single customer to be
1472 treated as a single record.
1473
1474 To do this we preprocess the data with a program that inserts a record
1475 separator before each customer (column 2 = $F[1]). Here we first make a
1476 50 character random string, which we then use as the separator:
1477
1478 sep=`perl -e 'print map { ("a".."z","A".."Z")[rand(52)] } (1..50);'`
1479 cat my.csv | \
1480 perl -ape '$F[1] ne $l and print "'$sep'"; $l = $F[1]' | \
1481 parallel --recend $sep --rrs --pipe -N1 wc
1482
1483 If your program can process multiple customers replace -N1 with a
1484 reasonable --blocksize.
1485
1486 EXAMPLE: Running more than 250 jobs workaround
1487 If you need to run a massive amount of jobs in parallel, then you will
1488 likely hit the filehandle limit which is often around 250 jobs. If you
1489 are super user you can raise the limit in /etc/security/limits.conf but
1490 you can also use this workaround. The filehandle limit is per process.
1491 That means that if you just spawn more GNU parallels then each of them
1492 can run 250 jobs. This will spawn up to 2500 jobs:
1493
1494 cat myinput |\
1495 parallel --pipe -N 50 --round-robin -j50 parallel -j50 your_prg
1496
1497 This will spawn up to 62500 jobs (use with caution - you need 64 GB RAM
1498 to do this, and you may need to increase /proc/sys/kernel/pid_max):
1499
1500 cat myinput |\
1501 parallel --pipe -N 250 --round-robin -j250 parallel -j250 your_prg
1502
1503 EXAMPLE: Working as mutex and counting semaphore
1504 The command sem is an alias for parallel --semaphore.
1505
1506 A counting semaphore will allow a given number of jobs to be started in
1507 the background. When the number of jobs are running in the background,
1508 GNU sem will wait for one of these to complete before starting another
1509 command. sem --wait will wait for all jobs to complete.
1510
1511 Run 10 jobs concurrently in the background:
1512
1513 for i in *.log ; do
1514 echo $i
1515 sem -j10 gzip $i ";" echo done
1516 done
1517 sem --wait
1518
1519 A mutex is a counting semaphore allowing only one job to run. This will
1520 edit the file myfile and prepends the file with lines with the numbers
1521 1 to 3.
1522
1523 seq 3 | parallel sem sed -i -e '1i{}' myfile
1524
1525 As myfile can be very big it is important only one process edits the
1526 file at the same time.
1527
1528 Name the semaphore to have multiple different semaphores active at the
1529 same time:
1530
1531 seq 3 | parallel sem --id mymutex sed -i -e '1i{}' myfile
1532
1533 EXAMPLE: Mutex for a script
1534 Assume a script is called from cron or from a web service, but only one
1535 instance can be run at a time. With sem and --shebang-wrap the script
1536 can be made to wait for other instances to finish. Here in bash:
1537
1538 #!/usr/bin/sem --shebang-wrap -u --id $0 --fg /bin/bash
1539
1540 echo This will run
1541 sleep 5
1542 echo exclusively
1543
1544 Here perl:
1545
1546 #!/usr/bin/sem --shebang-wrap -u --id $0 --fg /usr/bin/perl
1547
1548 print "This will run ";
1549 sleep 5;
1550 print "exclusively\n";
1551
1552 Here python:
1553
1554 #!/usr/local/bin/sem --shebang-wrap -u --id $0 --fg /usr/bin/python
1555
1556 import time
1557 print "This will run ";
1558 time.sleep(5)
1559 print "exclusively";
1560
1561 EXAMPLE: Start editor with filenames from stdin (standard input)
1562 You can use GNU parallel to start interactive programs like emacs or
1563 vi:
1564
1565 cat filelist | parallel --tty -X emacs
1566 cat filelist | parallel --tty -X vi
1567
1568 If there are more files than will fit on a single command line, the
1569 editor will be started again with the remaining files.
1570
1571 EXAMPLE: Running sudo
1572 sudo requires a password to run a command as root. It caches the
1573 access, so you only need to enter the password again if you have not
1574 used sudo for a while.
1575
1576 The command:
1577
1578 parallel sudo echo ::: This is a bad idea
1579
1580 is no good, as you would be prompted for the sudo password for each of
1581 the jobs. Instead do:
1582
1583 sudo parallel echo ::: This is a good idea
1584
1585 This way you only have to enter the sudo password once.
1586
1587 EXAMPLE: Run ping in parallel
1588 ping prints out statistics when killed with CTRL-C.
1589
1590 Unfortunately, CTRL-C will also normally kill GNU parallel.
1591
1592 But by using --open-tty and ignoring SIGINT you can get the wanted
1593 effect:
1594
1595 parallel -j0 --open-tty --lb --tag ping '{= $SIG{INT}=sub {} =}' \
1596 ::: 1.1.1.1 8.8.8.8 9.9.9.9 21.21.21.21 80.80.80.80 88.88.88.88
1597
1598 --open-tty will make the pings receive SIGINT (from CTRL-C). CTRL-C
1599 will not kill GNU parallel, so that will only exit after ping is done.
1600
1601 EXAMPLE: GNU Parallel as queue system/batch manager
1602 GNU parallel can work as a simple job queue system or batch manager.
1603 The idea is to put the jobs into a file and have GNU parallel read from
1604 that continuously. As GNU parallel will stop at end of file we use tail
1605 to continue reading:
1606
1607 true >jobqueue; tail -n+0 -f jobqueue | parallel
1608
1609 To submit your jobs to the queue:
1610
1611 echo my_command my_arg >> jobqueue
1612
1613 You can of course use -S to distribute the jobs to remote computers:
1614
1615 true >jobqueue; tail -n+0 -f jobqueue | parallel -S ..
1616
1617 Output only will be printed when reading the next input after a job has
1618 finished: So you need to submit a job after the first has finished to
1619 see the output from the first job.
1620
1621 If you keep this running for a long time, jobqueue will grow. A way of
1622 removing the jobs already run is by making GNU parallel stop when it
1623 hits a special value and then restart. To use --eof to make GNU
1624 parallel exit, tail also needs to be forced to exit:
1625
1626 true >jobqueue;
1627 while true; do
1628 tail -n+0 -f jobqueue |
1629 (parallel -E StOpHeRe -S ..; echo GNU Parallel is now done;
1630 perl -e 'while(<>){/StOpHeRe/ and last};print <>' jobqueue > j2;
1631 (seq 1000 >> jobqueue &);
1632 echo Done appending dummy data forcing tail to exit)
1633 echo tail exited;
1634 mv j2 jobqueue
1635 done
1636
1637 In some cases you can run on more CPUs and computers during the night:
1638
1639 # Day time
1640 echo 50% > jobfile
1641 cp day_server_list ~/.parallel/sshloginfile
1642 # Night time
1643 echo 100% > jobfile
1644 cp night_server_list ~/.parallel/sshloginfile
1645 tail -n+0 -f jobqueue | parallel --jobs jobfile -S ..
1646
1647 GNU parallel discovers if jobfile or ~/.parallel/sshloginfile changes.
1648
1649 EXAMPLE: GNU Parallel as dir processor
1650 If you have a dir in which users drop files that needs to be processed
1651 you can do this on GNU/Linux (If you know what inotifywait is called on
1652 other platforms file a bug report):
1653
1654 inotifywait -qmre MOVED_TO -e CLOSE_WRITE --format %w%f my_dir |\
1655 parallel -u echo
1656
1657 This will run the command echo on each file put into my_dir or subdirs
1658 of my_dir.
1659
1660 You can of course use -S to distribute the jobs to remote computers:
1661
1662 inotifywait -qmre MOVED_TO -e CLOSE_WRITE --format %w%f my_dir |\
1663 parallel -S .. -u echo
1664
1665 If the files to be processed are in a tar file then unpacking one file
1666 and processing it immediately may be faster than first unpacking all
1667 files. Set up the dir processor as above and unpack into the dir.
1668
1669 Using GNU parallel as dir processor has the same limitations as using
1670 GNU parallel as queue system/batch manager.
1671
1672 EXAMPLE: Locate the missing package
1673 If you have downloaded source and tried compiling it, you may have
1674 seen:
1675
1676 $ ./configure
1677 [...]
1678 checking for something.h... no
1679 configure: error: "libsomething not found"
1680
1681 Often it is not obvious which package you should install to get that
1682 file. Debian has `apt-file` to search for a file. `tracefile` from
1683 https://gitlab.com/ole.tange/tangetools can tell which files a program
1684 tried to access. In this case we are interested in one of the last
1685 files:
1686
1687 $ tracefile -un ./configure | tail | parallel -j0 apt-file search
1688
1690 When using GNU parallel for a publication please cite:
1691
1692 O. Tange (2011): GNU Parallel - The Command-Line Power Tool, ;login:
1693 The USENIX Magazine, February 2011:42-47.
1694
1695 This helps funding further development; and it won't cost you a cent.
1696 If you pay 10000 EUR you should feel free to use GNU Parallel without
1697 citing.
1698
1699 Copyright (C) 2007-10-18 Ole Tange, http://ole.tange.dk
1700
1701 Copyright (C) 2008-2010 Ole Tange, http://ole.tange.dk
1702
1703 Copyright (C) 2010-2022 Ole Tange, http://ole.tange.dk and Free
1704 Software Foundation, Inc.
1705
1706 Parts of the manual concerning xargs compatibility is inspired by the
1707 manual of xargs from GNU findutils 4.4.2.
1708
1710 This program is free software; you can redistribute it and/or modify it
1711 under the terms of the GNU General Public License as published by the
1712 Free Software Foundation; either version 3 of the License, or at your
1713 option any later version.
1714
1715 This program is distributed in the hope that it will be useful, but
1716 WITHOUT ANY WARRANTY; without even the implied warranty of
1717 MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
1718 General Public License for more details.
1719
1720 You should have received a copy of the GNU General Public License along
1721 with this program. If not, see <https://www.gnu.org/licenses/>.
1722
1723 Documentation license I
1724 Permission is granted to copy, distribute and/or modify this
1725 documentation under the terms of the GNU Free Documentation License,
1726 Version 1.3 or any later version published by the Free Software
1727 Foundation; with no Invariant Sections, with no Front-Cover Texts, and
1728 with no Back-Cover Texts. A copy of the license is included in the
1729 file LICENSES/GFDL-1.3-or-later.txt.
1730
1731 Documentation license II
1732 You are free:
1733
1734 to Share to copy, distribute and transmit the work
1735
1736 to Remix to adapt the work
1737
1738 Under the following conditions:
1739
1740 Attribution
1741 You must attribute the work in the manner specified by the
1742 author or licensor (but not in any way that suggests that they
1743 endorse you or your use of the work).
1744
1745 Share Alike
1746 If you alter, transform, or build upon this work, you may
1747 distribute the resulting work only under the same, similar or
1748 a compatible license.
1749
1750 With the understanding that:
1751
1752 Waiver Any of the above conditions can be waived if you get
1753 permission from the copyright holder.
1754
1755 Public Domain
1756 Where the work or any of its elements is in the public domain
1757 under applicable law, that status is in no way affected by the
1758 license.
1759
1760 Other Rights
1761 In no way are any of the following rights affected by the
1762 license:
1763
1764 • Your fair dealing or fair use rights, or other applicable
1765 copyright exceptions and limitations;
1766
1767 • The author's moral rights;
1768
1769 • Rights other persons may have either in the work itself or
1770 in how the work is used, such as publicity or privacy
1771 rights.
1772
1773 Notice For any reuse or distribution, you must make clear to others
1774 the license terms of this work.
1775
1776 A copy of the full license is included in the file as
1777 LICENCES/CC-BY-SA-4.0.txt
1778
1780 parallel(1), parallel_tutorial(7), env_parallel(1), parset(1),
1781 parsort(1), parallel_alternatives(7), parallel_design(7), niceload(1),
1782 sql(1), ssh(1), ssh-agent(1), sshpass(1), ssh-copy-id(1), rsync(1)
1783
1784
1785
178620221022 2022-11-06 PARALLEL_EXAMPLES(7)