1PERLEBCDIC(1)          Perl Programmers Reference Guide          PERLEBCDIC(1)
2
3
4

NAME

6       perlebcdic - Considerations for running Perl on EBCDIC platforms
7

DESCRIPTION

9       An exploration of some of the issues facing Perl programmers on EBCDIC
10       based computers.  We do not cover localization, internationalization,
11       or multi byte character set issues other than some discussion of UTF-8
12       and UTF-EBCDIC.
13
14       Portions that are still incomplete are marked with XXX.
15

COMMON CHARACTER CODE SETS

17       ASCII
18
19       The American Standard Code for Information Interchange is a set of
20       integers running from 0 to 127 (decimal) that imply character interpre‐
21       tation by the display and other system(s) of computers.  The range
22       0..127 can be covered by setting the bits in a 7-bit binary digit,
23       hence the set is sometimes referred to as a "7-bit ASCII".  ASCII was
24       described by the American National Standards Institute document ANSI
25       X3.4-1986.  It was also described by ISO 646:1991 (with localization
26       for currency symbols).  The full ASCII set is given in the table below
27       as the first 128 elements.  Languages that can be written adequately
28       with the characters in ASCII include English, Hawaiian, Indonesian,
29       Swahili and some Native American languages.
30
31       There are many character sets that extend the range of integers from
32       0..2**7-1 up to 2**8-1, or 8 bit bytes (octets if you prefer).  One
33       common one is the ISO 8859-1 character set.
34
35       ISO 8859
36
37       The ISO 8859-$n are a collection of character code sets from the Inter‐
38       national Organization for Standardization (ISO) each of which adds
39       characters to the ASCII set that are typically found in European lan‐
40       guages many of which are based on the Roman, or Latin, alphabet.
41
42       Latin 1 (ISO 8859-1)
43
44       A particular 8-bit extension to ASCII that includes grave and acute
45       accented Latin characters.  Languages that can employ ISO 8859-1
46       include all the languages covered by ASCII as well as Afrikaans, Alba‐
47       nian, Basque, Catalan, Danish, Faroese, Finnish, Norwegian, Portuguese,
48       Spanish, and Swedish.  Dutch is covered albeit without the ij ligature.
49       French is covered too but without the oe ligature.  German can use ISO
50       8859-1 but must do so without German-style quotation marks.  This set
51       is based on Western European extensions to ASCII and is commonly
52       encountered in world wide web work.  In IBM character code set identi‐
53       fication terminology ISO 8859-1 is also known as CCSID 819 (or some‐
54       times 0819 or even 00819).
55
56       EBCDIC
57
58       The Extended Binary Coded Decimal Interchange Code refers to a large
59       collection of slightly different single and multi byte coded character
60       sets that are different from ASCII or ISO 8859-1 and typically run on
61       host computers.  The EBCDIC encodings derive from 8 bit byte extensions
62       of Hollerith punched card encodings.  The layout on the cards was such
63       that high bits were set for the upper and lower case alphabet charac‐
64       ters [a-z] and [A-Z], but there were gaps within each latin alphabet
65       range.
66
67       Some IBM EBCDIC character sets may be known by character code set iden‐
68       tification numbers (CCSID numbers) or code page numbers.  Leading zero
69       digits in CCSID numbers within this document are insignificant.  E.g.
70       CCSID 0037 may be referred to as 37 in places.
71
72       13 variant characters
73
74       Among IBM EBCDIC character code sets there are 13 characters that are
75       often mapped to different integer values.  Those characters are known
76       as the 13 "variant" characters and are:
77
78           \ [ ] { } ^ ~ ! # ⎪ $ @ `
79
80       0037
81
82       Character code set ID 0037 is a mapping of the ASCII plus Latin-1 char‐
83       acters (i.e. ISO 8859-1) to an EBCDIC set.  0037 is used in North Amer‐
84       ican English locales on the OS/400 operating system that runs on AS/400
85       computers.  CCSID 37 differs from ISO 8859-1 in 237 places, in other
86       words they agree on only 19 code point values.
87
88       1047
89
90       Character code set ID 1047 is also a mapping of the ASCII plus Latin-1
91       characters (i.e. ISO 8859-1) to an EBCDIC set.  1047 is used under Unix
92       System Services for OS/390 or z/OS, and OpenEdition for VM/ESA.  CCSID
93       1047 differs from CCSID 0037 in eight places.
94
95       POSIX-BC
96
97       The EBCDIC code page in use on Siemens' BS2000 system is distinct from
98       1047 and 0037.  It is identified below as the POSIX-BC set.
99
100       Unicode code points versus EBCDIC code points
101
102       In Unicode terminology a code point is the number assigned to a charac‐
103       ter: for example, in EBCDIC the character "A" is usually assigned the
104       number 193.  In Unicode the character "A" is assigned the number 65.
105       This causes a problem with the semantics of the pack/unpack "U", which
106       are supposed to pack Unicode code points to characters and back to num‐
107       bers.  The problem is: which code points to use for code points less
108       than 256?  (for 256 and over there's no problem: Unicode code points
109       are used) In EBCDIC, for the low 256 the EBCDIC code points are used.
110       This means that the equivalences
111
112               pack("U", ord($character)) eq $character
113               unpack("U", $character) == ord $character
114
115       will hold.  (If Unicode code points were applied consistently over all
116       the possible code points, pack("U",ord("A")) would in EBCDIC equal A
117       with acute or chr(101), and unpack("U", "A") would equal 65, or non-
118       breaking space, not 193, or ord "A".)
119
120       Remaining Perl Unicode problems in EBCDIC
121
122       ·   Many of the remaining seem to be related to case-insensitive match‐
123           ing: for example, "/[\x{131}]/" (LATIN SMALL LETTER DOTLESS I) does
124           not match "I" case-insensitively, as it should under Unicode.  (The
125           match succeeds in ASCII-derived platforms.)
126
127       ·   The extensions Unicode::Collate and Unicode::Normalized are not
128           supported under EBCDIC, likewise for the encoding pragma.
129
130       Unicode and UTF
131
132       UTF is a Unicode Transformation Format.  UTF-8 is a Unicode conforming
133       representation of the Unicode standard that looks very much like ASCII.
134       UTF-EBCDIC is an attempt to represent Unicode characters in an EBCDIC
135       transparent manner.
136
137       Using Encode
138
139       Starting from Perl 5.8 you can use the standard new module Encode to
140       translate from EBCDIC to Latin-1 code points
141
142               use Encode 'from_to';
143
144               my %ebcdic = ( 176 => 'cp37', 95 => 'cp1047', 106 => 'posix-bc' );
145
146               # $a is in EBCDIC code points
147               from_to($a, $ebcdic{ord '^'}, 'latin1');
148               # $a is ISO 8859-1 code points
149
150       and from Latin-1 code points to EBCDIC code points
151
152               use Encode 'from_to';
153
154               my %ebcdic = ( 176 => 'cp37', 95 => 'cp1047', 106 => 'posix-bc' );
155
156               # $a is ISO 8859-1 code points
157               from_to($a, 'latin1', $ebcdic{ord '^'});
158               # $a is in EBCDIC code points
159
160       For doing I/O it is suggested that you use the autotranslating features
161       of PerlIO, see perluniintro.
162
163       Since version 5.8 Perl uses the new PerlIO I/O library.  This enables
164       you to use different encodings per IO channel.  For example you may use
165
166           use Encode;
167           open($f, ">:encoding(ascii)", "test.ascii");
168           print $f "Hello World!\n";
169           open($f, ">:encoding(cp37)", "test.ebcdic");
170           print $f "Hello World!\n";
171           open($f, ">:encoding(latin1)", "test.latin1");
172           print $f "Hello World!\n";
173           open($f, ">:encoding(utf8)", "test.utf8");
174           print $f "Hello World!\n";
175
176       to get two files containing "Hello World!\n" in ASCII, CP 37 EBCDIC,
177       ISO 8859-1 (Latin-1) (in this example identical to ASCII) respective
178       UTF-EBCDIC (in this example identical to normal EBCDIC).  See the docu‐
179       mentation of Encode::PerlIO for details.
180
181       As the PerlIO layer uses raw IO (bytes) internally, all this totally
182       ignores things like the type of your filesystem (ASCII or EBCDIC).
183

SINGLE OCTET TABLES

185       The following tables list the ASCII and Latin 1 ordered sets including
186       the subsets: C0 controls (0..31), ASCII graphics (32..7e), delete (7f),
187       C1 controls (80..9f), and Latin-1 (a.k.a. ISO 8859-1) (a0..ff).  In the
188       table non-printing control character names as well as the Latin 1
189       extensions to ASCII have been labelled with character names roughly
190       corresponding to The Unicode Standard, Version 3.0 albeit with substi‐
191       tutions such as s/LATIN// and s/VULGAR// in all cases, s/CAPITAL LET‐
192       TER// in some cases, and s/SMALL LETTER ([A-Z])/\l$1/ in some other
193       cases (the "charnames" pragma names unfortunately do not list explicit
194       names for the C0 or C1 control characters).  The "names" of the C1 con‐
195       trol set (128..159 in ISO 8859-1) listed here are somewhat arbitrary.
196       The differences between the 0037 and 1047 sets are flagged with ***.
197       The differences between the 1047 and POSIX-BC sets are flagged with
198       ###.  All ord() numbers listed are decimal.  If you would rather see
199       this table listing octal values then run the table (that is, the pod
200       version of this document since this recipe may not work with a
201       pod2_other_format translation) through:
202
203       recipe 0
204
205           perl -ne 'if(/(.{33})(\d+)\s+(\d+)\s+(\d+)\s+(\d+)/)' \
206            -e '{printf("%s%-9o%-9o%-9o%o\n",$1,$2,$3,$4,$5)}' perlebcdic.pod
207
208       If you want to retain the UTF-x code points then in script form you
209       might want to write:
210
211       recipe 1
212
213           open(FH,"<perlebcdic.pod") or die "Could not open perlebcdic.pod: $!";
214           while (<FH>) {
215               if (/(.{33})(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\.?(\d*)\s+(\d+)\.?(\d*)/)  {
216                   if ($7 ne '' && $9 ne '') {
217                       printf("%s%-9o%-9o%-9o%-9o%-3o.%-5o%-3o.%o\n",$1,$2,$3,$4,$5,$6,$7,$8,$9);
218                   }
219                   elsif ($7 ne '') {
220                       printf("%s%-9o%-9o%-9o%-9o%-3o.%-5o%o\n",$1,$2,$3,$4,$5,$6,$7,$8);
221                   }
222                   else {
223                       printf("%s%-9o%-9o%-9o%-9o%-9o%o\n",$1,$2,$3,$4,$5,$6,$8);
224                   }
225               }
226           }
227
228       If you would rather see this table listing hexadecimal values then run
229       the table through:
230
231       recipe 2
232
233           perl -ne 'if(/(.{33})(\d+)\s+(\d+)\s+(\d+)\s+(\d+)/)' \
234            -e '{printf("%s%-9X%-9X%-9X%X\n",$1,$2,$3,$4,$5)}' perlebcdic.pod
235
236       Or, in order to retain the UTF-x code points in hexadecimal:
237
238       recipe 3
239
240           open(FH,"<perlebcdic.pod") or die "Could not open perlebcdic.pod: $!";
241           while (<FH>) {
242               if (/(.{33})(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\.?(\d*)\s+(\d+)\.?(\d*)/)  {
243                   if ($7 ne '' && $9 ne '') {
244                       printf("%s%-9X%-9X%-9X%-9X%-2X.%-6X%-2X.%X\n",$1,$2,$3,$4,$5,$6,$7,$8,$9);
245                   }
246                   elsif ($7 ne '') {
247                       printf("%s%-9X%-9X%-9X%-9X%-2X.%-6X%X\n",$1,$2,$3,$4,$5,$6,$7,$8);
248                   }
249                   else {
250                       printf("%s%-9X%-9X%-9X%-9X%-9X%X\n",$1,$2,$3,$4,$5,$6,$8);
251                   }
252               }
253           }
254
255                                                                            incomp-  incomp-
256                                        8859-1                              lete     lete
257           chr                          0819     0037     1047     POSIX-BC UTF-8    UTF-EBCDIC
258           ------------------------------------------------------------------------------------
259           <NULL>                       0        0        0        0        0        0
260           <START OF HEADING>           1        1        1        1        1        1
261           <START OF TEXT>              2        2        2        2        2        2
262           <END OF TEXT>                3        3        3        3        3        3
263           <END OF TRANSMISSION>        4        55       55       55       4        55
264           <ENQUIRY>                    5        45       45       45       5        45
265           <ACKNOWLEDGE>                6        46       46       46       6        46
266           <BELL>                       7        47       47       47       7        47
267           <BACKSPACE>                  8        22       22       22       8        22
268           <HORIZONTAL TABULATION>      9        5        5        5        9        5
269           <LINE FEED>                  10       37       21       21       10       21       ***
270           <VERTICAL TABULATION>        11       11       11       11       11       11
271           <FORM FEED>                  12       12       12       12       12       12
272           <CARRIAGE RETURN>            13       13       13       13       13       13
273           <SHIFT OUT>                  14       14       14       14       14       14
274           <SHIFT IN>                   15       15       15       15       15       15
275           <DATA LINK ESCAPE>           16       16       16       16       16       16
276           <DEVICE CONTROL ONE>         17       17       17       17       17       17
277           <DEVICE CONTROL TWO>         18       18       18       18       18       18
278           <DEVICE CONTROL THREE>       19       19       19       19       19       19
279           <DEVICE CONTROL FOUR>        20       60       60       60       20       60
280           <NEGATIVE ACKNOWLEDGE>       21       61       61       61       21       61
281           <SYNCHRONOUS IDLE>           22       50       50       50       22       50
282           <END OF TRANSMISSION BLOCK>  23       38       38       38       23       38
283           <CANCEL>                     24       24       24       24       24       24
284           <END OF MEDIUM>              25       25       25       25       25       25
285           <SUBSTITUTE>                 26       63       63       63       26       63
286           <ESCAPE>                     27       39       39       39       27       39
287           <FILE SEPARATOR>             28       28       28       28       28       28
288           <GROUP SEPARATOR>            29       29       29       29       29       29
289           <RECORD SEPARATOR>           30       30       30       30       30       30
290           <UNIT SEPARATOR>             31       31       31       31       31       31
291           <SPACE>                      32       64       64       64       32       64
292           !                            33       90       90       90       33       90
293           "                            34       127      127      127      34       127
294           #                            35       123      123      123      35       123
295           $                            36       91       91       91       36       91
296           %                            37       108      108      108      37       108
297           &                            38       80       80       80       38       80
298           '                            39       125      125      125      39       125
299           (                            40       77       77       77       40       77
300           )                            41       93       93       93       41       93
301           *                            42       92       92       92       42       92
302           +                            43       78       78       78       43       78
303           ,                            44       107      107      107      44       107
304           -                            45       96       96       96       45       96
305           .                            46       75       75       75       46       75
306           /                            47       97       97       97       47       97
307           0                            48       240      240      240      48       240
308           1                            49       241      241      241      49       241
309           2                            50       242      242      242      50       242
310           3                            51       243      243      243      51       243
311           4                            52       244      244      244      52       244
312           5                            53       245      245      245      53       245
313           6                            54       246      246      246      54       246
314           7                            55       247      247      247      55       247
315           8                            56       248      248      248      56       248
316           9                            57       249      249      249      57       249
317           :                            58       122      122      122      58       122
318           ;                            59       94       94       94       59       94
319           <                            60       76       76       76       60       76
320           =                            61       126      126      126      61       126
321           >                            62       110      110      110      62       110
322           ?                            63       111      111      111      63       111
323           @                            64       124      124      124      64       124
324           A                            65       193      193      193      65       193
325           B                            66       194      194      194      66       194
326           C                            67       195      195      195      67       195
327           D                            68       196      196      196      68       196
328           E                            69       197      197      197      69       197
329           F                            70       198      198      198      70       198
330           G                            71       199      199      199      71       199
331           H                            72       200      200      200      72       200
332           I                            73       201      201      201      73       201
333           J                            74       209      209      209      74       209
334           K                            75       210      210      210      75       210
335           L                            76       211      211      211      76       211
336           M                            77       212      212      212      77       212
337           N                            78       213      213      213      78       213
338           O                            79       214      214      214      79       214
339           P                            80       215      215      215      80       215
340           Q                            81       216      216      216      81       216
341           R                            82       217      217      217      82       217
342           S                            83       226      226      226      83       226
343           T                            84       227      227      227      84       227
344           U                            85       228      228      228      85       228
345           V                            86       229      229      229      86       229
346           W                            87       230      230      230      87       230
347           X                            88       231      231      231      88       231
348           Y                            89       232      232      232      89       232
349           Z                            90       233      233      233      90       233
350           [                            91       186      173      187      91       173      *** ###
351           \                            92       224      224      188      92       224      ###
352           ]                            93       187      189      189      93       189      ***
353           ^                            94       176      95       106      94       95       *** ###
354           _                            95       109      109      109      95       109
355           `                            96       121      121      74       96       121      ###
356           a                            97       129      129      129      97       129
357           b                            98       130      130      130      98       130
358           c                            99       131      131      131      99       131
359           d                            100      132      132      132      100      132
360           e                            101      133      133      133      101      133
361           f                            102      134      134      134      102      134
362           g                            103      135      135      135      103      135
363           h                            104      136      136      136      104      136
364           i                            105      137      137      137      105      137
365           j                            106      145      145      145      106      145
366           k                            107      146      146      146      107      146
367           l                            108      147      147      147      108      147
368           m                            109      148      148      148      109      148
369           n                            110      149      149      149      110      149
370           o                            111      150      150      150      111      150
371           p                            112      151      151      151      112      151
372           q                            113      152      152      152      113      152
373           r                            114      153      153      153      114      153
374           s                            115      162      162      162      115      162
375           t                            116      163      163      163      116      163
376           u                            117      164      164      164      117      164
377           v                            118      165      165      165      118      165
378           w                            119      166      166      166      119      166
379           x                            120      167      167      167      120      167
380           y                            121      168      168      168      121      168
381           z                            122      169      169      169      122      169
382           {                            123      192      192      251      123      192      ###
383           ⎪                            124      79       79       79       124      79
384           }                            125      208      208      253      125      208      ###
385           ~                            126      161      161      255      126      161      ###
386           <DELETE>                     127      7        7        7        127      7
387           <C1 0>                       128      32       32       32       194.128  32
388           <C1 1>                       129      33       33       33       194.129  33
389           <C1 2>                       130      34       34       34       194.130  34
390           <C1 3>                       131      35       35       35       194.131  35
391           <C1 4>                       132      36       36       36       194.132  36
392           <C1 5>                       133      21       37       37       194.133  37       ***
393           <C1 6>                       134      6        6        6        194.134  6
394           <C1 7>                       135      23       23       23       194.135  23
395           <C1 8>                       136      40       40       40       194.136  40
396           <C1 9>                       137      41       41       41       194.137  41
397           <C1 10>                      138      42       42       42       194.138  42
398           <C1 11>                      139      43       43       43       194.139  43
399           <C1 12>                      140      44       44       44       194.140  44
400           <C1 13>                      141      9        9        9        194.141  9
401           <C1 14>                      142      10       10       10       194.142  10
402           <C1 15>                      143      27       27       27       194.143  27
403           <C1 16>                      144      48       48       48       194.144  48
404           <C1 17>                      145      49       49       49       194.145  49
405           <C1 18>                      146      26       26       26       194.146  26
406           <C1 19>                      147      51       51       51       194.147  51
407           <C1 20>                      148      52       52       52       194.148  52
408           <C1 21>                      149      53       53       53       194.149  53
409           <C1 22>                      150      54       54       54       194.150  54
410           <C1 23>                      151      8        8        8        194.151  8
411           <C1 24>                      152      56       56       56       194.152  56
412           <C1 25>                      153      57       57       57       194.153  57
413           <C1 26>                      154      58       58       58       194.154  58
414           <C1 27>                      155      59       59       59       194.155  59
415           <C1 28>                      156      4        4        4        194.156  4
416           <C1 29>                      157      20       20       20       194.157  20
417           <C1 30>                      158      62       62       62       194.158  62
418           <C1 31>                      159      255      255      95       194.159  255      ###
419           <NON-BREAKING SPACE>         160      65       65       65       194.160  128.65
420           <INVERTED EXCLAMATION MARK>  161      170      170      170      194.161  128.66
421           <CENT SIGN>                  162      74       74       176      194.162  128.67   ###
422           <POUND SIGN>                 163      177      177      177      194.163  128.68
423           <CURRENCY SIGN>              164      159      159      159      194.164  128.69
424           <YEN SIGN>                   165      178      178      178      194.165  128.70
425           <BROKEN BAR>                 166      106      106      208      194.166  128.71   ###
426           <SECTION SIGN>               167      181      181      181      194.167  128.72
427           <DIAERESIS>                  168      189      187      121      194.168  128.73   *** ###
428           <COPYRIGHT SIGN>             169      180      180      180      194.169  128.74
429           <FEMININE ORDINAL INDICATOR> 170      154      154      154      194.170  128.81
430           <LEFT POINTING GUILLEMET>    171      138      138      138      194.171  128.82
431           <NOT SIGN>                   172      95       176      186      194.172  128.83   *** ###
432           <SOFT HYPHEN>                173      202      202      202      194.173  128.84
433           <REGISTERED TRADE MARK SIGN> 174      175      175      175      194.174  128.85
434           <MACRON>                     175      188      188      161      194.175  128.86   ###
435           <DEGREE SIGN>                176      144      144      144      194.176  128.87
436           <PLUS-OR-MINUS SIGN>         177      143      143      143      194.177  128.88
437           <SUPERSCRIPT TWO>            178      234      234      234      194.178  128.89
438           <SUPERSCRIPT THREE>          179      250      250      250      194.179  128.98
439           <ACUTE ACCENT>               180      190      190      190      194.180  128.99
440           <MICRO SIGN>                 181      160      160      160      194.181  128.100
441           <PARAGRAPH SIGN>             182      182      182      182      194.182  128.101
442           <MIDDLE DOT>                 183      179      179      179      194.183  128.102
443           <CEDILLA>                    184      157      157      157      194.184  128.103
444           <SUPERSCRIPT ONE>            185      218      218      218      194.185  128.104
445           <MASC. ORDINAL INDICATOR>    186      155      155      155      194.186  128.105
446           <RIGHT POINTING GUILLEMET>   187      139      139      139      194.187  128.106
447           <FRACTION ONE QUARTER>       188      183      183      183      194.188  128.112
448           <FRACTION ONE HALF>          189      184      184      184      194.189  128.113
449           <FRACTION THREE QUARTERS>    190      185      185      185      194.190  128.114
450           <INVERTED QUESTION MARK>     191      171      171      171      194.191  128.115
451           <A WITH GRAVE>               192      100      100      100      195.128  138.65
452           <A WITH ACUTE>               193      101      101      101      195.129  138.66
453           <A WITH CIRCUMFLEX>          194      98       98       98       195.130  138.67
454           <A WITH TILDE>               195      102      102      102      195.131  138.68
455           <A WITH DIAERESIS>           196      99       99       99       195.132  138.69
456           <A WITH RING ABOVE>          197      103      103      103      195.133  138.70
457           <CAPITAL LIGATURE AE>        198      158      158      158      195.134  138.71
458           <C WITH CEDILLA>             199      104      104      104      195.135  138.72
459           <E WITH GRAVE>               200      116      116      116      195.136  138.73
460           <E WITH ACUTE>               201      113      113      113      195.137  138.74
461           <E WITH CIRCUMFLEX>          202      114      114      114      195.138  138.81
462           <E WITH DIAERESIS>           203      115      115      115      195.139  138.82
463           <I WITH GRAVE>               204      120      120      120      195.140  138.83
464           <I WITH ACUTE>               205      117      117      117      195.141  138.84
465           <I WITH CIRCUMFLEX>          206      118      118      118      195.142  138.85
466           <I WITH DIAERESIS>           207      119      119      119      195.143  138.86
467           <CAPITAL LETTER ETH>         208      172      172      172      195.144  138.87
468           <N WITH TILDE>               209      105      105      105      195.145  138.88
469           <O WITH GRAVE>               210      237      237      237      195.146  138.89
470           <O WITH ACUTE>               211      238      238      238      195.147  138.98
471           <O WITH CIRCUMFLEX>          212      235      235      235      195.148  138.99
472           <O WITH TILDE>               213      239      239      239      195.149  138.100
473           <O WITH DIAERESIS>           214      236      236      236      195.150  138.101
474           <MULTIPLICATION SIGN>        215      191      191      191      195.151  138.102
475           <O WITH STROKE>              216      128      128      128      195.152  138.103
476           <U WITH GRAVE>               217      253      253      224      195.153  138.104  ###
477           <U WITH ACUTE>               218      254      254      254      195.154  138.105
478           <U WITH CIRCUMFLEX>          219      251      251      221      195.155  138.106  ###
479           <U WITH DIAERESIS>           220      252      252      252      195.156  138.112
480           <Y WITH ACUTE>               221      173      186      173      195.157  138.113  *** ###
481           <CAPITAL LETTER THORN>       222      174      174      174      195.158  138.114
482           <SMALL LETTER SHARP S>       223      89       89       89       195.159  138.115
483           <a WITH GRAVE>               224      68       68       68       195.160  139.65
484           <a WITH ACUTE>               225      69       69       69       195.161  139.66
485           <a WITH CIRCUMFLEX>          226      66       66       66       195.162  139.67
486           <a WITH TILDE>               227      70       70       70       195.163  139.68
487           <a WITH DIAERESIS>           228      67       67       67       195.164  139.69
488           <a WITH RING ABOVE>          229      71       71       71       195.165  139.70
489           <SMALL LIGATURE ae>          230      156      156      156      195.166  139.71
490           <c WITH CEDILLA>             231      72       72       72       195.167  139.72
491           <e WITH GRAVE>               232      84       84       84       195.168  139.73
492           <e WITH ACUTE>               233      81       81       81       195.169  139.74
493           <e WITH CIRCUMFLEX>          234      82       82       82       195.170  139.81
494           <e WITH DIAERESIS>           235      83       83       83       195.171  139.82
495           <i WITH GRAVE>               236      88       88       88       195.172  139.83
496           <i WITH ACUTE>               237      85       85       85       195.173  139.84
497           <i WITH CIRCUMFLEX>          238      86       86       86       195.174  139.85
498           <i WITH DIAERESIS>           239      87       87       87       195.175  139.86
499           <SMALL LETTER eth>           240      140      140      140      195.176  139.87
500           <n WITH TILDE>               241      73       73       73       195.177  139.88
501           <o WITH GRAVE>               242      205      205      205      195.178  139.89
502           <o WITH ACUTE>               243      206      206      206      195.179  139.98
503           <o WITH CIRCUMFLEX>          244      203      203      203      195.180  139.99
504           <o WITH TILDE>               245      207      207      207      195.181  139.100
505           <o WITH DIAERESIS>           246      204      204      204      195.182  139.101
506           <DIVISION SIGN>              247      225      225      225      195.183  139.102
507           <o WITH STROKE>              248      112      112      112      195.184  139.103
508           <u WITH GRAVE>               249      221      221      192      195.185  139.104  ###
509           <u WITH ACUTE>               250      222      222      222      195.186  139.105
510           <u WITH CIRCUMFLEX>          251      219      219      219      195.187  139.106
511           <u WITH DIAERESIS>           252      220      220      220      195.188  139.112
512           <y WITH ACUTE>               253      141      141      141      195.189  139.113
513           <SMALL LETTER thorn>         254      142      142      142      195.190  139.114
514           <y WITH DIAERESIS>           255      223      223      223      195.191  139.115
515
516       If you would rather see the above table in CCSID 0037 order rather than
517       ASCII + Latin-1 order then run the table through:
518
519       recipe 4
520
521           perl -ne 'if(/.{33}\d{1,3}\s{6,8}\d{1,3}\s{6,8}\d{1,3}\s{6,8}\d{1,3}/)'\
522            -e '{push(@l,$_)}' \
523            -e 'END{print map{$_->[0]}' \
524            -e '          sort{$a->[1] <=> $b->[1]}' \
525            -e '          map{[$_,substr($_,42,3)]}@l;}' perlebcdic.pod
526
527       If you would rather see it in CCSID 1047 order then change the digit 42
528       in the last line to 51, like this:
529
530       recipe 5
531
532           perl -ne 'if(/.{33}\d{1,3}\s{6,8}\d{1,3}\s{6,8}\d{1,3}\s{6,8}\d{1,3}/)'\
533            -e '{push(@l,$_)}' \
534            -e 'END{print map{$_->[0]}' \
535            -e '          sort{$a->[1] <=> $b->[1]}' \
536            -e '          map{[$_,substr($_,51,3)]}@l;}' perlebcdic.pod
537
538       If you would rather see it in POSIX-BC order then change the digit 51
539       in the last line to 60, like this:
540
541       recipe 6
542
543           perl -ne 'if(/.{33}\d{1,3}\s{6,8}\d{1,3}\s{6,8}\d{1,3}\s{6,8}\d{1,3}/)'\
544            -e '{push(@l,$_)}' \
545            -e 'END{print map{$_->[0]}' \
546            -e '          sort{$a->[1] <=> $b->[1]}' \
547            -e '          map{[$_,substr($_,60,3)]}@l;}' perlebcdic.pod
548

IDENTIFYING CHARACTER CODE SETS

550       To determine the character set you are running under from perl one
551       could use the return value of ord() or chr() to test one or more char‐
552       acter values.  For example:
553
554           $is_ascii  = "A" eq chr(65);
555           $is_ebcdic = "A" eq chr(193);
556
557       Also, "\t" is a "HORIZONTAL TABULATION" character so that:
558
559           $is_ascii  = ord("\t") == 9;
560           $is_ebcdic = ord("\t") == 5;
561
562       To distinguish EBCDIC code pages try looking at one or more of the
563       characters that differ between them.  For example:
564
565           $is_ebcdic_37   = "\n" eq chr(37);
566           $is_ebcdic_1047 = "\n" eq chr(21);
567
568       Or better still choose a character that is uniquely encoded in any of
569       the code sets, e.g.:
570
571           $is_ascii           = ord('[') == 91;
572           $is_ebcdic_37       = ord('[') == 186;
573           $is_ebcdic_1047     = ord('[') == 173;
574           $is_ebcdic_POSIX_BC = ord('[') == 187;
575
576       However, it would be unwise to write tests such as:
577
578           $is_ascii = "\r" ne chr(13);  #  WRONG
579           $is_ascii = "\n" ne chr(10);  #  ILL ADVISED
580
581       Obviously the first of these will fail to distinguish most ASCII
582       machines from either a CCSID 0037, a 1047, or a POSIX-BC EBCDIC machine
583       since "\r" eq chr(13) under all of those coded character sets.  But
584       note too that because "\n" is chr(13) and "\r" is chr(10) on the MacIn‐
585       tosh (which is an ASCII machine) the second $is_ascii test will lead to
586       trouble there.
587
588       To determine whether or not perl was built under an EBCDIC code page
589       you can use the Config module like so:
590
591           use Config;
592           $is_ebcdic = $Config{'ebcdic'} eq 'define';
593

CONVERSIONS

595       tr///
596
597       In order to convert a string of characters from one character set to
598       another a simple list of numbers, such as in the right columns in the
599       above table, along with perl's tr/// operator is all that is needed.
600       The data in the table are in ASCII order hence the EBCDIC columns pro‐
601       vide easy to use ASCII to EBCDIC operations that are also easily
602       reversed.
603
604       For example, to convert ASCII to code page 037 take the output of the
605       second column from the output of recipe 0 (modified to add \\ charac‐
606       ters) and use it in tr/// like so:
607
608           $cp_037 =
609           '\000\001\002\003\234\011\206\177\227\215\216\013\014\015\016\017' .
610           '\020\021\022\023\235\205\010\207\030\031\222\217\034\035\036\037' .
611           '\200\201\202\203\204\012\027\033\210\211\212\213\214\005\006\007' .
612           '\220\221\026\223\224\225\226\004\230\231\232\233\024\025\236\032' .
613           '\040\240\342\344\340\341\343\345\347\361\242\056\074\050\053\174' .
614           '\046\351\352\353\350\355\356\357\354\337\041\044\052\051\073\254' .
615           '\055\057\302\304\300\301\303\305\307\321\246\054\045\137\076\077' .
616           '\370\311\312\313\310\315\316\317\314\140\072\043\100\047\075\042' .
617           '\330\141\142\143\144\145\146\147\150\151\253\273\360\375\376\261' .
618           '\260\152\153\154\155\156\157\160\161\162\252\272\346\270\306\244' .
619           '\265\176\163\164\165\166\167\170\171\172\241\277\320\335\336\256' .
620           '\136\243\245\267\251\247\266\274\275\276\133\135\257\250\264\327' .
621           '\173\101\102\103\104\105\106\107\110\111\255\364\366\362\363\365' .
622           '\175\112\113\114\115\116\117\120\121\122\271\373\374\371\372\377' .
623           '\134\367\123\124\125\126\127\130\131\132\262\324\326\322\323\325' .
624           '\060\061\062\063\064\065\066\067\070\071\263\333\334\331\332\237' ;
625
626           my $ebcdic_string = $ascii_string;
627           eval '$ebcdic_string =~ tr/' . $cp_037 . '/\000-\377/';
628
629       To convert from EBCDIC 037 to ASCII just reverse the order of the tr///
630       arguments like so:
631
632           my $ascii_string = $ebcdic_string;
633           eval '$ascii_string =~ tr/\000-\377/' . $cp_037 . '/';
634
635       Similarly one could take the output of the third column from recipe 0
636       to obtain a $cp_1047 table.  The fourth column of the output from
637       recipe 0 could provide a $cp_posix_bc table suitable for transcoding as
638       well.
639
640       iconv
641
642       XPG operability often implies the presence of an iconv utility avail‐
643       able from the shell or from the C library.  Consult your system's docu‐
644       mentation for information on iconv.
645
646       On OS/390 or z/OS see the iconv(1) manpage.  One way to invoke the
647       iconv shell utility from within perl would be to:
648
649           # OS/390 or z/OS example
650           $ascii_data = `echo '$ebcdic_data'⎪ iconv -f IBM-1047 -t ISO8859-1`
651
652       or the inverse map:
653
654           # OS/390 or z/OS example
655           $ebcdic_data = `echo '$ascii_data'⎪ iconv -f ISO8859-1 -t IBM-1047`
656
657       For other perl based conversion options see the Convert::* modules on
658       CPAN.
659
660       C RTL
661
662       The OS/390 and z/OS C run time libraries provide _atoe() and _etoa()
663       functions.
664

OPERATOR DIFFERENCES

666       The ".." range operator treats certain character ranges with care on
667       EBCDIC machines.  For example the following array will have twenty six
668       elements on either an EBCDIC machine or an ASCII machine:
669
670           @alphabet = ('A'..'Z');   #  $#alphabet == 25
671
672       The bitwise operators such as & ^ ⎪ may return different results when
673       operating on string or character data in a perl program running on an
674       EBCDIC machine than when run on an ASCII machine.  Here is an example
675       adapted from the one in perlop:
676
677           # EBCDIC-based examples
678           print "j p \n" ^ " a h";                      # prints "JAPH\n"
679           print "JA" ⎪ "  ph\n";                        # prints "japh\n"
680           print "JAPH\nJunk" & "\277\277\277\277\277";  # prints "japh\n";
681           print 'p N$' ^ " E<H\n";                      # prints "Perl\n";
682
683       An interesting property of the 32 C0 control characters in the ASCII
684       table is that they can "literally" be constructed as control characters
685       in perl, e.g. "(chr(0) eq "\c@")" "(chr(1) eq "\cA")", and so on.  Perl
686       on EBCDIC machines has been ported to take "\c@" to chr(0) and "\cA" to
687       chr(1) as well, but the thirty three characters that result depend on
688       which code page you are using.  The table below uses the character
689       names from the previous table but with substitutions such as s/START
690       OF/S.O./; s/END OF /E.O./; s/TRANSMISSION/TRANS./; s/TABULATION/TAB./;
691       s/VERTICAL/VERT./; s/HORIZONTAL/HORIZ./; s/DEVICE CONTROL/D.C./; s/SEP‐
692       ARATOR/SEP./; s/NEGATIVE ACKNOWLEDGE/NEG. ACK./;.  The POSIX-BC and
693       1047 sets are identical throughout this range and differ from the 0037
694       set at only one spot (21 decimal).  Note that the "LINE FEED" character
695       may be generated by "\cJ" on ASCII machines but by "\cU" on 1047 or
696       POSIX-BC machines and cannot be generated as a "\c.letter." control
697       character on 0037 machines.  Note also that "\c\\" maps to two charac‐
698       ters not one.
699
700           chr   ord  8859-1               0037                1047 && POSIX-BC
701           ------------------------------------------------------------------------
702           "\c?" 127  <DELETE>             "                   "              ***><
703           "\c@"   0  <NULL>               <NULL>              <NULL>         ***><
704           "\cA"   1  <S.O. HEADING>       <S.O. HEADING>      <S.O. HEADING>
705           "\cB"   2  <S.O. TEXT>          <S.O. TEXT>         <S.O. TEXT>
706           "\cC"   3  <E.O. TEXT>          <E.O. TEXT>         <E.O. TEXT>
707           "\cD"   4  <E.O. TRANS.>        <C1 28>             <C1 28>
708           "\cE"   5  <ENQUIRY>            <HORIZ. TAB.>       <HORIZ. TAB.>
709           "\cF"   6  <ACKNOWLEDGE>        <C1 6>              <C1 6>
710           "\cG"   7  <BELL>               <DELETE>            <DELETE>
711           "\cH"   8  <BACKSPACE>          <C1 23>             <C1 23>
712           "\cI"   9  <HORIZ. TAB.>        <C1 13>             <C1 13>
713           "\cJ"  10  <LINE FEED>          <C1 14>             <C1 14>
714           "\cK"  11  <VERT. TAB.>         <VERT. TAB.>        <VERT. TAB.>
715           "\cL"  12  <FORM FEED>          <FORM FEED>         <FORM FEED>
716           "\cM"  13  <CARRIAGE RETURN>    <CARRIAGE RETURN>   <CARRIAGE RETURN>
717           "\cN"  14  <SHIFT OUT>          <SHIFT OUT>         <SHIFT OUT>
718           "\cO"  15  <SHIFT IN>           <SHIFT IN>          <SHIFT IN>
719           "\cP"  16  <DATA LINK ESCAPE>   <DATA LINK ESCAPE>  <DATA LINK ESCAPE>
720           "\cQ"  17  <D.C. ONE>           <D.C. ONE>          <D.C. ONE>
721           "\cR"  18  <D.C. TWO>           <D.C. TWO>          <D.C. TWO>
722           "\cS"  19  <D.C. THREE>         <D.C. THREE>        <D.C. THREE>
723           "\cT"  20  <D.C. FOUR>          <C1 29>             <C1 29>
724           "\cU"  21  <NEG. ACK.>          <C1 5>              <LINE FEED>    ***
725           "\cV"  22  <SYNCHRONOUS IDLE>   <BACKSPACE>         <BACKSPACE>
726           "\cW"  23  <E.O. TRANS. BLOCK>  <C1 7>              <C1 7>
727           "\cX"  24  <CANCEL>             <CANCEL>            <CANCEL>
728           "\cY"  25  <E.O. MEDIUM>        <E.O. MEDIUM>       <E.O. MEDIUM>
729           "\cZ"  26  <SUBSTITUTE>         <C1 18>             <C1 18>
730           "\c["  27  <ESCAPE>             <C1 15>             <C1 15>
731           "\c\\" 28  <FILE SEP.>\         <FILE SEP.>\        <FILE SEP.>\
732           "\c]"  29  <GROUP SEP.>         <GROUP SEP.>        <GROUP SEP.>
733           "\c^"  30  <RECORD SEP.>        <RECORD SEP.>       <RECORD SEP.>  ***><
734           "\c_"  31  <UNIT SEP.>          <UNIT SEP.>         <UNIT SEP.>    ***><
735

FUNCTION DIFFERENCES

737       chr()   chr() must be given an EBCDIC code number argument to yield a
738               desired character return value on an EBCDIC machine.  For exam‐
739               ple:
740
741                   $CAPITAL_LETTER_A = chr(193);
742
743       ord()   ord() will return EBCDIC code number values on an EBCDIC
744               machine.  For example:
745
746                   $the_number_193 = ord("A");
747
748       pack()  The c and C templates for pack() are dependent upon character
749               set encoding.  Examples of usage on EBCDIC include:
750
751                   $foo = pack("CCCC",193,194,195,196);
752                   # $foo eq "ABCD"
753                   $foo = pack("C4",193,194,195,196);
754                   # same thing
755
756                   $foo = pack("ccxxcc",193,194,195,196);
757                   # $foo eq "AB\0\0CD"
758
759       print() One must be careful with scalars and strings that are passed to
760               print that contain ASCII encodings.  One common place for this
761               to occur is in the output of the MIME type header for CGI
762               script writing.  For example, many perl programming guides rec‐
763               ommend something similar to:
764
765                   print "Content-type:\ttext/html\015\012\015\012";
766                   # this may be wrong on EBCDIC
767
768               Under the IBM OS/390 USS Web Server or WebSphere on z/OS for
769               example you should instead write that as:
770
771                   print "Content-type:\ttext/html\r\n\r\n"; # OK for DGW et alia
772
773               That is because the translation from EBCDIC to ASCII is done by
774               the web server in this case (such code will not be appropriate
775               for the Macintosh however).  Consult your web server's documen‐
776               tation for further details.
777
778       printf()
779               The formats that can convert characters to numbers and vice
780               versa will be different from their ASCII counterparts when exe‐
781               cuted on an EBCDIC machine.  Examples include:
782
783                   printf("%c%c%c",193,194,195);  # prints ABC
784
785       sort()  EBCDIC sort results may differ from ASCII sort results espe‐
786               cially for mixed case strings.  This is discussed in more
787               detail below.
788
789       sprintf()
790               See the discussion of printf() above.  An example of the use of
791               sprintf would be:
792
793                   $CAPITAL_LETTER_A = sprintf("%c",193);
794
795       unpack()
796               See the discussion of pack() above.
797

REGULAR EXPRESSION DIFFERENCES

799       As of perl 5.005_03 the letter range regular expression such as [A-Z]
800       and [a-z] have been especially coded to not pick up gap characters.
801       For example, characters such as o "o WITH CIRCUMFLEX" that lie between
802       I and J would not be matched by the regular expression range "/[H-K]/".
803       This works in the other direction, too, if either of the range end
804       points is explicitly numeric: "[\x89-\x91]" will match "\x8e", even
805       though "\x89" is "i" and "\x91 " is "j", and "\x8e" is a gap character
806       from the alphabetic viewpoint.
807
808       If you do want to match the alphabet gap characters in a single octet
809       regular expression try matching the hex or octal code such as "/\313/"
810       on EBCDIC or "/\364/" on ASCII machines to have your regular expression
811       match "o WITH CIRCUMFLEX".
812
813       Another construct to be wary of is the inappropriate use of hex or
814       octal constants in regular expressions.  Consider the following set of
815       subs:
816
817           sub is_c0 {
818               my $char = substr(shift,0,1);
819               $char =~ /[\000-\037]/;
820           }
821
822           sub is_print_ascii {
823               my $char = substr(shift,0,1);
824               $char =~ /[\040-\176]/;
825           }
826
827           sub is_delete {
828               my $char = substr(shift,0,1);
829               $char eq "\177";
830           }
831
832           sub is_c1 {
833               my $char = substr(shift,0,1);
834               $char =~ /[\200-\237]/;
835           }
836
837           sub is_latin_1 {
838               my $char = substr(shift,0,1);
839               $char =~ /[\240-\377]/;
840           }
841
842       The above would be adequate if the concern was only with numeric code
843       points.  However, the concern may be with characters rather than code
844       points and on an EBCDIC machine it may be desirable for constructs such
845       as "if (is_print_ascii("A")) {print "A is a printable character\n";}"
846       to print out the expected message.  One way to represent the above col‐
847       lection of character classification subs that is capable of working
848       across the four coded character sets discussed in this document is as
849       follows:
850
851           sub Is_c0 {
852               my $char = substr(shift,0,1);
853               if (ord('^')==94)  { # ascii
854                   return $char =~ /[\000-\037]/;
855               }
856               if (ord('^')==176) { # 37
857                   return $char =~ /[\000-\003\067\055-\057\026\005\045\013-\023\074\075\062\046\030\031\077\047\034-\037]/;
858               }
859               if (ord('^')==95 ⎪⎪ ord('^')==106) { # 1047 ⎪⎪ posix-bc
860                   return $char =~ /[\000-\003\067\055-\057\026\005\025\013-\023\074\075\062\046\030\031\077\047\034-\037]/;
861               }
862           }
863
864           sub Is_print_ascii {
865               my $char = substr(shift,0,1);
866               $char =~ /[ !"\#\$%&'()*+,\-.\/0-9:;<=>?\@A-Z[\\\]^_`a-z{⎪}~]/;
867           }
868
869           sub Is_delete {
870               my $char = substr(shift,0,1);
871               if (ord('^')==94)  { # ascii
872                   return $char eq "\177";
873               }
874               else  {              # ebcdic
875                   return $char eq "\007";
876               }
877           }
878
879           sub Is_c1 {
880               my $char = substr(shift,0,1);
881               if (ord('^')==94)  { # ascii
882                   return $char =~ /[\200-\237]/;
883               }
884               if (ord('^')==176) { # 37
885                   return $char =~ /[\040-\044\025\006\027\050-\054\011\012\033\060\061\032\063-\066\010\070-\073\040\024\076\377]/;
886               }
887               if (ord('^')==95)  { # 1047
888                   return $char =~ /[\040-\045\006\027\050-\054\011\012\033\060\061\032\063-\066\010\070-\073\040\024\076\377]/;
889               }
890               if (ord('^')==106) { # posix-bc
891                   return $char =~
892                     /[\040-\045\006\027\050-\054\011\012\033\060\061\032\063-\066\010\070-\073\040\024\076\137]/;
893               }
894           }
895
896           sub Is_latin_1 {
897               my $char = substr(shift,0,1);
898               if (ord('^')==94)  { # ascii
899                   return $char =~ /[\240-\377]/;
900               }
901               if (ord('^')==176) { # 37
902                   return $char =~
903                     /[\101\252\112\261\237\262\152\265\275\264\232\212\137\312\257\274\220\217\352\372\276\240\266\263\235\332\233\213\267\270\271\253\144\145\142\146\143\147\236\150\164\161-\163\170\165-\167\254\151\355\356\353\357\354\277\200\375\376\373\374\255\256\131\104\105\102\106\103\107\234\110\124\121-\123\130\125-\127\214\111\315\316\313\317\314\341\160\335\336\333\334\215\216\337]/;
904               }
905               if (ord('^')==95)  { # 1047
906                   return $char =~
907                     /[\101\252\112\261\237\262\152\265\273\264\232\212\260\312\257\274\220\217\352\372\276\240\266\263\235\332\233\213\267\270\271\253\144\145\142\146\143\147\236\150\164\161-\163\170\165-\167\254\151\355\356\353\357\354\277\200\375\376\373\374\272\256\131\104\105\102\106\103\107\234\110\124\121-\123\130\125-\127\214\111\315\316\313\317\314\341\160\335\336\333\334\215\216\337]/;
908               }
909               if (ord('^')==106) { # posix-bc
910                   return $char =~
911                     /[\101\252\260\261\237\262\320\265\171\264\232\212\272\312\257\241\220\217\352\372\276\240\266\263\235\332\233\213\267\270\271\253\144\145\142\146\143\147\236\150\164\161-\163\170\165-\167\254\151\355\356\353\357\354\277\200\340\376\335\374\255\256\131\104\105\102\106\103\107\234\110\124\121-\123\130\125-\127\214\111\315\316\313\317\314\341\160\300\336\333\334\215\216\337]/;
912               }
913           }
914
915       Note however that only the "Is_ascii_print()" sub is really independent
916       of coded character set.  Another way to write "Is_latin_1()" would be
917       to use the characters in the range explicitly:
918
919           sub Is_latin_1 {
920               my $char = substr(shift,0,1);
921               $char =~ /[ ¡¢£¤¥¦§¨©ª«¬®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ]/;
922           }
923
924       Although that form may run into trouble in network transit (due to the
925       presence of 8 bit characters) or on non ISO-Latin character sets.
926

SOCKETS

928       Most socket programming assumes ASCII character encodings in network
929       byte order.  Exceptions can include CGI script writing under a host web
930       server where the server may take care of translation for you.  Most
931       host web servers convert EBCDIC data to ISO-8859-1 or Unicode on out‐
932       put.
933

SORTING

935       One big difference between ASCII based character sets and EBCDIC ones
936       are the relative positions of upper and lower case letters and the let‐
937       ters compared to the digits.  If sorted on an ASCII based machine the
938       two letter abbreviation for a physician comes before the two letter for
939       drive, that is:
940
941           @sorted = sort(qw(Dr. dr.));  # @sorted holds ('Dr.','dr.') on ASCII,
942                                         # but ('dr.','Dr.') on EBCDIC
943
944       The property of lower case before uppercase letters in EBCDIC is even
945       carried to the Latin 1 EBCDIC pages such as 0037 and 1047.  An example
946       would be that Ee "E WITH DIAERESIS" (203) comes before ee "e WITH
947       DIAERESIS" (235) on an ASCII machine, but the latter (83) comes before
948       the former (115) on an EBCDIC machine.  (Astute readers will note that
949       the upper case version of ss "SMALL LETTER SHARP S" is simply "SS" and
950       that the upper case version of ye "y WITH DIAERESIS" is not in the
951       0..255 range but it is at U+x0178 in Unicode, or "\x{178}" in a Unicode
952       enabled Perl).
953
954       The sort order will cause differences between results obtained on ASCII
955       machines versus EBCDIC machines.  What follows are some suggestions on
956       how to deal with these differences.
957
958       Ignore ASCII vs. EBCDIC sort differences.
959
960       This is the least computationally expensive strategy.  It may require
961       some user education.
962
963       MONO CASE then sort data.
964
965       In order to minimize the expense of mono casing mixed test try to
966       "tr///" towards the character set case most employed within the data.
967       If the data are primarily UPPERCASE non Latin 1 then apply
968       tr/[a-z]/[A-Z]/ then sort().  If the data are primarily lowercase non
969       Latin 1 then apply tr/[A-Z]/[a-z]/ before sorting.  If the data are
970       primarily UPPERCASE and include Latin-1 characters then apply:
971
972           tr/[a-z]/[A-Z]/;
973           tr/[àáâãäåæçèéêëìíîïðñòóôõöøùúûüýþ]/[ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞ]/;
974           s/ß/SS/g;
975
976       then sort().  Do note however that such Latin-1 manipulation does not
977       address the ye "y WITH DIAERESIS" character that will remain at code
978       point 255 on ASCII machines, but 223 on most EBCDIC machines where it
979       will sort to a place less than the EBCDIC numerals.  With a Unicode
980       enabled Perl you might try:
981
982           tr/^?/\x{178}/;
983
984       The strategy of mono casing data before sorting does not preserve the
985       case of the data and may not be acceptable for that reason.
986
987       Convert, sort data, then re convert.
988
989       This is the most expensive proposition that does not employ a network
990       connection.
991
992       Perform sorting on one type of machine only.
993
994       This strategy can employ a network connection.  As such it would be
995       computationally expensive.
996

TRANSFORMATION FORMATS

998       There are a variety of ways of transforming data with an intra charac‐
999       ter set mapping that serve a variety of purposes.  Sorting was dis‐
1000       cussed in the previous section and a few of the other more popular map‐
1001       ping techniques are discussed next.
1002
1003       URL decoding and encoding
1004
1005       Note that some URLs have hexadecimal ASCII code points in them in an
1006       attempt to overcome character or protocol limitation issues.  For exam‐
1007       ple the tilde character is not on every keyboard hence a URL of the
1008       form:
1009
1010           http://www.pvhp.com/~pvhp/
1011
1012       may also be expressed as either of:
1013
1014           http://www.pvhp.com/%7Epvhp/
1015
1016           http://www.pvhp.com/%7epvhp/
1017
1018       where 7E is the hexadecimal ASCII code point for '~'.  Here is an exam‐
1019       ple of decoding such a URL under CCSID 1047:
1020
1021           $url = 'http://www.pvhp.com/%7Epvhp/';
1022           # this array assumes code page 1047
1023           my @a2e_1047 = (
1024                 0,  1,  2,  3, 55, 45, 46, 47, 22,  5, 21, 11, 12, 13, 14, 15,
1025                16, 17, 18, 19, 60, 61, 50, 38, 24, 25, 63, 39, 28, 29, 30, 31,
1026                64, 90,127,123, 91,108, 80,125, 77, 93, 92, 78,107, 96, 75, 97,
1027               240,241,242,243,244,245,246,247,248,249,122, 94, 76,126,110,111,
1028               124,193,194,195,196,197,198,199,200,201,209,210,211,212,213,214,
1029               215,216,217,226,227,228,229,230,231,232,233,173,224,189, 95,109,
1030               121,129,130,131,132,133,134,135,136,137,145,146,147,148,149,150,
1031               151,152,153,162,163,164,165,166,167,168,169,192, 79,208,161,  7,
1032                32, 33, 34, 35, 36, 37,  6, 23, 40, 41, 42, 43, 44,  9, 10, 27,
1033                48, 49, 26, 51, 52, 53, 54,  8, 56, 57, 58, 59,  4, 20, 62,255,
1034                65,170, 74,177,159,178,106,181,187,180,154,138,176,202,175,188,
1035               144,143,234,250,190,160,182,179,157,218,155,139,183,184,185,171,
1036               100,101, 98,102, 99,103,158,104,116,113,114,115,120,117,118,119,
1037               172,105,237,238,235,239,236,191,128,253,254,251,252,186,174, 89,
1038                68, 69, 66, 70, 67, 71,156, 72, 84, 81, 82, 83, 88, 85, 86, 87,
1039               140, 73,205,206,203,207,204,225,112,221,222,219,220,141,142,223
1040           );
1041           $url =~ s/%([0-9a-fA-F]{2})/pack("c",$a2e_1047[hex($1)])/ge;
1042
1043       Conversely, here is a partial solution for the task of encoding such a
1044       URL under the 1047 code page:
1045
1046           $url = 'http://www.pvhp.com/~pvhp/';
1047           # this array assumes code page 1047
1048           my @e2a_1047 = (
1049                 0,  1,  2,  3,156,  9,134,127,151,141,142, 11, 12, 13, 14, 15,
1050                16, 17, 18, 19,157, 10,  8,135, 24, 25,146,143, 28, 29, 30, 31,
1051               128,129,130,131,132,133, 23, 27,136,137,138,139,140,  5,  6,  7,
1052               144,145, 22,147,148,149,150,  4,152,153,154,155, 20, 21,158, 26,
1053                32,160,226,228,224,225,227,229,231,241,162, 46, 60, 40, 43,124,
1054                38,233,234,235,232,237,238,239,236,223, 33, 36, 42, 41, 59, 94,
1055                45, 47,194,196,192,193,195,197,199,209,166, 44, 37, 95, 62, 63,
1056               248,201,202,203,200,205,206,207,204, 96, 58, 35, 64, 39, 61, 34,
1057               216, 97, 98, 99,100,101,102,103,104,105,171,187,240,253,254,177,
1058               176,106,107,108,109,110,111,112,113,114,170,186,230,184,198,164,
1059               181,126,115,116,117,118,119,120,121,122,161,191,208, 91,222,174,
1060               172,163,165,183,169,167,182,188,189,190,221,168,175, 93,180,215,
1061               123, 65, 66, 67, 68, 69, 70, 71, 72, 73,173,244,246,242,243,245,
1062               125, 74, 75, 76, 77, 78, 79, 80, 81, 82,185,251,252,249,250,255,
1063                92,247, 83, 84, 85, 86, 87, 88, 89, 90,178,212,214,210,211,213,
1064                48, 49, 50, 51, 52, 53, 54, 55, 56, 57,179,219,220,217,218,159
1065           );
1066           # The following regular expression does not address the
1067           # mappings for: ('.' => '%2E', '/' => '%2F', ':' => '%3A')
1068           $url =~ s/([\t "#%&\(\),;<=>\?\@\[\\\]^`{⎪}~])/sprintf("%%%02X",$e2a_1047[ord($1)])/ge;
1069
1070       where a more complete solution would split the URL into components and
1071       apply a full s/// substitution only to the appropriate parts.
1072
1073       In the remaining examples a @e2a or @a2e array may be employed but the
1074       assignment will not be shown explicitly.  For code page 1047 you could
1075       use the @a2e_1047 or @e2a_1047 arrays just shown.
1076
1077       uu encoding and decoding
1078
1079       The "u" template to pack() or unpack() will render EBCDIC data in
1080       EBCDIC characters equivalent to their ASCII counterparts.  For example,
1081       the following will print "Yes indeed\n" on either an ASCII or EBCDIC
1082       computer:
1083
1084           $all_byte_chrs = '';
1085           for (0..255) { $all_byte_chrs .= chr($_); }
1086           $uuencode_byte_chrs = pack('u', $all_byte_chrs);
1087           ($uu = <<'ENDOFHEREDOC') =~ s/^\s*//gm;
1088           M``$"`P0%!@<("0H+#`T.#Q`1$A,4%187&!D:&QP='A\@(2(C)"4F)R@I*BLL
1089           M+2XO,#$R,S0U-C<X.3H[/#T^/T!!0D-$149'2$E*2TQ-3D]045)35%565UA9
1090           M6EM<75Y?8&%B8V1E9F=H:6IK;&UN;W!Q<G-T=79W>'EZ>WQ]?G^`@8*#A(6&
1091           MAXB)BHN,C8Z/D)&2DY25EI>8F9J;G)V>GZ"AHJ.DI::GJ*FJJZRMKJ^PL;*S
1092           MM+6VM[BYNKN\O;Z_P,'"P\3%QL?(R<K+S,W.S]#1TM/4U=;7V-G:V]S=WM_@
1093           ?X>+CY.7FY^CIZNOL[>[O\/'R\_3U]O?X^?K[_/W^_P``
1094           ENDOFHEREDOC
1095           if ($uuencode_byte_chrs eq $uu) {
1096               print "Yes ";
1097           }
1098           $uudecode_byte_chrs = unpack('u', $uuencode_byte_chrs);
1099           if ($uudecode_byte_chrs eq $all_byte_chrs) {
1100               print "indeed\n";
1101           }
1102
1103       Here is a very spartan uudecoder that will work on EBCDIC provided that
1104       the @e2a array is filled in appropriately:
1105
1106           #!/usr/local/bin/perl
1107           @e2a = ( # this must be filled in
1108                  );
1109           $_ = <> until ($mode,$file) = /^begin\s*(\d*)\s*(\S*)/;
1110           open(OUT, "> $file") if $file ne "";
1111           while(<>) {
1112               last if /^end/;
1113               next if /[a-z]/;
1114               next unless int(((($e2a[ord()] - 32 ) & 077) + 2) / 3) ==
1115                   int(length() / 4);
1116               print OUT unpack("u", $_);
1117           }
1118           close(OUT);
1119           chmod oct($mode), $file;
1120
1121       Quoted-Printable encoding and decoding
1122
1123       On ASCII encoded machines it is possible to strip characters outside of
1124       the printable set using:
1125
1126           # This QP encoder works on ASCII only
1127           $qp_string =~ s/([=\x00-\x1F\x80-\xFF])/sprintf("=%02X",ord($1))/ge;
1128
1129       Whereas a QP encoder that works on both ASCII and EBCDIC machines would
1130       look somewhat like the following (where the EBCDIC branch @e2a array is
1131       omitted for brevity):
1132
1133           if (ord('A') == 65) {    # ASCII
1134               $delete = "\x7F";    # ASCII
1135               @e2a = (0 .. 255)    # ASCII to ASCII identity map
1136           }
1137           else {                   # EBCDIC
1138               $delete = "\x07";    # EBCDIC
1139               @e2a =               # EBCDIC to ASCII map (as shown above)
1140           }
1141           $qp_string =~
1142             s/([^ !"\#\$%&'()*+,\-.\/0-9:;<>?\@A-Z[\\\]^_`a-z{⎪}~$delete])/sprintf("=%02X",$e2a[ord($1)])/ge;
1143
1144       (although in production code the substitutions might be done in the
1145       EBCDIC branch with the @e2a array and separately in the ASCII branch
1146       without the expense of the identity map).
1147
1148       Such QP strings can be decoded with:
1149
1150           # This QP decoder is limited to ASCII only
1151           $string =~ s/=([0-9A-Fa-f][0-9A-Fa-f])/chr hex $1/ge;
1152           $string =~ s/=[\n\r]+$//;
1153
1154       Whereas a QP decoder that works on both ASCII and EBCDIC machines would
1155       look somewhat like the following (where the @a2e array is omitted for
1156       brevity):
1157
1158           $string =~ s/=([0-9A-Fa-f][0-9A-Fa-f])/chr $a2e[hex $1]/ge;
1159           $string =~ s/=[\n\r]+$//;
1160
1161       Caesarian ciphers
1162
1163       The practice of shifting an alphabet one or more characters for enci‐
1164       pherment dates back thousands of years and was explicitly detailed by
1165       Gaius Julius Caesar in his Gallic Wars text.  A single alphabet shift
1166       is sometimes referred to as a rotation and the shift amount is given as
1167       a number $n after the string 'rot' or "rot$n".  Rot0 and rot26 would
1168       designate identity maps on the 26 letter English version of the Latin
1169       alphabet.  Rot13 has the interesting property that alternate subsequent
1170       invocations are identity maps (thus rot13 is its own non-trivial
1171       inverse in the group of 26 alphabet rotations).  Hence the following is
1172       a rot13 encoder and decoder that will work on ASCII and EBCDIC
1173       machines:
1174
1175           #!/usr/local/bin/perl
1176
1177           while(<>){
1178               tr/n-za-mN-ZA-M/a-zA-Z/;
1179               print;
1180           }
1181
1182       In one-liner form:
1183
1184           perl -ne 'tr/n-za-mN-ZA-M/a-zA-Z/;print'
1185

Hashing order and checksums

1187       To the extent that it is possible to write code that depends on hashing
1188       order there may be differences between hashes as stored on an ASCII
1189       based machine and hashes stored on an EBCDIC based machine.  XXX
1190

I18N AND L10N

1192       Internationalization(I18N) and localization(L10N) are supported at
1193       least in principle even on EBCDIC machines.  The details are system
1194       dependent and discussed under the "OS ISSUES" in perlebcdic section
1195       below.
1196

MULTI OCTET CHARACTER SETS

1198       Perl may work with an internal UTF-EBCDIC encoding form for wide char‐
1199       acters on EBCDIC platforms in a manner analogous to the way that it
1200       works with the UTF-8 internal encoding form on ASCII based platforms.
1201
1202       Legacy multi byte EBCDIC code pages XXX.
1203

OS ISSUES

1205       There may be a few system dependent issues of concern to EBCDIC Perl
1206       programmers.
1207
1208       OS/400
1209
1210       PASE    The PASE environment is runtime environment for OS/400 that can
1211               run executables built for PowerPC AIX in OS/400, see perlos400.
1212               PASE is ASCII-based, not EBCDIC-based as the ILE.
1213
1214       IFS access
1215               XXX.
1216
1217       OS/390, z/OS
1218
1219       Perl runs under Unix Systems Services or USS.
1220
1221       chcp    chcp is supported as a shell utility for displaying and chang‐
1222               ing one's code page.  See also chcp.
1223
1224       dataset access
1225               For sequential data set access try:
1226
1227                   my @ds_records = `cat //DSNAME`;
1228
1229               or:
1230
1231                   my @ds_records = `cat //'HLQ.DSNAME'`;
1232
1233               See also the OS390::Stdio module on CPAN.
1234
1235       OS/390, z/OS iconv
1236               iconv is supported as both a shell utility and a C RTL routine.
1237               See also the iconv(1) and iconv(3) manual pages.
1238
1239       locales On OS/390 or z/OS see locale for information on locales.  The
1240               L10N files are in /usr/nls/locale.  $Config{d_setlocale} is
1241               'define' on OS/390 or z/OS.
1242
1243       VM/ESA?
1244
1245       XXX.
1246
1247       POSIX-BC?
1248
1249       XXX.
1250

BUGS

1252       This pod document contains literal Latin 1 characters and may encounter
1253       translation difficulties.  In particular one popular nroff implementa‐
1254       tion was known to strip accented characters to their unaccented coun‐
1255       terparts while attempting to view this document through the pod2man
1256       program (for example, you may see a plain "y" rather than one with a
1257       diaeresis as in ye).  Another nroff truncated the resultant manpage at
1258       the first occurrence of 8 bit characters.
1259
1260       Not all shells will allow multiple "-e" string arguments to perl to be
1261       concatenated together properly as recipes 0, 2, 4, 5, and 6 might seem
1262       to imply.
1263

SEE ALSO

1265       perllocale, perlfunc, perlunicode, utf8.
1266

REFERENCES

1268       http://anubis.dkuug.dk/i18n/charmaps
1269
1270       http://www.unicode.org/
1271
1272       http://www.unicode.org/unicode/reports/tr16/
1273
1274       http://www.wps.com/texts/codes/ ASCII: American Standard Code for
1275       Information Infiltration Tom Jennings, September 1999.
1276
1277       The Unicode Standard, Version 3.0 The Unicode Consortium, Lisa Moore
1278       ed., ISBN 0-201-61633-5, Addison Wesley Developers Press, February
1279       2000.
1280
1281       CDRA: IBM - Character Data Representation Architecture - Reference and
1282       Registry, IBM SC09-2190-00, December 1996.
1283
1284       "Demystifying Character Sets", Andrea Vine, Multilingual Computing &
1285       Technology, #26 Vol. 10 Issue 4, August/September 1999; ISSN 1523-0309;
1286       Multilingual Computing Inc. Sandpoint ID, USA.
1287
1288       Codes, Ciphers, and Other Cryptic and Clandestine Communication Fred B.
1289       Wrixon, ISBN 1-57912-040-7, Black Dog & Leventhal Publishers, 1998.
1290
1291       http://www.bobbemer.com/P-BIT.HTM IBM - EBCDIC and the P-bit; The big‐
1292       gest Computer Goof Ever Robert Bemer.
1293

HISTORY

1295       15 April 2001: added UTF-8 and UTF-EBCDIC to main table, pvhp.
1296

AUTHOR

1298       Peter Prymmer pvhp@best.com wrote this in 1999 and 2000 with CCSID 0819
1299       and 0037 help from Chris Leach and Andre Pirard A.Pirard@ulg.ac.be as
1300       well as POSIX-BC help from Thomas Dorner Thomas.Dorner@start.de.
1301       Thanks also to Vickie Cooper, Philip Newton, William Raffloer, and Joe
1302       Smith.  Trademarks, registered trademarks, service marks and registered
1303       service marks used in this document are the property of their respec‐
1304       tive owners.
1305
1306
1307
1308perl v5.8.8                       2006-01-07                     PERLEBCDIC(1)
Impressum