1PERLEBCDIC(1)          Perl Programmers Reference Guide          PERLEBCDIC(1)
2
3
4

NAME

6       perlebcdic - Considerations for running Perl on EBCDIC platforms
7

DESCRIPTION

9       An exploration of some of the issues facing Perl programmers on EBCDIC
10       based computers.  We do not cover localization, internationalization,
11       or multi-byte character set issues other than some discussion of UTF-8
12       and UTF-EBCDIC.
13
14       Portions that are still incomplete are marked with XXX.
15
16       Perl used to work on EBCDIC machines, but there are now areas of the
17       code where it doesn't.  If you want to use Perl on an EBCDIC machine,
18       please let us know by sending mail to perlbug@perl.org
19

COMMON CHARACTER CODE SETS

21   ASCII
22       The American Standard Code for Information Interchange (ASCII or US-
23       ASCII) is a set of integers running from 0 to 127 (decimal) that imply
24       character interpretation by the display and other systems of computers.
25       The range 0..127 can be covered by setting the bits in a 7-bit binary
26       digit, hence the set is sometimes referred to as "7-bit ASCII".  ASCII
27       was described by the American National Standards Institute document
28       ANSI X3.4-1986.  It was also described by ISO 646:1991 (with
29       localization for currency symbols).  The full ASCII set is given in the
30       table below as the first 128 elements.  Languages that can be written
31       adequately with the characters in ASCII include English, Hawaiian,
32       Indonesian, Swahili and some Native American languages.
33
34       There are many character sets that extend the range of integers from
35       0..2**7-1 up to 2**8-1, or 8 bit bytes (octets if you prefer).  One
36       common one is the ISO 8859-1 character set.
37
38   ISO 8859
39       The ISO 8859-$n are a collection of character code sets from the
40       International Organization for Standardization (ISO) each of which adds
41       characters to the ASCII set that are typically found in European
42       languages many of which are based on the Roman, or Latin, alphabet.
43
44   Latin 1 (ISO 8859-1)
45       A particular 8-bit extension to ASCII that includes grave and acute
46       accented Latin characters.  Languages that can employ ISO 8859-1
47       include all the languages covered by ASCII as well as Afrikaans,
48       Albanian, Basque, Catalan, Danish, Faroese, Finnish, Norwegian,
49       Portuguese, Spanish, and Swedish.  Dutch is covered albeit without the
50       ij ligature.  French is covered too but without the oe ligature.
51       German can use ISO 8859-1 but must do so without German-style quotation
52       marks.  This set is based on Western European extensions to ASCII and
53       is commonly encountered in world wide web work.  In IBM character code
54       set identification terminology ISO 8859-1 is also known as CCSID 819
55       (or sometimes 0819 or even 00819).
56
57   EBCDIC
58       The Extended Binary Coded Decimal Interchange Code refers to a large
59       collection of single- and multi-byte coded character sets that are
60       different from ASCII or ISO 8859-1 and are all slightly different from
61       each other; they typically run on host computers.  The EBCDIC encodings
62       derive from 8-bit byte extensions of Hollerith punched card encodings.
63       The layout on the cards was such that high bits were set for the upper
64       and lower case alphabet characters [a-z] and [A-Z], but there were gaps
65       within each Latin alphabet range.
66
67       Some IBM EBCDIC character sets may be known by character code set
68       identification numbers (CCSID numbers) or code page numbers.
69
70       Perl can be compiled on platforms that run any of three commonly used
71       EBCDIC character sets, listed below.
72
73   The 13 variant characters
74       Among IBM EBCDIC character code sets there are 13 characters that are
75       often mapped to different integer values.  Those characters are known
76       as the 13 "variant" characters and are:
77
78           \ [ ] { } ^ ~ ! # | $ @ `
79
80       When Perl is compiled for a platform, it looks at some of these
81       characters to guess which EBCDIC character set the platform uses, and
82       adapts itself accordingly to that platform.  If the platform uses a
83       character set that is not one of the three Perl knows about, Perl will
84       either fail to compile, or mistakenly and silently choose one of the
85       three.  They are:
86
87   0037
88       Character code set ID 0037 is a mapping of the ASCII plus Latin-1
89       characters (i.e. ISO 8859-1) to an EBCDIC set.  0037 is used in North
90       American English locales on the OS/400 operating system that runs on
91       AS/400 computers.  CCSID 0037 differs from ISO 8859-1 in 237 places, in
92       other words they agree on only 19 code point values.
93
94   1047
95       Character code set ID 1047 is also a mapping of the ASCII plus Latin-1
96       characters (i.e. ISO 8859-1) to an EBCDIC set.  1047 is used under Unix
97       System Services for OS/390 or z/OS, and OpenEdition for VM/ESA.  CCSID
98       1047 differs from CCSID 0037 in eight places.
99
100   POSIX-BC
101       The EBCDIC code page in use on Siemens' BS2000 system is distinct from
102       1047 and 0037.  It is identified below as the POSIX-BC set.
103
104   Unicode code points versus EBCDIC code points
105       In Unicode terminology a code point is the number assigned to a
106       character: for example, in EBCDIC the character "A" is usually assigned
107       the number 193.  In Unicode the character "A" is assigned the number
108       65.  This causes a problem with the semantics of the pack/unpack "U",
109       which are supposed to pack Unicode code points to characters and back
110       to numbers.  The problem is: which code points to use for code points
111       less than 256?  (for 256 and over there's no problem: Unicode code
112       points are used) In EBCDIC, for the low 256 the EBCDIC code points are
113       used.  This means that the equivalences
114
115           pack("U", ord($character)) eq $character
116           unpack("U", $character) == ord $character
117
118       will hold.  (If Unicode code points were applied consistently over all
119       the possible code points, pack("U",ord("A")) would in EBCDIC equal A
120       with acute or chr(101), and unpack("U", "A") would equal 65, or non-
121       breaking space, not 193, or ord "A".)
122
123   Remaining Perl Unicode problems in EBCDIC
124       ·   Many of the remaining problems seem to be related to case-
125           insensitive matching
126
127       ·   The extensions Unicode::Collate and Unicode::Normalized are not
128           supported under EBCDIC, likewise for the encoding pragma.
129
130   Unicode and UTF
131       UTF stands for "Unicode Transformation Format".  UTF-8 is an encoding
132       of Unicode into a sequence of 8-bit byte chunks, based on ASCII and
133       Latin-1.  The length of a sequence required to represent a Unicode code
134       point depends on the ordinal number of that code point, with larger
135       numbers requiring more bytes.  UTF-EBCDIC is like UTF-8, but based on
136       EBCDIC.
137
138       You may see the term "invariant" character or code point.  This simply
139       means that the character has the same numeric value when encoded as
140       when not.  (Note that this is a very different concept from "The 13
141       variant characters" mentioned above.)  For example, the ordinal value
142       of 'A' is 193 in most EBCDIC code pages, and also is 193 when encoded
143       in UTF-EBCDIC.  All variant code points occupy at least two bytes when
144       encoded.  In UTF-8, the code points corresponding to the lowest 128
145       ordinal numbers (0 - 127: the ASCII characters) are invariant.  In UTF-
146       EBCDIC, there are 160 invariant characters.  (If you care, the EBCDIC
147       invariants are those characters which have ASCII equivalents, plus
148       those that correspond to the C1 controls (80..9f on ASCII platforms).)
149
150       A string encoded in UTF-EBCDIC may be longer (but never shorter) than
151       one encoded in UTF-8.
152
153   Using Encode
154       Starting from Perl 5.8 you can use the standard new module Encode to
155       translate from EBCDIC to Latin-1 code points.  Encode knows about more
156       EBCDIC character sets than Perl can currently be compiled to run on.
157
158          use Encode 'from_to';
159
160          my %ebcdic = ( 176 => 'cp37', 95 => 'cp1047', 106 => 'posix-bc' );
161
162          # $a is in EBCDIC code points
163          from_to($a, $ebcdic{ord '^'}, 'latin1');
164          # $a is ISO 8859-1 code points
165
166       and from Latin-1 code points to EBCDIC code points
167
168          use Encode 'from_to';
169
170          my %ebcdic = ( 176 => 'cp37', 95 => 'cp1047', 106 => 'posix-bc' );
171
172          # $a is ISO 8859-1 code points
173          from_to($a, 'latin1', $ebcdic{ord '^'});
174          # $a is in EBCDIC code points
175
176       For doing I/O it is suggested that you use the autotranslating features
177       of PerlIO, see perluniintro.
178
179       Since version 5.8 Perl uses the new PerlIO I/O library.  This enables
180       you to use different encodings per IO channel.  For example you may use
181
182           use Encode;
183           open($f, ">:encoding(ascii)", "test.ascii");
184           print $f "Hello World!\n";
185           open($f, ">:encoding(cp37)", "test.ebcdic");
186           print $f "Hello World!\n";
187           open($f, ">:encoding(latin1)", "test.latin1");
188           print $f "Hello World!\n";
189           open($f, ">:encoding(utf8)", "test.utf8");
190           print $f "Hello World!\n";
191
192       to get four files containing "Hello World!\n" in ASCII, CP 0037 EBCDIC,
193       ISO 8859-1 (Latin-1) (in this example identical to ASCII since only
194       ASCII characters were printed), and UTF-EBCDIC (in this example
195       identical to normal EBCDIC since only characters that don't differ
196       between EBCDIC and UTF-EBCDIC were printed).  See the documentation of
197       Encode::PerlIO for details.
198
199       As the PerlIO layer uses raw IO (bytes) internally, all this totally
200       ignores things like the type of your filesystem (ASCII or EBCDIC).
201

SINGLE OCTET TABLES

203       The following tables list the ASCII and Latin 1 ordered sets including
204       the subsets: C0 controls (0..31), ASCII graphics (32..7e), delete (7f),
205       C1 controls (80..9f), and Latin-1 (a.k.a. ISO 8859-1) (a0..ff).  In the
206       table non-printing control character names as well as the Latin 1
207       extensions to ASCII have been labelled with character names roughly
208       corresponding to The Unicode Standard, Version 3.0 albeit with
209       substitutions such as s/LATIN// and s/VULGAR// in all cases, s/CAPITAL
210       LETTER// in some cases, and s/SMALL LETTER ([A-Z])/\l$1/ in some other
211       cases.  The "names" of the controls listed here are the Unicode Version
212       1 names, except for the few that don't have names, in which case the
213       names in the Wikipedia article were used
214       (<http://en.wikipedia.org/wiki/C0_and_C1_control_codes>).  The
215       differences between the 0037 and 1047 sets are flagged with ***.  The
216       differences between the 1047 and POSIX-BC sets are flagged with ###.
217       All ord() numbers listed are decimal.  If you would rather see this
218       table listing octal values then run the table (that is, the pod version
219       of this document since this recipe may not work with a
220       pod2_other_format translation) through:
221
222       recipe 0
223
224           perl -ne 'if(/(.{43})(\d+)\s+(\d+)\s+(\d+)\s+(\d+)/)' \
225            -e '{printf("%s%-9.03o%-9.03o%-9.03o%.03o\n",$1,$2,$3,$4,$5)}' \
226            perlebcdic.pod
227
228       If you want to retain the UTF-x code points then in script form you
229       might want to write:
230
231       recipe 1
232
233        open(FH,"<perlebcdic.pod") or die "Could not open perlebcdic.pod: $!";
234        while (<FH>) {
235            if (/(.{43})(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\.?(\d*)\s+(\d+)\.?(\d*)/)
236            {
237                if ($7 ne '' && $9 ne '') {
238                    printf(
239                       "%s%-9.03o%-9.03o%-9.03o%-9.03o%-3o.%-5o%-3o.%.03o\n",
240                                                   $1,$2,$3,$4,$5,$6,$7,$8,$9);
241                }
242                elsif ($7 ne '') {
243                    printf("%s%-9.03o%-9.03o%-9.03o%-9.03o%-3o.%-5o%.03o\n",
244                                                  $1,$2,$3,$4,$5,$6,$7,$8);
245                }
246                else {
247                    printf("%s%-9.03o%-9.03o%-9.03o%-9.03o%-9.03o%.03o\n",
248                                                       $1,$2,$3,$4,$5,$6,$8);
249                }
250            }
251        }
252
253       If you would rather see this table listing hexadecimal values then run
254       the table through:
255
256       recipe 2
257
258           perl -ne 'if(/(.{43})(\d+)\s+(\d+)\s+(\d+)\s+(\d+)/)' \
259            -e '{printf("%s%-9.02X%-9.02X%-9.02X%.02X\n",$1,$2,$3,$4,$5)}' \
260            perlebcdic.pod
261
262       Or, in order to retain the UTF-x code points in hexadecimal:
263
264       recipe 3
265
266        open(FH,"<perlebcdic.pod") or die "Could not open perlebcdic.pod: $!";
267        while (<FH>) {
268            if (/(.{43})(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\s+(\d+)\.?(\d*)\s+(\d+)\.?(\d*)/)
269            {
270                if ($7 ne '' && $9 ne '') {
271                    printf(
272                       "%s%-9.02X%-9.02X%-9.02X%-9.02X%-2X.%-6.02X%02X.%02X\n",
273                                                  $1,$2,$3,$4,$5,$6,$7,$8,$9);
274                }
275                elsif ($7 ne '') {
276                    printf("%s%-9.02X%-9.02X%-9.02X%-9.02X%-2X.%-6.02X%02X\n",
277                                                     $1,$2,$3,$4,$5,$6,$7,$8);
278                }
279                else {
280                    printf("%s%-9.02X%-9.02X%-9.02X%-9.02X%-9.02X%02X\n",
281                                                         $1,$2,$3,$4,$5,$6,$8);
282                }
283            }
284        }
285
286
287                                             ISO 8859-1  CCSID    CCSID                    CCSID 1047
288        chr                                  CCSID 0819  0037     1047    POSIX-BC  UTF-8  UTF-EBCDIC
289        ----------------------------------------------------------------------------------------------
290        <NULL>                                    0        0        0        0        0        0
291        <START OF HEADING>                        1        1        1        1        1        1
292        <START OF TEXT>                           2        2        2        2        2        2
293        <END OF TEXT>                             3        3        3        3        3        3
294        <END OF TRANSMISSION>                     4        55       55       55       4        55
295        <ENQUIRY>                                 5        45       45       45       5        45
296        <ACKNOWLEDGE>                             6        46       46       46       6        46
297        <BELL>                                    7        47       47       47       7        47
298        <BACKSPACE>                               8        22       22       22       8        22
299        <HORIZONTAL TABULATION>                   9        5        5        5        9        5
300        <LINE FEED>                               10       37       21       21       10       21       ***
301        <VERTICAL TABULATION>                     11       11       11       11       11       11
302        <FORM FEED>                               12       12       12       12       12       12
303        <CARRIAGE RETURN>                         13       13       13       13       13       13
304        <SHIFT OUT>                               14       14       14       14       14       14
305        <SHIFT IN>                                15       15       15       15       15       15
306        <DATA LINK ESCAPE>                        16       16       16       16       16       16
307        <DEVICE CONTROL ONE>                      17       17       17       17       17       17
308        <DEVICE CONTROL TWO>                      18       18       18       18       18       18
309        <DEVICE CONTROL THREE>                    19       19       19       19       19       19
310        <DEVICE CONTROL FOUR>                     20       60       60       60       20       60
311        <NEGATIVE ACKNOWLEDGE>                    21       61       61       61       21       61
312        <SYNCHRONOUS IDLE>                        22       50       50       50       22       50
313        <END OF TRANSMISSION BLOCK>               23       38       38       38       23       38
314        <CANCEL>                                  24       24       24       24       24       24
315        <END OF MEDIUM>                           25       25       25       25       25       25
316        <SUBSTITUTE>                              26       63       63       63       26       63
317        <ESCAPE>                                  27       39       39       39       27       39
318        <FILE SEPARATOR>                          28       28       28       28       28       28
319        <GROUP SEPARATOR>                         29       29       29       29       29       29
320        <RECORD SEPARATOR>                        30       30       30       30       30       30
321        <UNIT SEPARATOR>                          31       31       31       31       31       31
322        <SPACE>                                   32       64       64       64       32       64
323        !                                         33       90       90       90       33       90
324        "                                         34       127      127      127      34       127
325        #                                         35       123      123      123      35       123
326        $                                         36       91       91       91       36       91
327        %                                         37       108      108      108      37       108
328        &                                         38       80       80       80       38       80
329        '                                         39       125      125      125      39       125
330        (                                         40       77       77       77       40       77
331        )                                         41       93       93       93       41       93
332        *                                         42       92       92       92       42       92
333        +                                         43       78       78       78       43       78
334        ,                                         44       107      107      107      44       107
335        -                                         45       96       96       96       45       96
336        .                                         46       75       75       75       46       75
337        /                                         47       97       97       97       47       97
338        0                                         48       240      240      240      48       240
339        1                                         49       241      241      241      49       241
340        2                                         50       242      242      242      50       242
341        3                                         51       243      243      243      51       243
342        4                                         52       244      244      244      52       244
343        5                                         53       245      245      245      53       245
344        6                                         54       246      246      246      54       246
345        7                                         55       247      247      247      55       247
346        8                                         56       248      248      248      56       248
347        9                                         57       249      249      249      57       249
348        :                                         58       122      122      122      58       122
349        ;                                         59       94       94       94       59       94
350        <                                         60       76       76       76       60       76
351        =                                         61       126      126      126      61       126
352        >                                         62       110      110      110      62       110
353        ?                                         63       111      111      111      63       111
354        @                                         64       124      124      124      64       124
355        A                                         65       193      193      193      65       193
356        B                                         66       194      194      194      66       194
357        C                                         67       195      195      195      67       195
358        D                                         68       196      196      196      68       196
359        E                                         69       197      197      197      69       197
360        F                                         70       198      198      198      70       198
361        G                                         71       199      199      199      71       199
362        H                                         72       200      200      200      72       200
363        I                                         73       201      201      201      73       201
364        J                                         74       209      209      209      74       209
365        K                                         75       210      210      210      75       210
366        L                                         76       211      211      211      76       211
367        M                                         77       212      212      212      77       212
368        N                                         78       213      213      213      78       213
369        O                                         79       214      214      214      79       214
370        P                                         80       215      215      215      80       215
371        Q                                         81       216      216      216      81       216
372        R                                         82       217      217      217      82       217
373        S                                         83       226      226      226      83       226
374        T                                         84       227      227      227      84       227
375        U                                         85       228      228      228      85       228
376        V                                         86       229      229      229      86       229
377        W                                         87       230      230      230      87       230
378        X                                         88       231      231      231      88       231
379        Y                                         89       232      232      232      89       232
380        Z                                         90       233      233      233      90       233
381        [                                         91       186      173      187      91       173      *** ###
382        \                                         92       224      224      188      92       224      ###
383        ]                                         93       187      189      189      93       189      ***
384        ^                                         94       176      95       106      94       95       *** ###
385        _                                         95       109      109      109      95       109
386        `                                         96       121      121      74       96       121      ###
387        a                                         97       129      129      129      97       129
388        b                                         98       130      130      130      98       130
389        c                                         99       131      131      131      99       131
390        d                                         100      132      132      132      100      132
391        e                                         101      133      133      133      101      133
392        f                                         102      134      134      134      102      134
393        g                                         103      135      135      135      103      135
394        h                                         104      136      136      136      104      136
395        i                                         105      137      137      137      105      137
396        j                                         106      145      145      145      106      145
397        k                                         107      146      146      146      107      146
398        l                                         108      147      147      147      108      147
399        m                                         109      148      148      148      109      148
400        n                                         110      149      149      149      110      149
401        o                                         111      150      150      150      111      150
402        p                                         112      151      151      151      112      151
403        q                                         113      152      152      152      113      152
404        r                                         114      153      153      153      114      153
405        s                                         115      162      162      162      115      162
406        t                                         116      163      163      163      116      163
407        u                                         117      164      164      164      117      164
408        v                                         118      165      165      165      118      165
409        w                                         119      166      166      166      119      166
410        x                                         120      167      167      167      120      167
411        y                                         121      168      168      168      121      168
412        z                                         122      169      169      169      122      169
413        {                                         123      192      192      251      123      192      ###
414        |                                         124      79       79       79       124      79
415        }                                         125      208      208      253      125      208      ###
416        ~                                         126      161      161      255      126      161      ###
417        <DELETE>                                  127      7        7        7        127      7
418        <PADDING CHARACTER>                       128      32       32       32       194.128  32
419        <HIGH OCTET PRESET>                       129      33       33       33       194.129  33
420        <BREAK PERMITTED HERE>                    130      34       34       34       194.130  34
421        <NO BREAK HERE>                           131      35       35       35       194.131  35
422        <INDEX>                                   132      36       36       36       194.132  36
423        <NEXT LINE>                               133      21       37       37       194.133  37       ***
424        <START OF SELECTED AREA>                  134      6        6        6        194.134  6
425        <END OF SELECTED AREA>                    135      23       23       23       194.135  23
426        <CHARACTER TABULATION SET>                136      40       40       40       194.136  40
427        <CHARACTER TABULATION WITH JUSTIFICATION> 137      41       41       41       194.137  41
428        <LINE TABULATION SET>                     138      42       42       42       194.138  42
429        <PARTIAL LINE FORWARD>                    139      43       43       43       194.139  43
430        <PARTIAL LINE BACKWARD>                   140      44       44       44       194.140  44
431        <REVERSE LINE FEED>                       141      9        9        9        194.141  9
432        <SINGLE SHIFT TWO>                        142      10       10       10       194.142  10
433        <SINGLE SHIFT THREE>                      143      27       27       27       194.143  27
434        <DEVICE CONTROL STRING>                   144      48       48       48       194.144  48
435        <PRIVATE USE ONE>                         145      49       49       49       194.145  49
436        <PRIVATE USE TWO>                         146      26       26       26       194.146  26
437        <SET TRANSMIT STATE>                      147      51       51       51       194.147  51
438        <CANCEL CHARACTER>                        148      52       52       52       194.148  52
439        <MESSAGE WAITING>                         149      53       53       53       194.149  53
440        <START OF GUARDED AREA>                   150      54       54       54       194.150  54
441        <END OF GUARDED AREA>                     151      8        8        8        194.151  8
442        <START OF STRING>                         152      56       56       56       194.152  56
443        <SINGLE GRAPHIC CHARACTER INTRODUCER>     153      57       57       57       194.153  57
444        <SINGLE CHARACTER INTRODUCER>             154      58       58       58       194.154  58
445        <CONTROL SEQUENCE INTRODUCER>             155      59       59       59       194.155  59
446        <STRING TERMINATOR>                       156      4        4        4        194.156  4
447        <OPERATING SYSTEM COMMAND>                157      20       20       20       194.157  20
448        <PRIVACY MESSAGE>                         158      62       62       62       194.158  62
449        <APPLICATION PROGRAM COMMAND>             159      255      255      95       194.159  255      ###
450        <NON-BREAKING SPACE>                      160      65       65       65       194.160  128.65
451        <INVERTED EXCLAMATION MARK>               161      170      170      170      194.161  128.66
452        <CENT SIGN>                               162      74       74       176      194.162  128.67   ###
453        <POUND SIGN>                              163      177      177      177      194.163  128.68
454        <CURRENCY SIGN>                           164      159      159      159      194.164  128.69
455        <YEN SIGN>                                165      178      178      178      194.165  128.70
456        <BROKEN BAR>                              166      106      106      208      194.166  128.71   ###
457        <SECTION SIGN>                            167      181      181      181      194.167  128.72
458        <DIAERESIS>                               168      189      187      121      194.168  128.73   *** ###
459        <COPYRIGHT SIGN>                          169      180      180      180      194.169  128.74
460        <FEMININE ORDINAL INDICATOR>              170      154      154      154      194.170  128.81
461        <LEFT POINTING GUILLEMET>                 171      138      138      138      194.171  128.82
462        <NOT SIGN>                                172      95       176      186      194.172  128.83   *** ###
463        <SOFT HYPHEN>                             173      202      202      202      194.173  128.84
464        <REGISTERED TRADE MARK SIGN>              174      175      175      175      194.174  128.85
465        <MACRON>                                  175      188      188      161      194.175  128.86   ###
466        <DEGREE SIGN>                             176      144      144      144      194.176  128.87
467        <PLUS-OR-MINUS SIGN>                      177      143      143      143      194.177  128.88
468        <SUPERSCRIPT TWO>                         178      234      234      234      194.178  128.89
469        <SUPERSCRIPT THREE>                       179      250      250      250      194.179  128.98
470        <ACUTE ACCENT>                            180      190      190      190      194.180  128.99
471        <MICRO SIGN>                              181      160      160      160      194.181  128.100
472        <PARAGRAPH SIGN>                          182      182      182      182      194.182  128.101
473        <MIDDLE DOT>                              183      179      179      179      194.183  128.102
474        <CEDILLA>                                 184      157      157      157      194.184  128.103
475        <SUPERSCRIPT ONE>                         185      218      218      218      194.185  128.104
476        <MASC. ORDINAL INDICATOR>                 186      155      155      155      194.186  128.105
477        <RIGHT POINTING GUILLEMET>                187      139      139      139      194.187  128.106
478        <FRACTION ONE QUARTER>                    188      183      183      183      194.188  128.112
479        <FRACTION ONE HALF>                       189      184      184      184      194.189  128.113
480        <FRACTION THREE QUARTERS>                 190      185      185      185      194.190  128.114
481        <INVERTED QUESTION MARK>                  191      171      171      171      194.191  128.115
482        <A WITH GRAVE>                            192      100      100      100      195.128  138.65
483        <A WITH ACUTE>                            193      101      101      101      195.129  138.66
484        <A WITH CIRCUMFLEX>                       194      98       98       98       195.130  138.67
485        <A WITH TILDE>                            195      102      102      102      195.131  138.68
486        <A WITH DIAERESIS>                        196      99       99       99       195.132  138.69
487        <A WITH RING ABOVE>                       197      103      103      103      195.133  138.70
488        <CAPITAL LIGATURE AE>                     198      158      158      158      195.134  138.71
489        <C WITH CEDILLA>                          199      104      104      104      195.135  138.72
490        <E WITH GRAVE>                            200      116      116      116      195.136  138.73
491        <E WITH ACUTE>                            201      113      113      113      195.137  138.74
492        <E WITH CIRCUMFLEX>                       202      114      114      114      195.138  138.81
493        <E WITH DIAERESIS>                        203      115      115      115      195.139  138.82
494        <I WITH GRAVE>                            204      120      120      120      195.140  138.83
495        <I WITH ACUTE>                            205      117      117      117      195.141  138.84
496        <I WITH CIRCUMFLEX>                       206      118      118      118      195.142  138.85
497        <I WITH DIAERESIS>                        207      119      119      119      195.143  138.86
498        <CAPITAL LETTER ETH>                      208      172      172      172      195.144  138.87
499        <N WITH TILDE>                            209      105      105      105      195.145  138.88
500        <O WITH GRAVE>                            210      237      237      237      195.146  138.89
501        <O WITH ACUTE>                            211      238      238      238      195.147  138.98
502        <O WITH CIRCUMFLEX>                       212      235      235      235      195.148  138.99
503        <O WITH TILDE>                            213      239      239      239      195.149  138.100
504        <O WITH DIAERESIS>                        214      236      236      236      195.150  138.101
505        <MULTIPLICATION SIGN>                     215      191      191      191      195.151  138.102
506        <O WITH STROKE>                           216      128      128      128      195.152  138.103
507        <U WITH GRAVE>                            217      253      253      224      195.153  138.104  ###
508        <U WITH ACUTE>                            218      254      254      254      195.154  138.105
509        <U WITH CIRCUMFLEX>                       219      251      251      221      195.155  138.106  ###
510        <U WITH DIAERESIS>                        220      252      252      252      195.156  138.112
511        <Y WITH ACUTE>                            221      173      186      173      195.157  138.113  *** ###
512        <CAPITAL LETTER THORN>                    222      174      174      174      195.158  138.114
513        <SMALL LETTER SHARP S>                    223      89       89       89       195.159  138.115
514        <a WITH GRAVE>                            224      68       68       68       195.160  139.65
515        <a WITH ACUTE>                            225      69       69       69       195.161  139.66
516        <a WITH CIRCUMFLEX>                       226      66       66       66       195.162  139.67
517        <a WITH TILDE>                            227      70       70       70       195.163  139.68
518        <a WITH DIAERESIS>                        228      67       67       67       195.164  139.69
519        <a WITH RING ABOVE>                       229      71       71       71       195.165  139.70
520        <SMALL LIGATURE ae>                       230      156      156      156      195.166  139.71
521        <c WITH CEDILLA>                          231      72       72       72       195.167  139.72
522        <e WITH GRAVE>                            232      84       84       84       195.168  139.73
523        <e WITH ACUTE>                            233      81       81       81       195.169  139.74
524        <e WITH CIRCUMFLEX>                       234      82       82       82       195.170  139.81
525        <e WITH DIAERESIS>                        235      83       83       83       195.171  139.82
526        <i WITH GRAVE>                            236      88       88       88       195.172  139.83
527        <i WITH ACUTE>                            237      85       85       85       195.173  139.84
528        <i WITH CIRCUMFLEX>                       238      86       86       86       195.174  139.85
529        <i WITH DIAERESIS>                        239      87       87       87       195.175  139.86
530        <SMALL LETTER eth>                        240      140      140      140      195.176  139.87
531        <n WITH TILDE>                            241      73       73       73       195.177  139.88
532        <o WITH GRAVE>                            242      205      205      205      195.178  139.89
533        <o WITH ACUTE>                            243      206      206      206      195.179  139.98
534        <o WITH CIRCUMFLEX>                       244      203      203      203      195.180  139.99
535        <o WITH TILDE>                            245      207      207      207      195.181  139.100
536        <o WITH DIAERESIS>                        246      204      204      204      195.182  139.101
537        <DIVISION SIGN>                           247      225      225      225      195.183  139.102
538        <o WITH STROKE>                           248      112      112      112      195.184  139.103
539        <u WITH GRAVE>                            249      221      221      192      195.185  139.104  ###
540        <u WITH ACUTE>                            250      222      222      222      195.186  139.105
541        <u WITH CIRCUMFLEX>                       251      219      219      219      195.187  139.106
542        <u WITH DIAERESIS>                        252      220      220      220      195.188  139.112
543        <y WITH ACUTE>                            253      141      141      141      195.189  139.113
544        <SMALL LETTER thorn>                      254      142      142      142      195.190  139.114
545        <y WITH DIAERESIS>                        255      223      223      223      195.191  139.115
546
547       If you would rather see the above table in CCSID 0037 order rather than
548       ASCII + Latin-1 order then run the table through:
549
550       recipe 4
551
552        perl \
553           -ne 'if(/.{43}\d{1,3}\s{6,8}\d{1,3}\s{6,8}\d{1,3}\s{6,8}\d{1,3}/)'\
554            -e '{push(@l,$_)}' \
555            -e 'END{print map{$_->[0]}' \
556            -e '          sort{$a->[1] <=> $b->[1]}' \
557            -e '          map{[$_,substr($_,52,3)]}@l;}' perlebcdic.pod
558
559       If you would rather see it in CCSID 1047 order then change the number
560       52 in the last line to 61, like this:
561
562       recipe 5
563
564        perl \
565           -ne 'if(/.{43}\d{1,3}\s{6,8}\d{1,3}\s{6,8}\d{1,3}\s{6,8}\d{1,3}/)'\
566           -e '{push(@l,$_)}' \
567           -e 'END{print map{$_->[0]}' \
568           -e '          sort{$a->[1] <=> $b->[1]}' \
569           -e '          map{[$_,substr($_,61,3)]}@l;}' perlebcdic.pod
570
571       If you would rather see it in POSIX-BC order then change the number 61
572       in the last line to 70, like this:
573
574       recipe 6
575
576        perl \
577           -ne 'if(/.{43}\d{1,3}\s{6,8}\d{1,3}\s{6,8}\d{1,3}\s{6,8}\d{1,3}/)'\
578            -e '{push(@l,$_)}' \
579            -e 'END{print map{$_->[0]}' \
580            -e '          sort{$a->[1] <=> $b->[1]}' \
581            -e '          map{[$_,substr($_,70,3)]}@l;}' perlebcdic.pod
582

IDENTIFYING CHARACTER CODE SETS

584       To determine the character set you are running under from perl one
585       could use the return value of ord() or chr() to test one or more
586       character values.  For example:
587
588           $is_ascii  = "A" eq chr(65);
589           $is_ebcdic = "A" eq chr(193);
590
591       Also, "\t" is a "HORIZONTAL TABULATION" character so that:
592
593           $is_ascii  = ord("\t") == 9;
594           $is_ebcdic = ord("\t") == 5;
595
596       To distinguish EBCDIC code pages try looking at one or more of the
597       characters that differ between them.  For example:
598
599           $is_ebcdic_37   = "\n" eq chr(37);
600           $is_ebcdic_1047 = "\n" eq chr(21);
601
602       Or better still choose a character that is uniquely encoded in any of
603       the code sets, e.g.:
604
605           $is_ascii           = ord('[') == 91;
606           $is_ebcdic_37       = ord('[') == 186;
607           $is_ebcdic_1047     = ord('[') == 173;
608           $is_ebcdic_POSIX_BC = ord('[') == 187;
609
610       However, it would be unwise to write tests such as:
611
612           $is_ascii = "\r" ne chr(13);  #  WRONG
613           $is_ascii = "\n" ne chr(10);  #  ILL ADVISED
614
615       Obviously the first of these will fail to distinguish most ASCII
616       platforms from either a CCSID 0037, a 1047, or a POSIX-BC EBCDIC
617       platform since "\r" eq chr(13) under all of those coded character sets.
618       But note too that because "\n" is chr(13) and "\r" is chr(10) on the
619       Macintosh (which is an ASCII platform) the second $is_ascii test will
620       lead to trouble there.
621
622       To determine whether or not perl was built under an EBCDIC code page
623       you can use the Config module like so:
624
625           use Config;
626           $is_ebcdic = $Config{'ebcdic'} eq 'define';
627

CONVERSIONS

629   tr///
630       In order to convert a string of characters from one character set to
631       another a simple list of numbers, such as in the right columns in the
632       above table, along with perl's tr/// operator is all that is needed.
633       The data in the table are in ASCII/Latin1 order, hence the EBCDIC
634       columns provide easy-to-use ASCII/Latin1 to EBCDIC operations that are
635       also easily reversed.
636
637       For example, to convert ASCII/Latin1 to code page 037 take the output
638       of the second numbers column from the output of recipe 2 (modified to
639       add '\' characters) and use it in tr/// like so:
640
641           $cp_037 =
642           '\x00\x01\x02\x03\x37\x2D\x2E\x2F\x16\x05\x25\x0B\x0C\x0D\x0E\x0F' .
643           '\x10\x11\x12\x13\x3C\x3D\x32\x26\x18\x19\x3F\x27\x1C\x1D\x1E\x1F' .
644           '\x40\x5A\x7F\x7B\x5B\x6C\x50\x7D\x4D\x5D\x5C\x4E\x6B\x60\x4B\x61' .
645           '\xF0\xF1\xF2\xF3\xF4\xF5\xF6\xF7\xF8\xF9\x7A\x5E\x4C\x7E\x6E\x6F' .
646           '\x7C\xC1\xC2\xC3\xC4\xC5\xC6\xC7\xC8\xC9\xD1\xD2\xD3\xD4\xD5\xD6' .
647           '\xD7\xD8\xD9\xE2\xE3\xE4\xE5\xE6\xE7\xE8\xE9\xBA\xE0\xBB\xB0\x6D' .
648           '\x79\x81\x82\x83\x84\x85\x86\x87\x88\x89\x91\x92\x93\x94\x95\x96' .
649           '\x97\x98\x99\xA2\xA3\xA4\xA5\xA6\xA7\xA8\xA9\xC0\x4F\xD0\xA1\x07' .
650           '\x20\x21\x22\x23\x24\x15\x06\x17\x28\x29\x2A\x2B\x2C\x09\x0A\x1B' .
651           '\x30\x31\x1A\x33\x34\x35\x36\x08\x38\x39\x3A\x3B\x04\x14\x3E\xFF' .
652           '\x41\xAA\x4A\xB1\x9F\xB2\x6A\xB5\xBD\xB4\x9A\x8A\x5F\xCA\xAF\xBC' .
653           '\x90\x8F\xEA\xFA\xBE\xA0\xB6\xB3\x9D\xDA\x9B\x8B\xB7\xB8\xB9\xAB' .
654           '\x64\x65\x62\x66\x63\x67\x9E\x68\x74\x71\x72\x73\x78\x75\x76\x77' .
655           '\xAC\x69\xED\xEE\xEB\xEF\xEC\xBF\x80\xFD\xFE\xFB\xFC\xAD\xAE\x59' .
656           '\x44\x45\x42\x46\x43\x47\x9C\x48\x54\x51\x52\x53\x58\x55\x56\x57' .
657           '\x8C\x49\xCD\xCE\xCB\xCF\xCC\xE1\x70\xDD\xDE\xDB\xDC\x8D\x8E\xDF';
658
659           my $ebcdic_string = $ascii_string;
660           eval '$ebcdic_string =~ tr/\000-\377/' . $cp_037 . '/';
661
662       To convert from EBCDIC 037 to ASCII just reverse the order of the tr///
663       arguments like so:
664
665           my $ascii_string = $ebcdic_string;
666           eval '$ascii_string =~ tr/' . $cp_037 . '/\000-\377/';
667
668       Similarly one could take the output of the third numbers column from
669       recipe 2 to obtain a $cp_1047 table.  The fourth numbers column of the
670       output from recipe 2 could provide a $cp_posix_bc table suitable for
671       transcoding as well.
672
673       If you wanted to see the inverse tables, you would first have to sort
674       on the desired numbers column as in recipes 4, 5 or 6, then take the
675       output of the first numbers column.
676
677   iconv
678       XPG operability often implies the presence of an iconv utility
679       available from the shell or from the C library.  Consult your system's
680       documentation for information on iconv.
681
682       On OS/390 or z/OS see the iconv(1) manpage.  One way to invoke the
683       iconv shell utility from within perl would be to:
684
685           # OS/390 or z/OS example
686           $ascii_data = `echo '$ebcdic_data'| iconv -f IBM-1047 -t ISO8859-1`
687
688       or the inverse map:
689
690           # OS/390 or z/OS example
691           $ebcdic_data = `echo '$ascii_data'| iconv -f ISO8859-1 -t IBM-1047`
692
693       For other perl-based conversion options see the Convert::* modules on
694       CPAN.
695
696   C RTL
697       The OS/390 and z/OS C run-time libraries provide _atoe() and _etoa()
698       functions.
699

OPERATOR DIFFERENCES

701       The ".." range operator treats certain character ranges with care on
702       EBCDIC platforms.  For example the following array will have twenty six
703       elements on either an EBCDIC platform or an ASCII platform:
704
705           @alphabet = ('A'..'Z');   #  $#alphabet == 25
706
707       The bitwise operators such as & ^ | may return different results when
708       operating on string or character data in a perl program running on an
709       EBCDIC platform than when run on an ASCII platform.  Here is an example
710       adapted from the one in perlop:
711
712           # EBCDIC-based examples
713           print "j p \n" ^ " a h";                      # prints "JAPH\n"
714           print "JA" | "  ph\n";                        # prints "japh\n"
715           print "JAPH\nJunk" & "\277\277\277\277\277";  # prints "japh\n";
716           print 'p N$' ^ " E<H\n";                      # prints "Perl\n";
717
718       An interesting property of the 32 C0 control characters in the ASCII
719       table is that they can "literally" be constructed as control characters
720       in perl, e.g. "(chr(0)" eq "\c@")> "(chr(1)" eq "\cA")>, and so on.
721       Perl on EBCDIC platforms has been ported to take "\c@" to chr(0) and
722       "\cA" to chr(1), etc. as well, but the thirty three characters that
723       result depend on which code page you are using.  The table below uses
724       the standard acronyms for the controls.  The POSIX-BC and 1047 sets are
725       identical throughout this range and differ from the 0037 set at only
726       one spot (21 decimal).  Note that the "LINE FEED" character may be
727       generated by "\cJ" on ASCII platforms but by "\cU" on 1047 or POSIX-BC
728       platforms and cannot be generated as a "\c.letter." control character
729       on 0037 platforms.  Note also that "\c\" cannot be the final element in
730       a string or regex, as it will absorb the terminator.   But "\c\X" is a
731       "FILE SEPARATOR" concatenated with X for all X.
732
733        chr   ord   8859-1    0037    1047 && POSIX-BC
734        -----------------------------------------------------------------------
735        \c?   127   <DEL>       "            "
736        \c@     0   <NUL>     <NUL>        <NUL>
737        \cA     1   <SOH>     <SOH>        <SOH>
738        \cB     2   <STX>     <STX>        <STX>
739        \cC     3   <ETX>     <ETX>        <ETX>
740        \cD     4   <EOT>     <ST>         <ST>
741        \cE     5   <ENQ>     <HT>         <HT>
742        \cF     6   <ACK>     <SSA>        <SSA>
743        \cG     7   <BEL>     <DEL>        <DEL>
744        \cH     8   <BS>      <EPA>        <EPA>
745        \cI     9   <HT>      <RI>         <RI>
746        \cJ    10   <LF>      <SS2>        <SS2>
747        \cK    11   <VT>      <VT>         <VT>
748        \cL    12   <FF>      <FF>         <FF>
749        \cM    13   <CR>      <CR>         <CR>
750        \cN    14   <SO>      <SO>         <SO>
751        \cO    15   <SI>      <SI>         <SI>
752        \cP    16   <DLE>     <DLE>        <DLE>
753        \cQ    17   <DC1>     <DC1>        <DC1>
754        \cR    18   <DC2>     <DC2>        <DC2>
755        \cS    19   <DC3>     <DC3>        <DC3>
756        \cT    20   <DC4>     <OSC>        <OSC>
757        \cU    21   <NAK>     <NEL>        <LF>              ***
758        \cV    22   <SYN>     <BS>         <BS>
759        \cW    23   <ETB>     <ESA>        <ESA>
760        \cX    24   <CAN>     <CAN>        <CAN>
761        \cY    25   <EOM>     <EOM>        <EOM>
762        \cZ    26   <SUB>     <PU2>        <PU2>
763        \c[    27   <ESC>     <SS3>        <SS3>
764        \c\X   28   <FS>X     <FS>X        <FS>X
765        \c]    29   <GS>      <GS>         <GS>
766        \c^    30   <RS>      <RS>         <RS>
767        \c_    31   <US>      <US>         <US>
768

FUNCTION DIFFERENCES

770       chr()   chr() must be given an EBCDIC code number argument to yield a
771               desired character return value on an EBCDIC platform.  For
772               example:
773
774                   $CAPITAL_LETTER_A = chr(193);
775
776       ord()   ord() will return EBCDIC code number values on an EBCDIC
777               platform.  For example:
778
779                   $the_number_193 = ord("A");
780
781       pack()  The c and C templates for pack() are dependent upon character
782               set encoding.  Examples of usage on EBCDIC include:
783
784                   $foo = pack("CCCC",193,194,195,196);
785                   # $foo eq "ABCD"
786                   $foo = pack("C4",193,194,195,196);
787                   # same thing
788
789                   $foo = pack("ccxxcc",193,194,195,196);
790                   # $foo eq "AB\0\0CD"
791
792       print() One must be careful with scalars and strings that are passed to
793               print that contain ASCII encodings.  One common place for this
794               to occur is in the output of the MIME type header for CGI
795               script writing.  For example, many perl programming guides
796               recommend something similar to:
797
798                   print "Content-type:\ttext/html\015\012\015\012";
799                   # this may be wrong on EBCDIC
800
801               Under the IBM OS/390 USS Web Server or WebSphere on z/OS for
802               example you should instead write that as:
803
804                   print "Content-type:\ttext/html\r\n\r\n"; # OK for DGW et al
805
806               That is because the translation from EBCDIC to ASCII is done by
807               the web server in this case (such code will not be appropriate
808               for the Macintosh however).  Consult your web server's
809               documentation for further details.
810
811       printf()
812               The formats that can convert characters to numbers and vice
813               versa will be different from their ASCII counterparts when
814               executed on an EBCDIC platform.  Examples include:
815
816                   printf("%c%c%c",193,194,195);  # prints ABC
817
818       sort()  EBCDIC sort results may differ from ASCII sort results
819               especially for mixed case strings.  This is discussed in more
820               detail below.
821
822       sprintf()
823               See the discussion of printf() above.  An example of the use of
824               sprintf would be:
825
826                   $CAPITAL_LETTER_A = sprintf("%c",193);
827
828       unpack()
829               See the discussion of pack() above.
830

REGULAR EXPRESSION DIFFERENCES

832       As of perl 5.005_03 the letter range regular expressions such as [A-Z]
833       and [a-z] have been especially coded to not pick up gap characters.
834       For example, characters such as o "o WITH CIRCUMFLEX" that lie between
835       I and J would not be matched by the regular expression range "/[H-K]/".
836       This works in the other direction, too, if either of the range end
837       points is explicitly numeric: "[\x89-\x91]" will match "\x8e", even
838       though "\x89" is "i" and "\x91 " is "j", and "\x8e" is a gap character
839       from the alphabetic viewpoint.
840
841       If you do want to match the alphabet gap characters in a single octet
842       regular expression try matching the hex or octal code such as "/\313/"
843       on EBCDIC or "/\364/" on ASCII platforms to have your regular
844       expression match "o WITH CIRCUMFLEX".
845
846       Another construct to be wary of is the inappropriate use of hex or
847       octal constants in regular expressions.  Consider the following set of
848       subs:
849
850           sub is_c0 {
851               my $char = substr(shift,0,1);
852               $char =~ /[\000-\037]/;
853           }
854
855           sub is_print_ascii {
856               my $char = substr(shift,0,1);
857               $char =~ /[\040-\176]/;
858           }
859
860           sub is_delete {
861               my $char = substr(shift,0,1);
862               $char eq "\177";
863           }
864
865           sub is_c1 {
866               my $char = substr(shift,0,1);
867               $char =~ /[\200-\237]/;
868           }
869
870           sub is_latin_1 {
871               my $char = substr(shift,0,1);
872               $char =~ /[\240-\377]/;
873           }
874
875       The above would be adequate if the concern was only with numeric code
876       points.  However, the concern may be with characters rather than code
877       points and on an EBCDIC platform it may be desirable for constructs
878       such as "if (is_print_ascii("A")) {print "A is a printable
879       character\n";}" to print out the expected message.  One way to
880       represent the above collection of character classification subs that is
881       capable of working across the four coded character sets discussed in
882       this document is as follows:
883
884           sub Is_c0 {
885               my $char = substr(shift,0,1);
886               if (ord('^')==94)  { # ascii
887                   return $char =~ /[\000-\037]/;
888               }
889               if (ord('^')==176) { # 0037
890                   return $char =~ /[\000-\003\067\055-\057\026\005\045\013-\023\074\075\062\046\030\031\077\047\034-\037]/;
891               }
892               if (ord('^')==95 || ord('^')==106) { # 1047 || posix-bc
893                   return $char =~ /[\000-\003\067\055-\057\026\005\025\013-\023\074\075\062\046\030\031\077\047\034-\037]/;
894               }
895           }
896
897           sub Is_print_ascii {
898               my $char = substr(shift,0,1);
899               $char =~ /[ !"\#\$%&'()*+,\-.\/0-9:;<=>?\@A-Z[\\\]^_`a-z{|}~]/;
900           }
901
902           sub Is_delete {
903               my $char = substr(shift,0,1);
904               if (ord('^')==94)  { # ascii
905                   return $char eq "\177";
906               }
907               else  {              # ebcdic
908                   return $char eq "\007";
909               }
910           }
911
912           sub Is_c1 {
913               my $char = substr(shift,0,1);
914               if (ord('^')==94)  { # ascii
915                   return $char =~ /[\200-\237]/;
916               }
917               if (ord('^')==176) { # 0037
918                   return $char =~ /[\040-\044\025\006\027\050-\054\011\012\033\060\061\032\063-\066\010\070-\073\040\024\076\377]/;
919               }
920               if (ord('^')==95)  { # 1047
921                   return $char =~ /[\040-\045\006\027\050-\054\011\012\033\060\061\032\063-\066\010\070-\073\040\024\076\377]/;
922               }
923               if (ord('^')==106) { # posix-bc
924                   return $char =~
925                     /[\040-\045\006\027\050-\054\011\012\033\060\061\032\063-\066\010\070-\073\040\024\076\137]/;
926               }
927           }
928
929           sub Is_latin_1 {
930               my $char = substr(shift,0,1);
931               if (ord('^')==94)  { # ascii
932                   return $char =~ /[\240-\377]/;
933               }
934               if (ord('^')==176) { # 0037
935                   return $char =~
936                     /[\101\252\112\261\237\262\152\265\275\264\232\212\137\312\257\274\220\217\352\372\276\240\266\263\235\332\233\213\267\270\271\253\144\145\142\146\143\147\236\150\164\161-\163\170\165-\167\254\151\355\356\353\357\354\277\200\375\376\373\374\255\256\131\104\105\102\106\103\107\234\110\124\121-\123\130\125-\127\214\111\315\316\313\317\314\341\160\335\336\333\334\215\216\337]/;
937               }
938               if (ord('^')==95)  { # 1047
939                   return $char =~
940                     /[\101\252\112\261\237\262\152\265\273\264\232\212\260\312\257\274\220\217\352\372\276\240\266\263\235\332\233\213\267\270\271\253\144\145\142\146\143\147\236\150\164\161-\163\170\165-\167\254\151\355\356\353\357\354\277\200\375\376\373\374\272\256\131\104\105\102\106\103\107\234\110\124\121-\123\130\125-\127\214\111\315\316\313\317\314\341\160\335\336\333\334\215\216\337]/;
941               }
942               if (ord('^')==106) { # posix-bc
943                   return $char =~
944                     /[\101\252\260\261\237\262\320\265\171\264\232\212\272\312\257\241\220\217\352\372\276\240\266\263\235\332\233\213\267\270\271\253\144\145\142\146\143\147\236\150\164\161-\163\170\165-\167\254\151\355\356\353\357\354\277\200\340\376\335\374\255\256\131\104\105\102\106\103\107\234\110\124\121-\123\130\125-\127\214\111\315\316\313\317\314\341\160\300\336\333\334\215\216\337]/;
945               }
946           }
947
948       Note however that only the "Is_ascii_print()" sub is really independent
949       of coded character set.  Another way to write "Is_latin_1()" would be
950       to use the characters in the range explicitly:
951
952           sub Is_latin_1 {
953               my $char = substr(shift,0,1);
954               $char =~ /[A AXAXAXAXAXAXAXAXAXAXAXAXAAXAXAXAXAXAXAXAXAXAXAXAXAXAXAXAXAXAXA~XA~XA~XA~XA~XA~XA~XA~XA~XA~XA~XA~XA~XA~XA~XA~XA~XA~XA~XA~XA~XA~XA~XA~XA~XA~XA~XA~XA~XA~XA~XA~XA~ A~XA~XA~XA~XA~XA~XA~XA~XA~XA~XA~XA~XA~A~XA~XA~XA~XA~XA~XA~XA~XA~XA~XA~XA~XA~XA~XA~XA~XA~XA~X]/;
955           }
956
957       Although that form may run into trouble in network transit (due to the
958       presence of 8 bit characters) or on non ISO-Latin character sets.
959

SOCKETS

961       Most socket programming assumes ASCII character encodings in network
962       byte order.  Exceptions can include CGI script writing under a host web
963       server where the server may take care of translation for you.  Most
964       host web servers convert EBCDIC data to ISO-8859-1 or Unicode on
965       output.
966

SORTING

968       One big difference between ASCII-based character sets and EBCDIC ones
969       are the relative positions of upper and lower case letters and the
970       letters compared to the digits.  If sorted on an ASCII-based platform
971       the two-letter abbreviation for a physician comes before the two letter
972       abbreviation for drive; that is:
973
974        @sorted = sort(qw(Dr. dr.));  # @sorted holds ('Dr.','dr.') on ASCII,
975                                         # but ('dr.','Dr.') on EBCDIC
976
977       The property of lowercase before uppercase letters in EBCDIC is even
978       carried to the Latin 1 EBCDIC pages such as 0037 and 1047.  An example
979       would be that Ee "E WITH DIAERESIS" (203) comes before ee "e WITH
980       DIAERESIS" (235) on an ASCII platform, but the latter (83) comes before
981       the former (115) on an EBCDIC platform.  (Astute readers will note that
982       the uppercase version of ss "SMALL LETTER SHARP S" is simply "SS" and
983       that the upper case version of ye "y WITH DIAERESIS" is not in the
984       0..255 range but it is at U+x0178 in Unicode, or "\x{178}" in a Unicode
985       enabled Perl).
986
987       The sort order will cause differences between results obtained on ASCII
988       platforms versus EBCDIC platforms.  What follows are some suggestions
989       on how to deal with these differences.
990
991   Ignore ASCII vs. EBCDIC sort differences.
992       This is the least computationally expensive strategy.  It may require
993       some user education.
994
995   MONO CASE then sort data.
996       In order to minimize the expense of mono casing mixed-case text, try to
997       "tr///" towards the character set case most employed within the data.
998       If the data are primarily UPPERCASE non Latin 1 then apply
999       tr/[a-z]/[A-Z]/ then sort().  If the data are primarily lowercase non
1000       Latin 1 then apply tr/[A-Z]/[a-z]/ before sorting.  If the data are
1001       primarily UPPERCASE and include Latin-1 characters then apply:
1002
1003           tr/[a-z]/[A-Z]/;
1004           tr/[A~ A~XA~XA~XA~XA~XA~XA~XA~XA~XA~XA~XA~XA~A~XA~XA~XA~XA~XA~XA~XA~XA~XA~XA~XA~XA~XA~XA~XA~X]/[A~XA~XA~XA~XA~XA~XA~XA~XA~XA~XA~XA~XA~XA~XA~XA~XA~XA~XA~XA~XA~XA~XA~XA~XA~XA~XA~XA~XA~XA~X/;
1005           s/A~X/SS/g;
1006
1007       then sort().  Do note however that such Latin-1 manipulation does not
1008       address the ye "y WITH DIAERESIS" character that will remain at code
1009       point 255 on ASCII platforms, but 223 on most EBCDIC platforms where it
1010       will sort to a place less than the EBCDIC numerals.  With a Unicode-
1011       enabled Perl you might try:
1012
1013           tr/^?/\x{178}/;
1014
1015       The strategy of mono casing data before sorting does not preserve the
1016       case of the data and may not be acceptable for that reason.
1017
1018   Convert, sort data, then re convert.
1019       This is the most expensive proposition that does not employ a network
1020       connection.
1021
1022   Perform sorting on one type of platform only.
1023       This strategy can employ a network connection.  As such it would be
1024       computationally expensive.
1025

TRANSFORMATION FORMATS

1027       There are a variety of ways of transforming data with an intra
1028       character set mapping that serve a variety of purposes.  Sorting was
1029       discussed in the previous section and a few of the other more popular
1030       mapping techniques are discussed next.
1031
1032   URL decoding and encoding
1033       Note that some URLs have hexadecimal ASCII code points in them in an
1034       attempt to overcome character or protocol limitation issues.  For
1035       example the tilde character is not on every keyboard hence a URL of the
1036       form:
1037
1038           http://www.pvhp.com/~pvhp/
1039
1040       may also be expressed as either of:
1041
1042           http://www.pvhp.com/%7Epvhp/
1043
1044           http://www.pvhp.com/%7epvhp/
1045
1046       where 7E is the hexadecimal ASCII code point for '~'.  Here is an
1047       example of decoding such a URL under CCSID 1047:
1048
1049           $url = 'http://www.pvhp.com/%7Epvhp/';
1050           # this array assumes code page 1047
1051           my @a2e_1047 = (
1052                 0,  1,  2,  3, 55, 45, 46, 47, 22,  5, 21, 11, 12, 13, 14, 15,
1053                16, 17, 18, 19, 60, 61, 50, 38, 24, 25, 63, 39, 28, 29, 30, 31,
1054                64, 90,127,123, 91,108, 80,125, 77, 93, 92, 78,107, 96, 75, 97,
1055               240,241,242,243,244,245,246,247,248,249,122, 94, 76,126,110,111,
1056               124,193,194,195,196,197,198,199,200,201,209,210,211,212,213,214,
1057               215,216,217,226,227,228,229,230,231,232,233,173,224,189, 95,109,
1058               121,129,130,131,132,133,134,135,136,137,145,146,147,148,149,150,
1059               151,152,153,162,163,164,165,166,167,168,169,192, 79,208,161,  7,
1060                32, 33, 34, 35, 36, 37,  6, 23, 40, 41, 42, 43, 44,  9, 10, 27,
1061                48, 49, 26, 51, 52, 53, 54,  8, 56, 57, 58, 59,  4, 20, 62,255,
1062                65,170, 74,177,159,178,106,181,187,180,154,138,176,202,175,188,
1063               144,143,234,250,190,160,182,179,157,218,155,139,183,184,185,171,
1064               100,101, 98,102, 99,103,158,104,116,113,114,115,120,117,118,119,
1065               172,105,237,238,235,239,236,191,128,253,254,251,252,186,174, 89,
1066                68, 69, 66, 70, 67, 71,156, 72, 84, 81, 82, 83, 88, 85, 86, 87,
1067               140, 73,205,206,203,207,204,225,112,221,222,219,220,141,142,223
1068           );
1069           $url =~ s/%([0-9a-fA-F]{2})/pack("c",$a2e_1047[hex($1)])/ge;
1070
1071       Conversely, here is a partial solution for the task of encoding such a
1072       URL under the 1047 code page:
1073
1074           $url = 'http://www.pvhp.com/~pvhp/';
1075           # this array assumes code page 1047
1076           my @e2a_1047 = (
1077                 0,  1,  2,  3,156,  9,134,127,151,141,142, 11, 12, 13, 14, 15,
1078                16, 17, 18, 19,157, 10,  8,135, 24, 25,146,143, 28, 29, 30, 31,
1079               128,129,130,131,132,133, 23, 27,136,137,138,139,140,  5,  6,  7,
1080               144,145, 22,147,148,149,150,  4,152,153,154,155, 20, 21,158, 26,
1081                32,160,226,228,224,225,227,229,231,241,162, 46, 60, 40, 43,124,
1082                38,233,234,235,232,237,238,239,236,223, 33, 36, 42, 41, 59, 94,
1083                45, 47,194,196,192,193,195,197,199,209,166, 44, 37, 95, 62, 63,
1084               248,201,202,203,200,205,206,207,204, 96, 58, 35, 64, 39, 61, 34,
1085               216, 97, 98, 99,100,101,102,103,104,105,171,187,240,253,254,177,
1086               176,106,107,108,109,110,111,112,113,114,170,186,230,184,198,164,
1087               181,126,115,116,117,118,119,120,121,122,161,191,208, 91,222,174,
1088               172,163,165,183,169,167,182,188,189,190,221,168,175, 93,180,215,
1089               123, 65, 66, 67, 68, 69, 70, 71, 72, 73,173,244,246,242,243,245,
1090               125, 74, 75, 76, 77, 78, 79, 80, 81, 82,185,251,252,249,250,255,
1091                92,247, 83, 84, 85, 86, 87, 88, 89, 90,178,212,214,210,211,213,
1092                48, 49, 50, 51, 52, 53, 54, 55, 56, 57,179,219,220,217,218,159
1093           );
1094           # The following regular expression does not address the
1095           # mappings for: ('.' => '%2E', '/' => '%2F', ':' => '%3A')
1096           $url =~ s/([\t "#%&\(\),;<=>\?\@\[\\\]^`{|}~])/sprintf("%%%02X",$e2a_1047[ord($1)])/ge;
1097
1098       where a more complete solution would split the URL into components and
1099       apply a full s/// substitution only to the appropriate parts.
1100
1101       In the remaining examples a @e2a or @a2e array may be employed but the
1102       assignment will not be shown explicitly.  For code page 1047 you could
1103       use the @a2e_1047 or @e2a_1047 arrays just shown.
1104
1105   uu encoding and decoding
1106       The "u" template to pack() or unpack() will render EBCDIC data in
1107       EBCDIC characters equivalent to their ASCII counterparts.  For example,
1108       the following will print "Yes indeed\n" on either an ASCII or EBCDIC
1109       computer:
1110
1111           $all_byte_chrs = '';
1112           for (0..255) { $all_byte_chrs .= chr($_); }
1113           $uuencode_byte_chrs = pack('u', $all_byte_chrs);
1114           ($uu = <<'ENDOFHEREDOC') =~ s/^\s*//gm;
1115           M``$"`P0%!@<("0H+#`T.#Q`1$A,4%187&!D:&QP='A\@(2(C)"4F)R@I*BLL
1116           M+2XO,#$R,S0U-C<X.3H[/#T^/T!!0D-$149'2$E*2TQ-3D]045)35%565UA9
1117           M6EM<75Y?8&%B8V1E9F=H:6IK;&UN;W!Q<G-T=79W>'EZ>WQ]?G^`@8*#A(6&
1118           MAXB)BHN,C8Z/D)&2DY25EI>8F9J;G)V>GZ"AHJ.DI::GJ*FJJZRMKJ^PL;*S
1119           MM+6VM[BYNKN\O;Z_P,'"P\3%QL?(R<K+S,W.S]#1TM/4U=;7V-G:V]S=WM_@
1120           ?X>+CY.7FY^CIZNOL[>[O\/'R\_3U]O?X^?K[_/W^_P``
1121           ENDOFHEREDOC
1122           if ($uuencode_byte_chrs eq $uu) {
1123               print "Yes ";
1124           }
1125           $uudecode_byte_chrs = unpack('u', $uuencode_byte_chrs);
1126           if ($uudecode_byte_chrs eq $all_byte_chrs) {
1127               print "indeed\n";
1128           }
1129
1130       Here is a very spartan uudecoder that will work on EBCDIC provided that
1131       the @e2a array is filled in appropriately:
1132
1133           #!/usr/local/bin/perl
1134           @e2a = ( # this must be filled in
1135                  );
1136           $_ = <> until ($mode,$file) = /^begin\s*(\d*)\s*(\S*)/;
1137           open(OUT, "> $file") if $file ne "";
1138           while(<>) {
1139               last if /^end/;
1140               next if /[a-z]/;
1141               next unless int(((($e2a[ord()] - 32 ) & 077) + 2) / 3) ==
1142                   int(length() / 4);
1143               print OUT unpack("u", $_);
1144           }
1145           close(OUT);
1146           chmod oct($mode), $file;
1147
1148   Quoted-Printable encoding and decoding
1149       On ASCII-encoded platforms it is possible to strip characters outside
1150       of the printable set using:
1151
1152           # This QP encoder works on ASCII only
1153           $qp_string =~ s/([=\x00-\x1F\x80-\xFF])/sprintf("=%02X",ord($1))/ge;
1154
1155       Whereas a QP encoder that works on both ASCII and EBCDIC platforms
1156       would look somewhat like the following (where the EBCDIC branch @e2a
1157       array is omitted for brevity):
1158
1159           if (ord('A') == 65) {    # ASCII
1160               $delete = "\x7F";    # ASCII
1161               @e2a = (0 .. 255)    # ASCII to ASCII identity map
1162           }
1163           else {                   # EBCDIC
1164               $delete = "\x07";    # EBCDIC
1165               @e2a =               # EBCDIC to ASCII map (as shown above)
1166           }
1167           $qp_string =~
1168             s/([^ !"\#\$%&'()*+,\-.\/0-9:;<>?\@A-Z[\\\]^_`a-z{|}~$delete])/sprintf("=%02X",$e2a[ord($1)])/ge;
1169
1170       (although in production code the substitutions might be done in the
1171       EBCDIC branch with the @e2a array and separately in the ASCII branch
1172       without the expense of the identity map).
1173
1174       Such QP strings can be decoded with:
1175
1176           # This QP decoder is limited to ASCII only
1177           $string =~ s/=([0-9A-Fa-f][0-9A-Fa-f])/chr hex $1/ge;
1178           $string =~ s/=[\n\r]+$//;
1179
1180       Whereas a QP decoder that works on both ASCII and EBCDIC platforms
1181       would look somewhat like the following (where the @a2e array is omitted
1182       for brevity):
1183
1184           $string =~ s/=([0-9A-Fa-f][0-9A-Fa-f])/chr $a2e[hex $1]/ge;
1185           $string =~ s/=[\n\r]+$//;
1186
1187   Caesarean ciphers
1188       The practice of shifting an alphabet one or more characters for
1189       encipherment dates back thousands of years and was explicitly detailed
1190       by Gaius Julius Caesar in his Gallic Wars text.  A single alphabet
1191       shift is sometimes referred to as a rotation and the shift amount is
1192       given as a number $n after the string 'rot' or "rot$n".  Rot0 and rot26
1193       would designate identity maps on the 26-letter English version of the
1194       Latin alphabet.  Rot13 has the interesting property that alternate
1195       subsequent invocations are identity maps (thus rot13 is its own non-
1196       trivial inverse in the group of 26 alphabet rotations).  Hence the
1197       following is a rot13 encoder and decoder that will work on ASCII and
1198       EBCDIC platforms:
1199
1200           #!/usr/local/bin/perl
1201
1202           while(<>){
1203               tr/n-za-mN-ZA-M/a-zA-Z/;
1204               print;
1205           }
1206
1207       In one-liner form:
1208
1209           perl -ne 'tr/n-za-mN-ZA-M/a-zA-Z/;print'
1210

Hashing order and checksums

1212       To the extent that it is possible to write code that depends on hashing
1213       order there may be differences between hashes as stored on an ASCII-
1214       based platform and hashes stored on an EBCDIC-based platform.  XXX
1215

I18N AND L10N

1217       Internationalization (I18N) and localization (L10N) are supported at
1218       least in principle even on EBCDIC platforms.  The details are system-
1219       dependent and discussed under the "OS ISSUES" in perlebcdic section
1220       below.
1221

MULTI-OCTET CHARACTER SETS

1223       Perl may work with an internal UTF-EBCDIC encoding form for wide
1224       characters on EBCDIC platforms in a manner analogous to the way that it
1225       works with the UTF-8 internal encoding form on ASCII based platforms.
1226
1227       Legacy multi byte EBCDIC code pages XXX.
1228

OS ISSUES

1230       There may be a few system-dependent issues of concern to EBCDIC Perl
1231       programmers.
1232
1233   OS/400
1234       PASE    The PASE environment is a runtime environment for OS/400 that
1235               can run executables built for PowerPC AIX in OS/400; see
1236               perlos400.  PASE is ASCII-based, not EBCDIC-based as the ILE.
1237
1238       IFS access
1239               XXX.
1240
1241   OS/390, z/OS
1242       Perl runs under Unix Systems Services or USS.
1243
1244       chcp    chcp is supported as a shell utility for displaying and
1245               changing one's code page.  See also chcp(1).
1246
1247       dataset access
1248               For sequential data set access try:
1249
1250                   my @ds_records = `cat //DSNAME`;
1251
1252               or:
1253
1254                   my @ds_records = `cat //'HLQ.DSNAME'`;
1255
1256               See also the OS390::Stdio module on CPAN.
1257
1258       OS/390, z/OS iconv
1259               iconv is supported as both a shell utility and a C RTL routine.
1260               See also the iconv(1) and iconv(3) manual pages.
1261
1262       locales On OS/390 or z/OS see locale for information on locales.  The
1263               L10N files are in /usr/nls/locale.  $Config{d_setlocale} is
1264               'define' on OS/390 or z/OS.
1265
1266   VM/ESA?
1267       XXX.
1268
1269   POSIX-BC?
1270       XXX.
1271

BUGS

1273       This pod document contains literal Latin 1 characters and may encounter
1274       translation difficulties.  In particular one popular nroff
1275       implementation was known to strip accented characters to their
1276       unaccented counterparts while attempting to view this document through
1277       the pod2man program (for example, you may see a plain "y" rather than
1278       one with a diaeresis as in ye).  Another nroff truncated the resultant
1279       manpage at the first occurrence of 8 bit characters.
1280
1281       Not all shells will allow multiple "-e" string arguments to perl to be
1282       concatenated together properly as recipes 0, 2, 4, 5, and 6 might seem
1283       to imply.
1284

SEE ALSO

1286       perllocale, perlfunc, perlunicode, utf8.
1287

REFERENCES

1289       <http://anubis.dkuug.dk/i18n/charmaps>
1290
1291       <http://www.unicode.org/>
1292
1293       <http://www.unicode.org/unicode/reports/tr16/>
1294
1295       <http://www.wps.com/projects/codes/> ASCII: American Standard Code for
1296       Information Infiltration Tom Jennings, September 1999.
1297
1298       The Unicode Standard, Version 3.0 The Unicode Consortium, Lisa Moore
1299       ed., ISBN 0-201-61633-5, Addison Wesley Developers Press, February
1300       2000.
1301
1302       CDRA: IBM - Character Data Representation Architecture - Reference and
1303       Registry, IBM SC09-2190-00, December 1996.
1304
1305       "Demystifying Character Sets", Andrea Vine, Multilingual Computing &
1306       Technology, #26 Vol. 10 Issue 4, August/September 1999; ISSN 1523-0309;
1307       Multilingual Computing Inc. Sandpoint ID, USA.
1308
1309       Codes, Ciphers, and Other Cryptic and Clandestine Communication Fred B.
1310       Wrixon, ISBN 1-57912-040-7, Black Dog & Leventhal Publishers, 1998.
1311
1312       http://www.bobbemer.com/P-BIT.HTM <http://www.bobbemer.com/P-BIT.HTM>
1313       IBM - EBCDIC and the P-bit; The biggest Computer Goof Ever Robert
1314       Bemer.
1315

HISTORY

1317       15 April 2001: added UTF-8 and UTF-EBCDIC to main table, pvhp.
1318

AUTHOR

1320       Peter Prymmer pvhp@best.com wrote this in 1999 and 2000 with CCSID 0819
1321       and 0037 help from Chris Leach and Andre Pirard A.Pirard@ulg.ac.be as
1322       well as POSIX-BC help from Thomas Dorner Thomas.Dorner@start.de.
1323       Thanks also to Vickie Cooper, Philip Newton, William Raffloer, and Joe
1324       Smith.  Trademarks, registered trademarks, service marks and registered
1325       service marks used in this document are the property of their
1326       respective owners.
1327
1328
1329
1330perl v5.16.3                      2013-03-04                     PERLEBCDIC(1)
Impressum