1QTextCodec(3qt) QTextCodec(3qt)
2
3
4
6 QTextCodec - Conversion between text encodings
7
9 Almost all the functions in this class are reentrant when Qt is built
10 with thread support. The exceptions are ~QTextCodec(), setCodecForTr(),
11 setCodecForCStrings(), and QTextCodec(). </p>
12
13 #include <qtextcodec.h>
14
15 Inherited by QBig5Codec, QBig5hkscsCodec, QEucJpCodec, QEucKrCodec,
16 QGb18030Codec, QJisCodec, QHebrewCodec, QSjisCodec, and QTsciiCodec.
17
18 Public Members
19 virtual ~QTextCodec ()
20 virtual const char * name () const = 0
21 virtual const char * mimeName () const
22 virtual int mibEnum () const = 0
23 virtual QTextDecoder * makeDecoder () const
24 virtual QTextEncoder * makeEncoder () const
25 virtual QString toUnicode ( const char * chars, int len ) const
26 virtual QCString fromUnicode ( const QString & uc, int & lenInOut )
27 const
28 QCString fromUnicode ( const QString & uc ) const
29 QString toUnicode ( const QByteArray & a, int len ) const
30 QString toUnicode ( const QByteArray & a ) const
31 QString toUnicode ( const QCString & a, int len ) const
32 QString toUnicode ( const QCString & a ) const
33 QString toUnicode ( const char * chars ) const
34 virtual bool canEncode ( QChar ch ) const
35 virtual bool canEncode ( const QString & s ) const
36 virtual int heuristicContentMatch ( const char * chars, int len ) const
37 = 0
38 virtual int heuristicNameMatch ( const char * hint ) const
39
40 Static Public Members
41 QTextCodec * loadCharmap ( QIODevice * iod )
42 QTextCodec * loadCharmapFile ( QString filename )
43 QTextCodec * codecForMib ( int mib )
44 QTextCodec * codecForName ( const char * name, int accuracy = 0 )
45 QTextCodec * codecForContent ( const char * chars, int len )
46 QTextCodec * codecForIndex ( int i )
47 QTextCodec * codecForLocale ()
48 void setCodecForLocale ( QTextCodec * c )
49 QTextCodec * codecForTr ()
50 void setCodecForTr ( QTextCodec * c )
51 QTextCodec * codecForCStrings ()
52 void setCodecForCStrings ( QTextCodec * c )
53 void deleteAllCodecs ()
54 const char * locale ()
55
56 Protected Members
57 QTextCodec ()
58
59 Static Protected Members
60 int simpleHeuristicNameMatch ( const char * name, const char * hint )
61
63 The QTextCodec class provides conversion between text encodings.
64
65 Qt uses Unicode to store, draw and manipulate strings. In many
66 situations you may wish to deal with data that uses a different
67 encoding. For example, most Japanese documents are still stored in
68 Shift-JIS or ISO2022, while Russian users often have their documents in
69 KOI8-R or CP1251.
70
71 Qt provides a set of QTextCodec classes to help with converting non-
72 Unicode formats to and from Unicode. You can also create your own codec
73 classes (see later).
74
75 The supported encodings are:
76
77 Latin1
78
79 Big5 -- Chinese
80
81 Big5-HKSCS -- Chinese
82
83 eucJP -- Japanese
84
85 eucKR -- Korean
86
87 GB2312 -- Chinese
88
89 GBK -- Chinese
90
91 GB18030 -- Chinese
92
93 JIS7 -- Japanese
94
95 Shift-JIS -- Japanese
96
97 TSCII -- Tamil
98
99 utf8 -- Unicode, 8-bit
100
101 utf16 -- Unicode
102
103 KOI8-R -- Russian
104
105 KOI8-U -- Ukrainian
106
107 ISO8859-1 -- Western
108
109 ISO8859-2 -- Central European
110
111 ISO8859-3 -- Central European
112
113 ISO8859-4 -- Baltic
114
115 ISO8859-5 -- Cyrillic
116
117 ISO8859-6 -- Arabic
118
119 ISO8859-7 -- Greek
120
121 ISO8859-8 -- Hebrew, visually ordered
122
123 ISO8859-8-i -- Hebrew, logically ordered
124
125 ISO8859-9 -- Turkish
126
127 ISO8859-10
128
129 ISO8859-13
130
131 ISO8859-14
132
133 ISO8859-15 -- Western
134
135 IBM 850
136
137 IBM 866
138
139 CP874
140
141 CP1250 -- Central European
142
143 CP1251 -- Cyrillic
144
145 CP1252 -- Western
146
147 CP1253 -- Greek
148
149 CP1254 -- Turkish
150
151 CP1255 -- Hebrew
152
153 CP1256 -- Arabic
154
155 CP1257 -- Baltic
156
157 CP1258
158
159 Apple Roman
160
161 TIS-620 -- Thai
162
163 QTextCodecs can be used as follows to convert some locally encoded
164 string to Unicode. Suppose you have some string encoded in Russian
165 KOI8-R encoding, and want to convert it to Unicode. The simple way to
166 do this is:
167
168 QCString locallyEncoded = "..."; // text to convert
169 QTextCodec *codec = QTextCodec::codecForName("KOI8-R"); // get the codec for KOI8-R
170 QString unicodeString = codec->toUnicode( locallyEncoded );
171
172 After this, unicodeString holds the text converted to Unicode.
173 Converting a string from Unicode to the local encoding is just as easy:
174
175 QString unicodeString = "..."; // any Unicode text
176 QTextCodec *codec = QTextCodec::codecForName("KOI8-R"); // get the codec for KOI8-R
177 QCString locallyEncoded = codec->fromUnicode( unicodeString );
178
179 Some care must be taken when trying to convert the data in chunks, for
180 example, when receiving it over a network. In such cases it is possible
181 that a multi-byte character will be split over two chunks. At best this
182 might result in the loss of a character and at worst cause the entire
183 conversion to fail.
184
185 The approach to use in these situations is to create a QTextDecoder
186 object for the codec and use this QTextDecoder for the whole decoding
187 process, as shown below:
188
189 QTextCodec *codec = QTextCodec::codecForName( "Shift-JIS" );
190 QTextDecoder *decoder = codec->makeDecoder();
191 QString unicodeString;
192 while( receiving_data ) {
193 QByteArray chunk = new_data;
194 unicodeString += decoder->toUnicode( chunk.data(), chunk.length() );
195 }
196
197 The QTextDecoder object maintains state between chunks and therefore
198 works correctly even if a multi-byte character is split between chunks.
199
201 Support for new text encodings can be added to Qt by creating
202 QTextCodec subclasses.
203
204 Built-in codecs can be overridden by custom codecs since more recently
205 created QTextCodec objects take precedence over earlier ones.
206
207 You may find it more convenient to make your codec class available as a
208 plugin; see the plugin documentation for more details.
209
210 The abstract virtual functions describe the encoder to the system and
211 the coder is used as required in the different text file formats
212 supported by QTextStream, and under X11, for the locale-specific
213 character input and output.
214
215 To add support for another 8-bit encoding to Qt, make a subclass of
216 QTextCodec and implement at least the following methods:
217
218 const char* name() const
219 Return the official name for the encoding.
220
221 int mibEnum() const
222 Return the MIB enum for the encoding if it is listed in the IANA
223 character-sets encoding file.
224
225 If the encoding is multi-byte then it will have "state"; that is, the
226 interpretation of some bytes will be dependent on some preceding bytes.
227 For such encodings, you must implement:
228
229 QTextDecoder* makeDecoder() const
230 Return a QTextDecoder that remembers incomplete multi-byte sequence
231 prefixes or other required state.
232
233 If the encoding does not require state, you should implement:
234
235 QString toUnicode(const char* chars, int len) const
236 Converts len characters from chars to Unicode.
237
238 The base QTextCodec class has default implementations of the above two
239 functions, but they are mutually recursive, so you must re-implement at
240 least one of them, or both for improved efficiency.
241
242 For conversion from Unicode to 8-bit encodings, it is rarely necessary
243 to maintain state. However, two functions similar to the two above are
244 used for encoding:
245
246 QTextEncoder* makeEncoder() const
247 Return a QTextEncoder.
248
249 QCString fromUnicode(const QString& uc, int& lenInOut ) const
250 Converts lenInOut characters (of type QChar) from the start of the
251 string uc, returning a QCString result, and also returning the length
252 of the result in lenInOut.
253
254 Again, these are mutually recursive so only one needs to be
255 implemented, or both if greater efficiency is possible.
256
257 Finally, you must implement:
258
259 int heuristicContentMatch(const char* chars, int len) const
260 Gives a value indicating how likely it is that len characters from
261 chars are in the encoding.
262
263 A good model for this function is the
264 QWindowsLocalCodec::heuristicContentMatch function found in the Qt
265 sources.
266
267 A QTextCodec subclass might have improved performance if you also re-
268 implement:
269
270 bool canEncode( QChar ) const
271 Test if a Unicode character can be encoded.
272
273 bool canEncode( const QString& ) const
274 Test if a string of Unicode characters can be encoded.
275
276 int heuristicNameMatch(const char* hint) const
277 Test if a possibly non-standard name is referring to the codec.
278
279 Codecs can also be created as plugins.
280
281 See also Internationalization with Qt.
282
285 Warning: This function is not reentrant.</p>
286
287 Constructs a QTextCodec, and gives it the highest precedence. The
288 QTextCodec should always be constructed on the heap (i.e. with new). Qt
289 takes ownership and will delete it when the application terminates.
290
292 Warning: This function is not reentrant.</p>
293
294 Destroys the QTextCodec. Note that you should not delete codecs
295 yourself: once created they become Qt's responsibility.
296
298 Returns TRUE if the Unicode character ch can be fully encoded with this
299 codec; otherwise returns FALSE. The default implementation tests if the
300 result of toUnicode(fromUnicode(ch)) is the original ch. Subclasses may
301 be able to improve the efficiency.
302
304 This is an overloaded member function, provided for convenience. It
305 behaves essentially like the above function.
306
307 s contains the string being tested for encode-ability.
308
310 Returns the codec used by QString to convert to and from const char*
311 and QCStrings. If this function returns 0 (the default), QString
312 assumes Latin-1.
313
314 See also setCodecForCStrings().
315
317 [static]
318 Searches all installed QTextCodec objects, returning the one which most
319 recognizes the given content. May return 0.
320
321 Note that this is often a poor choice, since character encodings often
322 use most of the available character sequences, and so only by
323 linguistic analysis could a true match be made.
324
325 chars contains the string to check, and len contains the number of
326 characters in the string to use.
327
328 See also heuristicContentMatch().
329
330 Example: qwerty/qwerty.cpp.
331
333 Returns the QTextCodec i positions from the most recently inserted
334 codec, or 0 if there is no such QTextCodec. Thus, codecForIndex(0)
335 returns the most recently created QTextCodec.
336
337 Example: qwerty/qwerty.cpp.
338
340 Returns a pointer to the codec most suitable for this locale.
341
342 Example: qwerty/qwerty.cpp.
343
345 Returns the QTextCodec which matches the MIBenum mib.
346
348 [static]
349 Searches all installed QTextCodec objects and returns the one which
350 best matches name; the match is case-insensitive. Returns 0 if no
351 codec's heuristicNameMatch() reports a match better than accuracy, or
352 if name is a null string.
353
354 See also heuristicNameMatch().
355
357 Returns the codec used by QObject::tr() on its argument. If this
358 function returns 0 (the default), tr() assumes Latin-1.
359
360 See also setCodecForTr().
361
363 Deletes all the created codecs.
364
365 Warning: Do not call this function.
366
367 QApplication calls this function just before exiting to delete any
368 QTextCodec objects that may be lying around. Since various other
369 classes hold pointers to QTextCodec objects, it is not safe to call
370 this function earlier.
371
372 If you are using the utility classes (like QString) but not using
373 QApplication, calling this function at the very end of your application
374 may be helpful for chasing down memory leaks by eliminating any
375 QTextCodec objects.
376
378 [virtual]
379 QTextCodec subclasses must reimplement either this function or
380 makeEncoder(). It converts the first lenInOut characters of uc from
381 Unicode to the encoding of the subclass. If lenInOut is negative or too
382 large, the length of uc is used instead.
383
384 Converts lenInOut characters (not bytes) from uc, producing a QCString.
385 lenInOut will be set to the length of the result (in bytes).
386
387 The default implementation makes an encoder with makeEncoder() and
388 converts the input with that. Note that the default makeEncoder()
389 implementation makes an encoder that simply calls this function, hence
390 subclasses must reimplement one function or the other to avoid infinite
391 recursion.
392
393 Reimplemented in QHebrewCodec.
394
396 This is an overloaded member function, provided for convenience. It
397 behaves essentially like the above function.
398
399 uc is the unicode source string.
400
402 [pure virtual]
403 QTextCodec subclasses must reimplement this function. It examines the
404 first len bytes of chars and returns a value indicating how likely it
405 is that the string is a prefix of text encoded in the encoding of the
406 subclass. A negative return value indicates that the text is detectably
407 not in the encoding (e.g. it contains characters undefined in the
408 encoding). A return value of 0 indicates that the text should be
409 decoded with this codec rather than as ASCII, but there is no
410 particular evidence. The value should range up to len. Thus, most
411 decoders will return -1, 0, or -len.
412
413 The characters are not null terminated.
414
415 See also codecForContent().
416
418 Returns a value indicating how likely it is that this decoder is
419 appropriate for decoding some format that has the given name. The name
420 is compared with the hint.
421
422 A good match returns a positive number around the length of the string.
423 A bad match is negative.
424
425 The default implementation calls simpleHeuristicNameMatch() with the
426 name of the codec.
427
429 Reads a POSIX2 charmap definition from iod. The parser recognizes the
430 following lines:
431
432 <font name="sans"> <code_set_name> name</br> <escape_char>
433 character</br> % alias alias</br> CHARMAP</br> <token> /xhexbyte
434 <Uunicode> ...</br> <token> /ddecbyte <Uunicode> ...</br> <token>
435 /octbyte <Uunicode> ...</br> <token> /any/any... <Uunicode> ...</br>
436 END CHARMAP</br> </font>
437
438 The resulting QTextCodec is returned (and also added to the global list
439 of codecs). The name() of the result is taken from the code_set_name.
440
441 Note that a codec constructed in this way uses much more memory and is
442 slower than a hand-written QTextCodec subclass, since tables in code
443 are kept in memory shared by all Qt applications.
444
445 See also loadCharmapFile().
446
447 Example: qwerty/qwerty.cpp.
448
450 A convenience function for loadCharmap() that loads the charmap
451 definition from the file filename.
452
454 Returns a string representing the current language and sublanguage,
455 e.g. "pt" for Portuguese, or "pt_br" for Portuguese/Brazil.
456
457 Example: i18n/main.cpp.
458
460 Creates a QTextDecoder which stores enough state to decode chunks of
461 char* data to create chunks of Unicode data. The default implementation
462 creates a stateless decoder, which is only sufficient for the simplest
463 encodings where each byte corresponds to exactly one Unicode character.
464
465 The caller is responsible for deleting the returned object.
466
468 Creates a QTextEncoder which stores enough state to encode chunks of
469 Unicode data as char* data. The default implementation creates a
470 stateless encoder, which is only sufficient for the simplest encodings
471 where each Unicode character corresponds to exactly one character.
472
473 The caller is responsible for deleting the returned object.
474
476 Subclasses of QTextCodec must reimplement this function. It returns the
477 MIBenum (see the IANA character-sets encoding file for more
478 information). It is important that each QTextCodec subclass returns the
479 correct unique value for this function.
480
481 Reimplemented in QEucJpCodec.
482
484 Returns the preferred mime name of the encoding as defined in the IANA
485 character-sets encoding file.
486
487 Reimplemented in QEucJpCodec, QEucKrCodec, QJisCodec, QHebrewCodec, and
488 QSjisCodec.
489
491 QTextCodec subclasses must reimplement this function. It returns the
492 name of the encoding supported by the subclass. When choosing a name
493 for an encoding, consider these points:
494
495 On X11, heuristicNameMatch( const char * hint ) is used to test if a
496 the QTextCodec can convert between Unicode and the encoding of a font
497 with encoding hint, such as "iso8859-1" for Latin-1 fonts," koi8-r" for
498 Russian KOI8 fonts. The default algorithm of heuristicNameMatch() uses
499 name().
500
501 Some applications may use this function to present encodings to the end
502 user.
503
504 Example: qwerty/qwerty.cpp.
505
507 Warning: This function is not reentrant.</p>
508
509 Sets the codec used by QString to convert to and from const char* and
510 QCStrings. If c is 0 (the default), QString assumes Latin-1.
511
512 Warning: Some codecs do not preserve the characters in the ascii range
513 (0x00 to 0x7f). For example, the Japanese Shift-JIS encoding maps the
514 backslash character (0x5a) to the Yen character. This leads to
515 unexpected results when using the backslash character to escape
516 characters in strings used in e.g. regular expressions. Use
517 QString::fromLatin1() to preserve characters in the ascii range when
518 needed.
519
520 See also codecForCStrings() and setCodecForTr().
521
523 Set the codec to c; this will be returned by codecForLocale(). This
524 might be needed for some applications that want to use their own
525 mechanism for setting the locale.
526
527 See also codecForLocale().
528
530 Warning: This function is not reentrant.</p>
531
532 Sets the codec used by QObject::tr() on its argument to c. If c is 0
533 (the default), tr() assumes Latin-1.
534
535 If the literal quoted text in the program is not in the Latin-1
536 encoding, this function can be used to set the appropriate encoding.
537 For example, software developed by Korean programmers might use eucKR
538 for all the text in the program, in which case the main() function
539 might look like this:
540
541 int main(int argc, char** argv)
542 {
543 QApplication app(argc, argv);
544 ... install any additional codecs ...
545 QTextCodec::setCodecForTr( QTextCodec::codecForName("eucKR") );
546 ...
547 }
548
549 Note that this is not the way to select the encoding that the user has
550 chosen. For example, to convert an application containing literal
551 English strings to Korean, all that is needed is for the English
552 strings to be passed through tr() and for translation files to be
553 loaded. For details of internationalization, see the Qt
554 internationalization documentation.
555
556 See also codecForTr() and setCodecForCStrings().
557
559 hint ) [static protected]
560 A simple utility function for heuristicNameMatch(): it does some very
561 minor character-skipping so that almost-exact matches score high. name
562 is the text we're matching and hint is used for the comparison.
563
565 QTextCodec subclasses must reimplement this function or makeDecoder().
566 It converts the first len characters of chars to Unicode.
567
568 The default implementation makes a decoder with makeDecoder() and
569 converts the input with that. Note that the default makeDecoder()
570 implementation makes a decoder that simply calls this function, hence
571 subclasses must reimplement one function or the other to avoid infinite
572 recursion.
573
575 This is an overloaded member function, provided for convenience. It
576 behaves essentially like the above function.
577
578 a contains the source characters; len contains the number of characters
579 in a to use.
580
582 This is an overloaded member function, provided for convenience. It
583 behaves essentially like the above function.
584
585 a contains the source characters.
586
588 This is an overloaded member function, provided for convenience. It
589 behaves essentially like the above function.
590
591 a contains the source characters; len contains the number of characters
592 in a to use.
593
595 This is an overloaded member function, provided for convenience. It
596 behaves essentially like the above function.
597
598 a contains the source characters.
599
601 This is an overloaded member function, provided for convenience. It
602 behaves essentially like the above function.
603
604 chars contains the source characters.
605
606
608 http://doc.trolltech.com/qtextcodec.html
609 http://www.trolltech.com/faq/tech.html
610
612 Copyright 1992-2007 Trolltech ASA, http://www.trolltech.com. See the
613 license file included in the distribution for a complete license
614 statement.
615
617 Generated automatically from the source code.
618
620 If you find a bug in Qt, please report it as described in
621 http://doc.trolltech.com/bughowto.html. Good bug reports help us to
622 help you. Thank you.
623
624 The definitive Qt documentation is provided in HTML format; it is
625 located at $QTDIR/doc/html and can be read using Qt Assistant or with a
626 web browser. This man page is provided as a convenience for those users
627 who prefer man pages, although this format is not officially supported
628 by Trolltech.
629
630 If you find errors in this manual page, please report them to qt-
631 bugs@trolltech.com. Please include the name of the manual page
632 (qtextcodec.3qt) and the Qt version (3.3.8).
633
634
635
636Trolltech AS 2 February 2007 QTextCodec(3qt)