org.exist.util
Class UTF8

java.lang.Object
  extended by org.exist.util.UTF8

public class UTF8
extends Object

This class contains two static tools for doing UTF-8 encoding and decoding.

UTF-8 is ASCII- transparent. It supports character sets requiring more than the seven bit ASCII base range of UTF-8, including Unicode, ISO-8859, ISO-10646, etc..

We do not use an ISO UCS code signature, and we do not use a Java Data I/O- style strlen prefix.

Author:
John Pritchard (john@syntelos.org)

Constructor Summary
UTF8()
           
 
Method Summary
static XMLString decode(byte[] code)
          Decode UTF-8 input, terminates decoding at a null character, value 0x0.
static XMLString decode(byte[] code, int off, int many)
           
static XMLString decode(byte[] code, int off, int many, XMLString xs)
          Decode UTF-8 input, terminates decoding at a null character, value 0x0.
static byte[] encode(char[] str)
          Encode string in UTF-8.
static byte[] encode(char[] str, int start, int length, byte[] bytbuf, int offset)
          Encode string in UTF-8.
static byte[] encode(String s)
          Encode string in UTF-8.
static byte[] encode(String str, byte[] bytbuf, int offset)
           
static byte[] encode(String str, int start, int length, byte[] bytbuf, int offset)
          Encode string in UTF-8.
static int encoded(char[] str, int start, int len)
          Returns the length of the string encoded in UTF-8.
static int encoded(String str)
          Returns the length of the string encoded in UTF-8.
static int getUTF8Encoding(char in, char in2, byte[] out)
          Static method to generate the UTF-8 representation of a Unicode character.
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

UTF8

public UTF8()
Method Detail

decode

public static final XMLString decode(byte[] code)
Decode UTF-8 input, terminates decoding at a null character, value 0x0.

Throws:
IllegalStateException - Bad format.

decode

public static final XMLString decode(byte[] code,
                                     int off,
                                     int many)

decode

public static final XMLString decode(byte[] code,
                                     int off,
                                     int many,
                                     XMLString xs)
Decode UTF-8 input, terminates decoding at a null character, value 0x0.

Throws:
IllegalStateException - Bad format.

encode

public static final byte[] encode(char[] str)
Encode string in UTF-8.


encode

public static final byte[] encode(char[] str,
                                  int start,
                                  int length,
                                  byte[] bytbuf,
                                  int offset)
Encode string in UTF-8. Warning: the size of bytbuf is not checked. Use encoded() to determine the size needed.


encode

public static final byte[] encode(String str,
                                  byte[] bytbuf,
                                  int offset)

encode

public static final byte[] encode(String str,
                                  int start,
                                  int length,
                                  byte[] bytbuf,
                                  int offset)
Encode string in UTF-8. Warning: the size of bytbuf is not checked. Use encoded() to determine the size needed.


encode

public static final byte[] encode(String s)
Encode string in UTF-8.


encoded

public static final int encoded(String str)
Returns the length of the string encoded in UTF-8.


encoded

public static final int encoded(char[] str,
                                int start,
                                int len)
Returns the length of the string encoded in UTF-8.


getUTF8Encoding

public static int getUTF8Encoding(char in,
                                  char in2,
                                  byte[] out)
Static method to generate the UTF-8 representation of a Unicode character. This particular code is taken from saxon (see http://saxon.sf.net).

Parameters:
in - the Unicode character, or the high half of a surrogate pair
in2 - the low half of a surrogate pair (ignored unless the first argument is in the range for a surrogate pair)
out - an array of at least 4 bytes to hold the UTF-8 representation.
Returns:
the number of bytes in the UTF-8 representation


Copyright (C) Wolfgang Meier. All rights reserved.