「Java」使用される文字セットを指定する

Java ( 仮想マシン ) で使用される文字セットの指定についてメモ。

先に結論を書くと

1. API レベルでの指定
2. システムプロパティ file.encoding での指定
3. 環境変数 LANG の設定 ( Java 実行ユーザの )

という順で ( 1 が優先順位高い ) 決定される感じです。

ちょっと動作確認を。サンプルコードは以下にします。

class Test {
  public static void main(String[] args) throws Exception {
    String s = "あ";
    byte[] b = null;

    if(args.length == 0) {
      b = s.getBytes();
    } else if (args.length == 1) {
      b = s.getBytes(args[0]);
    } else {
      System.exit(-1);
    }
    
    for(byte bb : b) {
      System.out.println(Integer.toHexString(bb & 0xff));
    }
  }
}

まず、LANG が ja_JP.UTF-8 として、上記のサンプルコードを普通に実行すると UTF-8 でエンコーディングされたバイト配列が取れる。

[testuser@centos62 work]$ java Test
e38182

次に、実行ユーザの LANG を ja_JP.EUC-JP として実行すると、EUC-JP でエンコーディングされたバイト配列が取れるようになる。

[testuser@centos62 work]$ export LANG=ja_JP.EUC-JP
[testuser@centos62 work]$ java Test
a4a2

今後は、システムプロパティ file.encoding を SJIS とすると、SJIS でエンコーディングされたバイト配列になる。

[testuser@centos62 work]$ echo $LANG
ja_JP.EUC-JP
[testuser@centos62 work]$ java -Dfile.encoding=SJIS Test
82a0

最後に、API レベル ( 今回は Stirng#getBytes の引数 ) で UTF-16 と指定すると UTF-16 でエンコーディングされたバイト配列になる。

[testuser@centos62 work]$ echo $LANG
ja_JP.EUC-JP
[testuser@centos62 work]$ java -Dfile.encoding=SJIS Test
82a0
[testuser@centos62 work]$ java -Dfile.encoding=SJIS Test UTF-16
feff3042

※ 参考までに、1、2、3 何にも設定されてない場合は "ANSI_X3.4-1968" ってエンコードが使用されるみたいです。

class Test2 {
  public static void main(String[] args) {
    System.out.println(System.getProperty("file.encoding"));
  }
}

[testuser@centos62 work]$ unset LANG
[testuser@centos62 work]$ echo $LANG

[testuser@centos62 work]$ java Test2
ANSI_X3.4-1968

"ANSI_X3.4-1968" って以下の一覧には載ってないなぁ。

・Supported Encodings
https://docs.oracle.com/javase/8/docs/technotes/guides/intl/encoding.doc.html

以上です。

[ 環境情報 ]
CentOS 6.2
Java SE 8