您现在的位置： Linux教程網 >> UnixLinux > >> Linux編程 >> Linux編程

Hadoop涉及GBK編碼的文件

Hadoop源代碼中涉及編碼問題時都是寫死的utf-8，但是不少情況下，也會遇到輸入文件和輸出文件需要GBK編碼的情況。

輸入文件為GBK，則只需在mapper或reducer程序中讀取Text時，使用transformTextToUTF8(text, "GBK");進行一下轉碼，以確保都是以UTF-8的編碼方式在運行。

public static Text transformTextToUTF8(Text text, String encoding) {
String value = null;
try {
value = new String(text.getBytes(), 0, text.getLength(), encoding);
} catch (UnsupportedEncodingException e) {
e.printStackTrace();
}
return new Text(value);
}

輸出文件為GBK，則重寫TextOutputFormat類，public class GBKFileOutputFormat<K, V> extends FileOutputFormat<K, V>，把TextOutputFormat的源碼拷過來，然後把裡面寫死的utf-8編碼改成GBK編碼。最後，在run程序中，設置job.setOutputFormatClass(GBKFileOutputFormat.class);

更多Hadoop相關信息見Hadoop 專題頁面 http://www.linuxidc.com/topicnews.aspx?tid=13

上一篇文章： MapReduce高級編程之自定義InputFormat
下一篇文章：使用Hadoop MapReduce 進行排序

Linux編程

Ubuntu下正確顯示Windows GBK編碼文件

Java Hadoop分布式系統文件操作

Hadoop文本轉換為序列文件

Hadoop序列化文件SequenceFile

Java 文件讀取寫入的編碼問題

Python源文件中使用UTF-8編碼

Shell應用：批量將文件編碼由gbk轉utf-8