您现在的位置： Linux教程網 >> UnixLinux > >> Linux編程 >> Linux編程

Hadoop-采樣器－多輸入路徑－只采一個文件－（MultipleInputs+getsample(conf.getInputFormat)

之前弄采樣器，以為已經結束了工作，結果現在又遇到了問題，因為我的輸入有兩個文件，設計要求是先只采樣其中的大文件（未來是兩個文件分別采樣的），只有一個輸入文件且采樣時，使用采樣器的代碼是：

Path input = new Path(args[0].toString());
input = input.makeQualified(input.getFileSystem(conf));

InputSampler.IntervalSampler<Text, NullWritable> sampler = new InputSampler.IntervalSampler<Text, NullWritable>(0.4, 5);

// 這句話的意思是兩個分區，

// K[] getSample(InputFormat<K,V> inf, JobConf job) 函數原型

String skewuri_out = args[2] + "/sample_list"; // 存放采樣的結果，不是分區的結果
FileSystem fs = FileSystem.get(URI.create(skewuri_out), conf);
FSDataOutputStream fs_out = fs.create(new Path(skewuri_out));

final InputFormat inf = conf.getInputFormat();//這個是獲得Jobconf的InputFormat
Object[] p = sampler.getSample(inf, conf);// 輸出采樣的結果，必須前面是Object類型，換成I那頭Writable就不管用了，不知道為什麼

但是這樣問題就來了，如果我寫了兩個Mapper類，分別為Map1class,Map2class,現在兩個class分別處理兩個不同輸入路徑的數據，目前是指定輸入數據的格式是相同的，那麼可以用MultipleInputs 來實現：

MultipleInputs.addInputPath(conf, new Path(args[0]), Definemyself.class,Map1class.class);
MultipleInputs.addInputPath(conf, new Path(args[1]), Definemyself.class,Map2class.class);

//Definemyself.class 是我自定義的繼承了FileInputFormat ，並且實現了WritableComparable接口

//繼承FileInputFormat 是采樣的需要，實現WritableComparable接口，是因為我在join的時候想整體數據進行序列化，我自己也解釋不明白這個序列化，可以理解成C裡面的結構體吧，就是作為一個整體，可以toString()輸出。

原型是：public class Definemyself extends FileInputFormat<Text,Text> implements WritableComparable{...}

這個問題從昨晚就困擾我，上周做夢采樣，這種做夢還是采樣。中午和老公出去吃的，因為要好好探討一下這個問題，我的理論就是既然系統提供MultipleInputs，同時Jobconf有能調用getInputFormat(),就肯定有辦法二者同時使用，不讓就矛盾了，傻子才會建立這樣的系統呢。

上一篇文章： Linux中無緩沖文件I/O API
下一篇文章： Hadoop 中的采樣器－附主要使用源碼

Linux編程

Java文件獲取路徑方式

Java Hadoop分布式系統文件操作

Hadoop文本轉換為序列文件

Hadoop序列化文件SequenceFile

Hadoop涉及GBK編碼的文件

自定義Hadoop Map/Reduce輸入文件切割InputFormat

改變Struts2配置文件默認路徑

配置文件keepalived.conf詳解，keepalived.conf

Linux編程

SHELL編程

PERL編程