Hadoop中采樣是由org.apache.hadoop.mapred.lib.InputSampler類來實現的。
InputSampler類實現了三種采樣方法:SplitSampler、RandomSampler和IntervalSampler。
SplitSampler、RandomSampler和IntervalSampler都是InputSampler的靜態內部類,它們都實現了InputSampler的內部接口Sampler接口:
public interface Sampler<K,V>{
K[] getSample(InputFormat<K,V> inf,JobConf job) throws IOException;
}
getSample方法根據job的配置信息以及輸入格式獲得抽樣結果,三個采樣類各自有不同的實現。
RandomSampler隨機地從輸入數據中抽取Key,是一個通用的采樣器。RandomSampler類有三個屬性:freq(一個Key被選中的概率),numSamples(從所有被選中的分區中獲得的總共的樣本數目),maxSplitsSampled(需要檢查掃描的最大分區數目)。
RandomSampler中getSample方法的實現如下:
public K[] getSample(InputFormat<K,V> inf, JobConf job) throws IOException {
InputSplit[] splits = inf.getSplits(job, job.getNumMapTasks());
ArrayList<K> samples = new ArrayList<K>(numSamples);
int splitsToSample = Math.min(maxSplitsSampled, splits.length);
Random r = new Random();
long seed = r.nextLong();
r.setSeed(seed);
LOG.debug("seed: " + seed);
// shuffle splits
for (int i = 0; i < splits.length; ++i) {
InputSplit tmp = splits[i];
int j = r.nextInt(splits.length);
splits[i] = splits[j];
splits[j] = tmp;
}
// our target rate is in terms of the maximum number of sample splits,
// but we accept the possibility of sampling additional splits to hit
// the target sample keyset
for (int i = 0; i < splitsToSample ||
(i < splits.length && samples.size() < numSamples); ++i) {
RecordReader<K,V> reader = inf.getRecordReader(splits[i], job,
Reporter.NULL);
K key = reader.createKey();
V value = reader.createValue();
while (reader.next(key, value)) {
if (r.nextDouble() <= freq) {
if (samples.size() < numSamples) {
samples.add(key);
} else {
// When exceeding the maximum number of samples, replace a
// random element with this one, then adjust the frequency
// to reflect the possibility of existing elements being
// pushed out
int ind = r.nextInt(numSamples);
if (ind != numSamples) {
samples.set(ind, key);
}
freq *= (numSamples - 1) / (double) numSamples;
}
key = reader.createKey();
}
}
reader.close();
}
return (K[])samples.toArray();
}
更多Hadoop相關信息見Hadoop 專題頁面 http://www.linuxidc.com/topicnews.aspx?tid=13