1. 開發環境
=========
Hadoop:
Hadoop 1.1.2
Subversion https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.1 -r 1440782
Compiled by hortonfo on Thu Jan 31 02:08:44 UTC 2013
From source with checksum c720ddcf4b926991de7467d253a79b8b
java:
java version "1.6.0_27"
OpenJDK Runtime Environment (IcedTea6 1.12.5) (6b27-1.12.5-0Ubuntu0.12.04.1)
OpenJDK 64-Bit Server VM (build 20.0-b12, mixed mode)
os:
Distributor ID: Ubuntu
Description: Ubuntu 12.04.2 LTS
Release: 12.04
Codename: precise
eclipse:
Eclipse Platform
Version: 3.7.2
Build id: I20110613-1736
2. 數據准備
=========
要做map-reduce,得先准備點數據。
在這個例子中,我產生了1億條,1000個學生的考試成績數據,學生的學號是從1 - 1000,考試成績的范圍是0 - 100
用php搞個小腳本來搞定
<?php
$i = 0;
$nRows = 10000000;
$fileData = fopen("Score.data", "a+");
if(!$fileData)
die("Can't open file!\n");
for(; $i < $nRows; $i++)
{
$nNumber = rand(1, 1000);
$nScore = rand(0, 100);
$strLine = $nNumber."\t".$nScore."\n";
fputs($fileData, $strLine);
}
fclose($fileData);
?>
3. 配置eclipse
=========
用hadoop搞map-reduce自然要用java,用java自然要用eclipse
eclipse是有hadoop map-reduce插件的,但是那個插件兼容的版本已經太老了,因此如果你用了新版的eclipse,還得自己配一下hadoop的包。
hadoop的包在ubuntu上安裝在
/usr/share/hadoop
該路徑下有一系列jar包(包括lib的子目錄裡等),導入到工程中,其中一般有用的是hadoop-core-1.1.2.jar這個包,基本上都要導入