今天嘗試用C語言在Hadoop上編寫統計單詞的程序,具體過程如下:
一、編寫map和reduce程序
mapper.c
- #include <stdio.h>
- #include <stdlib.h>
- #include <string.h>
-
- #define BUF_SIZE 2048
- #define DELIM '\n'
-
- int main(int argc, char * argv[])
- {
- char buffer[BUF_SIZE];
- while(fgets(buffer,BUF_SIZE-1,stdin))
- {
- int len = strlen(buffer);
- if(buffer[len-1] == DELIM) // 將換行符去掉
- buffer[len-1] = 0;
-
- char *query = NULL;
- query = strtok(buffer, " ");
- while(query)
- {
- printf("%s\t1\n",query);
- query = strtok(NULL," ");
- }
- }
- return 0;
- }
reducer.c
- #include <stdio.h>
- #include <stdlib.h>
- #include <string.h>
-
- #define BUFFER_SIZE 1024
- #define DELIM "\t"
-
- int main(int argc, char * argv[])
- {
- char str_last_key[BUFFER_SIZE];
- char str_line[BUFFER_SIZE];
- int count = 0;
-
- *str_last_key = '\0';
-
- while( fgets(str_line,BUFFER_SIZE-1,stdin) )
- {
- char * str_cur_key = NULL;
- char * str_cur_num = NULL;
-
- str_cur_key = strtok(str_line,DELIM);
- str_cur_num = strtok(NULL,DELIM);
-
- if(str_last_key[0] =='\0')
- {
- strcpy(str_last_key,str_cur_key);
- }
- if(strcmp(str_cur_key, str_last_key))// 前後不相等,輸出
- {
- printf("%s\t%d\n",str_last_key,count);
- count = atoi(str_cur_num);
- }else{// 相等,則加當前的key的value
- count += atoi(str_cur_num);
- }
- strcpy(str_last_key,str_cur_key);
- }
- printf("%s\t%d\n",str_last_key,count);
- return 0;
- }
二、編譯
gcc mapper.c -o mapper
gcc reducer.c -o reducer
三、運行
(一)啟動hadoop後將待統計單詞的輸入文件放到 input文件夾中:bin/hadoop fs -put 待統計文件 input
(二)使用contrib/streaming/下的jar工具調用上面的mapper\reducer:
bin/hadoop jar /home/huangkq/Desktop/hadoop/contrib/streaming/hadoop-streaming-0.20.203.0.jar -mapper /home/huangkq/Desktop/hadoop2/mapper -reducer /home/huangkq/Desktop/hadoop2/reducer -input input -output c_output -jobconf mapred.reduce.tasks=2
說明:hadoop-streaming-0.20.203.0.jar是一個管道工具
(三)查看結果:bin/hadoop fs -cat c_output/*
更多Hadoop相關信息見Hadoop 專題頁面 http://www.linuxidc.com/topicnews.aspx?tid=13