集群配置都大同小異,在這裡我簡單說下我的配置:
主節點系統是Ubuntu 14.04 LTS x64其他兩個節點在VM中系統為Centos 6.4 x64
JVM為jdk1.7_80
hadoop版本2.7.1和2.7.2都嘗試了
出現的問題是:
啟動hdfs系統正常,都啟動起來了,jps查看如下
主節點 SecondaryNameNode和 NameNode
從節點:DataNode
但使用hfds命令dfsadmin -report發現live的datanode只有1個,而且當你不同時間report,這個存活的節點是交替改變的,一會是datanode1,一會是datanode2
如下
hadoop@hadoop:modules$ hdfs dfsadmin -report Configured Capacity: 16488800256 (15.36 GB) Present Capacity: 13008093184 (12.11 GB) DFS Remaining: 13008068608 (12.11 GB) DFS Used: 24576 (24 KB) DFS Used%: 0.00% Under replicated blocks: 0 Blocks with corrupt replicas: 0 Missing blocks: 0 Missing blocks (with replication factor 1): 0 ------------------------------------------------- Live datanodes (1): Name: 192.168.2.3:50010 (hadoop) Hostname: hadoop1 Decommission Status : Normal Configured Capacity: 16488800256 (15.36 GB) DFS Used: 24576 (24 KB) Non DFS Used: 3480969216 (3.24 GB) DFS Remaining: 13007806464 (12.11 GB) DFS Used%: 0.00% DFS Remaining%: 78.89% Configured Cache Capacity: 0 (0 B) Cache Used: 0 (0 B) Cache Remaining: 0 (0 B) Cache Used%: 100.00% Cache Remaining%: 0.00% Xceivers: 1 Last contact: Mon May 09 17:30:08 CST 2016再次report
hadoop@hadoop:modules$ hdfs dfsadmin -report Configured Capacity: 16488800256 (15.36 GB) Present Capacity: 13008007168 (12.11 GB) DFS Remaining: 13007982592 (12.11 GB) DFS Used: 24576 (24 KB) DFS Used%: 0.00% Under replicated blocks: 0 Blocks with corrupt replicas: 0 Missing blocks: 0 Missing blocks (with replication factor 1): 0 ------------------------------------------------- Live datanodes (1): Name: 192.168.2.3:50010 (hadoop) Hostname: hadoop2 Decommission Status : Normal Configured Capacity: 16488800256 (15.36 GB) DFS Used: 24576 (24 KB) Non DFS Used: 3480793088 (3.24 GB) DFS Remaining: 13007982592 (12.11 GB) DFS Used%: 0.00% DFS Remaining%: 78.89% Configured Cache Capacity: 0 (0 B) Cache Used: 0 (0 B) Cache Remaining: 0 (0 B) Cache Used%: 100.00% Cache Remaining%: 0.00% Xceivers: 1 Last contact: Mon May 09 17:34:06 CST 2016奇怪了..同時,通過web ui 50070查看datanode存活的節點時候,只能有1個.而且當你刷新頁面,這個存活的節點是變化的,同上
一開始我沒這樣看,我是通過dfs -mkdir /test,然後put文件,出現IO傳輸異常..主要異常內容如下
<pre name="code" class="html">hdfs.DFSClient: org.apache.hadoop.ipc.RemoteException: Java.io.IOException:could only be replicated to 0 nodes, instead of 1
我就納悶了,先是檢驗每個機器的防火牆,Selinux,然後互ping,然後ssh連接測試都沒有問題..
一切都正常啊,為何會報傳輸異常...
截取一部分日志如下;
2016-05-10 01:29:54,148 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Block pool Block pool BP-1877985316-192.168.2.3-1462786104060 (Datanode Uuid c31e3853-b15e-46d8-abd0-ac1d1ed4572b) service to hadoop/192.168.2.3:9000 successfully registered with NN 2016-05-10 01:29:54,151 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Successfully sent block report 0x44419c23fe, containing 1 storage report(s), of which we sent 1. The reports had 0 total blocks and used 1 RPC(s). This took 0 msec to generate and 2 msecs for RPC and NN processing. Got back one command: FinalizeCommand/5. 2016-05-10 01:29:54,152 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Got finalize command for block pool BP-1877985316-192.168.2.3-1462786104060 2016-05-10 01:29:57,150 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeCommand action : DNA_REGISTER from hadoop/192.168.2.3:9000 with active state 2016-05-10 01:29:57,153 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Block pool BP-1877985316-192.168.2.3-1462786104060 (Datanode Uuid c31e3853-b15e-46d8-abd0-ac1d1ed4572b) service to hadoop/192.168.2.3:9000 beginning handshake with NN這段日志一直在重復,每2-3秒重復一次,另一個節點也是.
另外我查看了虛擬機右下角的網絡傳輸信號燈,也是基本沒隔1秒閃一下,說明每一秒主節點和從節點進行了一次ssh交互,起初沒在意..以為是心跳..
其實如果是正常情況下,只有每一次ssh 提出一個requets的時候,才會閃爍一下,我是這麼理解的
然後查閱官方文檔,參考其他網上資源和問題,都試了都不行...納悶死了..
小修改了一些自己的配置文件,但只要基本的東西配置的沒錯,小改動的都是沒多大關系的.這個是我配置的文件:
core-site.xml
<configuration> <property> <name>fs.defaultFS</name> <value>hdfs://hadoop:9000</value> </property> <property> <name>hadoop.tmp.dir</name> <value>/opt/modules/hadoop-2.7.2/data/tmp</value> </property> <property> <name>io.file.buffer.size</name> <value>131072</value> </property> <property> <name>hadoop.proxyuser.hadoop.hosts</name> <value>*</value> </property> <property> <name>hadoop.proxyuser.hadoop.groups</name> <value>*</value> </property> </configuration>hdfs-site.xml
<configuration> <property> <name>dfs.namenode.secondary.http-address</name> <value>hadoop:50090</value> </property> <property> <name>dfs.replication</name> <value>2</value> </property> <property> <name>dfs.permissions.enabled</name> <value>false</value> </property> <property> <name>dfs.namenode.name.dir</name> <value>file:/opt/modules/hadoop-2.7.2/data/dfs/name</value> </property> <property> <name>dfs.datanode.data.dir</name> <value>file:/opt/modules/hadoop-2.7.2/data/dfs/data</value> </property> <property> <name>dfs.webhdfs.enabled</name> <value>true</value> </property> </configuration>
這裡只關系到hdfs文件系統,所以mapred-site.xml和yarh-sitexml在這裡就不累述了.
經過無數次的namnode format和刪除data, /tmp目錄下的文件,都無濟於事,後來我開始懷疑是hadoop版本問題,試了試從一開始hadoop2.7.1改為2.7.2,還一樣,然後開始懷疑Ubuntu的問題..我接著轉到我Win7系統上,在虛擬機裡運行三個節點,上傳一個文件成功,十分順利啊,,我就納悶了..這是unbutu哪裡的問題,,,
說來說去都是關系這節點ssh傳輸問題..我就開始想會不會是虛擬機的網絡問題..我就換了幾個DNS,從VM中Net適配器在Ubuntu中的ip xx.xx.xx.1換到我Ubunt接入外網的路由器ip,xx.xx.xx.1,都沒啥問題啊,各個節點ping自己,網關,互ping和外網,都很順..說明不是這裡的問題..
最後一個嘗試了.這裡我設的是VM Net鏈接方式...改成橋接模式..配置一個外網的DNS,,.然後重新format Namenode,啟動hdfs...檢查Datanode報告,上傳文件測試,,,一切順利..終於改成功了..
put一個文件
web ui查看,存活的兩個節點都顯示出來了,不會跟之一樣只出現一個..
個人總結;
啟動hdfs文件系統,啟動節點啟動正常,但是datanode個數不正常,put文件時報IO錯.
無外乎
1配置的slave或者hdfs-site.xml有問題
2ssh傳輸有問題..
3 網絡傳輸比如網關,DNS,ip配置,hosts文件出現了問題..
4 以上都設置正常,在換一種方式測試鏈接.比如改變虛擬機的Net或者橋接方式,
要從這裡著手,像這種比較少見的問題,只能自己一個個試著調了..
關於Ubuntu下的Vm虛擬機Net方式連接為何會導致這種單線傳輸的問題,我也不是很了解..希望了解的補充以下..
畢竟只有hadoop傳輸有問題,其他的比如虛擬機中ping,聯網,傳文件,ssh訪問之類的,都很順利