suse內核BUG一例:update_group_power:cpu_power
引言:
最近業務服務器有5台都先後在3天內宕機,查出來的原因是suse11sp1版本的內核bug。
系統報錯信息:
系統messages日志報錯如下
Jun 10 14:00:07 sharedbpro kernel: [ 282.962529] update_group_power: cpu_power = 3925366004
Jun 10 14:00:07 sharedbpro kernel: [ 282.962559] update_group_power: cpu_power = 3925397578
Jun 10 14:00:07 sharedbpro kernel: [ 282.965803] update_group_power: cpu_power = 3928638515
Jun 10 14:00:07 sharedbpro kernel: [ 282.966201] update_group_power: cpu_power = 3929034454
Jun 10 14:00:07 sharedbpro kernel: [ 282.966369] update_group_power: cpu_power = 3929206061
Jun 10 14:00:07 sharedbpro kernel: [ 282.966397] update_group_power: cpu_power = 3929235611
Jun 10 14:00:07 sharedbpro kernel: [ 282.966507] update_group_power: cpu_power = 3929344069
Jun 10 14:00:07 sharedbpro kernel: [ 282.966535] update_group_power: cpu_power = 3929373135
Jun 10 14:00:07 sharedbpro kernel: [ 282.969804] update_group_power: cpu_power = 3932639635
Jun 10 14:00:07 sharedbpro kernel: [ 282.970188] update_group_power: cpu_power = 3933021527
Jun 10 14:00:07 sharedbpro kernel: [ 282.970353] update_group_power: cpu_power = 3933189985
Jun 10 14:00:07 sharedbpro kernel: [ 282.970381] update_group_power: cpu_power = 3933218987
Jun 10 14:00:07 sharedbpro kernel: [ 282.970490] update_group_power: cpu_power = 3933327365
Jun 10 14:00:07 sharedbpro kernel: [ 282.970518] update_group_power: cpu_power = 3933356585
Jun 10 14:00:07 sharedbpro kernel: [ 282.973789] update_group_power: cpu_power = 3936624686
Jun 10 14:00:07 sharedbpro kernel: [ 282.974194] update_group_power: cpu_power = 3937026810
Jun 10 14:00:07 sharedbpro kernel: [ 282.974360] update_group_power: cpu_power = 3937196506
Jun 10 14:00:07 sharedbpro kernel: [ 282.974388] update_group_power: cpu_power = 3937226236
Jun 10 14:00:07 sharedbpro kernel: [ 282.974496] update_group_power: cpu_power = 3937333589
Jun 10 14:00:07 sharedbpro kernel: [ 282.974525] update_group_power: cpu_power = 3937363466
Jun 10 14:00:07 sharedbpro kernel: [ 282.977789] update_group_power: cpu_power = 3940624812
Jun 10 14:00:07 sharedbpro kernel: [ 282.978185] update_group_power: cpu_power = 3941017715
Jun 10 14:00:07 sharedbpro kernel: [ 282.978351] update_group_power: cpu_power = 3941187161
問題現象:
系統日志內出現類似“update_group_power: cpu_power = xxxxxxxx”的報錯,一般報錯時間都會超過10分鐘,且是連續報錯,在日志中看著很是壯觀,滿篇都是。
到達一定的時間之後,系統就會宕機,我第一時間我通過ILO登錄看見控制台顯示是黑屏假死,當時直接重啟系統然後啟動數據庫,觀察一切恢復正常。
解決辦法:
根據廠商判斷,確定此現象為一bug。
解決辦法為更新系統內核到穩定版本sp2或sp1最高版,或更新系統所有文件到sp2版本;
小貼士:
鹵肉在這裡強調一下,我們作為運維的dba應該遵從業務優先,先恢復應用,然後再查問題原因,當然必要的短時間(一兩分鐘內)可以做的信息收集工作還是可以做的。