一次误操作引起的linux系统网络故障推荐

1、故障描述

接到用户报障,生产某系统无法访问。同事接到报障后立即排查,经测试,系统确实无法访问,并且无法ping通服务器。

2、故障处理

由于客户端无法ping通服务器,需要进入机房查看。经查看,服务器硬件无报警,系统无重启。登录系统使用ifconfig命令查看,IP丢失(eth0不存在),紧接打开网卡配置目录/etc/sysconfig/network-scripts,发现网卡文件ifcfg-eth0丢失,只存在之前备份的ifcfg-eth0.bak文件和ifcfg-peth0文件。根据先抢通业务后处理故障原则,通过备份的文件复制一份进行修复,重启network服务,故障解决。

3、故障分析

3.1经了解,故障发生时,有一同事正在登录系统查询安全基线配置,但同事坚称并未进行rm或者mv网卡文件操作。通过history命令得知,该同事确实未执行rm或者mv操作,只执行了chkconfig –list命令,但却不小心把原本需要复制的内容误操作的当作命令去执行了,历史记录如下:

883chkconfig--list884NetworkManager0:off1:off2:off3:off4:off5:off6:off885PowerIscsi0:off1:off2:off3:on4:off5:on6:off886PowerMig0:off1:off2:off3:on4:off5:on6:off887PowerMigRecoverAll0:off1:off2:off3:on4:off5:on6:off888acpid0:off1:off2:on3:on4:on5:on6:off889anacron0:off1:off2:on3:on4:on5:on6:off890atd0:off1:off2:off3:on4:on5:on6:off891auditd0:off1:off2:on3:on4:on5:on6:off892autofs0:off1:off2:off3:on4:on5:on6:off893avahi-daemon0:off1:off2:off3:on4:on5:on6:off894avahi-dnsconfd0:off1:off2:off3:off4:off5:off6:off895bluetooth0:off1:off2:on3:on4:on5:on6:off896capi0:off1:off2:off3:off4:off5:off6:off897conman0:off1:off2:off3:off4:off5:off6:off898coremail0:off1:off2:on3:on4:on5:on6:off899cpuspeed0:off1:on2:on3:on4:on5:on6:off900crond0:off1:off2:on3:on4:on5:on6:off901cups0:off1:off2:on3:on4:on5:on6:off902dnsmasq0:off1:off2:off3:off4:off5:off6:off903dund0:off1:off2:off3:off4:off5:off6:off904ebtables0:off1:off2:off3:off4:off5:off6:off905firstboot0:off1:off2:off3:on4:off5:on6:off906gpm0:off1:off2:on3:on4:on5:on6:off907haldaemon0:off1:off2:off3:on4:on5:on6:off908hidd0:off1:off2:on3:on4:on5:on6:off909hplip0:off1:off2:on3:on4:on5:on6:off910httpd0:off1:off2:off3:off4:off5:off6:off911ip6tables0:off1:off2:on3:on4:on5:on6:off912ipmi0:off1:off2:off3:off4:off5:off6:off913iptables0:off1:off2:off3:off4:off5:off6:off914irda0:off1:off2:off3:off4:off5:off6:off915irqbalance0:off1:off2:on3:on4:on5:on6:off916iscsi0:off1:off2:off3:on4:on5:on6:off917iscsid0:off1:off2:off3:on4:on5:on6:off918isdn0:off1:off2:on3:on4:on5:on6:off919kdump0:off1:off2:off3:off4:off5:off6:off920kudzu0:off1:off2:off3:on4:on5:on6:off921libvirt-guests0:off1:off2:off3:on4:on5:on6:off922libvirtd0:off1:off2:off3:on4:on5:on6:off923lvm2-monitor0:off1:on2:on3:on4:on5:on6:off924mcstrans0:off1:off2:on3:on4:on5:on6:off925mdmonitor0:off1:off2:on3:on4:on5:on6:off926mdmpd0:off1:off2:off3:off4:off5:off6:off927messagebus0:off1:off2:off3:on4:on5:on6:off928microcode_ctl0:off1:off2:on3:on4:on5:on6:off929multipathd0:off1:off2:off3:off4:off5:off6:off930named0:off1:off2:off3:off4:off5:off6:off931netbackup0:off1:off2:on3:on4:off5:on6:off932netconsole0:off1:off2:off3:off4:off5:off6:off933netfs0:off1:off2:off3:on4:on5:on6:off934netplugd0:off1:off2:off3:off4:off5:off6:off935network0:off1:off2:on3:on4:on5:on6:off936nfs0:off1:off2:off3:off4:off5:off6:off937nfslock0:off1:off2:off3:on4:on5:on6:off938nscd0:off1:off2:off3:off4:off5:off6:off939ntpd0:off1:off2:off3:off4:off5:off6:off940pand0:off1:off2:off3:off4:off5:off6:off941pcscd0:off1:off2:on3:on4:on5:on6:off942portmap0:off1:off2:off3:on4:on5:on6:off943psacct0:off1:off2:off3:off4:off5:off6:off944rawdevices0:off1:off2:off3:on4:on5:on6:off945rdisc0:off1:off2:off3:off4:off5:off6:off946readahead_early0:off1:off2:on3:on4:on5:on6:off947readahead_later0:off1:off2:off3:off4:off5:on6:off948restorecond0:off1:off2:on3:on4:on5:on6:off949rhnsd0:off1:off2:off3:on4:on5:on6:off950rpcgssd0:off1:off2:off3:on4:on5:on6:off951rpcidmapd0:off1:off2:off3:on4:on5:on6:off952rpcsvcgssd0:off1:off2:off3:off4:off5:off6:off953saslauthd0:off1:off2:off3:off4:off5:off6:off954sendmail0:off1:off2:off3:off4:off5:off6:off

以上操作记录表面看起来,并无异常。

3.2通过查看系统日志messages,发现有“removed ifcfg-eth0”关键字,发生的时间与同事误操作的时间吻合,如下:

Mar2109:46:50localhostnm-system-settings:ifcfg-rh:removed/etc/sysconfig/network-scripts/ifcfg-eth0.Mar2109:46:50localhostnm-system-settings:ifcfg-rh:parsing/etc/sysconfig/network-scripts/ifcfg-peth0...Mar2109:46:50localhostnm-system-settings:ifcfg-rh:readconnection'Systempeth0'Mar2109:46:50localhostnm-system-settings:ifcfg-rh:updating/etc/sysconfig/network-scripts/ifcfg-peth0

同事既然没有误操作,那为什么会有remove网卡文件的日志呢?难道被黑了?还是有其它原因?

3.3查看日志secure和命令last,并未发现异常登录IP,先排除被黑可能性,着重排查同事误操作的命令中,哪一条才是引起网卡文件丢失的。

3.4再一次确认3.1的history操作记录,表面看上去确实没有什么异常,而且都是chkconfig –list的输出内容,百思不得其解。

3.5查问题,看日志。只能通过仔细的分析message日志查找一点蛛丝马迹。从3.2的日志来看,当看到

Mar2109:46:50localhostnm-system-settings:ifcfg-rh:parsing/etc/sysconfig/network-scripts/ifcfg-peth0...

时,发现“ifcfg-peth0”这个网卡文件很可疑,该文件应该跟XEN虚拟化有关,但这个系统并未使用到XEN虚拟化。

3.6登录系统确认,系统虽未使用虚拟化,但前期安装时安装了XEN虚拟化,并且加载了kernel-xen内核和启动了xend服务:

1)[root@~]#uname-r2.6.18-238.el5xen2)#/etc/init.d/xendstatusxendisrunning

3.7查看Ifcfg-peth0文件的创建修改时间,与同事误操作的时间吻合,再一次怀疑这个文件跟这次故障有一定的关系:

#find.-typef-mtime2|xargsls-l-rw-r--r--1rootroot303Mar2109:46./etc/modprobe.conf-rw-r--r--1rootroot23116Mar2109:46./etc/sysconfig/hwconf-rw-r--r--1rootroot122Mar2109:46./etc/sysconfig/network-scripts/ifcfg-peth0

3.8为方便排查和重现故障,根据系统的环境,在测试环境搭建:安装了XEN虚拟化RHEL5.6。

3.8.1跟生产系统一样,同样的备份一份Ifcfg-eth0.bak文件;

3.8.2根据同事误操作的历史记录,逐条进行执行测试,当测试到“kudzu 0:off 1:off 2:off 3:on 4:on 5:on 6:off”,问题重现:ifcfg-eth0文件丢失,同时生成了ifcfg-peth0文件,并且断网。与生产系统故障的情况一致。如图:

3.9搭建另一个测试环境:并未安装XEN虚拟化的RHEL5.6。同样的执行3.8.2章节的命令,但问题未重现。如图:

4、故障原因

通过问题重现,得出结论:安装了XEN虚拟化环境的系统,同事误操作的时候执行了其中一条“kudzu 0:off 1:off 2:off 3:on 4:on 5:on 6:off”命令,两者条件满足情况下,从而导致删除了ifcfg-eth0文件,继而发生断网。

5、相关知识

根据网上信息了解,kudzu命令为什么会导致删除网卡配置文件,目前所了解的,应该是在特定情况下(安装了XEN虚拟化)触发的BUG或者本身的机制导致。

附:

1、kudzu介绍:http://blog.csdn.net/huyangg/article/details/7189743

2、kudzu相关BUG:https://bugzilla.redhat.com/show_bug.cgi?id=206910、https://bugzilla.redhat.com/show_bug.cgi?id=229579、http://linux.bigresource.com/Red-Hat-Prevent-kudzu-from-changing-ifcfg-ethX-file–wi29JYmpf.html

闲淡时光里徜徉书海。本文是旅游开心句子说说心情,希望对大家有帮助!

一次误操作引起的linux系统网络故障推荐

相关文章:

你感兴趣的文章:

标签云: