namenode gc导致的故障一例

操作都出现问题。

几十个job报以下错

FAILED:RuntimeExceptionorg.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException):OperationcategoryWRITEisnotsupportedinstatestandby或者:FAILED:RuntimeExceptionorg.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException):OperationcategoryREADisnotsupportedinstatestandby

2)仔细查看nn的日志,和对源码的分析,发现是由于namenode这个时间点进行了一次时间比较长的full gc(96s),导致写journalnode超时(默认是20s),namenode进程退出。同时从jn的日志可以看出,,后面是没有nn的请求过来的(因为nn已经挂掉了)。

Gc日志:

2014-05-25T05:59:53.578+0800:2478553.316:[FullGC[PSYoungGen:2784K->0K(4179328K)][PSOldGen:28566944K->9208608K(28573696K)]28569728K->9208608K(32753024K)[PSPermGen:41405K->41405K(524288K)],95.6706740secs][Times:user=95.62sys=0.00,real=95.67secs]HeapPSYoungGentotal4179328K,used87037K[0x00007f70e5000000,0x00007f71e5000000,0x00007f71e5000000)edenspace4164800K,2%used[0x00007f70e5000000,0x00007f70ea4ff590,0x00007f71e3330000)fromspace14528K,0%used[0x00007f71e3330000,0x00007f71e3330000,0x00007f71e4160000)tospace14336K,0%used[0x00007f71e4200000,0x00007f71e4200000,0x00007f71e5000000)PSOldGentotal28573696K,used9208608K[0x00007f6a15000000,0x00007f70e5000000,0x00007f70e5000000)objectspace28573696K,32%used[0x00007f6a15000000,0x00007f6c470c8078,0x00007f70e5000000)PSPermGentotal524288K,used41466K[0x00007f69f5000000,0x00007f6a15000000,0x00007f6a15000000)objectspace524288K,7%used[0x00007f69f5000000,0x00007f69f787e8f8,0x00007f6a15000000)

jn日志:

2014-05-2505:58:45,432INFOorg.apache.hadoop.hdfs.server.namenode.FileJournalManager:Finalizingeditsfile/home/vipshop/hard_disk/nn_shared_disk/bipcluster/current/edits_inprogress_0000000001665741687->/home/vipshop/hard_disk/nn_shared_disk/bipcluster/current/edits_0000000001665741687-00000000016657770622014-05-2506:44:54,299ERRORorg.apache.hadoop.hdfs.qjournal.server.JournalNode:RECEIVEDSIGNAL15:SIGTERM2014-05-2506:44:54,302INFOorg.apache.hadoop.hdfs.qjournal.server.JournalNode:SHUTDOWN_MSG:

nn错误日志:

2014-05-2506:01:29,258WARNorg.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager:Waited96275ms(timeout=20000ms)foraresponseforsendEdits.Noresponsesyet.2014-05-2506:01:29,259WARNorg.apache.hadoop.util.JvmPauseMonitor:DetectedpauseinJVMorhostmachine(egGC):pauseofapproximately96145msGCpool’PSMarkSweep’hadcollection(s):count=1time=95670msGCpool’PSScavenge’hadcollection(s):count=1time=592ms2014-05-2506:01:29,259WARNorg.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager:Took96281mstosendabatchof2edits(358bytes)toremotejournalxxx:84852014-05-2506:01:29,259WARNorg.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager:Took96283mstosendabatchof2edits(358bytes)toremotejournalxxx:84852014-05-2506:01:29,259INFOorg.apache.hadoop.hdfs.server.namenode.FSEditLog:Numberoftransactions:20019Totaltimefortransactions(ms):436NumberoftransactionsbatchedinSyncs:9441Numberofsyncs:5160SyncTimes(ms):4162393052014-05-2506:01:29,259WARNorg.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager:Took96275mstosendabatchof13edits(1638bytes)toremotejournalxxx:84852014-05-2506:01:29,259FATALorg.apache.hadoop.hdfs.server.namenode.FSEditLog:Error:flushfailedforrequiredjournal(JournalAndStream(mgr=QJMto[xxx:8485,xxx:8485,xxx:8485,xxx:8485,xxx:8485],stream=QuorumOutputStreamstartingattxid1665777063))java.io.IOException:Timedoutwaiting20000msforaquorumofnodestorespond.atorg.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:137)atorg.apache.hadoop.hdfs.qjournal.client.QuorumOutputStream.flushAndSync(QuorumOutputStream.java:107)atorg.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:113)atorg.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:107)atorg.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream$8.apply(JournalSet.java:490)atorg.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:350)atorg.apache.hadoop.hdfs.server.namenode.JournalSet.access$100(JournalSet.java:53)atorg.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream.flush(JournalSet.java:486)atorg.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:581)atorg.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:1879)atorg.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:1845)atorg.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:439)atorg.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:207)atorg.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:44942)atorg.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:453)atorg.apache.hadoop.ipc.RPC$Server.call(RPC.java:1002)atorg.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1752)atorg.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1748)atjava.security.AccessController.doPrivileged(NativeMethod)atjavax.security.auth.Subject.doAs(Subject.java:396)atorg.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1438)atorg.apache.hadoop.ipc.Server$Handler.run(Server.java:1746)2014-05-2506:01:29,259WARNorg.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager:Took96276mstosendabatchof13edits(1638bytes)toremotejournal10.201.202.21:84852014-05-2506:01:29,263INFOorg.apache.hadoop.hdfs.StateChange:BLOCK*allocateBlock:/tmp/hive-hdfs/hive_2014-05-25_05-59-16_296_5866222407565920996/_task_tmp.-ext-10002/_tmp.001379_0.BP-180494678-xxx-1366627257763blk_-3717787322078480343_437088977{blockUCState=UNDER_CONSTRUCTION,primaryNodeIndex=-1,replicas=[ReplicaUnderConstruction[xxx:50010|RBW],ReplicaUnderConstruction[xxx:50010|RBW],ReplicaUnderConstruction[xxx:50010|RBW]]}2014-05-2506:01:29,263WARNorg.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault:Notabletoplaceenoughreplicas,stillinneedof2toreach3Formoreinformation,pleaseenableDEBUGloglevelonorg.apache.commons.logging.impl.Log4JLogger2014-05-2506:01:29,263WARNorg.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager:AbortingQuorumOutputStreamstartingattxid1665777063……2014-05-2506:01:29,268WARNorg.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager:Took96285mstosendabatchof13edits(1638bytes)toremotejournalxxx:8485……2014-05-2506:01:29,324INFOorg.apache.hadoop.util.ExitUtil:Exitingwithstatus12014-05-2506:01:29,364INFOorg.apache.hadoop.hdfs.server.namenode.NameNode:SHUTDOWN_MSG:

在写journalnode超时时,触发了ExitUtil类的terminate方法,终止当前的进程:

世上并没有用来鼓励工作努力的赏赐,

namenode gc导致的故障一例

相关文章:

你感兴趣的文章:

标签云: