搜索推荐有1个job,1000多个map,200个reduce,运行到最后只剩1个reduce(10.39.6.130上)的时候,出现以下异常,致使job失败:
2014⑴2-04 15:49:04,297 INFO [main] org.apache.hadoop.mapred.Merger: Down to the last merge-pass, with 12 segments left of total size: 11503294914 bytes 2014⑴2-04 15:49:04,314 INFO [main] org.apache.hadoop.conf.Configuration.deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id 2014⑴2-04 15:49:04,394 INFO [main] org.apache.hadoop.io.compress.CodecPool: Got brand-new compressor [.lzo_deflate] 2014⑴2-04 16:02:26,889 WARN [ResponseProcessor for block BP⑴386326728⑴0.39.2.131⑴382089338395:blk_1394149732_320469086] org.apache.hadoop.hdfs.DFSClient: DFSOutputStream ResponseProcessor exception for block BP⑴386326728⑴0.39.2.131⑴382089338395:blk_1394149732_320469086 java.io.IOException: Bad response ERROR for block BP⑴386326728⑴0.39.2.131⑴382089338395:blk_1394149732_320469086 from datanode 10.39.5.193:50010 at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:819) 2014⑴2-04 16:02:26,889 WARN [ResponseProcessor for block BP⑴386326728⑴0.39.2.131⑴382089338395:blk_1394153869_320473223] org.apache.hadoop.hdfs.DFSClient: DFSOutputStream ResponseProcessor exception for block BP⑴386326728⑴0.39.2.131⑴382089338395:blk_1394153869_320473223 java.io.IOException: Bad response ERROR for block BP⑴386326728⑴0.39.2.131⑴382089338395:blk_1394153869_320473223 from datanode 10.39.1.90:50010 at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:819) 2014⑴2-04 16:02:26,891 WARN [DataStreamer for file /dw_ext/recmd/mds6/mds_filter_relation_10/20141203/_temporary/1/_temporary/attempt_1415948652989_195149_r_000158_3/user-r-00158 block BP⑴386326728⑴0.39.2.131⑴382089338395:blk_1394149732_320469086] org.apache.hadoop.hdfs.DFSClient: Error Recovery for block BP⑴386326728⑴0.39.2.131⑴382089338395:blk_1394149732_320469086 in pipeline 10.39.6.130:50010, 10.39.5.185:50010, 10.39.5.193:50010: bad datanode 10.39.5.193:50010 2014⑴2-04 16:02:26,891 WARN [DataStreamer for file /dw_ext/recmd/mds6/mds_filter_relation_10/20141203/_temporary/1/_temporary/attempt_1415948652989_195149_r_000158_3/exposure-r-00158 block BP⑴386326728⑴0.39.2.131⑴382089338395:blk_1394153869_320473223] org.apache.hadoop.hdfs.DFSClient: Error Recovery for block BP⑴386326728⑴0.39.2.131⑴382089338395:blk_1394153869_320473223 in pipeline 10.39.6.130:50010, 10.39.1.89:50010, 10.39.1.90:50010: bad datanode 10.39.1.90:50010 java.io.EOFException: Premature EOF: no length prefix available at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:1987) at org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:176) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:796) 2014⑴2-04 16:05:23,743 FATAL [main] org.apache.hadoop.mapred.YarnChild: Error running child : java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2734) at java.util.Vector.ensureCapacityHelper(Vector.java:226) at java.util.Vector.add(Vector.java:728) at rec.CommonUtil.pack_Treeset(CommonUtil.java:395) at rec.ConvertExposure10$MyReducer.collect_exposure(ConvertExposure10.java:259) at rec.ConvertExposure10$MyReducer.reduce(ConvertExposure10.java:329) at rec.ConvertExposure10$MyReducer.reduce(ConvertExposure10.java:234) at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:444) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1550) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)
2014⑴2-04 16:02:26,889 WARN [ResponseProcessor for block BP⑴386326728⑴0.39.2.131⑴382089338395:blk_1394149732_320469086] org.apache.hadoop.hdfs.DFSClient: DFSOutputStream ResponseProcessor exception for block BP⑴386326728⑴0.39.2.131⑴382089338395:blk_1394149732_320469086 java.io.IOException: Bad response ERROR for block BP⑴386326728⑴0.39.2.131⑴382089338395:blk_1394149732_320469086 from datanode 10.39.5.193:50010 at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:819) 2014⑴2-04 16:02:26,889 WARN [ResponseProcessor for block BP⑴386326728⑴0.39.2.131⑴382089338395:blk_1394153869_320473223] org.apache.hadoop.hdfs.DFSClient: DFSOutputStream ResponseProcessor exception for block BP⑴386326728⑴0.39.2.131⑴382089338395:blk_1394153869_320473223 java.io.IOException: Bad response ERROR for block BP⑴386326728⑴0.39.2.131⑴382089338395:blk_1394153869_320473223 from datanode 10.39.1.90:50010 at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:819) 2014⑴2-04 16:02:26,891 WARN [DataStreamer for file /dw_ext/recmd/mds6/mds_filter_relation_10/20141203/_temporary/1/_temporary/attempt_1415948652989_195149_r_000158_3/user-r-00158 block BP⑴386326728⑴0.39.2.131⑴382089338395:blk_1394149732_320469086] org.apache.hadoop.hdfs.DFSClient: Error Recovery for block BP⑴386326728⑴0.39.2.131⑴382089338395:blk_1394149732_320469086 in pipeline 10.39.6.130:50010, 10.39.5.185:50010, 10.39.5.193:50010: bad datanode 10.39.5.193:50010 2014⑴2-04 16:02:26,891 WARN [DataStreamer for file /dw_ext/recmd/mds6/mds_filter_relation_10/20141203/_temporary/1/_temporary/attempt_1415948652989_195149_r_000158_3/exposure-r-00158 block BP⑴386326728⑴0.39.2.131⑴382089338395:blk_1394153869_320473223] org.apache.hadoop.hdfs.DFSClient: Error Recovery for block BP⑴386326728⑴0.39.2.131⑴382089338395:blk_1394153869_320473223 in pipeline 10.39.6.130:50010, 10.39.1.89:50010, 10.39.1.90:50010: bad datanode 10.39.1.90:50010 java.io.EOFException: Premature EOF: no length prefix available at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:1987) at org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:176) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:796)
2014⑴2-04 16:00:57,424 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Exception for BP⑴386326728⑴0.39.2.131⑴382089338395:blk_1394149732_320469086
java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.39.5.193:50010 remote=/10.39.5.185:58225] at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) at java.io.BufferedInputStream.read1(BufferedInputStream.java:258) at java.io.BufferedInputStream.read(BufferedInputStream.java:317) at java.io.DataInputStream.read(DataInputStream.java:132) at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:192) at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213) at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134) at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:446) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:702) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:739) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232) at java.lang.Thread.run(Thread.java:662) 2014⑴2-04 16:00:57,424 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: BP⑴386326728⑴0.39.2.131⑴382089338395:blk_1394149732_320469086, type=LAST_IN_PIPELINE, downstreams=0:[]: Thread is interrupted.
java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.39.5.193:50010 remote=/10.39.5.185:58225]
java.io.EOFException: Premature EOF: no length prefix available at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:1987) at org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:176) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:1083) at java.lang.Thread.run(Thread.java:662) 2014⑴2-04 16:00:58,008 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Exception for BP⑴386326728⑴0.39.2.131⑴382089338395:blk_1394149732_320469086 java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.39.5.185:50010 remote=/10.39.6.130:59083] at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) at java.io.BufferedInputStream.read1(BufferedInputStream.java:258) at java.io.BufferedInputStream.read(BufferedInputStream.java:317) at java.io.DataInputStream.read(DataInputStream.java:132) at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:192) at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213) at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134) at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:446) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:702) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:739) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232) at java.lang.Thread.run(Thread.java:662) 2014⑴2-04 16:00:58,009 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: BP⑴386326728⑴0.39.2.131⑴382089338395:blk_1394149732_320469086, type=HAS_DOWNSTREAM_IN_PIPELINE: Thread is interrupted.
java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.39.5.185:50010 remote=/10.39.6.130:59083]
2014⑴2-04 16:00:59,987 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Exception for BP⑴386326728⑴0.39.2.131⑴382089338395:blk_1394149732_320469086 java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.39.6.130:50010 remote=/10.39.6.130:45259] at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) at java.io.BufferedInputStream.read1(BufferedInputStream.java:258) at java.io.BufferedInputStream.read(BufferedInputStream.java:317) at java.io.DataInputStream.read(DataInputStream.java:132) at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:192) at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213) at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134) at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:446) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:702) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:739) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232) at java.lang.Thread.run(Thread.java:662)
java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.39.6.130:50010 remote=/10.39.6.130:45259]
在10.39.6.130上,根据任务id:attempt_1415948652989_195149_r_000158_3 找到进程id 31050,查看内存使用情况:
果然,JVM老年代被占满,不断履行FGC,直接stop the world,致使JVM没法对外提供服务,致使dfsclient挂起,1直不能往pipeline上面的节点写Packet,直到socket超时。
根据reduce任务最后的日志,也从侧面验证了这类观点:
Error running child : java.lang.OutOfMemoryError: Java heap space
既然是OOM致使的job失败,那是甚么对象致使的内存泄漏呢:履行:
总结:对该job出现的异常,1般在以下几种情况下产生:
1、写数据块的DN出现问题,不能写入,就像之前出现的DN由于本地读问题致使xceivers(每一个DN用于并发数据传输处理最大线程数)到达4096,耗尽了所有的线程,没法对新发起的输出写入做出相应。
2、网络出现异常,DN节点进或出的带宽被耗尽,致使数据写不出去或写不进来,这类情况可以视察ganglia看节点带宽使用情况,这类情况1般比较少。当该job出现问题的时候,也怀疑过是带宽被耗尽的问题,查看了1下相干节点ganglia带宽使用情况,最大in/out 85M/s左右,最后排除是带宽问题。
3、dfsclient出现问题,长时间没反应,致使已发起的socket超时。由于dfsclient情况比较复杂,出现问题的情况比较多,比如本问就是由于reduce出现内存溢出,jvm不断进行FGC,致使dfsclient挂起,终究socket出现超时。