Hive中有一表,列分隔符为冒号(:),有一列utime是Timestamp格式,需要转成Weekday存到新表。

利用Python写一个Pipeline的Transform,weekday.py的代码也很简单: import sys import datetime for line in sys.stdin: line=line.strip() uid,mid,rating,utime=line.split(':') weekday=datetime.datetime.fromtimestamp(float(utime)).isoweekday() print '\t'.join([uid,mid,rating,str(weekday)]) HQL的查询也很简单: select transform(uid,mid,rating,utime) using 'python weekday.py' as (uid,mid,rating,weekday) from rating Stage-1结束后就报错! 排查过程: 1. Hive给出的日志,没有什么意义。Hive日志: INFO exec.Task: 2015-07-07 16:34:57,938 Stage-1 map = 0%, reduce = 0% INFO exec.Task: 2015-07-07 16:35:30,262 Stage-1 map = 100%, reduce = 0% ERROR exec.Task: Ended Job = job_1431587697935_0210 with errors ERROR operation.Operation: Error running hive query: org.apache.hive.service.cli.HiveSQLException: Error while processing statement: FAILED: Execution Error, return code 20001 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask. An error occurred while reading or writing to your custom script. It may have crashed with an error. at org.apache.hive.service.cli.operation.Operation.toSQLException(Operation.java:315) at org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:156) at org.apache.hive.service.cli.operation.SQLOperation.access$100(SQLOperation.java:71) at org.apache.hive.service.cli.operation.SQLOperation$1$1.run(SQLOperation.java:206) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) at org.apache.hive.service.cli.operation.SQLOperation$1.run(SQLOperation.java:218) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 2. 不死心呀!开启Hive日志的Debug,再看看日志。 因为我一直使用的是Beeline连接Hive,得到的日志跟1.一样,没有收获。后来我想想,要不用Hive CLI看一下,会不会有收获。终于得到点有意义的日志了: Taskwiththemostfailures(4):
-----
TaskID:
task_1431587697935_0210_m_000000
-----
DiagnosticMessagesforthisTask:
Error:java.lang.RuntimeException:org.apache.hadoop.hive.ql.metadata.HiveException:HiveRuntimeErrorwhileprocessingrow{"uid":11,"mid":2791,"rating":4,"utime":"978903186"}
atorg.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:172)
atorg.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
atorg.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:450)
atorg.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
atorg.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
atjava.security.AccessController.doPrivileged(NativeMethod)
atjavax.security.auth.Subject.doAs(Subject.java:415)
atorg.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
atorg.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Causedby:org.apache.hadoop.hive.ql.metadata.HiveException:HiveRuntimeErrorwhileprocessingrow{"uid":11,"mid":2791,"rating":4,"utime":"978903186"}
atorg.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:518)
atorg.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:163)
...8more
Causedby:org.apache.hadoop.hive.ql.metadata.HiveException:[Error20001]:Anerroroccurredwhilereadingorwritingtoyourcustomscript.Itmayhavecrashedwithanerror.
atorg.apache.hadoop.hive.ql.exec.ScriptOperator.process(ScriptOperator.java:456)
atorg.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:837)
atorg.apache.hadoop.hive.ql.exec.SelectOperator.process(SelectOperator.java:88)
atorg.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:837)
atorg.apache.hadoop.hive.ql.exec.TableScanOperator.process(TableScanOperator.java:97)
atorg.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.forward(MapOperator.java:162)
atorg.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:508)

Causedby:java.io.IOException:Brokenpipe
atjava.io.FileOutputStream.writeBytes(NativeMethod)
atjava.io.FileOutputStream.write(FileOutputStream.java:345)
atjava.io.BufferedOutputStream.write(BufferedOutputStream.java:122)
atjava.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
atjava.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
atjava.io.DataOutputStream.write(DataOutputStream.java:107)
atorg.apache.hadoop.hive.ql.exec.TextRecordWriter.write(TextRecordWriter.java:53)
atorg.apache.hadoop.hive.ql.exec.ScriptOperator.process(ScriptOperator.java:425)

3. 根据上步提示的那行可疑的数据,我怀疑有Bad Data,处理时出错。我单独将报错的行放到另一个表中去处理,又完全没有问题。好吧,继续。 4. 断续之前我需要搞清楚: java.io.IOException: Broken pipe是什么? 写入端出现的时候,另一端却休息或退出了,因此造成没有及时取走管道中的数据,从而系统异常退出。在这里就是:当Streaming正在获取input数据,好给weekday.py处理的过程中,weekday.py异常终止了。等Streaming准备好数据回来后,却找不到weekday.py来接收数据,于是Broken pipe了。 5. 搞明白Broken pipe并给前面的错误信息,断定问题应该出在weekday.py上。接下来,既然是MapReduce出错,那就需要去看Yarn的Stderr. 通过ResouceManager查看对应Application的Logs中的stderr,发现: Traceback (most recent call last): File "weekday_mapper.py", line 5, in <module> uid,mid,rating,utime=line.split(':') ValueError: need more than 1 value to unpack 6. 从Python的错误来看,推测有数据行的分隔符(:)有异常,导致split之后不能返回4个值(uid,mid,rating,utime)。用各种方法检查数据格式,一切正常。只好,处理脚本加上异常处理。 加上异常处理之后不报错了,但是Select输出0行数据 import sys import datetime for line in sys.stdin: try: line=line.strip() uid,mid,rating,utime=line.split(':') weekday=datetime.datetime.fromtimestamp(float(utime)).isoweekday() print '\t'.join([uid,mid,rating,str(weekday)]) except Exception, ex: pass 7. 问题锁定到:脚本处理数据有问题。尝试直接从HDFS上直接抓取表的数据文件,再用脚本处理,是正常的。 hdfs dfs -cat /user/hive/warehouse/test.db/t/000000_0|python /tmp/weekday_mapper.py 最后,怀疑transform的输出格式是不是与定义表的格式不一样,查阅官方说明:

By default, columns will be transformed toSTRINGand delimited by TAB before feeding to the user script。

于是,将脚本中的uid,mid,rating,utime=line.split(':')改成uid,mid,rating,utime=line.split('\t')。再试一次,成功!

总结

1. 基础知识很重要,要在自己内心成体系,才能够用信手拈来。路漫漫兮!

2. 有时凭借经验的“猜”,会很有帮助,有时却会“聪明反被聪明误"。所以要重视日志,并以之为操作重现的依据。

更多相关文章

  1. 【python网络爬虫三】爬取动态数据及数据入库
  2. 数据挖掘(三)分类模型的描述与性能评估,以决策树为例
  3. 用于Python项目的低内存和最快查询数据库
  4. python爬虫:爬取豌豆荚APP第一页数据信息(selenium)
  5. Pandas 文本数据方法 findall( )
  6. TensorFlow数据集(一)——数据集的基本使用方法
  7. python常用数据类型-字典
  8. python脚本下载并解码MNIST数据遇到的问题
  9. 日志与python日志组件logging

随机推荐

  1. Jquery Validate 正则表达式实用验证代码
  2. 如何检索类以隐藏按钮?
  3. [置顶] JavaWeb JQuery实现记住我
  4. jQuery循环没有每个和回调函数
  5. jquery编写web页面头部分类的选中事件$.c
  6. 通过ajax POST提交nicEdit文本编辑器的值
  7. Google地图信息窗口左箭头,右箭头传递Type
  8. 如何使用jQuery的叠加对话框作为反馈表单
  9. 带有括号的某些字符串导致Ajax POST操作
  10. inarray(),如何正确使用它?