Using the DirectPipelineRunner, I'd like to run my pipeline locally for debugging purposes. I'm using SDK 1.9.0 with Java 8.

使用DirectPipelineRunner,我想在本地运行我的管道以进行调试。我正在使用SDK 1.9.0和Java 8。

My pipeline reads a table from BigQuery, transforms some fields, and writes back to BigQuery.

我的管道从BigQuery读取一个表,转换一些字段,然后写回BigQuery。

Running on GCP, i.e. using the DataflowPipelineRunner runner works absolutely fine. However, when I use the DirectPipelineRunner is just keeps spitting out the following log info, and does nothing else:

在GCP上运行,即使用DataflowPipelineRunner运行器可以正常运行。但是,当我使用DirectPipelineRunner时,只是不断吐出以下日志信息,并且不执行任何其他操作:

19:45:05,470 21866 [main] INFO  com.google.cloud.dataflow.sdk.runners.DirectPipelineRunner - Executing pipeline using the DirectPipelineRunner.
19:45:18,594 34990 [main] INFO  com.google.cloud.dataflow.sdk.util.BigQueryServicesImpl - No BigQuery job with job id beam_job_c88ee6741e434aabbf50e73d4e6733d1-extract found.
19:45:27,344 43740 [main] INFO  com.google.cloud.dataflow.sdk.util.BigQueryServicesImpl - No BigQuery job with job id beam_job_012dca76d75e461480fe75897b5fa7ba-extract found.
19:45:38,150 54546 [main] INFO  com.google.cloud.dataflow.sdk.util.BigQueryServicesImpl - No BigQuery job with job id beam_job_3548a0ee373a417e8e7570ae90aef78d-extract found.
19:45:47,912 64308 [main] INFO  com.google.cloud.dataflow.sdk.util.BigQueryServicesImpl - No BigQuery job with job id beam_job_db0b957250ef41279a639bdc113c5493-extract found.
19:45:56,685 73081 [main] INFO  com.google.cloud.dataflow.sdk.util.BigQueryServicesImpl - No BigQuery job with job id beam_job_3773e0643ec14475aaa140bcf46ea7af-extract found.
19:46:45,958 122354 [main] INFO  com.google.cloud.dataflow.sdk.util.BigQueryServicesImpl - No BigQuery job with job id beam_job_27af9a1163944cb19e520242de98d899-extract found.
19:46:55,766 132162 [main] INFO  com.google.cloud.dataflow.sdk.util.BigQueryServicesImpl - No BigQuery job with job id beam_job_5473e6702b3544118c7da8877c900f7a-extract found.
19:47:04,015 140411 [main] INFO  com.google.cloud.dataflow.sdk.util.BigQueryServicesImpl - No BigQuery job with job id beam_job_40f47d35aa154708a6fc684c8ffb0ba4-extract found.
19:47:11,913 148309 [main] INFO  com.google.cloud.dataflow.sdk.util.BigQueryServicesImpl - No BigQuery job with job id beam_job_6dce34301c97498884d7344b85a1b07e-extract found.
19:47:35,809 172205 [main] INFO  com.google.cloud.dataflow.sdk.util.BigQueryServicesImpl - No BigQuery job with job id beam_job_4f7c26d372974095a24ac58b547c13d6-extract found.
19:47:45,136 181532 [main] INFO  com.google.cloud.dataflow.sdk.util.BigQueryServicesImpl - No BigQuery job with job id beam_job_a7c33e75bfdb41a6990dd66810a0d44a-extract found.
19:47:55,802 192198 [main] INFO  com.google.cloud.dataflow.sdk.util.BigQueryServicesImpl - No BigQuery job with job id beam_job_a1d7422ca42a4b1d96205bf8c6dada9d-extract found.

The log message is coming from here:

日志消息来自此处:

@VisibleForTesting
    public Job getJob(JobReference jobRef, Sleeper sleeper, BackOff backoff)
        throws IOException, InterruptedException {
      String jobId = jobRef.getJobId();
      Exception lastException;
      do {
        try {
          return client.jobs().get(jobRef.getProjectId(), jobId).execute();
        } catch (GoogleJsonResponseException e) {
          if (errorExtractor.itemNotFound(e)) {
            LOG.info("No BigQuery job with job id {} found.", jobId);
            return null;
          }....

Eventually, the JVM runs out of memory:

最终,JVM内存不足:

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOfRange(Arrays.java:3664)
at java.lang.String.<init>(String.java:207)
at java.lang.String.toLowerCase(String.java:2647)
at com.google.api.client.json.JsonParser.parseValue(JsonParser.java:847)
at com.google.api.client.json.JsonParser.parse(JsonParser.java:472)
at com.google.api.client.json.JsonParser.parseValue(JsonParser.java:781)
at com.google.api.client.json.JsonParser.parseArray(JsonParser.java:648)
at com.google.api.client.json.JsonParser.parseValue(JsonParser.java:740)
at com.google.api.client.json.JsonParser.parse(JsonParser.java:472)
at com.google.api.client.json.JsonParser.parseValue(JsonParser.java:781)
at com.google.api.client.json.JsonParser.parseArray(JsonParser.java:648)
at com.google.api.client.json.JsonParser.parseValue(JsonParser.java:740)
at com.google.api.client.json.JsonParser.parse(JsonParser.java:472)
at com.google.api.client.json.JsonParser.parseValue(JsonParser.java:781)
at com.google.api.client.json.JsonParser.parse(JsonParser.java:382)
at com.google.api.client.json.JsonParser.parse(JsonParser.java:355)
at com.google.api.client.json.JsonObjectParser.parseAndClose(JsonObjectParser.java:87)
at com.google.api.client.json.JsonObjectParser.parseAndClose(JsonObjectParser.java:81)
at com.google.api.client.http.HttpResponse.parseAs(HttpResponse.java:459)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:469)
at com.google.cloud.dataflow.sdk.util.BigQueryTableRowIterator.executeWithBackOff(BigQueryTableRowIterator.java:497)
at com.google.cloud.dataflow.sdk.util.BigQueryTableRowIterator.advance(BigQueryTableRowIterator.java:180)
at com.google.cloud.dataflow.sdk.util.BigQueryServicesImpl$BigQueryJsonReaderImpl.advance(BigQueryServicesImpl.java:555)
at com.google.cloud.dataflow.sdk.io.BigQueryIO$BigQuerySourceBase$BigQueryReader.advance(BigQueryIO.java:1331)
at com.google.cloud.dataflow.sdk.io.Read$Bounded$1.evaluateReadHelper(Read.java:180)
at com.google.cloud.dataflow.sdk.io.Read$Bounded$1.evaluate(Read.java:168)
at com.google.cloud.dataflow.sdk.io.Read$Bounded$1.evaluate(Read.java:164)
at com.google.cloud.dataflow.sdk.runners.DirectPipelineRunner$Evaluator.visitTransform(DirectPipelineRunner.java:858)
at com.google.cloud.dataflow.sdk.runners.TransformTreeNode.visit(TransformTreeNode.java:221)
at com.google.cloud.dataflow.sdk.runners.TransformTreeNode.visit(TransformTreeNode.java:217)
at com.google.cloud.dataflow.sdk.runners.TransformTreeNode.visit(TransformTreeNode.java:217)
at com.google.cloud.dataflow.sdk.runners.TransformHierarchy.visit(TransformHierarchy.java:103)

The table in BigQuery only has 100 rows (it's just for debugging).

BigQuery中的表只有100行(它只是用于调试)。

What is the problem here?

这里有什么问题?

1 个解决方案

#1


0

I believe the BigQuery message is a red-herring; the stack trace of the OOM indicates that data is being read directly from the table and not via an export job.

我相信BigQuery的消息是一个红色的鲱鱼; OOM的堆栈跟踪表明正在直接从表中读取数据,而不是通过导出作业。

The DirectPipelineRunner is not at all optimized for memory utilization; try using the newer InProcessPipelineRunner. Additionally, it may be worth using standard Java heap profiling tools to see where the memory is being used.

DirectPipelineRunner根本没有针对内存利用率进行优化;尝试使用较新的InProcessPipelineRunner。此外,使用标准Java堆分析工具来查看内存的使用位置可能是值得的。

更多相关文章

  1. 本地方法中printf如何传给java--java系统级命名管道

随机推荐

  1. Android 横竖屏幕
  2. android UEventObserver的用法
  3. 谈谈 View 绘制流程
  4. android MTP 过滤指定的文件格式
  5. Android SDK4.0 离线安装方法
  6. 查看自己Android设备分辨率
  7. android UI小知识点
  8. Android BufferQueue原理分析
  9. 缓存
  10. 分享20个Android游戏源码,希望大家喜欢哈!