이러한 측정항목은 단순 배치 파이프라인을 기반으로 합니다. 이러한 측정항목은 I/O 커넥터 사이의 성능 비교를 위해 사용되며 반드시 실제 파이프라인을 나타내지는 않습니다.
Dataflow 파이프라인 성능은 복잡하며 VM 유형, 처리 중인 데이터, 외부 소스 및 싱크의 성능, 사용자 코드와 상관관계가 있습니다. 측정항목은 Java SDK 실행을 기반으로 하며 다른 언어 SDK의 성능 특성을 나타내지 않습니다. 자세한 내용은 Beam IO 성능을 참조하세요.
권장사항
새 파이프라인의 경우 CloudBigtableIO가 아닌 BigtableIO 커넥터를 사용합니다.
파이프라인 유형마다 별도의 앱 프로필을 만듭니다. 앱 프로필을 사용하면 지원과 사용량 추적을 위해 파이프라인 간에 트래픽을 구분하는 데 더 효과적인 측정항목을 사용할 수 있습니다.
Bigtable 노드를 모니터링합니다. 성능 병목 현상이 발생하면 CPU 사용량과 같은 리소스가 Bigtable 내에서 제한되는지 확인합니다. 자세한 내용은 Monitoring을 참조하세요.
일반적으로 기본 제한 시간은 대부분의 파이프라인에 조정됩니다. 스트리밍 파이프라인에서 Bigtable 읽기가 중단되면 withAttemptTimeout을 호출하여 시도 제한 시간을 조정합니다.
Bigtable 자동 확장을 사용 설정하거나 Dataflow 작업 크기에 맞게 확장되도록 Bigtable 클러스터 크기를 조정하는 것이 좋습니다.
Bigtable 클러스터에서 부하를 제한하려면 Dataflow 작업에서 maxNumWorkers를 설정하는 것이 좋습니다.
셔플 전에 Bigtable 요소에서 처리가 상당량 완료되면 Bigtable에 대한 호출 시간이 초과될 수 있습니다. 이 경우 withMaxBufferElementCount를 호출하여 요소를 버퍼링할 수 있습니다. 이 방법은 읽기 작업을 스트리밍 작업에서 페이지로 나눈 작업으로 변환하여 이 문제를 방지합니다.
스트리밍 및 일괄 파이프라인 모두에 단일 Bigtable 클러스터를 사용하고 있으며 Bigtable 측 성능이 저하되는 경우 클러스터에 복제를 설정하는 것이 좋습니다. 그런 후 여러 다른 복제본에서 읽도록 배치 및 스트리밍 파이프라인을 구분합니다. 자세한 내용은 복제 개요를 참조하세요.
[[["이해하기 쉬움","easyToUnderstand","thumb-up"],["문제가 해결됨","solvedMyProblem","thumb-up"],["기타","otherUp","thumb-up"]],[["이해하기 어려움","hardToUnderstand","thumb-down"],["잘못된 정보 또는 샘플 코드","incorrectInformationOrSampleCode","thumb-down"],["필요한 정보/샘플이 없음","missingTheInformationSamplesINeed","thumb-down"],["번역 문제","translationIssue","thumb-down"],["기타","otherDown","thumb-down"]],["최종 업데이트: 2025-09-04(UTC)"],[[["\u003cp\u003eUse the Apache Beam Bigtable I/O connector to read data from Bigtable to Dataflow, considering Google-provided Dataflow templates as an alternative depending on your specific use case.\u003c/p\u003e\n"],["\u003cp\u003eParallelism in reading Bigtable data is governed by the number of nodes in the Bigtable cluster, with each node managing key ranges.\u003c/p\u003e\n"],["\u003cp\u003ePerformance metrics for Bigtable read operations on one \u003ccode\u003ee2-standard2\u003c/code\u003e worker using Apache Beam SDK 2.48.0 for Java, show a throughput of 180 MBps or 170,000 elements per second for 100M records, 1 kB, and 1 column, noting that real-world pipeline performance may vary.\u003c/p\u003e\n"],["\u003cp\u003eFor new pipelines, use the \u003ccode\u003eBigtableIO\u003c/code\u003e connector instead of \u003ccode\u003eCloudBigtableIO\u003c/code\u003e, and create separate app profiles for each pipeline type for better traffic differentiation and tracking.\u003c/p\u003e\n"],["\u003cp\u003eBest practices for pipeline optimization include monitoring Bigtable node resources, adjusting timeouts as needed, considering Bigtable autoscaling or resizing, and potentially using replication to separate batch and streaming pipelines for improved performance.\u003c/p\u003e\n"]]],[],null,["# Read from Bigtable to Dataflow\n\nTo read data from Bigtable to Dataflow, use the\nApache Beam [Bigtable I/O connector](https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/io/gcp/bigtable/package-summary.html).\n| **Note:** Depending on your scenario, consider using one of the [Google-provided Dataflow templates](/dataflow/docs/guides/templates/provided-templates). Several of these read from Bigtable.\n\nParallelism\n-----------\n\nParallelism is controlled by the number of\n[nodes](/bigtable/docs/instances-clusters-nodes#nodes) in the\nBigtable cluster. Each node manages one or more key ranges,\nalthough key ranges can move between nodes as part of\n[load balancing](/bigtable/docs/overview#load-balancing). For more information,\nsee [Reads and performance](/bigtable/docs/reads#performance) in the\nBigtable documentation.\n\nYou are charged for the number of nodes in your instance's clusters. See\n[Bigtable pricing](/bigtable/pricing).\n\nPerformance\n-----------\n\nThe following table shows performance metrics for Bigtable read\noperations. The workloads were run on one `e2-standard2` worker, using the\nApache Beam SDK 2.48.0 for Java. They did not use Runner v2.\n\n\nThese metrics are based on simple batch pipelines. They are intended to compare performance\nbetween I/O connectors, and are not necessarily representative of real-world pipelines.\nDataflow pipeline performance is complex, and is a function of VM type, the data\nbeing processed, the performance of external sources and sinks, and user code. Metrics are based\non running the Java SDK, and aren't representative of the performance characteristics of other\nlanguage SDKs. For more information, see [Beam IO\nPerformance](https://beam.apache.org/performance/).\n\n\u003cbr /\u003e\n\nBest practices\n--------------\n\n- For new pipelines, use the [`BigtableIO`](https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/io/gcp/bigtable/BigtableIO.html) connector, not\n `CloudBigtableIO`.\n\n- Create separate [app profiles](/bigtable/docs/app-profiles) for each type of\n pipeline. App profiles enable better metrics for differentiating traffic\n between pipelines, both for support and for tracking usage.\n\n- Monitor the Bigtable nodes. If you experience performance\n bottlenecks, check whether resources such as CPU utilization are constrained\n within Bigtable. For more information, see\n [Monitoring](/bigtable/docs/monitoring-instance).\n\n- In general, the default timeouts are well tuned for most pipelines. If a\n streaming pipeline appears to get stuck reading from Bigtable,\n try calling [`withAttemptTimeout`](https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/io/gcp/bigtable/BigtableIO.Read.html#withAttemptTimeout-org.joda.time.Duration-) to adjust the attempt\n timeout.\n\n- Consider enabling\n [Bigtable autoscaling](/bigtable/docs/autoscaling), or resize\n the Bigtable cluster to scale with the size of your\n Dataflow jobs.\n\n- Consider setting\n [`maxNumWorkers`](/dataflow/docs/reference/pipeline-options#resource_utilization)\n on the Dataflow job to limit load on the\n Bigtable cluster.\n\n- If significant processing is done on a Bigtable element before\n a shuffle, calls to Bigtable might time out. In that case, you\n can call [`withMaxBufferElementCount`](https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/io/gcp/bigtable/BigtableIO.Read.html#withMaxBufferElementCount-java.lang.Integer-) to buffer\n elements. This method converts the read operation from streaming to paginated,\n which avoids the issue.\n\n- If you use a single Bigtable cluster for both streaming and\n batch pipelines, and the performance degrades on the Bigtable\n side, consider setting up replication on the cluster. Then separate the batch\n and streaming pipelines, so that they read from different replicas. For more\n information, see [Replication overview](/bigtable/docs/replication-overview).\n\nWhat's next\n-----------\n\n- Read the [Bigtable I/O connector](https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/io/gcp/bigtable/package-summary.html) documentation.\n- See the list of [Google-provided templates](/dataflow/docs/guides/templates/provided-templates)."]]