Hive-BigQuery 커넥터는 Hive Storage Handler API를 구현하여 Hive 워크로드가 BigQuery 및 BigLake 테이블과 통합되도록 합니다. Hive 실행 엔진은 집계 및 조인과 같은 컴퓨팅 작업을 처리하고, 커넥터는 BigQuery 또는 BigLake에 연결된 Cloud Storage 버킷에 저장된 데이터와의 상호작용을 관리합니다.
다음 다이어그램은 Hive-BigQuery 커넥터가 컴퓨팅 레이어와 데이터 레이어 간에 어떻게 적합한지 보여줍니다.
사용 사례
다음은 Hive-BigQuery 커넥터가 일반적인 데이터 기반 시나리오에서 도움이 되는 몇 가지 방법입니다.
데이터 마이그레이션 Hive 데이터 웨어하우스를 BigQuery로 이동한 다음 Hive 쿼리를 BigQuery SQL 언어로 점진적으로 변환할 계획이라고 합시다.
데이터 웨어하우스 크기 및 연결된 애플리케이션의 많은 수로 인해 마이그레이션에 상당한 시간이 걸릴 것으로 예상되며 마이그레이션 작업 중에 연속성을 보장해야 합니다. 워크플로는 다음과 같습니다.
BigQuery로 데이터 이동합니다.
Hive 쿼리를 BigQuery ANSI 호환 SQL 언어로 점진적으로 변환하는 동안, 커넥터를 사용하여 원래 Hive 쿼리에 액세스하여 실행합니다.
마이그레이션과 변환을 완료한 후 Hive를 중단합니다.
Hive 및 BigQuery 워크플로 일부 작업에는 Hive를, BigQuery BI Engine 또는 BigQuery ML과 같은 기능을 활용하는 워크로드에는 BigQuery를 사용할 계획이라고 합시다. 커넥터를 사용하여 Hive 테이블을 BigQuery 테이블에 조인하세요.
오픈소스 소프트웨어 (OSS) 스택에 의존. 공급업체 종속을 방지하기 위해 데이터 웨어하우스에 전체 OSS 스택을 사용합니다. 데이터 요금제는 다음과 같습니다.
BigLake 연결을 사용하여 Avro, Parquet, ORC 등 원래 OSS 형식의 데이터를 Cloud Storage 버킷으로 마이그레이션합니다.
Hive SQL 언어 쿼리를 실행하고 처리하는 데 계속 Hive를 사용합니다.
필요에 따라 커넥터를 사용하여 BigQuery에 연결하면 다음 기능을 활용할 수 있습니다.
BigQuery Storage Write API 대기 모드를 사용하는 직접 쓰기. 짧은 새로 고침 기간이 있는 실시간에 가까운 대시보드와 같이 낮은 쓰기 지연 시간이 필요한 워크로드에 이 방법을 사용하세요.
임시 Avro 파일을 Cloud Storage에 스테이징한 후 Load Job API를 사용하여 대상 테이블에 파일을 로드하여 간접 쓰기.
이 방법은 BigQuery 로드 작업에 요금이 부과되지 않으므로 직접 방법보다 비용이 낮습니다. 이 방법은 속도가 느리기 때문에 시간이 중요하지 않은 워크로드에 가장 적합합니다.
BigQuery 시간으로 파티션을 나눈 테이블 및 클러스터링된 테이블에 액세스합니다. 다음 예에서는 Hive 테이블과 BigQuery에서 파티션을 나누고 클러스터링된 테이블 간의 관계를 정의합니다.
[[["이해하기 쉬움","easyToUnderstand","thumb-up"],["문제가 해결됨","solvedMyProblem","thumb-up"],["기타","otherUp","thumb-up"]],[["이해하기 어려움","hardToUnderstand","thumb-down"],["잘못된 정보 또는 샘플 코드","incorrectInformationOrSampleCode","thumb-down"],["필요한 정보/샘플이 없음","missingTheInformationSamplesINeed","thumb-down"],["번역 문제","translationIssue","thumb-down"],["기타","otherDown","thumb-down"]],["최종 업데이트: 2025-08-27(UTC)"],[[["\u003cp\u003eThe Hive-BigQuery connector enables Apache Hive workloads to interact with data in BigQuery and BigLake tables, allowing for data storage in either BigQuery or open-source formats on Cloud Storage.\u003c/p\u003e\n"],["\u003cp\u003eThis connector is beneficial for migrating from Hive to BigQuery, utilizing both Hive and BigQuery in tandem, or maintaining an entirely open-source data warehouse stack.\u003c/p\u003e\n"],["\u003cp\u003eUsing the Hive Storage Handler API, the connector manages data interactions, while Hive handles compute operations, like aggregates and joins, offering integration between the two platforms.\u003c/p\u003e\n"],["\u003cp\u003eThe connector supports direct writes to BigQuery for low-latency needs or indirect writes via temporary Avro files for cost-effective, non-time-critical operations.\u003c/p\u003e\n"],["\u003cp\u003eFeatures of the Hive-BigQuery connector include running queries with MapReduce and Tez engines, creating/deleting BigQuery tables from Hive, joining BigQuery and Hive tables, fast reads using the Storage Read API, column pruning, and predicate pushdowns for performance optimization.\u003c/p\u003e\n"]]],[],null,["The open source\n[Hive-BigQuery connector](https://github.com/GoogleCloudDataproc/hive-bigquery-connector)\nlets your [Apache Hive](https://hive.apache.org/)\nworkloads read and write data from and to [BigQuery](/bigquery) and\n[BigLake](/biglake) tables. You can store data in\nBigQuery storage or in open source data formats on\nCloud Storage.\n| Use the connector to work with Hive and BigQuery together or to migrate your data warehouse from Hive to BigQuery.\n\nThe Hive-BigQuery connector implements the\n[Hive Storage Handler API](https://cwiki.apache.org/confluence/display/Hive/StorageHandlers)\nto allow Hive workloads to integrate with BigQuery and BigLake\ntables. The Hive execution engine handles compute operations, such\nas aggregates and joins, and the connector manages interactions with\ndata stored in BigQuery or in BigLake-connected\nCloud Storage buckets.\n\nThe following diagram illustrates how Hive-BigQuery connector\nfits between the compute and data layers.\n\nUse cases\n\nHere are some of the ways the Hive-BigQuery connector can help you in\ncommon data-driven scenarios:\n\n- Data migration. You plan to move your Hive data warehouse to BigQuery,\n then incrementally translate your Hive queries into BigQuery SQL dialect.\n You expect the migration to take a significant amount of time due to the size\n of your data warehouse and the large number of connected applications, and\n you need to ensure continuity during the migration operations. Here's the\n workflow:\n\n 1. You move your data to BigQuery\n 2. Using the connector, you access and run your original Hive queries while you gradually translate the Hive queries to BigQuery ANSI-compliant SQL dialect.\n 3. After completing the migration and translation, you retire Hive.\n- Hive and BigQuery workflows. You plan to use\n Hive for some tasks, and BigQuery for workloads that benefit\n from its features, such as [BigQuery BI Engine](/bigquery/docs/bi-engine-intro) or\n [BigQuery ML](/bigquery/docs/bqml-introduction). You use\n the connector to join Hive tables to your BigQuery tables.\n\n- Reliance on an open source software (OSS) stack. To avoid vendor lock-in,\n you use a full OSS stack for your data warehouse. Here's your data plan:\n\n 1. You migrate your data in its original OSS format, such as Avro, Parquet, or\n ORC, to Cloud Storage buckets using a BigLake connection.\n\n 2. You continue to use Hive to execute and process your Hive SQL dialect queries.\n\n 3. You use the connector as needed to connect to BigQuery\n to benefit from the following features:\n\n - [Metadata caching](/bigquery/docs/biglake-intro#metadata_caching_for_performance) for query performance\n - [Data loss prevention](/bigquery/docs/scan-with-dlp)\n - [Column-level access control](/bigquery/docs/column-level-security-intro)\n - [Dynamic data masking](/bigquery/docs/column-data-masking-intro) for security and governance at scale.\n\nFeatures\n\nYou can use the Hive-BigQuery connector to work with your\nBigQuery data and accomplish the following tasks:\n\n- Run queries with MapReduce and Tez execution engines.\n- Create and delete BigQuery tables from Hive.\n- Join BigQuery and BigLake tables with Hive tables.\n- Perform fast reads from BigQuery tables using the [Storage Read API](/bigquery/docs/reference/storage) streams and the [Apache Arrow](https://arrow.apache.org/) format\n- Write data to BigQuery using the following methods:\n - Direct writes using the BigQuery [Storage Write API in pending mode](/bigquery/docs/write-api-batch). Use this method for workloads that require low write latency, such as near-real-time dashboards with short refresh time windows.\n - Indirect writes by staging temporary Avro files to Cloud Storage, and then loading the files into a destination table using the [Load Job API](/bigquery/docs/batch-loading-data). This method is less expensive than the direct method, since BigQuery load jobs don't accrue charges. Since this method is slower, and finds its best use in workloads that aren't time critical\n- Access BigQuery [time-partitioned](/bigquery/docs/partitioned-tables)\n and [clustered](/bigquery/docs/clustered-tables) tables. The following example\n defines the relation between a Hive table and a table\n that is partitioned and clustered in BigQuery.\n\n ```sql\n CREATE TABLE my_hive_table (int_val BIGINT, text STRING, ts TIMESTAMP)\n STORED BY 'com.google.cloud.hive.bigquery.connector.BigQueryStorageHandler'\n TBLPROPERTIES (\n 'bq.table'='myproject.mydataset.mytable',\n 'bq.time.partition.field'='ts',\n 'bq.time.partition.type'='MONTH',\n 'bq.clustered.fields'='int_val,text'\n );\n ```\n- Prune columns to avoid retrieving unnecessary columns from the data layer.\n\n- Use predicate pushdowns to pre-filter data rows at the BigQuery storage\n layer. This technique can significantly improve overall query performance by\n reducing the amount of data traversing the network.\n\n- Automatically convert Hive data types to BigQuery data types.\n\n- Read BigQuery [views](/bigquery/docs/views-intro) and\n [table snapshots](/bigquery/docs/table-snapshots-intro).\n\n- Integrate with Spark SQL.\n\n- Integrate with Apache Pig and HCatalog.\n\nGet started\n\nSee the instructions to\n[install and configure the Hive-BigQuery connector on a Hive cluster](https://github.com/GoogleCloudDataproc/hive-bigquery-connector/blob/main/README.md)."]]