BigQuery는 여러 가지 데이터 모델링 방법에서 사용 가능하며 일반적으로 여러 데이터 모델 방법론에서 고성능을 제공합니다. 성능을 위해 데이터 모델을 더 세부적으로 조정할 때 고려할 수 있는 한 가지 방법은 데이터 비정규화입니다. 비정규화에서는 단일 테이블에 데이터 열을 추가하여 테이블 조인을 줄이거나 삭제합니다.
권장사항: 중첩 및 반복 필드를 사용하여 데이터 스토리지를 비정규화하고 쿼리 성능을 높이세요.
비정규화는 이전에 정규화된 관계형 데이터 세트의 읽기 성능을 높이기 위한 일반적인 전략입니다. BigQuery에서 데이터를 비정규화할 때는 중첩되고 반복되는 입력란을 사용하는 것이 좋습니다. 이 전략은 관계가 계층적이며 부모-자식 관계와 같이 자주 쿼리되는 경우 사용하는 것이 가장 좋습니다.
현대적인 시스템에서는 정규화된 데이터 사용에 따른 스토리지 비용 절감 효과가 크지 않습니다. 비정규화된 데이터 사용은 스토리지 비용을 높이지만 이를 상쇄하는 성능상의 이점을 제공합니다. 조인에는 데이터 조정(통신 대역폭)이 필요합니다. 비정규화는 데이터를 개별 슬롯에 로컬화하므로 병렬 실행이 가능합니다.
데이터의 비정규화와 동시에 관계를 유지해야 한다면 데이터를 완전히 병합하는 대신 중첩 및 반복 필드를 사용하면 됩니다. 관계형 데이터가 완전히 병합되면 네트워크 통신(무작위 섞기)이 쿼리 성능에 악영향을 줄 수 있습니다.
예를 들어 중첩 및 반복 필드를 사용하지 않고 주문 스키마를 비정규화하면(일대다 관계가 존재하는 경우) order_id처럼 필드별로 그룹화해야 할 수도 있습니다. 무작위 섞기가 동원되기 때문에 데이터 그룹화는 중첩 및 반복 필드를 이용한 데이터 비정규화보다 효과가 떨어집니다.
중첩 및 반복 필드로 데이터를 비정규화해도 성능이 향상되지 않을 수 있습니다. 예를 들어 별표 스키마는 일반적으로 분석에 최적화된 스키마이므로 추가로 비정규화하려고 해도 성능이 크게 달라지지 않을 수 있습니다.
중첩 및 반복 필드 사용
BigQuery는 완전한 병합 비정규화를 요구하지 않습니다. 중첩 및 반복 필드를 이용해 관계를 유지할 수 있습니다.
데이터 중첩(STRUCT)
데이터를 중첩하면 외부 항목을 인라인으로 표현할 수 있습니다.
중첩 데이터를 쿼리하면 조인을 사용하는 구문과 유사한 '점' 구문을 사용하여 리프 필드를 참조합니다.
[[["이해하기 쉬움","easyToUnderstand","thumb-up"],["문제가 해결됨","solvedMyProblem","thumb-up"],["기타","otherUp","thumb-up"]],[["이해하기 어려움","hardToUnderstand","thumb-down"],["잘못된 정보 또는 샘플 코드","incorrectInformationOrSampleCode","thumb-down"],["필요한 정보/샘플이 없음","missingTheInformationSamplesINeed","thumb-down"],["번역 문제","translationIssue","thumb-down"],["기타","otherDown","thumb-down"]],["최종 업데이트: 2025-09-04(UTC)"],[[["\u003cp\u003eDenormalizing data using nested and repeated fields in BigQuery can significantly improve query performance by reducing the need for table joins and data shuffling.\u003c/p\u003e\n"],["\u003cp\u003eNesting data with \u003ccode\u003eSTRUCT\u003c/code\u003e types allows for representing foreign entities inline, while repeated data with \u003ccode\u003eARRAY\u003c/code\u003e types preserves one-to-many relationships without high cardinality, both improving query performance.\u003c/p\u003e\n"],["\u003cp\u003eUsing nested and repeated fields can reduce computation overhead from operations like \u003ccode\u003eGROUP BY\u003c/code\u003e, which can occur when data is completely flattened and not using nested fields.\u003c/p\u003e\n"],["\u003cp\u003eNested and repeated data, represented as an \u003ccode\u003eARRAY\u003c/code\u003e of \u003ccode\u003eSTRUCT\u003c/code\u003es, offers a way to maintain data relationships and avoid the performance penalties associated with extensive network communication when data is completely flattened.\u003c/p\u003e\n"],["\u003cp\u003eThe provided example demonstrated that using nested fields can result in a query taking less time to run, as well as processing less data, compared to the version that does not use them.\u003c/p\u003e\n"]]],[],null,["# Use nested and repeated fields\n==============================\n\nBigQuery can be used with many different data modelling methods, and\ngenerally provides high performance across many data model methodologies. To further\ntune a data model for performance, one method you might consider is data denormalization,\nwhich means adding columns of data to a single table to reduce or remove table joins.\n\n**Best practice:** Use nested and repeated fields to denormalize data\nstorage and increase query performance.\n\nDenormalization is a common strategy for increasing read performance for\nrelational datasets that were previously normalized. The recommended way to\ndenormalize data in BigQuery is to use nested and repeated\nfields. It's best to use this strategy when the relationships are hierarchical\nand frequently queried together, such as in parent-child relationships.\n\nThe storage savings from using normalized data has less of an effect in modern\nsystems. Increases in storage costs are worth the performance gains of using\ndenormalized data. Joins require data coordination (communication\nbandwidth). Denormalization localizes the data to individual\n[slots](/bigquery/docs/slots), so that execution can be done in parallel.\n\nTo maintain relationships while denormalizing your data, you can use nested\nand repeated fields instead of completely flattening your data. When relational\ndata is completely flattened, network communication (shuffling) can negatively\nimpact query performance.\n\nFor example, denormalizing an orders schema without using nested and repeated\nfields might require you to group the data by a field like `order_id`\n(when there is a one-to-many relationship). Because of the shuffling involved,\ngrouping the data is less effective than denormalizing the data by using\nnested and repeated fields.\n\nIn some circumstances, denormalizing your data and using nested and repeated\nfields doesn't result in increased performance. For example, star schemas are\ntypically optimized schemas for analytics, and as a result, performance might\nnot be significantly different if you attempt to denormalize further.\n\nUsing nested and repeated fields\n--------------------------------\n\nBigQuery doesn't require a completely flat denormalization. You\ncan use nested and repeated fields to maintain relationships.\n\n- Nesting data (`STRUCT`)\n\n - Nesting data lets you represent foreign entities inline.\n - Querying nested data uses \"dot\" syntax to reference leaf fields, which is similar to the syntax using a join.\n - Nested data is represented as a [`STRUCT` type](/bigquery/docs/reference/standard-sql/data-types#struct_type) in GoogleSQL.\n- Repeated data (`ARRAY`)\n\n - Creating a field of type `RECORD` with the mode set to `REPEATED` lets you preserve a one-to-many relationship inline (so long as the relationship isn't high cardinality).\n - With repeated data, shuffling is not necessary.\n - Repeated data is represented as an `ARRAY`. You can use an [`ARRAY` function](/bigquery/docs/reference/standard-sql/array_functions) in GoogleSQL when you query the repeated data.\n- Nested and repeated data (`ARRAY` of `STRUCT`s)\n\n - Nesting and repetition complement each other.\n - For example, in a table of transaction records, you could include an array of line item `STRUCT`s.\n\nFor more information, see\n[Specify nested and repeated columns in table schemas](/bigquery/docs/nested-repeated).\n\nFor more information about denormalizing data, see\n[Denormalization](/bigquery/docs/migration/schema-data-overview#denormalization).\n\nExample\n-------\n\nConsider an `Orders` table with a row for each line item sold:\n\nIf you wanted to analyze data from this table, you would need to use a\n`GROUP BY` clause, similar to the following: \n\n```googlesql\nSELECT COUNT (Item_Name)\nFROM Orders\nGROUP BY Order_Id;\n```\n\nThe `GROUP BY` clause involves additional computation overhead, but this can be\navoided by nesting repeated data. You can avoid using a `GROUP BY` clause\nby creating a table with one order per row, where the order line items are in a\nnested field:\n\nIn BigQuery, you typically specify a nested schema as an `ARRAY`\nof `STRUCT` objects. You use the\n[`UNNEST` operator](/bigquery/docs/reference/standard-sql/query-syntax#unnest_operator)\nto [flatten the nested data](/bigquery/docs/arrays#flattening_arrays),\nas shown in the following query: \n\n```googlesql\nSELECT *\nFROM UNNEST(\n [\n STRUCT('001' AS Order_Id, ['A1', 'B1'] AS Item_Name),\n STRUCT('002' AS Order_Id, ['A1', 'C1'] AS Item_Name)\n ]\n);\n```\n\nThis query yields results similar to the following:\n\nIf this data wasn't nested, you could potentially have several rows for\neach order, one for each item sold in that order, which would result in a\nlarge table and an expensive `GROUP BY` operation.\n\nExercise\n--------\n\nYou can see the performance difference in queries that use nested fields as\ncompared to those that don't by following the steps in this section.\n\n1. Create a table based on the `bigquery-public-data.stackoverflow.comments`\n public dataset:\n\n ```googlesql\n CREATE OR REPLACE TABLE `\u003cvar translate=\"no\"\u003ePROJECT\u003c/var\u003e.\u003cvar translate=\"no\"\u003eDATASET\u003c/var\u003e.stackoverflow`\n AS (\n SELECT\n user_id,\n post_id,\n creation_date\n FROM\n `bigquery-public-data.stackoverflow.comments`\n );\n ```\n2. Using the `stackoverflow` table, run the following query to see the\n earliest comment for each user:\n\n ```googlesql\n SELECT\n user_id,\n ARRAY_AGG(STRUCT(post_id, creation_date AS earliest_comment) ORDER BY creation_date ASC LIMIT 1)[OFFSET(0)].*\n FROM\n `\u003cvar translate=\"no\"\u003ePROJECT\u003c/var\u003e.\u003cvar translate=\"no\"\u003eDATASET\u003c/var\u003e.stackoverflow`\n GROUP BY user_id\n ORDER BY user_id ASC;\n ```\n\n This query takes about 25 seconds to run and processes 1.88 GB of data.\n3. Create a second table with identical data that creates a `comments` field\n using a `STRUCT` type to store the `post_id` and `creation_date` data, instead\n of two individual fields:\n\n ```googlesql\n CREATE OR REPLACE TABLE `\u003cvar translate=\"no\"\u003ePROJECT\u003c/var\u003e.\u003cvar translate=\"no\"\u003eDATASET\u003c/var\u003e.stackoverflow_nested`\n AS (\n SELECT\n user_id,\n ARRAY_AGG(STRUCT(post_id, creation_date) ORDER BY creation_date ASC) AS comments\n FROM\n `bigquery-public-data.stackoverflow.comments`\n GROUP BY user_id\n );\n ```\n4. Using the `stackoverflow_nested` table, run the following query\n to see the earliest comment for each user:\n\n ```googlesql\n SELECT\n user_id,\n (SELECT AS STRUCT post_id, creation_date as earliest_comment FROM UNNEST(comments) ORDER BY creation_date ASC LIMIT 1).*\n FROM\n `\u003cvar translate=\"no\"\u003ePROJECT\u003c/var\u003e.\u003cvar translate=\"no\"\u003eDATASET\u003c/var\u003e.stackoverflow_nested`\n ORDER BY user_id ASC;\n ```\n\n This query takes about 10 seconds to run and processes 1.28 GB of data.\n5. [Delete](/bigquery/docs/samples/bigquery-delete-table) the `stackoverflow`\n and `stackoverflow_nested` tables when you are finished with them."]]