如果 JOIN 的兩側都太大,無法放入記憶體,您可以使用其他技巧將每個 JOIN 鍵分割成多個鍵,以提高平行處理的程度。這項技巧可套用至 INNER JOIN 和 LEFT OUTER JOIN 作業。無法用於 FULL OUTER JOIN 運算。
在這種方法中,偏差的一方會使用新的整數資料欄進行加鹽處理,其中包含 1 到 N 之間的隨機數字。未傾斜的側邊會展開,每個現有資料列都會產生 N 個新資料列。系統會在展開的側邊新增一列,並填入 1 到 N 之間的每個數字。接著執行一般 JOIN,但新欄會新增為 JOIN 鍵的一部分。如此一來,原本會進入單一分區的所有資料,現在會分散到最多 N 個不同的分區。
在上述範例中,分配因子 N 已設為 3。左側顯示原始資料集。中間顯示經過加鹽和展開的資料集版本。右側顯示經過重新排序的資料,其中有三個不同的執行緒加入購買雞蛋,而非一個。
[[["容易理解","easyToUnderstand","thumb-up"],["確實解決了我的問題","solvedMyProblem","thumb-up"],["其他","otherUp","thumb-up"]],[["難以理解","hardToUnderstand","thumb-down"],["資訊或程式碼範例有誤","incorrectInformationOrSampleCode","thumb-down"],["缺少我需要的資訊/範例","missingTheInformationSamplesINeed","thumb-down"],["翻譯問題","translationIssue","thumb-down"],["其他","otherDown","thumb-down"]],["上次更新時間:2025-09-04 (世界標準時間)。"],[[["\u003cp\u003e\u003ccode\u003eJOIN\u003c/code\u003e operations in Cloud Data Fusion can be performance-intensive due to the data shuffling required to group records with the same \u003ccode\u003eJOIN\u003c/code\u003e key on the same executor.\u003c/p\u003e\n"],["\u003cp\u003eUneven distribution of \u003ccode\u003eJOIN\u003c/code\u003e keys, known as data skew, can negatively impact \u003ccode\u003eJOIN\u003c/code\u003e performance, but adaptive query execution can automatically split these large, skewed partitions to improve performance.\u003c/p\u003e\n"],["\u003cp\u003eFor smaller datasets, an in-memory \u003ccode\u003eJOIN\u003c/code\u003e can be used where the smaller dataset is loaded into memory and broadcast to all executors, avoiding the need for shuffling the larger dataset and significantly improving performance.\u003c/p\u003e\n"],["\u003cp\u003eWhen both datasets are too large for in-memory \u003ccode\u003eJOIN\u003c/code\u003e, a technique called key distribution can be applied to \u003ccode\u003eINNER JOIN\u003c/code\u003e and \u003ccode\u003eLEFT OUTER JOIN\u003c/code\u003e operations to break up each \u003ccode\u003eJOIN\u003c/code\u003e key into multiple keys and increase parallelism by salting and exploding the datasets.\u003c/p\u003e\n"]]],[],null,["# Parallel processing for JOIN operations\n\nThis page explains performance tuning for `JOIN` operations in Cloud Data Fusion.\n\n`JOIN` operations can be the most expensive part of a pipeline. Like everything\nelse in a pipeline, operations are executed in parallel. The first step of a\n`JOIN` is to shuffle data so that every record with the same `JOIN` key is sent\nto the same executor. After all of the data is shuffled, it's joined, and the\noutput continues through the pipeline.\n\nExample of parallel processing in `JOIN` operations\n---------------------------------------------------\n\nFor example, suppose you perform a `JOIN` operation on datasets called\n`Purchases` and `Items`. Each purchase record contains an item name and number\npurchased. Each item record contains the item name and the price of that item. A\n`JOIN` is performed on the item name to calculate the total price of each\npurchase. When the data is joined, data is shuffled across the cluster such that\nrecords with the same ID end up on the same executor.\n\nWhen the `JOIN` keys are fairly evenly distributed, `JOIN` operations perform\nwell because they can be executed in parallel.\n\nLike any shuffle, data skew negatively impacts performance. In the preceding\nexample, eggs are purchased much more frequently than chicken or milk, which\nmeans the executor joining the egg purchases does more work than the other\nexecutors. If you notice that a `JOIN` is skewed, there are two ways to improve\nperformance.\n\n### Automatically split up skewed partitions\n\nWith adaptive query execution, really heavy skews will be handled automatically.\nAs soon as a `JOIN` produces some partitions much larger than others, they are\nsplit into smaller ones. To confirm you have adaptive query execution enabled,\nsee [Autotuning](https://cdap.atlassian.net/wiki/spaces/DOCS/pages/818544644/Parallel+Processing#Autotuning).\n\n### Use an in-memory `JOIN`\n\nAn in-memory `JOIN` can be performed if one side of the `JOIN` is small enough\nto fit in memory. In this situation, the small dataset is loaded into memory,\nand then gets broadcasted to every executor. The large dataset isn't shuffled at\nall, removing the uneven partitions that are generated when shuffling on the\n`JOIN` key.\n\nIn the previous example, the items dataset is first loaded into memory of the\nSpark driver. It is then broadcast to each executor. Executors can now join the\ndata without shuffling any of the purchase dataset.\n\nThis approach requires you to give enough memory to both the Spark driver and\nexecutors to allow them to store the broadcast dataset in memory. By default,\nSpark reserves slightly less than 30% of its memory for storing this type of\ndata. When using in-memory `JOIN`s, multiply the size of the dataset by four and\nset that as the executor and driver memory. For example, if the items dataset\nwas 1 GB in size, we would need to set the executor and driver memory to at\nleast 4 GB. Datasets larger than 8 GB cannot be loaded into memory.\n\nKey distribution\n----------------\n\nWhen both sides of the `JOIN` are too large to fit in memory, a different\ntechnique can be used to break up each `JOIN` key into multiple keys to increase\nthe level of parallelism. This technique can be applied to `INNER JOIN` and\n`LEFT OUTER JOIN`operations. It cannot be used for `FULL OUTER JOIN`\noperations.\n\nIn this approach, the skewed side is salted with a new integer column with a\nrandom number from 1 to N. The unskewed side is exploded, with each existing row\ngenerating `N` new rows. A new column is added to the exploded side, populated\nwith each number from 1 to N. A normal `JOIN` is then performed, except the new\ncolumn is added as part of the `JOIN` key. In this way, all the data that used\nto go to a single partition is now spread out to up to `N` different partitions.\n\nIn the preceding example, the distribution factor `N` is set to `3`. The\noriginal datasets are shown on the left. The salted and exploded versions of the\ndataset are shown in the middle. The shuffled data is shown on the right, with\nthree different executors joining egg purchases, instead of one.\n\nGreater parallelism is achieved by increasing distributions. However, this comes\nat the cost of exploding one side of the `JOIN`, resulting in more data shuffled\nacross the cluster. Because of this, the benefit diminishes as distribution\nincreases. In most situations, set it to 20 or less.\n\nWhat's next\n-----------\n\n- Learn more about [parallel processing](/data-fusion/docs/concepts/parallel-processing) in Cloud Data Fusion."]]