이 페이지에서는 코사인 거리, 유클리드 거리, 내적 벡터 함수를 사용하여 K-최근접 이웃을 찾아 Spanner에서 벡터 유사성 검색을 수행하는 방법을 설명합니다. 이 정보는 GoogleSQL 언어 데이터베이스와 PostgreSQL 언어 데이터베이스 모두에 적용됩니다. 이 페이지를 읽기 전에 다음 개념을 이해해야 합니다.
내적: 각도에 해당 벡터 크기의 곱을 곱하여 코사인을 계산합니다. 데이터 세트의 모든 벡터 임베딩이 정규화된 경우 DOT_PRODUCT()를 거리 함수로 사용할 수 있습니다.
K-최근접 이웃(KNN): 분류 또는 회귀 문제를 해결하는 데 사용되는 지도 머신러닝 알고리즘입니다.
유사성 검색이나 검색 증강 생성과 같은 사용 사례에 벡터 거리 함수를 사용하여 K-최근접 이웃(KNN) 벡터 검색을 수행할 수 있습니다. Spanner는 벡터 임베딩에서 작동하는 COSINE_DISTANCE(), EUCLIDEAN_DISTANCE(), DOT_PRODUCT() 함수를 지원하므로 입력 임베딩의 KNN을 찾을 수 있습니다.
예를 들어 운영 Spanner 데이터를 벡터 임베딩으로 생성 및 저장한 후 이러한 벡터 임베딩을 쿼리에서 입력 매개변수로 제공하여 N차원 공간에서 가장 가까운 벡터를 찾아 의미론적으로 유사하거나 관련된 항목을 검색할 수 있습니다.
세 거리 함수 모두 array<> 유형인 vector1 및 vector2 인수를 사용하며 동일한 차원으로 구성되고 길이가 같아야 합니다. 이러한 함수에 대한 자세한 내용은 다음을 참조하세요.
사용하는 WHERE 절 필터가 테이블의 기본 키에 포함되지 않은 경우 보조 색인을 만들어 색인 전용 스캔으로 작업을 가속화할 수 있습니다.
GoogleSQL
CREATEINDEXDocsByAuthorONDocuments(Author)STORING(DocEmbedding);SELECTAuthor,DocId,DocEmbeddingFROMDocumentsWHEREAuthor="Mark Twain"ORDERBYEUCLIDEAN_DISTANCE(DocEmbedding,<embeddingsfor"book about the time traveling American">)LIMIT5;
PostgreSQL
CREATEINDEXDocsByAuthorONDocuments(Author)INCLUDE(DocEmbedding);SELECTAuthor,DocId,DocEmbeddingFROMDocumentsWHEREAuthor="Mark Twain"ORDERBYspanner.euclidean_distance(DocEmbedding,<embeddingsfor"that book about the time traveling American">)LIMIT5;
[[["이해하기 쉬움","easyToUnderstand","thumb-up"],["문제가 해결됨","solvedMyProblem","thumb-up"],["기타","otherUp","thumb-up"]],[["이해하기 어려움","hardToUnderstand","thumb-down"],["잘못된 정보 또는 샘플 코드","incorrectInformationOrSampleCode","thumb-down"],["필요한 정보/샘플이 없음","missingTheInformationSamplesINeed","thumb-down"],["번역 문제","translationIssue","thumb-down"],["기타","otherDown","thumb-down"]],["최종 업데이트: 2025-06-03(UTC)"],[],[],null,["# Perform vector similarity search in Spanner by finding the K-nearest neighbors\n\n| **Note:** This feature is available with the Spanner Enterprise edition and Enterprise Plus edition. For more information, see the [Spanner editions overview](/spanner/docs/editions-overview).\n\n\u003cbr /\u003e\n\nThis page describes how to perform a vector similarity search in\nSpanner by using the cosine distance, Euclidean distance, and dot\nproduct vector functions to find K-nearest neighbors. This information\napplies to both GoogleSQL-dialect databases and PostgreSQL-dialect databases. Before you read this page,\nit's important that you understand the following concepts:\n\n- [Euclidean distance](https://en.wikipedia.org/wiki/Euclidean_distance): measures the shortest distance between two vectors.\n- [Cosine distance](https://en.wikipedia.org/wiki/Cosine_similarity#Cosine_distance): measures the cosine of the angle between two vectors.\n- [Dot product](https://mathworld.wolfram.com/DotProduct.html): calculates the cosine of the angle multiplied by the product of corresponding vector magnitudes. If you know that all the vector embeddings in your dataset are normalized, then you can use `DOT_PRODUCT()` as a distance function.\n- [K-nearest neighbors (KNN)](https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm): a supervised machine learning algorithm used to solve classification or regression problems.\n\nYou can use vector distance functions to perform K-nearest neighbors (KNN)\nvector search for use cases like similarity search or retrieval-augmented\ngeneration. Spanner supports the `COSINE_DISTANCE()`,\n`EUCLIDEAN_DISTANCE()`, and `DOT_PRODUCT()` functions, which operate on vector\nembeddings, allowing you to find the KNN of the input embedding.\n\nFor example, after you [generate and save your operational Spanner\ndata as vector embeddings](/spanner/docs/ml-tutorial-embeddings), you can then\nprovide these vector embeddings as an input parameter in your query to find the\nnearest vectors in N-dimensional space to search for semantically similar or\nrelated items.\n\nAll three distance functions take the arguments `vector1` and `vector2`, which\nare of the type `array\u003c\u003e`, and must consist of the same dimensions and have the\nsame length. For more details about these functions, see:\n\n- [`COSINE_DISTANCE()` in GoogleSQL](/spanner/docs/reference/standard-sql/mathematical_functions#cosine_distance)\n- [`EUCLIDEAN_DISTANCE()` in GoogleSQL](/spanner/docs/reference/standard-sql/mathematical_functions#euclidean_distance)\n- [`DOT_PRODUCT()` in GoogleSQL](/spanner/docs/reference/standard-sql/mathematical_functions#dot_product)\n- [Mathematical functions in PostgreSQL](/spanner/docs/reference/postgresql/functions-and-operators#mathematical) (`spanner.cosine_distance()`, `spanner.euclidean_distance()`, and `spanner.dot_product()`)\n- [Choose among vector distance functions to measure vector embeddings similarity](/spanner/docs/choose-vector-distance-function).\n\nExamples\n--------\n\nThe following examples show KNN search, KNN search over partitioned data, and\nusing a secondary index with KNN.\n\nThe examples all use `EUCLIDEAN_DISTANCE()`. You can also use\n`COSINE_DISTANCE()`. In addition, if all the vector embeddings in your dataset\nare normalized, you can use `DOT_PRODUCT()` as a distance function.\n\n### Example 1: KNN search\n\nConsider a `Documents` table that has a column (`DocEmbedding`) of precomputed\ntext embeddings from the `DocContents` bytes column. \n\n### GoogleSQL\n\n CREATE TABLE Documents (\n UserId INT64 NOT NULL,\n DocId INT64 NOT NULL,\n Author STRING(1024),\n DocContents BYTES(MAX),\n DocEmbedding ARRAY\u003cFLOAT32\u003e\n ) PRIMARY KEY (UserId, DocId);\n\n### PostgreSQL\n\n CREATE TABLE Documents (\n UserId bigint NOT NULL,\n DocId bigint NOT NULL,\n Author varchar(1024),\n DocContents bytea,\n DocEmbedding float4[],\n PRIMARY KEY (UserId, DocId)\n );\n\nAssuming that an input embedding for \"baseball, but not professional baseball\"\nis the array `[0.3, 0.3, 0.7, 0.7]`, you can find the top five nearest documents\nthat match, with the following query: \n\n### GoogleSQL\n\n SELECT DocId, DocEmbedding FROM Documents\n ORDER BY EUCLIDEAN_DISTANCE(DocEmbedding,\n ARRAY\u003cFLOAT32\u003e[0.3, 0.3, 0.7, 0.8])\n LIMIT 5;\n\n### PostgreSQL\n\n SELECT DocId, DocEmbedding FROM Documents\n ORDER BY spanner.euclidean_distance(DocEmbedding,\n '{0.3, 0.3, 0.7, 0.8}'::float4[])\n LIMIT 5;\n\nThe expected results of this example: \n\n Documents\n +---------------------------+-----------------+\n | DocId | DocEmbedding |\n +---------------------------+-----------------+\n | 24 | [8, ...] |\n +---------------------------+-----------------+\n | 25 | [6, ...] |\n +---------------------------+-----------------+\n | 26 | [3.2, ...] |\n +---------------------------+-----------------+\n | 27 | [38, ...] |\n +---------------------------+-----------------+\n | 14229 | [1.6, ...] |\n +---------------------------+-----------------+\n\n### Example 2: KNN search over partitioned data\n\nThe query in the previous example can be modified by adding conditions to the\n`WHERE` clause to limit the vector search to a subset of your data. One common\napplication of this is to search over partitioned data, such as rows that belong\nto a specific `UserId`. \n\n### GoogleSQL\n\n SELECT UserId, DocId, DocEmbedding FROM Documents\n WHERE UserId=18\n ORDER BY EUCLIDEAN_DISTANCE(DocEmbedding,\n ARRAY\u003cFLOAT32\u003e[0.3, 0.3, 0.7, 0.8])\n LIMIT 5;\n\n### PostgreSQL\n\n SELECT UserId, DocId, DocEmbedding FROM Documents\n WHERE UserId=18\n ORDER BY spanner.euclidean_distance(DocEmbedding,\n '{0.3, 0.3, 0.7, 0.8}'::float4[])\n LIMIT 5;\n\nThe expected results of this example: \n\n Documents\n +-----------+-----------------+-----------------+\n | UserId | DocId | DocEmbedding |\n +-----------+-----------------+-----------------+\n | 18 | 234 | [12, ...] |\n +-----------+-----------------+-----------------+\n | 18 | 12 | [1.6, ...] |\n +-----------+-----------------+-----------------+\n | 18 | 321 | [22, ...] |\n +-----------+-----------------+-----------------+\n | 18 | 432 | [3, ...] |\n +-----------+-----------------+-----------------+\n\n### Example 3: KNN search over secondary index ranges\n\nIf the `WHERE` clause filter you're using isn't part of the table's primary key,\nthen you can create a secondary index to accelerate the operation with an\n[index-only scan](/spanner/docs/secondary-indexes#storing-clause). \n\n### GoogleSQL\n\n CREATE INDEX DocsByAuthor\n ON Documents(Author)\n STORING (DocEmbedding);\n\n SELECT Author, DocId, DocEmbedding FROM Documents\n WHERE Author=\"Mark Twain\"\n ORDER BY EUCLIDEAN_DISTANCE(DocEmbedding,\n \u003cembeddings for \"book about the time traveling American\"\u003e)\n LIMIT 5;\n\n### PostgreSQL\n\n CREATE INDEX DocsByAuthor\n ON Documents(Author)\n INCLUDE (DocEmbedding);\n\n SELECT Author, DocId, DocEmbedding FROM Documents\n WHERE Author=\"Mark Twain\"\n ORDER BY spanner.euclidean_distance(DocEmbedding,\n \u003cembeddings for \"that book about the time traveling American\"\u003e)\n LIMIT 5;\n\nThe expected results of this example: \n\n Documents\n +------------+-----------------+-----------------+\n | Author | DocId | DocEmbedding |\n +------------+-----------------+-----------------+\n | Mark Twain | 234 | [12, ...] |\n +------------+-----------------+-----------------+\n | Mark Twain | 12 | [1.6, ...] |\n +------------+-----------------+-----------------+\n | Mark Twain | 321 | [22, ...] |\n +------------+-----------------+-----------------+\n | Mark Twain | 432 | [3, ...] |\n +------------+-----------------+-----------------+\n | Mark Twain | 375 | [9, ...] |\n +------------+-----------------+-----------------+\n\nWhat's next\n-----------\n\n- Learn more about the [GoogleSQL `COSINE_DISTANCE()`, `EUCLIDEAN_DISTANCE()`, `DOT_PRODUCT()`](/spanner/docs/reference/standard-sql/mathematical_functions) functions.\n\n- Learn more about the [PostgreSQL `spanner.cosine_distance()`, `spanner.euclidean_distance()`, `spanner.dot_product()`](/spanner/docs/reference/postgresql/functions-and-operators#mathematical) functions.\n\n- Learn more about how to [Choose among vector distance functions to measure vector embeddings similarity](/spanner/docs/choose-vector-distance-function)."]]