SEARCH_NGRAMS 会查找与搜索查询共享 N 元语法的所有候选专辑。例如,“California”的三字母 N 元语法包括 [cal, ali,
lif, ifo, for, orn, rni, nia],“Kaliphorn”的三字母 N 元语法包括 [kal, ali, lip,
iph, pho, hor, orn]。这些数据集中的共享 N 元语法为 [ali,
orn]。默认情况下,SEARCH_NGRAMS 会与至少有两个共享 N 元语法的文档匹配,因此“Kaliphorn”与“California”匹配。
SCORE_NGRAMS 按相似度对匹配项进行排名。两个字符串的相似度定义为不同共享 N 元语法与不同非共享 N 元语法的比率:
使用 TOKENIZE_SUBSTRING(/spanner/docs/reference/standard-sql/search_functions#tokenize_substring) 或 TOKENIZE_NGRAMS 函数指定 N 元语法的最小和最大大小。我们不建议使用单个字符 N 元语法,因为它们可能会与大量文档匹配。另一方面,长 N 元语法会导致 SEARCH_NGRAMS 忽略拼写错误的短单词。
SEARCH_NGRAMS 必须匹配的最小 N 元语法数量(使用 SEARCH_NGRAMS 中的 min_ngrams 和 min_ngrams_percent 参数进行设置)。数字越大,查询速度通常越快,但召回率越低。
[[["易于理解","easyToUnderstand","thumb-up"],["解决了我的问题","solvedMyProblem","thumb-up"],["其他","otherUp","thumb-up"]],[["很难理解","hardToUnderstand","thumb-down"],["信息或示例代码不正确","incorrectInformationOrSampleCode","thumb-down"],["没有我需要的信息/示例","missingTheInformationSamplesINeed","thumb-down"],["翻译问题","translationIssue","thumb-down"],["其他","otherDown","thumb-down"]],["最后更新时间 (UTC):2025-09-05。"],[],[],null,["# Find approximate matches with fuzzy search\n\n\u003cbr /\u003e\n\n\n| **Note:** This feature is available with the Spanner Enterprise edition and Enterprise Plus edition. For more information, see the [Spanner editions overview](/spanner/docs/editions-overview).\n\n\u003cbr /\u003e\n\nThis page describes how to use a fuzzy search as part of a\n[full-text search](/spanner/docs/full-text-search).\n\nIn addition to performing exact token searches using the\n[`SEARCH`](/spanner/docs/reference/standard-sql/search_functions#search_fulltext)\nand\n[`SEARCH_SUBSTRING`](/spanner/docs/reference/standard-sql/search_functions#search_substring)\nfunctions, Spanner also supports approximate (or fuzzy) searches. Fuzzy\nsearches find matching documents despite small differences between the query and\nthe document.\n\nSpanner supports the following types of fuzzy search:\n\n- N-grams-based approximate search\n- Phonetic search using [Soundex](https://en.wikipedia.org/wiki/Soundex)\n\nUse an n-grams-based approximate search\n---------------------------------------\n\nN-grams-based fuzzy search relies on the same\nsubstring tokenization that a\n[substring search](/spanner/docs/full-text-search/substring-search)\nrequires. The configuration of the tokenizer is important as it affects\nsearch quality and performance. The following example shows how to create a\nquery with misspelled or differently spelled words to find approximate matches\nin the search index.\n\n**Schema** \n\n### GoogleSQL\n\n CREATE TABLE Albums (\n AlbumId STRING(MAX) NOT NULL,\n AlbumTitle STRING(MAX),\n AlbumTitle_Tokens TOKENLIST AS (\n TOKENIZE_SUBSTRING(AlbumTitle, ngram_size_min=\u003e2, ngram_size_max=\u003e3,\n relative_search_types=\u003e[\"word_prefix\", \"word_suffix\"])) HIDDEN\n ) PRIMARY KEY(AlbumId);\n\n CREATE SEARCH INDEX AlbumsIndex\n ON Albums(AlbumTitle_Tokens)\n STORING (AlbumTitle);\n\n### PostgreSQL\n\nThis example uses\n[`spanner.tokenize_substring`](/spanner/docs/reference/postgresql/functions-and-operators#search_functions). \n\n CREATE TABLE albums (\n albumid character varying NOT NULL,\n albumtitle character varying,\n albumtitle_tokens spanner.tokenlist GENERATED ALWAYS AS (\n spanner.tokenize_substring(albumtitle, ngram_size_min=\u003e2, ngram_size_max=\u003e3,\n relative_search_types=\u003e'{word_prefix, word_suffix}'::text[])) VIRTUAL HIDDEN,\n PRIMARY KEY(albumid));\n\n CREATE SEARCH INDEX albumsindex\n ON albums(albumtitle_tokens)\n INCLUDE (albumtitle);\n\n**Query**\n\nThe following query finds the albums with titles that are the closest to\n\"Hatel Kaliphorn\", such as \"Hotel California\". \n\n### GoogleSQL\n\n SELECT AlbumId\n FROM Albums\n WHERE SEARCH_NGRAMS(AlbumTitle_Tokens, \"Hatel Kaliphorn\")\n ORDER BY SCORE_NGRAMS(AlbumTitle_Tokens, \"Hatel Kaliphorn\") DESC\n LIMIT 10\n\n### PostgreSQL\n\nThis examples uses\n[`spanner.score_ngrams`](/spanner/docs/reference/postgresql/functions-and-operators#search_functions)\nand\n[`spanner.search_ngrams`](/spanner/docs/reference/postgresql/functions-and-operators#search_functions). \n\n SELECT albumid\n FROM albums\n WHERE spanner.search_ngrams(albumtitle_tokens, 'Hatel Kaliphorn')\n ORDER BY spanner.score_ngrams(albumtitle_tokens, 'Hatel Kaliphorn') DESC\n LIMIT 10\n\n### Optimize performance and recall for an n-grams-based approximate search\n\nThe sample query in the previous section searches in two phases, using two\ndifferent functions:\n\n1. [`SEARCH_NGRAMS`](/spanner/docs/reference/postgresql/functions-and-operators#search_functions) finds all candidate albums that have shared n-grams with the search query. For example, three-character n-grams for \"California\" include `[cal, ali,\n lif, ifo, for, orn, rni, nia]` and for \"Kaliphorn\" include `[kal, ali, lip,\n iph, pho, hor, orn]`. The shared n-grams in these data sets are `[ali,\n orn]`. By default, `SEARCH_NGRAMS` matches all documents with at least two shared n-grams, therefore \"Kaliphorn\" matches \"California\".\n2. [`SCORE_NGRAMS`](/spanner/docs/reference/standard-sql/search_functions#score_ngrams) ranks matches by similarity. The similarity of two strings is defined as a ratio of distinct shared n-grams to distinct non-shared n-grams:\n\n$$ \\\\frac{shared\\\\_ngrams}{total\\\\_ngrams_{index} + total\\\\_ngrams_{query} - shared\\\\_ngrams} $$\n\nUsually the search query is the same across both the `SEARCH_NGRAMS` and\n`SCORE_NGRAMS` functions. The recommended way to do this is to use the argument\nwith\n[query parameters](/spanner/docs/reference/standard-sql/lexical#query_parameters)\nrather than with string literals, and specify the same query parameter in the\n`SEARCH_NGRAMS` and `SCORE_NGRAMS` functions.\n\nSpanner has three configuration arguments that can be used with\n`SEARCH_NGRAMS`:\n\n- The minimum and maximum sizes for n-grams are specified with the `TOKENIZE_SUBSTRING`(/spanner/docs/reference/standard-sql/search_functions#tokenize_substring) or [`TOKENIZE_NGRAMS`](/spanner/docs/reference/standard-sql/search_functions#tokenize_ngrams) functions. We don't recommend one character n-grams because they could match a very large number of documents. On the other hand, long n-grams cause `SEARCH_NGRAMS` to miss short misspelled words.\n- The minimum number of n-grams that `SEARCH_NGRAMS` must match (set with the `min_ngrams` and `min_ngrams_percent` arguments in `SEARCH_NGRAMS`). Higher numbers typically make the query faster, but reduce recall.\n\nIn order to achieve a good balance between performance and recall, you can\nconfigure these arguments to fit the specific query and workload.\n\nWe also recommend including an inner `LIMIT` to avoid creating very expensive\nqueries when a combination of popular n-grams is encountered. \n\n### GoogleSQL\n\n SELECT AlbumId\n FROM (\n SELECT AlbumId,\n SCORE_NGRAMS(AlbumTitle_Tokens, @p) AS score\n FROM Albums\n WHERE SEARCH_NGRAMS(AlbumTitle_Tokens, @p)\n LIMIT 10000 # inner limit\n )\n ORDER BY score DESC\n LIMIT 10 # outer limit\n\n### PostgreSQL\n\nThis example uses query parameter `$1` which is bound to 'Hatel Kaliphorn'. \n\n SELECT albumid\n FROM\n (\n SELECT albumid, spanner.score_ngrams(albumtitle_tokens, $1) AS score\n FROM albums\n WHERE spanner.search_ngrams(albumtitle_tokens, $1)\n LIMIT 10000\n ) AS inner_query\n ORDER BY inner_query.score DESC\n LIMIT 10\n\n### N-grams-based fuzzy search versus enhanced query mode\n\nAlongside n-grams-based fuzzy search, the\n[enhanced query mode](/spanner/docs/full-text-search/query-overview#enhanced_query_mode)\nalso handles some misspelled words. Thus, there is some overlap between the two\nfeatures. The following table summarizes the differences:\n\nPerform a phonetic search with Soundex\n--------------------------------------\n\nSpanner provides the\n[`SOUNDEX`](/spanner/docs/reference/standard-sql/string_functions#soundex)\nfunction for finding words that are spelled differently, but sound the same. For\nexample, `SOUNDEX(\"steven\")`, `SOUNDEX(\"stephen\")` and`SOUNDEX(\"stefan\")` are\nall \"s315\", while `SOUNDEX(\"stella\")` is \"s340\". `SOUNDEX` is case sensitive and\nonly works for Latin-based alphabets.\n\nPhonetic search with `SOUNDEX` can be implemented with a generated column and a\nsearch index as shown in the following example: \n\n### GoogleSQL\n\n CREATE TABLE Singers (\n SingerId INT64,\n AlbumTitle STRING(MAX),\n AlbumTitle_Tokens TOKENLIST AS (TOKENIZE_FULLTEXT(AlbumTitle)) HIDDEN,\n Name STRING(MAX),\n NameSoundex STRING(MAX) AS (LOWER(SOUNDEX(Name))),\n NameSoundex_Tokens TOKENLIST AS (TOKEN(NameSoundex)) HIDDEN\n ) PRIMARY KEY(SingerId);\n\n CREATE SEARCH INDEX SingersPhoneticIndex ON Singers(AlbumTitle_Tokens, NameSoundex_Tokens);\n\n### PostgreSQL\n\nThis example uses\n[`spanner.soundex`](/spanner/docs/reference/postgresql/functions-and-operators#string_functions). \n\n CREATE TABLE singers (\n singerid bigint,\n albumtitle character varying,\n albumtitle_tokens spanner.tokenlist GENERATED ALWAYS AS (spanner.tokenize_fulltext(albumtitle)) VIRTUAL HIDDEN,\n name character varying,\n namesoundex character varying GENERATED ALWAYS AS (lower(spanner.soundex(name))) VIRTUAL,\n namesoundex_tokens spanner.tokenlist GENERATED ALWAYS AS (spanner.token(lower(spanner.soundex(name))) VIRTUAL HIDDEN,\n PRIMARY KEY(singerid));\n\n CREATE SEARCH INDEX singersphoneticindex ON singers(albumtitle_tokens, namesoundex_tokens);\n\nThe following query matches \"stefan\" to \"Steven\" on `SOUNDEX`, along with\n`AlbumTitle` containing \"cat\": \n\n### GoogleSQL\n\n SELECT SingerId\n FROM Singers\n WHERE NameSoundex = LOWER(SOUNDEX(\"stefan\")) AND SEARCH(AlbumTitle_Tokens, \"cat\")\n\n### PostgreSQL\n\n SELECT singerid\n FROM singers\n WHERE namesoundex = lower(spanner.soundex('stefan')) AND spanner.search(albumtitle_tokens, 'cat')\n\n\u003cbr /\u003e\n\nWhat's next\n-----------\n\n- Learn about [tokenization and Spanner tokenizers](/spanner/docs/full-text-search/tokenization).\n- Learn about [search indexes](/spanner/docs/full-text-search/search-indexes).\n- Learn about [full-text search queries](/spanner/docs/full-text-search/query-overview)."]]