分析语法

大多数 Natural Language 方法都会分析给定文本讲的是什么,但 analyzeSyntax 方法会检查语言本身的结构。语法分析将给定文本分解为一系列句子和令牌(通常是单词),并提供这些令牌的相关语言信息。请参阅词法与依存关系树以了解语言分析的详细信息,并参阅语言支持以查看 Natural Language API 可以分析语法的语言列表。

本部分演示在文档中检测语法的几种方法。 您必须针对每个文档分别提交请求。

分析字符串中的语法

下面的示例说明如何对直接发送至 Natural Language API 的文本字符串执行语法分析:

协议

如需分析文档中的语法,请按照下面示例中所示,向 documents:analyzeSyntax REST 方法发出 POST 请求,并提供相应的请求正文。

该示例使用 gcloud auth application-default print-access-token 命令获取通过 Google Cloud Platform Cloud SDK 为项目设置的服务帐号的访问令牌。如需了解有关安装 Cloud SDK 以及使用服务帐号设置项目的说明,请参阅快速入门

curl -X POST \
     -H "Authorization: Bearer "$(gcloud auth application-default print-access-token) \
     -H "Content-Type: application/json; charset=utf-8" \
     --data "{
  'encodingType': 'UTF8',
  'document': {
    'type': 'PLAIN_TEXT',
    'content': 'Google, headquartered in Mountain View, unveiled the new Android phone at the Consumer Electronic Show.  Sundar Pichai said in his keynote that users love their new Android phones.'
  }
}" "https://language.googleapis.com/v1/documents:analyzeSyntax"

如果您不指定 document.language,则系统将自动检测语言。有关 Natural Language API 支持哪些语言的信息,请参阅语言支持。如需详细了解如何配置请求正文,请参阅 Document 参考文档。

如果请求成功,服务器将返回一个 200 OK HTTP 状态代码以及 JSON 格式的响应:

{
  "sentences": [
    {
      "text": {
        "content": "Google, headquartered in Mountain View, unveiled the new Android phone at the Consumer Electronic Show.",
        "beginOffset": 0
      }
    },
    {
      "text": {
        "content": "Sundar Pichai said in his keynote that users love their new Android phones.",
        "beginOffset": 105
      }
    }
  ],
  "tokens": [
    {
      "text": {
        "content": "Google",
        "beginOffset": 0
      },
      "partOfSpeech": {
        "tag": "NOUN",
        "aspect": "ASPECT_UNKNOWN",
        "case": "CASE_UNKNOWN",
        "form": "FORM_UNKNOWN",
        "gender": "GENDER_UNKNOWN",
        "mood": "MOOD_UNKNOWN",
        "number": "SINGULAR",
        "person": "PERSON_UNKNOWN",
        "proper": "PROPER",
        "reciprocity": "RECIPROCITY_UNKNOWN",
        "tense": "TENSE_UNKNOWN",
        "voice": "VOICE_UNKNOWN"
      },
      "dependencyEdge": {
        "headTokenIndex": 7,
        "label": "NSUBJ"
      },
      "lemma": "Google"
    },
    ...
    {
      "text": {
        "content": ".",
        "beginOffset": 179
      },
      "partOfSpeech": {
        "tag": "PUNCT",
        "aspect": "ASPECT_UNKNOWN",
        "case": "CASE_UNKNOWN",
        "form": "FORM_UNKNOWN",
        "gender": "GENDER_UNKNOWN",
        "mood": "MOOD_UNKNOWN",
        "number": "NUMBER_UNKNOWN",
        "person": "PERSON_UNKNOWN",
        "proper": "PROPER_UNKNOWN",
        "reciprocity": "RECIPROCITY_UNKNOWN",
        "tense": "TENSE_UNKNOWN",
        "voice": "VOICE_UNKNOWN"
      },
      "dependencyEdge": {
        "headTokenIndex": 20,
        "label": "P"
      },
      "lemma": "."
    }
  ],
  "language": "en"
}

tokens 数组包含代表检测到的句子令牌的 Token 对象,其中包括词性令牌和在句子中的位置等信息。

gcloud

如需查看完整的详细信息,请参阅 analyze-syntax 命令。

如需执行语法分析,请使用 gcloud 命令行工具并使用 --content 标志来标识要分析的内容:

gcloud ml language analyze-syntax --content="Google, headquartered in Mountain View, unveiled the new Android phone at the Consumer Electronic Show.  Sundar Pichai said in his keynote that users love their new Android phones."

如果请求成功,则服务器返回 JSON 格式的响应:

{
  "sentences": [
    {
      "text": {
        "content": "Google, headquartered in Mountain View, unveiled the new Android phone at the Consumer Electronic Show.",
        "beginOffset": 0
      }
    },
    {
      "text": {
        "content": "Sundar Pichai said in his keynote that users love their new Android phones.",
        "beginOffset": 105
      }
    }
  ],
  "tokens": [
    {
      "text": {
        "content": "Google",
        "beginOffset": 0
      },
      "partOfSpeech": {
        "tag": "NOUN",
        "aspect": "ASPECT_UNKNOWN",
        "case": "CASE_UNKNOWN",
        "form": "FORM_UNKNOWN",
        "gender": "GENDER_UNKNOWN",
        "mood": "MOOD_UNKNOWN",
        "number": "SINGULAR",
        "person": "PERSON_UNKNOWN",
        "proper": "PROPER",
        "reciprocity": "RECIPROCITY_UNKNOWN",
        "tense": "TENSE_UNKNOWN",
        "voice": "VOICE_UNKNOWN"
      },
      "dependencyEdge": {
        "headTokenIndex": 7,
        "label": "NSUBJ"
      },
      "lemma": "Google"
    },
    ...
    {
      "text": {
        "content": ".",
        "beginOffset": 179
      },
      "partOfSpeech": {
        "tag": "PUNCT",
        "aspect": "ASPECT_UNKNOWN",
        "case": "CASE_UNKNOWN",
        "form": "FORM_UNKNOWN",
        "gender": "GENDER_UNKNOWN",
        "mood": "MOOD_UNKNOWN",
        "number": "NUMBER_UNKNOWN",
        "person": "PERSON_UNKNOWN",
        "proper": "PROPER_UNKNOWN",
        "reciprocity": "RECIPROCITY_UNKNOWN",
        "tense": "TENSE_UNKNOWN",
        "voice": "VOICE_UNKNOWN"
      },
      "dependencyEdge": {
        "headTokenIndex": 20,
        "label": "P"
      },
      "lemma": "."
    }
  ],
  "language": "en"
}

tokens 数组包含代表检测到的句子令牌的 Token 对象,其中包括词性令牌和在句子中的位置等信息。

Go


func analyzeSyntax(ctx context.Context, client *language.Client, text string) (*languagepb.AnnotateTextResponse, error) {
	return client.AnnotateText(ctx, &languagepb.AnnotateTextRequest{
		Document: &languagepb.Document{
			Source: &languagepb.Document_Content{
				Content: text,
			},
			Type: languagepb.Document_PLAIN_TEXT,
		},
		Features: &languagepb.AnnotateTextRequest_Features{
			ExtractSyntax: true,
		},
		EncodingType: languagepb.EncodingType_UTF8,
	})
}

Java

// Instantiate the Language client com.google.cloud.language.v1.LanguageServiceClient
try (LanguageServiceClient language = LanguageServiceClient.create()) {
  Document doc = Document.newBuilder().setContent(text).setType(Type.PLAIN_TEXT).build();
  AnalyzeSyntaxRequest request =
      AnalyzeSyntaxRequest.newBuilder()
          .setDocument(doc)
          .setEncodingType(EncodingType.UTF16)
          .build();
  // analyze the syntax in the given text
  AnalyzeSyntaxResponse response = language.analyzeSyntax(request);
  // print the response
  for (Token token : response.getTokensList()) {
    System.out.printf("\tText: %s\n", token.getText().getContent());
    System.out.printf("\tBeginOffset: %d\n", token.getText().getBeginOffset());
    System.out.printf("Lemma: %s\n", token.getLemma());
    System.out.printf("PartOfSpeechTag: %s\n", token.getPartOfSpeech().getTag());
    System.out.printf("\tAspect: %s\n", token.getPartOfSpeech().getAspect());
    System.out.printf("\tCase: %s\n", token.getPartOfSpeech().getCase());
    System.out.printf("\tForm: %s\n", token.getPartOfSpeech().getForm());
    System.out.printf("\tGender: %s\n", token.getPartOfSpeech().getGender());
    System.out.printf("\tMood: %s\n", token.getPartOfSpeech().getMood());
    System.out.printf("\tNumber: %s\n", token.getPartOfSpeech().getNumber());
    System.out.printf("\tPerson: %s\n", token.getPartOfSpeech().getPerson());
    System.out.printf("\tProper: %s\n", token.getPartOfSpeech().getProper());
    System.out.printf("\tReciprocity: %s\n", token.getPartOfSpeech().getReciprocity());
    System.out.printf("\tTense: %s\n", token.getPartOfSpeech().getTense());
    System.out.printf("\tVoice: %s\n", token.getPartOfSpeech().getVoice());
    System.out.println("DependencyEdge");
    System.out.printf("\tHeadTokenIndex: %d\n", token.getDependencyEdge().getHeadTokenIndex());
    System.out.printf("\tLabel: %s\n\n", token.getDependencyEdge().getLabel());
  }
  return response.getTokensList();
}

Node.js

// Imports the Google Cloud client library
const language = require('@google-cloud/language');

// Creates a client
const client = new language.LanguageServiceClient();

/**
 * TODO(developer): Uncomment the following line to run this code.
 */
// const text = 'Your text to analyze, e.g. Hello, world!';

// Prepares a document, representing the provided text
const document = {
  content: text,
  type: 'PLAIN_TEXT',
};

// Need to specify an encodingType to receive word offsets
const encodingType = 'UTF8';

// Detects the sentiment of the document
const [syntax] = await client.analyzeSyntax({document, encodingType});

console.log('Tokens:');
syntax.tokens.forEach(part => {
  console.log(`${part.partOfSpeech.tag}: ${part.text.content}`);
  console.log('Morphology:', part.partOfSpeech);
});

Python

from google.cloud import language_v1

def sample_analyze_syntax(text_content):
    """
    Analyzing Syntax in a String

    Args:
      text_content The text content to analyze
    """

    client = language_v1.LanguageServiceClient()

    # text_content = 'This is a short sentence.'

    # Available types: PLAIN_TEXT, HTML
    type_ = language_v1.Document.Type.PLAIN_TEXT

    # Optional. If not specified, the language is automatically detected.
    # For list of supported languages:
    # https://cloud.google.com/natural-language/docs/languages
    language = "en"
    document = {"content": text_content, "type_": type_, "language": language}

    # Available values: NONE, UTF8, UTF16, UTF32
    encoding_type = language_v1.EncodingType.UTF8

    response = client.analyze_syntax(request = {'document': document, 'encoding_type': encoding_type})
    # Loop through tokens returned from the API
    for token in response.tokens:
        # Get the text content of this token. Usually a word or punctuation.
        text = token.text
        print(u"Token text: {}".format(text.content))
        print(
            u"Location of this token in overall document: {}".format(text.begin_offset)
        )
        # Get the part of speech information for this token.
        # Part of speech is defined in:
        # http://www.lrec-conf.org/proceedings/lrec2012/pdf/274_Paper.pdf
        part_of_speech = token.part_of_speech
        # Get the tag, e.g. NOUN, ADJ for Adjective, et al.
        print(
            u"Part of Speech tag: {}".format(
                language_v1.PartOfSpeech.Tag(part_of_speech.tag).name
            )
        )
        # Get the voice, e.g. ACTIVE or PASSIVE
        print(u"Voice: {}".format(language_v1.PartOfSpeech.Voice(part_of_speech.voice).name))
        # Get the tense, e.g. PAST, FUTURE, PRESENT, et al.
        print(u"Tense: {}".format(language_v1.PartOfSpeech.Tense(part_of_speech.tense).name))
        # See API reference for additional Part of Speech information available
        # Get the lemma of the token. Wikipedia lemma description
        # https://en.wikipedia.org/wiki/Lemma_(morphology)
        print(u"Lemma: {}".format(token.lemma))
        # Get the dependency tree parse information for this token.
        # For more information on dependency labels:
        # http://www.aclweb.org/anthology/P13-2017
        dependency_edge = token.dependency_edge
        print(u"Head token index: {}".format(dependency_edge.head_token_index))
        print(
            u"Label: {}".format(language_v1.DependencyEdge.Label(dependency_edge.label).name)
        )

    # Get the language of the text, which will be the same as
    # the language specified in the request or, if not specified,
    # the automatically-detected language.
    print(u"Language of the text: {}".format(response.language))

其他语言

C#: 请按照客户端库页面上的 C# 设置说明操作,然后访问 .NET 版 Natural Language 参考文档。

PHP: 请按照客户端库页面上的 PHP 设置说明操作,然后访问 PHP 版 Natural Language 参考文档。

Ruby: 请按照客户端库页面上的 Ruby 设置说明操作,然后访问 Ruby 版 Natural Language 参考文档。

分析 Cloud Storage 中的语法

为您方便起见,Natural Language API 可以直接对位于 Cloud Storage 的文件执行语法分析,而无需在请求正文中发送文件内容。

以下示例介绍如何对 Cloud Storage 中的文件执行语法分析。

协议

如需分析 Cloud Storage 中存储的文档的语法,请向 documents:analyzeSyntax REST 方法发出 POST 请求,并提供带有文档路径的相应请求正文,如以下示例所示。

curl -X POST \
     -H "Authorization: Bearer "$(gcloud auth application-default print-access-token) \
     -H "Content-Type: application/json; charset=utf-8" \
     --data "{
  'encodingType': 'UTF8',
  'document': {
    'type': 'PLAIN_TEXT',
    'gcsContentUri': 'gs://<bucket-name>/<object-name>'
  }
}" "https://language.googleapis.com/v1/documents:analyzeSyntax"

如果您不指定 document.language,则系统将自动检测语言。有关 Natural Language API 支持哪些语言的信息,请参阅语言支持。如需详细了解如何配置请求正文,请参阅 Document 参考文档。

如果请求成功,服务器将返回一个 200 OK HTTP 状态代码以及 JSON 格式的响应:

{
  "sentences": [
    {
      "text": {
        "content": "Hello, world!",
        "beginOffset": 0
      }
    }
  ],
  "tokens": [
    {
      "text": {
        "content": "Hello",
        "beginOffset": 0
      },
      "partOfSpeech": {
        "tag": "X",
        // ...
      },
      "dependencyEdge": {
        "headTokenIndex": 2,
        "label": "DISCOURSE"
      },
      "lemma": "Hello"
    },
    {
      "text": {
        "content": ",",
        "beginOffset": 5
      },
      "partOfSpeech": {
        "tag": "PUNCT",
        // ...
      },
      "dependencyEdge": {
        "headTokenIndex": 2,
        "label": "P"
      },
      "lemma": ","
    },
    // ...
  ],
  "language": "en"
}

tokens 数组包含代表检测到的句子令牌的 Token 对象,其中包括词性令牌和在句子中的位置等信息。

gcloud

如需查看完整的详细信息,请参阅 analyze-syntax 命令。

如需对 Cloud Storage 中的文件执行语法分析,请使用 gcloud 命令行工具并使用 --content-file 标志来标识包含待分析内容的文件路径:

gcloud ml language analyze-syntax --content-file=gs://YOUR_BUCKET_NAME/YOUR_FILE_NAME

如果请求成功,则服务器返回 JSON 格式的响应:

{
  "sentences": [
    {
      "text": {
        "content": "Hello, world!",
        "beginOffset": 0
      }
    }
  ],
  "tokens": [
    {
      "text": {
        "content": "Hello",
        "beginOffset": 0
      },
      "partOfSpeech": {
        "tag": "X",
        // ...
      },
      "dependencyEdge": {
        "headTokenIndex": 2,
        "label": "DISCOURSE"
      },
      "lemma": "Hello"
    },
    {
      "text": {
        "content": ",",
        "beginOffset": 5
      },
      "partOfSpeech": {
        "tag": "PUNCT",
        // ...
      },
      "dependencyEdge": {
        "headTokenIndex": 2,
        "label": "P"
      },
      "lemma": ","
    },
    // ...
  ],
  "language": "en"
}

tokens 数组包含代表检测到的句子令牌的 Token 对象,其中包括词性令牌和在句子中的位置等信息。

Go


func analyzeSyntaxFromGCS(ctx context.Context, gcsURI string) (*languagepb.AnnotateTextResponse, error) {
	return client.AnnotateText(ctx, &languagepb.AnnotateTextRequest{
		Document: &languagepb.Document{
			Source: &languagepb.Document_GcsContentUri{
				GcsContentUri: gcsURI,
			},
			Type: languagepb.Document_PLAIN_TEXT,
		},
		Features: &languagepb.AnnotateTextRequest_Features{
			ExtractSyntax: true,
		},
		EncodingType: languagepb.EncodingType_UTF8,
	})
}

Java

// Instantiate the Language client com.google.cloud.language.v1.LanguageServiceClient
try (LanguageServiceClient language = LanguageServiceClient.create()) {
  Document doc =
      Document.newBuilder().setGcsContentUri(gcsUri).setType(Type.PLAIN_TEXT).build();
  AnalyzeSyntaxRequest request =
      AnalyzeSyntaxRequest.newBuilder()
          .setDocument(doc)
          .setEncodingType(EncodingType.UTF16)
          .build();
  // analyze the syntax in the given text
  AnalyzeSyntaxResponse response = language.analyzeSyntax(request);
  // print the response
  for (Token token : response.getTokensList()) {
    System.out.printf("\tText: %s\n", token.getText().getContent());
    System.out.printf("\tBeginOffset: %d\n", token.getText().getBeginOffset());
    System.out.printf("Lemma: %s\n", token.getLemma());
    System.out.printf("PartOfSpeechTag: %s\n", token.getPartOfSpeech().getTag());
    System.out.printf("\tAspect: %s\n", token.getPartOfSpeech().getAspect());
    System.out.printf("\tCase: %s\n", token.getPartOfSpeech().getCase());
    System.out.printf("\tForm: %s\n", token.getPartOfSpeech().getForm());
    System.out.printf("\tGender: %s\n", token.getPartOfSpeech().getGender());
    System.out.printf("\tMood: %s\n", token.getPartOfSpeech().getMood());
    System.out.printf("\tNumber: %s\n", token.getPartOfSpeech().getNumber());
    System.out.printf("\tPerson: %s\n", token.getPartOfSpeech().getPerson());
    System.out.printf("\tProper: %s\n", token.getPartOfSpeech().getProper());
    System.out.printf("\tReciprocity: %s\n", token.getPartOfSpeech().getReciprocity());
    System.out.printf("\tTense: %s\n", token.getPartOfSpeech().getTense());
    System.out.printf("\tVoice: %s\n", token.getPartOfSpeech().getVoice());
    System.out.println("DependencyEdge");
    System.out.printf("\tHeadTokenIndex: %d\n", token.getDependencyEdge().getHeadTokenIndex());
    System.out.printf("\tLabel: %s\n\n", token.getDependencyEdge().getLabel());
  }

  return response.getTokensList();
}

Node.js

// Imports the Google Cloud client library
const language = require('@google-cloud/language');

// Creates a client
const client = new language.LanguageServiceClient();

/**
 * TODO(developer): Uncomment the following lines to run this code
 */
// const bucketName = 'Your bucket name, e.g. my-bucket';
// const fileName = 'Your file name, e.g. my-file.txt';

// Prepares a document, representing a text file in Cloud Storage
const document = {
  gcsContentUri: `gs://${bucketName}/${fileName}`,
  type: 'PLAIN_TEXT',
};

// Need to specify an encodingType to receive word offsets
const encodingType = 'UTF8';

// Detects the sentiment of the document
const [syntax] = await client.analyzeSyntax({document, encodingType});

console.log('Parts of speech:');
syntax.tokens.forEach(part => {
  console.log(`${part.partOfSpeech.tag}: ${part.text.content}`);
  console.log('Morphology:', part.partOfSpeech);
});

Python

from google.cloud import language_v1

def sample_analyze_syntax(gcs_content_uri):
    """
    Analyzing Syntax in text file stored in Cloud Storage

    Args:
      gcs_content_uri Google Cloud Storage URI where the file content is located.
      e.g. gs://[Your Bucket]/[Path to File]
    """

    client = language_v1.LanguageServiceClient()

    # gcs_content_uri = 'gs://cloud-samples-data/language/syntax-sentence.txt'

    # Available types: PLAIN_TEXT, HTML
    type_ = language_v1.Document.Type.PLAIN_TEXT

    # Optional. If not specified, the language is automatically detected.
    # For list of supported languages:
    # https://cloud.google.com/natural-language/docs/languages
    language = "en"
    document = {"gcs_content_uri": gcs_content_uri, "type_": type_, "language": language}

    # Available values: NONE, UTF8, UTF16, UTF32
    encoding_type = language_v1.EncodingType.UTF8

    response = client.analyze_syntax(request = {'document': document, 'encoding_type': encoding_type})
    # Loop through tokens returned from the API
    for token in response.tokens:
        # Get the text content of this token. Usually a word or punctuation.
        text = token.text
        print(u"Token text: {}".format(text.content))
        print(
            u"Location of this token in overall document: {}".format(text.begin_offset)
        )
        # Get the part of speech information for this token.
        # Part of speech is defined in:
        # http://www.lrec-conf.org/proceedings/lrec2012/pdf/274_Paper.pdf
        part_of_speech = token.part_of_speech
        # Get the tag, e.g. NOUN, ADJ for Adjective, et al.
        print(
            u"Part of Speech tag: {}".format(
                language_v1.PartOfSpeech.Tag(part_of_speech.tag).name
            )
        )
        # Get the voice, e.g. ACTIVE or PASSIVE
        print(u"Voice: {}".format(language_v1.PartOfSpeech.Voice(part_of_speech.voice).name))
        # Get the tense, e.g. PAST, FUTURE, PRESENT, et al.
        print(u"Tense: {}".format(language_v1.PartOfSpeech.Tense(part_of_speech.tense).name))
        # See API reference for additional Part of Speech information available
        # Get the lemma of the token. Wikipedia lemma description
        # https://en.wikipedia.org/wiki/Lemma_(morphology)
        print(u"Lemma: {}".format(token.lemma))
        # Get the dependency tree parse information for this token.
        # For more information on dependency labels:
        # http://www.aclweb.org/anthology/P13-2017
        dependency_edge = token.dependency_edge
        print(u"Head token index: {}".format(dependency_edge.head_token_index))
        print(
            u"Label: {}".format(language_v1.DependencyEdge.Label(dependency_edge.label).name)
        )

    # Get the language of the text, which will be the same as
    # the language specified in the request or, if not specified,
    # the automatically-detected language.
    print(u"Language of the text: {}".format(response.language))

其他语言

C#: 请按照客户端库页面上的 C# 设置说明操作,然后访问 .NET 版 Natural Language 参考文档。

PHP: 请按照客户端库页面上的 PHP 设置说明操作,然后访问 PHP 版 Natural Language 参考文档。

Ruby: 请按照客户端库页面上的 Ruby 设置说明操作,然后访问 Ruby 版 Natural Language 参考文档。