Morphology & Dependency Trees

Morphology is the study of the internal structure of words and how they are formed and modified. Morphology focuses on how the components within a word (stems, root words, prefixes, suffixes, etc.) are arranged or modified to create different meanings. The Natural Language API uses morphological analysis to infer grammatical information about the words provided to it.

Note that morphology is distinct from syntax (though they may influence each other). For example, the future tense is expressed in English by adding the word "will" before a verb as in the sentence "I will get my umbrella." However, morphologically speaking, neither "will" nor "get" by themselves indicate a future tense, as the words themselves have not changed; instead, the future tense is denoted through syntax rules. Other languages, however, may (and many do) modify those words directly to create future tense verbs.

Within the Natural Language API, context via grammar can affect the morphological analysis of words/tokens, but only if the morphological analysis itself is confined to a single token. (Proper nouns, however, can be recognized across word boundaries.)

Morphology information is returned in the syntactic analysis response's partOfSpeech field. Additionally, the syntactic relationship between words is returned in the syntactic analysis response's dependencyTree field.

Parts of Speech

Within a syntactic request, part-of-speech and morphological information are returned within the response's partOfSpeech field. The partOfSpeech field contains a set of sub-fields with Part-of-Speech (POS) information as well as more explicit morphological information. These subfields are listed below.

Morphology varies greatly between languages. Languages such as Spanish, where word endings are changed often to change meaning, will exhibit more morphological features; languages such as English, which rely more on word placement and syntax, will exhibit less. For example, English nouns have lost most distinct morphological cases, as most nouns do not change their word form to indicate cases (except for the nominative, genitive, and accusative on personal pronouns). As a result, morphological analysis depends heavily on the source language, and an understanding of what morphology is supported within that language.

  • tag denotes the part of speech using a coarse-grained POS tag (NOUN, VERB, etc.), and provides top-level surface syntax information. POS tags are helpful if you want to create patterns and/or reduce ambiguity for subsequent language analysis (for example, “train” tagged as a NOUN versus a VERB).

  • number denotes a word's grammatical number indicating its count distinction. In English, the suffix "s" is usually used to distinguish plural forms of nouns from singular, for example. Some languages, such as Arabic, have the notion of a dual number as well. This field may contain the following values:

    • SINGULAR denotes one quantity.
    • PLURAL denotes more than one quantity.
    • DUAL denotes precisely two quantities.
  • person denotes a word's grammatical person indicating a speaker's relationship to an event. In English, person is most often used on pronouns to distinguish between speakers (first person), those spoken to (second person), and others (third person). This field may contain the following values:

    • FIRST person denotes the first person (the speaker).
    • SECOND person denotes the second person (the spoken to).
    • THIRD person denotes an "other" person outside of the conversation.
    • REFLEXIVE_PERSON denotes use of a reflexive pronoun
  • gender denotes a noun's grammatical gender. This field may contain the following values:

    • The FEMININE grammatical gender
    • The MASCULINE grammatical gender
    • The NEUTER grammatical gender
  • case denotes a word's grammatical case and its relationship to its containing sentence. Note that English does not exhibit many explicit morphological cases, as the information normally conveyed through cases is typically indicated by word order. This field may contain the following values:

    • The ACCUSATIVE case indicates the direct object of a transitive verb.
    • The ADVERBIAL case indicates an adverbial form of an adjective. Note that English uses separate words to distinguish adverbs ("well") and adjectives ("good") rather than using an explicit adverbial case.
    • The COMPLEMENTIVE case (Chinese) indicates a word necessary to complete the meaning of a potential, descriptive, or resultative expression using a conjunctive particle.
    • The DATIVE case indicates an indirect object or the direct object being given something. In English, the dative case is obviated through use of the preposition "to" as in the phrase "He gave the ball to Bobby."
    • The GENITIVE case indicates possession. Note that in English, the "'s" clitic is used to denote this usage instead of through a strict genitive case.
    • The INSTRUMENTAL case indicates whether a noun is the instrument by which a subject completes an action. In English, the instrumental case is obviated through use of the preposition "with" as in the phrase "He hit him with a baseball bat."
    • The LOCATIVE case indicates a word's use inferring a location. In English, the locative case is obviated through use of prepositions such as "in", "on", etc. as in the phrase "The cow is in the barn."
    • The NOMINATIVE case indicates the subject of a verb. In English, the subjetc of a verb is instead indicated through word order.
    • The OBLIQUE case indicates a word's use as an object to either a verb or preposition.
    • The PARTITIVE case indicates a word's "partialness" or lack of specific identity.
    • The PREPOSITIONAL case indicates the object of a proposition.
    • The REFLEXIVE_CASE indicates the identity of an object of a verb to its subject. Most languages do not use a reflexive case, as this usage is indicated through use of special reflexive pronouns instead (such as "himself", "myself", etc.")
    • The RELATIVE_CASE (Chinese) indicates the complementizer of a relative clause connecting a noun with a verb or adjective. Examples: 工作 [的] 地方 (work [] place :: "place [where I] work"). 便宜 的 餐馆 (inexpensive [] restaurants :: restaurants [that are] inexpensive).
    • The VOCATIVE case indicates a noun being used to address someone or something, usually when spoken to.
  • tense denotes a verb's grammatical tense, which indicates the verb's reference to a position in time. Note that tense is distinct from aspect, which also deals with a verb's relationship to time, but focuses on the characteristics of that time flow, rather than its position. The IMPERFECT and PLUPERFECT tenses in many languages more accurately refer to specific combinations of tense and aspect. This field may contain the following values:

    • CONDITIONAL_TENSE is an alternate term for the more prevalent morphological term of "conditional mood." (See CONDITIONAL_MOOD below.)
    • FUTURE denotes an action taking place in the future. Note that in English, the future tense is most often denoted by adding the word "will" to a verb phrase.
    • PAST denotes an action taking place in the past.
    • PRESENT denotes an action taking place in the present.
    • IMPERFECT denotes an action taking place in the past, but which was not completed at that tense's frame of reference. Note that in English, the imperfect tense is most often denoted by adding a gerund form of a verb to the past tense as in "I was walking." An imperfect tense event takes place in the past, but is not completed relative to that past tense.
    • PLUPERFECT denotes an action that has taken place in the past, and was also completed at that tense's frame of reference. For example, "I had walked" takes place in the past, but was also complete during the past tense's frame of reference.
  • aspect denotes a verb's grammatical aspect, its expression of time flow. Unlike tense, which focuses on a verb's position within time, aspect focuses on the characteristics of that time flow where it occurs. This field may contain the following values:

    • The PERFECTIVE aspect denotes an event that is "completed" either because it has completely happened in the past or will completely happen in the future.
    • The IMPERFECTIVE aspect denotes an event that is incomplete, either because it is continuous or because it is repeated.
    • The PROGRESSIVE aspect denotes an event that is continuous. A progressive aspect is generally treated as a special case of the more general imperfective aspect (which also covers repetition).

  • mood denotes a verb's grammatical mood, which indicates attitude about an underlying action. This field may contain the following values:

    • CONDITIONAL_MOOD indicates an action which is contingent. Note that in English, verb forms are not conditional; instead, conditional behavior is noted through use of the word "would" combined with the verb's infinitive.
    • IMPERATIVE indicates a command or request through the second person.
    • INDICATIVE indicates a statement of fact, more generally known as a "realis mood."
    • INTERROGATIVE indicates a question.
    • JUSSIVE indicates a command or request through either the first or third person. English does not have a jussive mood, though exhortations that begin with a real or implied "Let us" convey this jussive mood.
    • SUBJUNCTIVE indicates a quality of uncertainty related to an action, also known as an "irrealis" mood (contrasted with the "realis" indicative mood). English does not have a specific subjunctive mood; instead, words such as "want", "wish", "hope", etc. convey the import of the subjunctive mood.
  • voice denotes a verb's grammatical voice, the relationship between an action and a subject and/or object. This field may contain the following values:

    • ACTIVE voice indicates an action whose subject is performing the action.
    • CAUSATIVE voice indicates an action whose effect is being performed on the subject. In English, no direct causative voice exists; instead, such causation is indicated through use of the verb "make", as in "Mom made me go to school."
    • PASSIVE voice indicates an action whose effect is being performed on the subject. In many cases, a passive "agent" is unspoken or unknown.
  • reciprocity denotes a word's (typically a pronoun's) reciprocity, indicating the pronoun refers to a noun phrase elsewhere within the sentence. This field may contain the following values:

    • RECIPROCAL indicates the pronoun is reciprocal.
    • NON_RECIPROCAL indicates the pronoun is not reciprocal.
  • proper denotes whether a noun is part of a proper name. Note that many proper names consist of several words; if this phrase is detected as a proper name, each token will be detected as proper as well. (For example, both "Wrigley" and "Field" in the proper name "Wrigley Field" will have their proper attribute set to PROPER. This field may contain the following values:

    • PROPER denotes that the token is part of a proper name.
    • NOT_PROPER denotes that the token is not part of a proper name.
  • form denotes additional morphological forms that don't neatly fit into the previous set of common forms (tense,mood,person, etc.) Most of these forms are specific to unique languages. This field may contain the following values:

    • ADNOMIAL (Korean/Japanese) indicates a word ending (Korean) or verb (Japanese) that modifies a noun phrase. Examples: 밥을 먹는 사람 [someone who eats rice] and 書く人 [someone who writes].
    • AUXILIARY (Korean) indicates a word ending that connects two adjacent main and auxiliary predicates: 밥을 먹게 하다 [make (someone) to eat]
    • COMPLEMENTIZER (Korean) indicates a word ending that connects two or more different clauses: 밥을 먹고 물을 마신다 [ (I) eat rice and drink water]
    • FINAL_ENDING (Korean/Japanese) indicates a word ending that finalizes the clause or sentence coming at the end of the clause or sentence. Examples: 밥을 먹는다 [(I) eat rice] and 手紙を書く [write a letter].
    • GERUND (Korean/Japanese) indicates a word ending that nominalizes verbs or adjectives: (Korean) 밥 먹기 [eating rice] or connects verbs with various auxiliary verbs: (Japanese) 書きたい [want to write]
    • REALIS (Japanese) indicates conditional and subjunctive forms with a conjunctive particle “ば”: 書けば [if (I) write].
    • IRREALIS (Japanese) indicates connecting verbs with negative, passive, or causitive auxilliary verbs: 書かない [do not write], 書かれる [to be written], 書かせる [make (someone) write].
    • ORDER (Japanese) indicates a command verb, similar to imperitive: 書け! [write!]
    • SPECIFIC (Japanese) indicates special forms that cannot be covered by the six categories above. The most common use of this form is a derivation of a noun from an adjective by adding a suffix to the form: かわいさ [cuteness]
    • SHORT (Russian) indicates a short-form adjective or participle.
    • LONG (Russian) indicates a long-form adjective or participle, as distinct from the above SHORT form.

Note that the Natural Language API provides morphological information on a per-token basis (not per phrase). Morphological constructs that cross word boundaries may not be supported.

Dependency trees

For each sentence within the text provided to the Natural Language API for syntactic analysis, the API constructs a dependency tree, describing the syntactic structure of that sentence. Generally, when analyzing this dependency graph, you will want to iterate over each sentence's constituent tokens.

A diagram of the dependency tree for this single sentence from John F. Kennedy's Inaugural speech appears below:

Note that the dependency tree includes a ROOT element, which corresponds to the main verb in the sentence.

The token's label field (of type Label) explains the syntactic relationship of this token to the token referenced in its headTokenIndex.

In the above example, headTokenIndex = 0 for "do" and the second clause's "ask", indicating that these words modify the first clause's ROOT word ("Ask"). The token's label values specify the type of relationship. For example, "country" has a NSUBJ (noun subject) relationship to "do" in the first clause, while "you" has that same relationship to "do" in the second clause.

Note that although parse trees do not cross sentence boundaries, headTokenIndex is an index into the token list of the entire document, not just the current sentence. For the ROOT word "Ask", the headTokenIndex is its own index.

Sentences and tokens within the Natural Language API are indexed using zero-based offset values within the text as a whole. The following pseudo-code provides a common pattern to use when performing iterative operations on the syntactic analysis response:

index = 0
  for sentence in self.sentences:
    content  = sentence['text']['content']
    sentence_begin = sentence['text']['beginOffset']
    sentence_end = sentence_begin + len(content) - 1
    while index < len(self.tokens) and self.tokens[index]['text']['beginOffset'] <= sentence_end:
      # This token is in this sentence
      index += 1

For more information about dependency trees, consult the Universal Dependency Treebank project. In addition, Universal Dependency Annotation for Multilingual Processing contains background information on the methodology used to interpret such a dependency tree.

Enviar comentarios sobre...

Google Cloud Natural Language API Documentation