Document AI API's text recognition feature (OCR) is able to detect a wide variety of languages and can detect multiple languages within a single document.
Languages detected by the Document AI API are returned in the Document
object
in the
detectedLanguages
field.
Each language code parameter typically consists of a
BCP-47 identifier. This parameter can be of
the form language-region, where language refers to the primary
language and the optional region refers to a region (usually a country
identifier) of a particular dialect. For example, Chinese can be
represented as Simplified Chinese as written in the People's Republic of China
(zh-Hans
) or Traditional Chinese as written in Taiwan (zh-Hant
).
Document OCR processor language support
Currently, English is the only language supported for Document OCR functionality.
There are three levels of language support in OCR functionality:
- Supported languages are those we prioritize and regularly evaluate performance against.
- Experimental languages are those under active development but not regularly evaluated against.
- Mapped languages are those supported by mapping them
to another language
code or to a general character recognizer. For example, "
en-GB
" is supported, but it is not treated any differently than "en
" for the purposes of recognizing text. We make a best-effort to return the correct mapped language code in the Entity locale field, but mapped languages are more likely than fully supported or experimentally supported languages to be misidentified as a similar language.
Supported languages
The following languages are prioritized and regularly evaluated.
Language | Language (English name) | languageHints code | Script / notes |
---|---|---|---|
Afrikaans | Afrikaans | af | Latn |
shqip | Albanian | sq | Latn |
العربية | Arabic | ar | Arab; Modern Standard |
Հայ | Armenian | hy | Armn |
беларускі | Belorussian | be | Cyrl |
বাংলা | Bengali | bn | Beng |
български | Bulgarian | bg | Cyrl |
Català | Catalan | ca | Latn |
普通话 | Chinese | zh | Hans/Hant |
Hrvatski | Croatian | hr | Latn |
Čeština | Czech | cs | Latn |
Dansk | Danish | da | Latn |
Nederlands | Dutch | nl | Latn |
English | English | en | Latn; American |
Eesti keel | Estonian | et | Latn |
Filipino | Filipino | fil (or tl) | Latn |
Suomi | Finnish | fi | Latn |
Français | French | fr | Latn; European |
Deutsch | German | de | Latn |
Ελληνικά | Greek | el | Grek |
ગુજરાતી | Gujarati | gu | Gujr |
עברית | Hebrew | iw | Hebr |
हिन्दी | Hindi | hi | Deva |
Magyar | Hungarian | hu | Latn |
Íslenska | Icelandic | is | Latn |
Bahasa Indonesia | Indonesian | id | Latn |
Italiano | Italian | it | Latn |
日本語 | Japanese | ja | Jpan |
ಕನ್ನಡ | Kannada | kn | Knda |
ភាសាខ្មែរ | Khmer | km | Khmr |
한국어 | Korean | ko | Kore |
ລາວ | Lao | lo | Laoo |
Latviešu | Latvian | lv | Latn |
Lietuvių | Lithuanian | lt | Latn |
Македонски | Macedonian | mk | Cyrl |
Bahasa Melayu | Malay | ms | Latn |
മലയാളം | Malayalam | ml | Mlym |
मराठी | Marathi | mr | Deva |
नेपाली | Nepali | ne | Deva |
Norsk | Norwegian | no | Latn; Bokmål |
فارسی | Persian | fa | Arab |
Polski | Polish | pl | Latn |
Português | Portuguese | pt | Latn; Brazilian |
ਪੰਜਾਬੀ | Punjabi | pa | Guru; Gurmukhi |
Română | Romanian | ro | Latn |
Русский | Russian | ru | Cyrl |
Русский (старая орфография) | Russian | ru-PETR1708 | Cyrl; Old Orthography |
Српски | Serbian | sr | Cyrl & Latn |
Српски (латиница) | Serbian | sr-Latn | Latn |
Slovenčina | Slovak | sk | Latn |
Slovenščina | Slovenian | sl | Latn |
Español | Spanish | es | Latn; European |
Svenska | Swedish | sv | Latn |
தமிழ் | Tamil | ta | Taml |
తెలుగు | Telugu | te | Telu |
ไทย | Thai | th | Thai |
Türkçe | Turkish | tr | Latn |
Українська | Ukrainian | uk | Cyrl |
Tiếng Việt | Vietnamese | vi | Latn |
Yiddish | Yiddish | yi | Hebr |
Experimental languages
The following languages are under active development and not yet regularly evaluated against.
Language | Language (English name) | languageHints code |
Script / notes |
---|---|---|---|
አማርኛ | Amharic | am | Ethi |
Αρχαία ελληνικά | Ancient Greek | grc | Grek |
অসমীয়া | Assamese | as | Beng |
Azərbaycan | Azerbaijani | az | Latn |
Azərbaycan (qədim yazı) | Azerbaijani | az-Cyrl | Cyrl; old orthography |
Euskara | Basque | eu | Latn |
Bosanski | Bosnian | bs | Latn |
မြန်မာ | Burmese | my | Mymr |
Cebuano | Cebuano | ceb | Latn |
ᏣᎳᎩ ᎦᏬᏂᎯᏍᏗ | Cherokee | chr | Cher |
dhivehi, dhivehi-bas | Dhivehi | dv | Thaa |
རྫོང་ཁ | Dzonkha | dz | Tibt |
Esperanto | Esperanto | eo | Latn |
Galego | Galician | gl | Latn |
ქართული | Georgian | ka | Geor |
Kreyòl Ayisyen | Haitian Creole | ht | Latn |
Gaeilge | Irish | ga | Latn |
Jawa | Javanese | jv | Latn |
Қазақ | Kazakh | kk | Cyrl |
Kirghiz | Kirghiz | ky | Cyrl |
Latine | Latin | la | Latn |
Malti | Maltese | mt | Latn |
Монгол | Mongolian | mn | Cyrl |
ଓଡ଼ିଆ | Oriya | or | Orya |
پښتو | Pashto | ps | Arab |
संस्कृतम् | Sanskrit | sa | Deva |
සිංහල | Sinhala | si | Sinh |
Swahili | Swahili | sw | Latn |
leššānā Suryāyā | Syriac | syr | Syriac |
བོད་སྐད་ | Tibetan | bo | Tibt |
ትግርኛ | Tigirinya | ti | Ethi |
اردو | Urdu | ur | Arab |
oʻzbekcha | Uzbek | uz | Latn; Latin |
oʻzbekcha | Uzbek | uz-Cyrl | Cyrl; old orthography |
Cymraeg | Welsh | cy | Latn |
IsiZulu | Zulu | zu | Latn |
Mapped languages
The following languages are mapped to another language code or mapped to a general character recognizer.
Language | Language (English name) | languageHints code | Script / notes | Mapped to |
---|---|---|---|---|
بهسا اچيه | Acehnese | ace | Latn | Latin script model |
Lwo | Acholi | ach | Latn | Latin script model |
Dangme | Adangme | ada | Latn | Latin script model |
Akan | Akan | ak | Latn | Latin script model |
Anicinâbemowin | Algonquinian | alg | Latn | Latin script model |
Mapudungu | Araucanian/Mapuche | arn | Latn | Latin script model |
Asturianu | Asturian | ast | Latn | Latin script model |
Dene | Athabaskan | ath | Latn | Latin script model |
Aymar aru | Aymara | ay | Latn | Latin script model |
Bhāṣa Bali | Balinese | ban | Latn | Latin script model |
Bamanankan | Bambara | bm | Latn | Latin script model |
Narrow Bantu | Bantu | bnt | Latn | Latin script model |
башҡорт теле | Bashkir | ba | Cyrl | Cyrillic script model |
Toba–Batak | Batak | btk | Latn | Latin script model |
Chibemba | Bemba | bem | Latn | Latin script model |
Bikol Naga | Bikol | bik | Latn | Latin script model |
Bichelamar | Bislama | bi | Latn | Latin script model |
Brezhoneg | Breton | br | Latn | Latin script model |
нохчийн мотт / noxçiyn mott | Chechen | ce | Cyrl | Cyrillic script model |
汉语 | Chinese | zh-Hans | Hans; Simplified; Mandarin | zh |
漢語 | Chinese | zh-Hant | Hant; Traditional; Mandarin | zh |
普通話 | Chinese | zh-Hant-HK | Hant; Mandarin; Hong Kong | zh |
Chahta' | Choctaw | cho | Latn | Latin script model |
Чӑвашла | Chuvash | cv | Cyrl | Cyrillic script model |
Cree–Montagnais–Naskapi | Cree | cr | Latn | Latin script model |
Mvskoke | Creek | mus | Latn | Latin script model |
qırımtatar tili, къырымтатар тили | Crimean Tatar | crh | Latn | Cyrillic script model |
Dakhótiyapi, Dakȟótiyapi | Dakota | dak | Latn | Latin script model |
Douala | Duala | dua | Latn | Latin script model |
Ikɔ Efik | Efik | efi | Latn | Latin script model |
English (British) | English | en-GB | Latn; British | en |
Èʋegbe | Ewe | ee | Latn | Latin script model |
føroyskt mál | Faroese | fo | Latn | Latin script model |
Na Vosa Vakaviti | Fijian | fj | Latn | Latin script model |
fɔ̀ngbè | Fon | fon | Latn | Latin script model |
Français canadien | French | fr-CA | Latn; Canadian | fr |
Fulani, Fulah, Peul | Fulah | ff | Latn | Latin script model |
Gã | Ga | gaa | Latn | Latin script model |
Luganda | Ganda | lg | Latn | Latin script model |
Basa Gayo | Gayo | gay | Latn | Latin script model |
Kiribati | Gilbertese | gil | Latn | Latin script model |
Gothic | Gothic | got | Latn | Latin script model |
Guaraní | Guarani | gn | Latn | Latin script model |
Harshen/Halshen Hausa هَرْشَن هَوْسَ | Hausa | ha | Latn | Latin script model |
ʻŌlelo Hawaiʻi | Hawaiian | haw | Latn | Latin script model |
Otjiherero | Herero | hz | Latn | Latin script model |
Ilonggo | Hiligaynon | hil | Latn | Latin script model |
Jaku Iban | Iban | iba | Latn | Latin script model |
Asụsụ Igbo | Igbo | ig | Latn | Latin script model |
Ilokano | Iloko | ilo | Latn | Latin script model |
Taqbaylit | Kabyle | kab | Latn | Latin script model |
Jingpho | Kachin | kac | Latn | Latin script model |
Kalaallisut | Kalaallisut | kl | Latn | Latin script model |
Kikamba | Kamba | kam | Latn | Latin script model |
Kanuri | Kanuri | kr | Latn | Latin script model |
Qaraqalpaq tili, Қарақалпақ тили, قاراقالپاق تىلى | Kara-Kalpak | kaa | Cyrl/Latn | Cyrillic script model |
Ka Ktien Khasi | Khasi | kha | Latn | Latin script model |
Gĩkũyũ | Kikuyu | ki | Latn | Latin script model |
Kinyarwanda | Kinyarwanda | rw | Latn | Latin script model |
коми кыв | Komi | kv | Cyrl | Cyrillic script model |
Kikongo | Kongo | kg | Latn | Latin script model |
Kosraean | Kosraean | kos | Latn | Latin script model |
Oshikwanyama | Kuanyama | kj | Latn | Latin script model |
Ngala | Lingala | ln | Latn | Latin script model |
Plattdütsch, Plattdeutsch, Nedersaksisch | Low German | nds | Latn | Latin script model |
siLozi | Lozi | loz | Latn | Latin script model |
Kiluba | Luba-Katanga | lu | Latn | Latin script model |
Dholuo | Luo | luo | Latn | Latin script model |
Madhura, Basa Mathura, بَهاسَ مَدورا | Madurese | mad | Latn | Latin script model |
Malagasy | Malagasy | mg | Latn | Latin script model |
Mandinka, لغة مندنكا | Mandingo | man | Latn | Latin script model |
Gaelg, Gailck | Manx | gv | Latn | Latin script model |
Te reo Māori | Maori | mi | Latn | Latin script model |
Ebon | Marshallese | mh | Latn | Latin script model |
Mɛnde yia | Mende | men | Latn | Latin script model |
Middle English | Middle English | enm | Latn | Latin script model |
Mittelhochdeutsch | Middle High German | gmh | Latn | Latin script model |
Baso Minangkabau, باسو مينڠكاباو | Minangkabau | min | Latn | Latin script model |
Kanienʼkéha | Mohawk | moh | Latn | Latin script model |
Nkundu | Mongo | lol | Latn | Latin script model |
Nāhuatl | Nahuatl | nah | Latn | Latin script model |
Diné bizaad | Navajo | nv | Latn | Latin script model |
Ndonga | Ndonga | ng | Latn | Latin script model |
ko e vagahau Niuē | Niuean | niu | Latn | Latin script model |
Zimbabwe Ndebele | North Ndebele | nd | Latn | Latin script model |
Sesotho sa Leboa | Northern Sotho | nso | Latn | Latin script model |
Chichewa, Chinyanja | Nyanja | ny | Latn | Latin script model |
Runyankore | Nyankole | nyn | Latn | Latin script model |
Chitonga | Nyasa Tonga | tog | Latn | Latin script model |
Appolo | Nzima | nzi | Latn | Latin script model |
Occitan, lenga d'òc, provençal | Occitan | oc | Latn | Latin script model |
Anishinaabemowin, ᐊᓂᔑᓈᐯᒧᐎᓐ | Ojibwa | oj | Latn | Latin script model |
Ænglisc, Englisc, Anglisc | Old English | ang | Latn | Latin script model |
Franceis, François, Romanz | Old French | fro | Latn | Latin script model |
Diutisk, Althochdeutsch | Old High German | goh | Latn | Latin script model |
Dǫnsk tunga | Old Norse | non | Latn | Latin script model |
Occitan ancian | Old Provencal | pro | Latn | Latin script model |
ирон ӕвзаг | Ossetic | os | Cyrl | Cyrillic script model |
Kapampangan | Pampanga | pam | Latn | Latin script model |
Salitan Pangasinan | Pangasinan | pag | Latn | Latin script model |
Papiamentu | Papiamento | pap | Latn | Latin script model |
Português (Portugal) | Portuguese | pt-PT | Latn; European | pt |
Kechua / Runa Simi | Quechua | qu | Latn | Latin script model |
Rumantsch | Romansh | rm | Latn | Latin script model |
Romani čhib | Romany | rom | Latn | Latin script model |
Ikirundi | Rundi | rn | Latn | Latin script model |
Sakha | Sakha | sah | Cyrl | Cyrillic script model |
Gagana faʻa Sāmoa | Samoan | sm | Latn | Latin script model |
yângâ tî sängö | Sango | sg | Latn | Latin script model |
(Braid) Scots, Lallans, Doric | Scots | sco | Latn | Latin script model |
Gàidhlig | Scottish Gaelic | gd | Latn | Latin script model |
chiShona | Shona | sn | Latn | Latin script model |
Songhay | Songhai | son | Latn | Latin script model |
Sesotho | Southern Sotho | st | Latn | Latin script model |
Español (Latinoamérica) | Spanish | es-419 | Latn; Latin American | es |
ᮘᮞ ᮞᮥᮔ᮪ᮓ , Basa Sunda | Sundanese | su | Latn | Latin script model |
siSwati | Swati | ss | Latn | Latin script model |
Reo Tahiti | Tahitian | ty | Latn | Latin script model |
тоҷикӣ | Tajik | tg | Cyrl | Cyrillic script model |
татар теле | Tatar | tt | Cyrl/Latn | Cyrillic script model |
KʌThemnɛ | Temne | tem | Latn | Latin script model |
lea faka-Tonga | Tongan | to | Latn | Latin script model |
Xitsonga | Tsonga | ts | Latn | Latin script model |
Setswana | Tswana | tn | Latn | Latin script model |
Türkmençe | Turkmen | tk | Latn | Cyrillic script model |
удмурт кыл | Udmurt | udm | Cyrl | Cyrillic script model |
Tshivenḓa | Venda | ve | Latn | Latin script model |
Vod | Votic | vot | Cyrl/Latn | Cyrillic script model |
Frysk | Western Frisian | fy | Latn | Latin script model |
Wolof | Wolof | wo | Latn | Latin script model |
isiXhosa | Xhosa | xh | Latn | Latin script model |
Èdè Yorùbá | Yoruba | yo | Latn | Latin script model |
Diidxazá | Zapotec | zap | Latn | Latin script model |
Other processor language support
Currently, English is the only language supported for all non-Document OCR processor functionality.