The sections below highlight the features and capabilities of the Google Video Intelligence API.
Supported video formats
The Video Intelligence API supports common video formats, including
.AVI, and the formats decodable by
Label detection annotates a video with labels (tags) for entities that are detected in a video or video segments and returns the following:
- A list of video segment annotations where an entity is detected.
- A list of frame annotations where an entity is detected.
- If specified in the request, a list of shots where an entity is detected. For details, see Shot change detection.
For example, for a video of a train at a crossing, the Video Intelligence returns labels such as "train", "transportation", "railroad crossing", and so on. Each label includes a time segment with the time offset (timestamp) for the entity's appearance from the beginning of the video. Each annotation also contains additional information including an entity id that you can use to find more information about the entity in the Google Knowledge Graph Search API.
Each entity returned can also include associated
category entities in the
categoryEntities field. For example the
“Terrier” entity label has a category of “Dog”. Category entities have a
hierarchy. For example, the “Dog” category is a child of the “Mammal”
category in the hierarchy. For a list of the common category entities that the
Video Intelligence uses, see
Shot change detection
By default the Video Intelligence examines a video or video segments by frame. That is, each complete picture in the series that forms the video. You can also have the Video Intelligence annotate a video or video segment according to each shot (scene) that it detects in the input video.
Shot change detection annotates a video with video segments that are selected based on content transition (scenes) as opposed to the individual frames. For example, a golf video following two players across the golf course with some panning to the woods for background may produce two shots: "players" and "woods," giving the developer access to the most relevant video segments showing the players for highlights.
Explicit content detection
Explicit Content Detection detects adult content within a video. Adult content is content generally appropriate for 18 years of age and older, including but not limited to nudity, sexual activities, and pornography (including cartoons or anime).
Explicit content detection annotates a video with explicit content annotations (tags) for entities that are detected in the video or video segments provided. The response returns a video frame timestamp where the explicit content is detected.
For an example, see Analyzing Videos for Explicit Content.
If no region is specified, the region is determined based on the video file location.
Speech Transcription transcribes spoken word audio in a video or video segment into text and returns blocks of text for each portion of transcribed audio.
You can use the following features when transcribing speech:
Alternative words: Use the
maxAlternativesoption to specify the maximum number of options for recognized text translations to include in the response. This value can be an integer from 1 to 30. The default is 1. The API returns multiple transcriptions in descending order based on the confidence value for the transcription. Alternative transcriptions do not include word-level entries.
Profanity filtering: Use the
filterProfanityoption to filter out known profanities in transcriptions. Matched words are replaced with the leading character of the word followed by asterisks. The default is false.
Transcription hints: Use the
speechContextsoption to provide common or unusual phrases in your audio. Those phrases are then used to assist the transcription service to create more accurate transcriptions. You provide a transcription hint as a SpeechContext object.
Audio track selection: Use the
audioTracksoption to specify which track to transcribe from multi-track audio. This value can be an integer from 0 to 2. Default is 0.
Automatic punctuation: Use the
enableAutomaticPunctuationoption to include punctuation in the transcribed text. The default is false.
Multiple speakers: Use the
enableSpeakerDiarizationoption to identify different speakers in a video. In the response, each recognized word includes a
speakerTagfield that identifies which speaker the recognized word is attributed to.
For best results, provide audio recorded at 16,000Hz or greater sampling rate.
For an example, see Speech Transcription.
Object tracking tracks multiple objects detected in an input video or video segments and returns labels (tags) for detected entities along with the location of the entity in the frame. For example, a video of vehicles crossing a traffic signal may produce labels such as “car” , “truck”, “bike”, “tires”, “lights”, “window” and so on. Each label includes a series of bounding boxes showing the location of the object in the frame. Each bounding box also has an associated time segment with a time offset (timestamp) that indicates the duration offset from the beginning of the video. The annotation also contains additional entity information including an entity id that you can use to find more information about the entity in the Google Knowledge Graph Search API.
Object tracking differs from label detection in that label detection provides labels for the entire frame (without bounding boxes), while object tracking detects individual objects and provides a label along with a bounding box that describes the location in the frame for each object.
For an example, see Object Tracking.
Text Detection performs Optical Character Recognition (OCR) to detect visible text from frames in a video or video segments and returns the detected text along with information about where the text was detected in the video.
Text detection is available for all of the languages supported by the Cloud Vision API.
For an example, see Text Detection.