OCR & Text Detection¶
One of the most powerful features of instatollm is automatic text recognition from video frames.
The visual.text_on_screen field captures everything legible on screen.
What gets detected¶
Menus and food¶
Travel and locations¶
"text_on_screen": [
"Madeira Island Travel Guide",
"Experiences to NOT miss:",
"- Sunrise at Pico Ruivo",
"- Dolphin and whale watching",
"Where to stay:",
"- First-timers: Funchal or Caniço"
]
Products and brands¶
Social handles and URLs¶
Subtitles and captions¶
"text_on_screen": [
"This is the most important tip",
"DO NOT skip this step",
"Results after 30 days"
]
Use cases¶
Extract restaurant recommendations¶
result = analyze_reel("https://www.instagram.com/reel/...")
# All restaurant names are in text_on_screen
restaurants = [
text for text in result["visual"]["text_on_screen"]
if any(kw in text.lower() for kw in ["restaurant", "café", "bar", "grill"])
]
Find all URLs and social handles¶
import re
text_items = result["visual"]["text_on_screen"]
urls = [t for t in text_items if re.search(r'https?://|\.com|\.net', t)]
handles = [t for t in text_items if t.startswith("@")]
hashtags = [t for t in text_items if t.startswith("#")]
Extract travel recommendations¶
# For travel guide reels, text_on_screen often contains the full list
# of recommendations as shown in the video
all_tips = "\n".join(result["visual"]["text_on_screen"])
# Pass to LLM
prompt = f"""
Extract all location recommendations from this travel guide:
{all_tips}
Format as a JSON list with: name, category (restaurant/hotel/activity/viewpoint).
"""
Price extraction¶
import re
prices = []
for text in result["visual"]["text_on_screen"]:
# Match common price patterns
found = re.findall(r'[\$€£¥₽]\s?\d+(?:[.,]\d{2})?|\d+(?:[.,]\d{2})?\s?[\$€£¥₽]', text)
prices.extend(found)
How it works¶
OCR in instatollm is performed by Gemini AI as part of the video analysis. Unlike traditional OCR tools that process static images, Gemini reads the entire video stream and detects text across all frames — including:
- Animated text that fades in/out
- Text that appears briefly on screen
- Overlaid graphics and infographics
- Subtitles and auto-generated captions
- Background signs and labels
The text is returned as a flat array of strings — each item is one distinct text element. Order roughly follows the sequence in which text appears in the video.
Tips¶
- For list-heavy reels (travel guides, top-10s),
text_on_screenoften contains the entire list exactly as shown — no further parsing needed - Combine
text_on_screenwithaudio.transcriptfor full coverage — some information is spoken but not shown, and vice versa - Very fast, small text (like scrolling credits) may be missed
- Text in non-Latin scripts is generally supported (Russian, Arabic, Chinese, etc.)