OCR & Text Detection¶

One of the most powerful features of instatollm is automatic text recognition from video frames. The visual.text_on_screen field captures everything legible on screen.

What gets detected¶

Menus and food¶

"text_on_screen": [
  "Carbonara - €12",
  "Cacio e pepe - €11",
  "Gricia - €10",
  "@trattoria_roma"
]

Travel and locations¶

"text_on_screen": [
  "Madeira Island Travel Guide",
  "Experiences to NOT miss:",
  "- Sunrise at Pico Ruivo",
  "- Dolphin and whale watching",
  "Where to stay:",
  "- First-timers: Funchal or Caniço"
]

Products and brands¶

"text_on_screen": [
  "iPhone 16 Pro",
  "Available from $999",
  "apple.com/iphone",
  "#ad #sponsored"
]

"text_on_screen": [
  "@username",
  "follow for more",
  "link in bio",
  "youtube.com/channel/..."
]

Subtitles and captions¶

"text_on_screen": [
  "This is the most important tip",
  "DO NOT skip this step",
  "Results after 30 days"
]

Use cases¶

Extract restaurant recommendations¶

result = analyze_reel("https://www.instagram.com/reel/...")

# All restaurant names are in text_on_screen
restaurants = [
    text for text in result["visual"]["text_on_screen"]
    if any(kw in text.lower() for kw in ["restaurant", "café", "bar", "grill"])
]

import re

text_items = result["visual"]["text_on_screen"]

urls = [t for t in text_items if re.search(r'https?://|\.com|\.net', t)]
handles = [t for t in text_items if t.startswith("@")]
hashtags = [t for t in text_items if t.startswith("#")]

Extract travel recommendations¶

# For travel guide reels, text_on_screen often contains the full list
# of recommendations as shown in the video
all_tips = "\n".join(result["visual"]["text_on_screen"])

# Pass to LLM
prompt = f"""
Extract all location recommendations from this travel guide:

{all_tips}

Format as a JSON list with: name, category (restaurant/hotel/activity/viewpoint).
"""

Price extraction¶

import re

prices = []
for text in result["visual"]["text_on_screen"]:
    # Match common price patterns
    found = re.findall(r'[\$€£¥₽]\s?\d+(?:[.,]\d{2})?|\d+(?:[.,]\d{2})?\s?[\$€£¥₽]', text)
    prices.extend(found)

How it works¶

OCR in instatollm is performed by Gemini AI as part of the video analysis. Unlike traditional OCR tools that process static images, Gemini reads the entire video stream and detects text across all frames — including:

Animated text that fades in/out
Text that appears briefly on screen
Overlaid graphics and infographics
Subtitles and auto-generated captions
Background signs and labels

The text is returned as a flat array of strings — each item is one distinct text element. Order roughly follows the sequence in which text appears in the video.

Tips¶

For list-heavy reels (travel guides, top-10s), text_on_screen often contains the entire list exactly as shown — no further parsing needed
Combine text_on_screen with audio.transcript for full coverage — some information is spoken but not shown, and vice versa
Very fast, small text (like scrolling credits) may be missed
Text in non-Latin scripts is generally supported (Russian, Arabic, Chinese, etc.)