Skip to content

OCR & Text Detection

One of the most powerful features of instatollm is automatic text recognition from video frames. The visual.text_on_screen field captures everything legible on screen.


What gets detected

"text_on_screen": [
  "Carbonara - €12",
  "Cacio e pepe - €11",
  "Gricia - €10",
  "@trattoria_roma"
]

Travel and locations

"text_on_screen": [
  "Madeira Island Travel Guide",
  "Experiences to NOT miss:",
  "- Sunrise at Pico Ruivo",
  "- Dolphin and whale watching",
  "Where to stay:",
  "- First-timers: Funchal or Caniço"
]

Products and brands

"text_on_screen": [
  "iPhone 16 Pro",
  "Available from $999",
  "apple.com/iphone",
  "#ad #sponsored"
]

Social handles and URLs

"text_on_screen": [
  "@username",
  "follow for more",
  "link in bio",
  "youtube.com/channel/..."
]

Subtitles and captions

"text_on_screen": [
  "This is the most important tip",
  "DO NOT skip this step",
  "Results after 30 days"
]

Use cases

Extract restaurant recommendations

result = analyze_reel("https://www.instagram.com/reel/...")

# All restaurant names are in text_on_screen
restaurants = [
    text for text in result["visual"]["text_on_screen"]
    if any(kw in text.lower() for kw in ["restaurant", "café", "bar", "grill"])
]

Find all URLs and social handles

import re

text_items = result["visual"]["text_on_screen"]

urls = [t for t in text_items if re.search(r'https?://|\.com|\.net', t)]
handles = [t for t in text_items if t.startswith("@")]
hashtags = [t for t in text_items if t.startswith("#")]

Extract travel recommendations

# For travel guide reels, text_on_screen often contains the full list
# of recommendations as shown in the video
all_tips = "\n".join(result["visual"]["text_on_screen"])

# Pass to LLM
prompt = f"""
Extract all location recommendations from this travel guide:

{all_tips}

Format as a JSON list with: name, category (restaurant/hotel/activity/viewpoint).
"""

Price extraction

import re

prices = []
for text in result["visual"]["text_on_screen"]:
    # Match common price patterns
    found = re.findall(r'[\$€£¥₽]\s?\d+(?:[.,]\d{2})?|\d+(?:[.,]\d{2})?\s?[\$€£¥₽]', text)
    prices.extend(found)

How it works

OCR in instatollm is performed by Gemini AI as part of the video analysis. Unlike traditional OCR tools that process static images, Gemini reads the entire video stream and detects text across all frames — including:

  • Animated text that fades in/out
  • Text that appears briefly on screen
  • Overlaid graphics and infographics
  • Subtitles and auto-generated captions
  • Background signs and labels

The text is returned as a flat array of strings — each item is one distinct text element. Order roughly follows the sequence in which text appears in the video.


Tips

  • For list-heavy reels (travel guides, top-10s), text_on_screen often contains the entire list exactly as shown — no further parsing needed
  • Combine text_on_screen with audio.transcript for full coverage — some information is spoken but not shown, and vice versa
  • Very fast, small text (like scrolling credits) may be missed
  • Text in non-Latin scripts is generally supported (Russian, Arabic, Chinese, etc.)