CASE STUDY

Real-Time AI Room Detection for Live Video Inspection

Real-Time AI Room Detection for Live Video Inspection
Industry Property / Real Estate Inspection
Region International
Timeline Full-cycle engagement
Team Trembit dedicated engineering team
Model
CLIP ViT-B/32 (zero-shot)
Interface
Gradio
Compute
FP16 GPU
Language
Python

The Problem

A property inspection company conducting audits for insurance underwriters and real estate transactions needed automated room detection that could confirm every room in a property had been visually documented during a walkthrough — in real time, as the inspector moves through the property with a webcam. Their workflow was entirely manual: an inspector recorded footage of each room and later reviewed video to verify coverage. At properties with dozens of rooms, inspectors routinely missed rooms, discovering gaps only in post-processing and requiring costly return visits. Object detection solutions identified objects within rooms rather than classifying the room type itself, and manual tagging was error-prone. They needed a system that watches the live feed, classifies each room as the inspector enters it, maintains a running checklist with confidence scores, and alerts the inspector if expected rooms are unvisited before they leave — all in a browser, on existing laptops, with no specialized hardware.

Why Building Real-Time Room Detection for Property Inspections Is Hard

Room classification from live video combines the open-vocabulary challenges of scene recognition with the real-time constraints of a walking inspection — where the camera is constantly moving and the system must decide from imperfect frames:

  • Room classification is a scene-level problem, not an object-level problem — recognizing a kitchen requires understanding the arrangement of countertops, appliances, and fixtures rather than identifying individual objects; most vision models are trained for object detection, not holistic scene classification
  • Zero-shot recognition without property-specific training data — every property differs; collecting labeled images for every possible room appearance is infeasible, so the system must classify rooms it has never seen by understanding what a room type looks like conceptually
  • Confidence scoring that is meaningful for inspection workflows — raw model confidence does not directly translate to inspection reliability; thresholds must balance sensitivity (do not miss rooms) with precision (do not log false detections)
  • Real-time processing from a moving webcam — the inspector is walking and turning, producing motion blur and partial views; the model must provide feedback while they are still in the room, not ten seconds later
  • Complete coverage verification against a room manifest — the system must track which rooms were visited, compare against the expected list, and identify gaps before the inspector leaves, maintaining state across the whole session
  • Deployment simplicity for non-technical inspectors — it must run in a standard browser with a standard webcam, require no installation beyond a URL, and give feedback inspectors can act on immediately

What We Did

1

Zero-Shot Room Classification with CLIP

  • Built the room classification engine using CLIP ViT-B/32 — computing similarity between video frame embeddings and text embeddings for room descriptions, enabling recognition of any room type describable in natural language without training data
  • Implemented multi-label room classification returning confidence scores for every room category, because a frame at a kitchen/dining boundary legitimately contains evidence of both
  • Developed prompt engineering and prompt ensembling for room descriptions — averaging embeddings from multiple prompts per room type to handle real-world visual diversity, with FP16 GPU inference for real-time speed on laptop hardware
2

Confidence Scoring & Detection Logic

  • Designed configurable per-room-type confidence thresholds, since room types differ in visual distinctiveness (bathrooms are distinctive; office and bedroom can look similar)
  • Implemented temporal smoothing that aggregates confidence across consecutive frames — triggering a detection only when sustained confidence exceeds the threshold, reducing false detections from momentary glances
  • Developed a detection state machine that tracks the lifecycle of a room visit (enter, present, exit), logging entry timestamp, peak confidence, duration, and a representative frame per visit
3

Coverage Tracking & Real-Time Feedback

  • Engineered the coverage tracker — a live dashboard of which room types were visited, how many times, and which expected rooms remain unvisited, updating as the inspector walks through
  • Implemented coverage gap alerting that compares detected rooms against the property's expected list and highlights missed areas before the inspector leaves
  • Built automatic frame capture and annotation — saving the highest-confidence frame per detected room with classification overlays, creating a visual inspection record for the audit report
4

Gradio Interface & Deployment

  • Built the Gradio web interface — browser-accessible, connecting to the inspector's webcam, displaying the live feed with classification overlays and the coverage dashboard, requiring no installation
  • Implemented configurable room manifests so coordinators define the expected room list per property before inspection, letting the tracker identify property-specific gaps
  • Deployed as a lightweight, self-contained application bundling the CLIP model, Gradio server, and detection logic — running on the inspector's laptop with no cloud connectivity required during inspection

Key Results

Zero-shot CLIP ViT-B/32 classifies rooms without property-specific training data
Real-time FP16 GPU inference with live classification overlays during the walkthrough
Manifest-driven Compares detected rooms against the expected list and alerts on gaps on-site
Browser-based Gradio interface, standard webcam, no installation
Audit-ready Detection logs, confidence scores, timestamps, and captured frames per visit

In Their Words

Trembit built us a room detection system that tells our inspectors in real time which rooms they have covered and which they have missed — before they leave the property. We used to send inspectors back for return visits. Now the coverage dashboard catches that while they are still on-site.
Property inspection company Operations Director
Their proactive team gets things done as if it were their own project.
Trembit client

What We Learned

In a zero-shot vision system, the text side is where the tuning happens

Single-word room labels gave mediocre accuracy (~60%). Switching to descriptive prompts ("a photograph of a residential kitchen with countertops, cabinets, and cooking appliances") jumped accuracy dramatically without changing the pipeline. Averaging multiple prompts per room type improved robustness further. We spent more time iterating on prompt descriptions than on the image pipeline — and that is where the accuracy gains came from.

Temporal smoothing is the difference between a prototype that demos well and a system inspectors trust

Single-frame classification produced a flickering experience — confidence jumping between room types as the inspector turned. Adding a rolling window that requires sustained confidence eliminated the flicker and made the system feel authoritative. The raw accuracy did not change, but perceived reliability transformed from "unstable prototype" to "tool I trust during an audit."

The end-of-inspection gap alert is the feature that actually saves money

We built live overlays, confidence scoring, smoothing, and logging — but the feature with the biggest business impact was the simplest: "you have not visited the laundry room" before the inspector leaves. Return visits were the company's biggest operational cost. The model is the enabler; the business value comes from workflow integration. Start from the workflow problem and work backward to what the model needs to do.

Need Real-Time Room Detection?

Book a 30-minute architecture session — we'll discuss your inspection workflow requirements and the detection decisions that matter most. No pitch deck. Just engineering clarity.

Thank you! Your message has been successfully sent. We will contact you shortly.

Something went wrong. Please try again or email us at welcome@trembit.com