v0.0.5 — Powered by YOLO-World v2 & ByteTracker

Natural Language Computer Vision

Bridge the gap between human language and computer vision. Ask questions about images and videos in plain English, and get intelligent analysis powered by state-of-the-art models.

Get Started

Image Analysis

Video Tracking

LLM Explanations

Key Features

Everything you need to connect language models with computer vision

Natural Language Interface

Ask questions like "Count all red cars" or "Find people wearing yellow"

Multi-Modal Support

Works seamlessly with both images and videos for comprehensive analysis

YOLO-World v2

Fast, accurate object detection without predefined classes

ByteTracker Integration

Advanced multi-object tracking with boundary crossing detection

Advanced Analytics

Object counting, speed estimation, spatial relationships, temporal tracking

Visual Output

Generates annotated images/videos with detection highlights and tracking trails

Web Interface

Includes a Flask web app for easy interaction and testing

Extensible

Easy to add new models and capabilities with flexible configuration

See It In Action

Simple, intuitive API that lets you ask questions in plain English

1import langvio
2 
3# Create a pipeline
4pipeline = langvio.create_pipeline()
5 
6# Analyze an image
7result = pipeline.process(
8    query="Count how many people are wearing red shirts",
9    media_path="street_scene.jpg"
10)
11 
12print(result['explanation'])
13# Output: "I found 3 people wearing red shirts in the image. 
14#          Two are located in the center-left area, and one is on the right side."
15 
16# View the annotated result
17print(f"Annotated image saved to: {result['output_path']}")

Installation

Get started with Langvio in seconds

pip install langvio

Core installation for basic functionality

Environment Setup

Configure your API keys

.env

# For OpenAI
OPENAI_API_KEY=your_openai_api_key_here

# For Google Gemini  
GOOGLE_API_KEY=your_google_api_key_here

Langvio automatically loads these environment variables!

Examples

Explore what you can build with Langvio

Featured Videoelephant_processed.mp4

"Track the elephant in this video"

Animal tracking and movement analysis

Object Detection

2 cars_processed.jpg

"Detect cars in this image?"

Count specific objects

helmet_processed.jpeg

"Detect how many people are wearing helmets in this image?"

Count and identify objects

Scene Analysis

image (1).png

"How many cyclists are in this image?"

Activity and context analysis

people in pool_processed.jpeg

"How many people are by the pool?"

Multi-person scene analysis

Video Tracking

lion_processed.mp4

"Track the lion in this video"

Animal tracking and behavior analysis

three people_processed.mp4

"How many people are in this scene?"

People counting and tracking

Architecture

A modular pipeline connecting language understanding with visual analysis

User Query

LLM Parser

YOLO-World

Video?

ByteTracker

Annotated Output

LLM Processor

Parses queries and generates explanations

Vision Processor

Detects objects using YOLO-World v2

ByteTracker

Multi-object tracking for video analysis

Media Processor

Creates visualizations and handles I/O

Pipeline

Orchestrates the entire workflow

Supported Models

Choose from a variety of vision and language models

Vision Models

YOLO-World v2 variants

yolo_world_v2_sSpeed

Fastest inference

yolo_world_v2_mDefault

Balanced performance

yolo_world_v2_lAccuracy

Higher accuracy

yolo_world_v2_xBest

Maximum accuracy

Language Models

LLM providers for explanations

OpenAIOPENAI_API_KEY

GPT-3.5 TurboGPT-4 Turbo

GoogleGOOGLE_API_KEY

Gemini ProGemini 2.0 Flash

ByteTracker Capabilities

Advanced multi-object tracking for video analysis

Multi-Object Tracking

Track multiple objects with unique IDs

Boundary Crossing

Detect entry/exit events

Speed Estimation

Calculate object velocities

Track Persistence

Maintain identity through occlusions

Kalman Filter

Smooth motion prediction

IoU Association

Accurate detection matching