Natural Language Computer Vision
Bridge the gap between human language and computer vision. Ask questions about images and videos in plain English, and get intelligent analysis powered by state-of-the-art models.
Key Features
Everything you need to connect language models with computer vision
Natural Language Interface
Ask questions like "Count all red cars" or "Find people wearing yellow"
Multi-Modal Support
Works seamlessly with both images and videos for comprehensive analysis
YOLO-World v2
Fast, accurate object detection without predefined classes
ByteTracker Integration
Advanced multi-object tracking with boundary crossing detection
Advanced Analytics
Object counting, speed estimation, spatial relationships, temporal tracking
Visual Output
Generates annotated images/videos with detection highlights and tracking trails
Web Interface
Includes a Flask web app for easy interaction and testing
Extensible
Easy to add new models and capabilities with flexible configuration
See It In Action
Simple, intuitive API that lets you ask questions in plain English
1import langvio2 3# Create a pipeline4pipeline = langvio.create_pipeline()5 6# Analyze an image7result = pipeline.process(8 query="Count how many people are wearing red shirts",9 media_path="street_scene.jpg"10)11 12print(result['explanation'])13# Output: "I found 3 people wearing red shirts in the image. 14# Two are located in the center-left area, and one is on the right side."15 16# View the annotated result17print(f"Annotated image saved to: {result['output_path']}")Installation
Get started with Langvio in seconds
pip install langvioCore installation for basic functionality
Environment Setup
Configure your API keys
# For OpenAI OPENAI_API_KEY=your_openai_api_key_here # For Google Gemini GOOGLE_API_KEY=your_google_api_key_here
Langvio automatically loads these environment variables!
Examples
Explore what you can build with Langvio
"Track the elephant in this video"
Animal tracking and movement analysis
Object Detection

"Detect cars in this image?"
Count specific objects

"Detect how many people are wearing helmets in this image?"
Count and identify objects
Scene Analysis
.png)
"How many cyclists are in this image?"
Activity and context analysis

"How many people are by the pool?"
Multi-person scene analysis
Video Tracking
"Track the lion in this video"
Animal tracking and behavior analysis
"How many people are in this scene?"
People counting and tracking
Architecture
A modular pipeline connecting language understanding with visual analysis
LLM Processor
Parses queries and generates explanations
Vision Processor
Detects objects using YOLO-World v2
ByteTracker
Multi-object tracking for video analysis
Media Processor
Creates visualizations and handles I/O
Pipeline
Orchestrates the entire workflow
Supported Models
Choose from a variety of vision and language models
Vision Models
YOLO-World v2 variants
yolo_world_v2_sSpeedFastest inference
yolo_world_v2_mDefaultBalanced performance
yolo_world_v2_lAccuracyHigher accuracy
yolo_world_v2_xBestMaximum accuracy
Language Models
LLM providers for explanations
OPENAI_API_KEYGOOGLE_API_KEYByteTracker Capabilities
Advanced multi-object tracking for video analysis
Multi-Object Tracking
Track multiple objects with unique IDs
Boundary Crossing
Detect entry/exit events
Speed Estimation
Calculate object velocities
Track Persistence
Maintain identity through occlusions
Kalman Filter
Smooth motion prediction
IoU Association
Accurate detection matching