top of page

How We Benchmarked Vision-Language Models (VLMs) for Real-World City Applications


Bridging the Gap in Real-World Vision AI


While Vision-Language Models (VLMs) have advanced rapidly in general-purpose tasks, their performance in real-world city applications remains underexplored. Most existing benchmarks focus on clean, static imagery or curated datasets detached from the complex, dynamic nature of urban environments. 

 

To address this gap, we co-developed a new benchmark in collaboration with the Kaohsiung City Government, grounded in actual municipal operations and decision-making workflows.  

 

In cities like Kaohsiung, where AI is increasingly used for frontline tasks such as traffic monitoring, public safety, and disaster response, models must do more than recognize objects. They must interpret context, resolve ambiguities, and respond with precision across diverse lighting, weather, and infrastructure conditions. 

 

The Challenge of City Scenes


Urban scenarios are complex and ambiguous. A traffic camera may capture pedestrians, vehicles, signage, and environmental anomalies all at once. Models must interpret and differentiate between multiple potential meanings—e.g., is a power pole tilted dangerously, or just at an angle due to camera position? 

 

We identified several key challenges that are often overlooked in lab settings: 

  • Ambiguity in scene composition 

  • Infrastructure obstruction 

  • Environmental variables like night rain, fog, glare 

  • Confusion between visually similar but behaviorally different situations 

 

In smart city settings, these aren’t edge cases—they’re daily realities. 

Urban scenarios are inherently complex and visually ambiguous
Urban scenarios are inherently complex and visually ambiguous.

Designing a Benchmark Grounded in Urban Needs 


To close the gap between generic AI performance and real-world city demands, we designed a benchmark grounded in real government operations, with 3 key principles: 

 

  1. Alignment with localized logic 

All tasks were created based on Kaohsiung’s actual use cases, ensuring practical and operational relevance.  


  1. Multi-variable realism 

We incorporated real-world variables, like weather, lighting, and signage occlusion, and designed multi-step tasks that mirror how city departments assess and respond to incidents. 


  1. Scalability and potential for global interoperability 

Designed with reference to benchmarks like MMBench, the framework supports local deployment while enabling cross-city and cross-country evaluation. 

 

This allowed us to simulate real operational complexity rather than artificial task isolation. 

Real-world distributions guided data selection
Real-world distributions guided data selection

Key Findings and Model Limitations 


In our evaluation, even strong VLMs exhibited significant performance drops under certain conditions: 

  • Nighttime footage with rain or glare 

  • The distinction between semantically similar categories 

  • Inconsistent reasoning in edge cases that require nuanced understanding of urban scenarios 

 

The benchmark highlighted where models require targeted fine-tuning and where synthetic data or multimodal augmentation could improve generalization. 

 

Toward Actionable AI for Cities 


We believe benchmarks should evolve beyond academic datasets to reflect real-world demands. Our Smart City VLM Benchmark is a step toward that goal—practical, operationally aligned, and grounded in actual municipal needs.  

By aligning model evaluation with the realities of urban governance, we aim to help cities deploy AI that’s not just accurate, but actionable. 

 

In the short term, our focus is on validating the accuracy of locally fine-tuned AI models in Kaohsiung. In the mid-term, we plan to extend this benchmark framework to encompass city operations across Taiwan. And in the long term, our goal is to scale these capabilities globally, supporting smart city deployments and VLM adaptation in diverse urban environments worldwide. 

 

▶ Read the full post to explore how this benchmark helps evaluate AI readiness for real city tasks: https://www.linkervision.com/vlm-benchmark-release-report


Note: The datasets and question banks will be released by the end of 2025, aiming to promote collaborative validation and continuous improvement of practical AI capabilities for smart cities across the public, private, and academic sectors.


 
 
 
bottom of page