VLM Benchmark for Smart City Governance: Kaohsiung Local Sovereign AI

Document Title: VLM Benchmark for Smart City Governance: Kaohsiung Local Sovereign AI

Authors: Linker Vision Co., Ltd. & Kaohsiung City Government

Release Version: Version 0.9 draft

Release Date: May 2025

Document Type: Public White Paper (Open Access)

Citation: Linker Vision & Kaohsiung City Government. (2025). VLM Benchmark for Smart City Governance – Kaohsiung Pilot Report. Version 0.9 draft

Image Sources: Some images are provided by Kaohsiung City Government and are authorized for use solely within this report. Unauthorized use or redistribution is prohibited.

Contact: https://www.linkervision.com

1. Introduction

With the continual and rapid advancement of Vision-Language Models (VLMs) within the broader field of artificial intelligence, their potential applications have significantly expanded to include a diverse range of complex real-world domains, notably encompassing smart city development, industrial governance frameworks, and data-driven community management initiatives. Despite this technological growth, most contemporary evaluation frameworks for VLMs remain predominantly academic and research-oriented. They often lack comprehensive validation mechanisms and practical testing methodologies that are truly aligned with real-world governance challenges and operational constraints.

Drawing from a variety of cross-departmental, scenario-based use cases provided by the Kaohsiung City Government, this research proposes the establishment of a VLM Benchmark specifically tailored to smart city applications. This benchmark, developed with high fidelity to practical urban governance, integrates representative event scenarios, rigorous question type design, and a standardized evaluation protocol. The overarching goal is to provide a reference system for evaluating model performance that maintains both academic integrity and practical relevance.

This proposed benchmark rests upon three fundamental design principles:

Alignment with localized application logic: All assessment items are formulated based on real-world urban scenarios and governance needs identified in Kaohsiung City, ensuring the evaluation content is not only contextually relevant but also grounded in policy and operational logic.
Capacity for multi-variable testing: The benchmark accounts for diverse scene variations and environmental conditions—ranging across diurnal cycles (day and night), spatial contexts (indoor and outdoor), and meteorological variables (such as clear, cloudy, and rainy weather)—to simulate the multifaceted and dynamic nature of urban environments.
Scalability and potential for global interoperability: The structure of the benchmark was designed with compatibility in mind, referencing internationally recognized evaluation paradigms such as MMBench. This allows the framework to support both local adaptation and comparative international benchmarking.

To realize this goal, the benchmark construction process incorporated governance requirements and visual data contributions from eight distinct departments and affiliated public agencies, covering domains such as transportation planning, environmental supervision, emergency service coordination, infrastructure management, and urban management. A multi-stakeholder VLM Benchmark Committee was assembled to guide the project, composed of end-user representatives from municipal government, system integrators and AI developers, as well as academic researchers. Through cross-sector collaboration, this committee jointly defined data selection criteria, question development workflows, and review protocols to uphold authenticity, representativeness, and fairness across all benchmark components.

2. Current Status and Related Work

2.1 Existing Benchmarks

Within the field of multimodal model evaluation, numerous datasets and benchmark frameworks have emerged and gained traction in academic research, including but not limited to COCO Caption, VQAv2, ScienceQA, GQA, and the more recent MMBench. While these resources serve as effective tools for evaluating general VLM capabilities in controlled settings, they tend to face three key limitations when extended to the domain of smart city governance.

First and foremost, the vast majority of these benchmarks rely heavily on static imagery and closed-form question-answer pairs, which inherently limits their applicability in dynamic, context-rich, and highly interactive city management environments. Second, these benchmarks are overwhelmingly designed for English-language contexts and thus lack sufficient support for other language capabilities, particularly for Traditional Chinese-language, which are essential for use in regions such as Taiwan. Third, these datasets often fall short in capturing task-specific logic, lacking the detailed operational workflows and decision-making structures that characterize real-world urban governance.

2.2 Gaps in Smart City Application

AI deployments in the smart city domain—especially those intended for frontline, citizen-facing tasks—demand sophisticated comprehension of diverse information sources and robust cross-modal reasoning. This is particularly evident in scenarios involving environmental ambiguity or infrastructural uncertainty. For example, accurately interpreting a photographic image depicting a temporary construction zone on a cloudy evening requires the model to conduct visual object recognition (identifying elements such as barricades or caution signs), infer spatial relationships (such as the proximity between pedestrians and hazard zones), and apply contextual governance rules (e.g., identifying violations or triggering alert mechanisms).

Although these tasks may appear conceptually simple, the execution requires a high degree of contextual awareness, multi-step inference, and domain-specific knowledge. Without a dedicated benchmark framework that reflects these complexities, it is exceedingly difficult to evaluate whether an AI model is adequately prepared for deployment in high-stakes, real-time urban governance settings. Such a gap emphasizes the critical need for new evaluation structures that go beyond generic accuracy measures and incorporate deeper dimensions of practical performance.

3. Data Sources and Construction Principles

3.1 Participating Departments and Dataset Scope

This benchmark was co-developed through the joint efforts of eight governmental departments and state-owned enterprises based in Kaohsiung City. These include the Transportation Bureau, Sports Development Bureau, Mass Rapid Transit Bureau, Water Resources Bureau, Public Works Bureau, and three major public sector enterprises—Taiwan International Ports Corporation, China Steel Corporation and Taiwan Power Company (Taipower). Each participating unit contributed a set of domain-relevant governance scenarios based on their daily operational experiences. These scenarios were later formalized into test items with the support of Information Technology Office, which served as the coordinating lead.

In summary:

Total participating units: 8 (comprising 5 city government departments and 3 state-owned enterprises)
Core governance scenarios compiled: 108
Finalized test items: 642 questions derived from practical operational contexts (dynamic adjustment based on the requirements.)
Question types: Binary classification, category recognition, ordinal ranking tasks
Design approach: Scenario-driven, governance-informed, and aimed at multi-dimensional evaluation diversity

3.2 Question Types and Structure

To systematically assess different aspects of model capability, the benchmark questions were categorized into three primary types:

Binary Classification: Designed to determine the presence or absence of specific events (e.g., traffic violations, obstruction).
Category Recognition: Aims to classify the situation or object type, such as identifying the category of an accident (collision, rollover, scratch, etc.).
Ordinal/Ranked Scenarios: Intended to gauge the model’s ability to assess severity levels, such as rating the degree of traffic obstruction from level A to F.

Each of these categories reflects a key cognitive skill required for governance reasoning and real-time decision-making. In the final scoring system, appropriate weighting and accuracy thresholds were defined per type.

The illustrative examples presented below feature representative camera footage provided by Water Resources Bureau, Economic Development Bureau, and Public Works Bureau. These demonstrate that every captured scenario can be formulated into three clearly defined categories of questions.

Water Resources Bureau:

Is there flooding or water accumulation: Yes / No
Affected road section due to flooding: No impact / Single (one-way) or partial lane / Entire (both directions) road or intersection
Water accumulation depth: Shallow (< 30 cm) / Moderate (30–50 cm) / Deep (> 50 cm)

Sports Development Bureau:

Is there an emergency exit: Yes / No
Is the emergency exit blocked: Yes / No
Type of blockage at the emergency exit: Crowd / Debris / Trash / Other / More than one
Degree of blockage at the emergency exit: Partially blocked (passable) / Completely blocked (impassable)
Are the obstructions at the emergency exit movable (including people, vehicles): Yes / No

Public Works Bureau:

Is there road construction occupying lanes and affecting traffic: Yes / No
Type of construction: Road repair / Road milling and paving / Sidewalk construction / Road excavation / Aerial lift work (e.g., streetlight or tree maintenance) / Road widening / Bridge construction / Road occupation by building site (equipment and materials) / Other
Are there construction signs or indicators: Yes / No
Are traffic control measures in the construction area adequate (e.g., traffic cones with connectors, steel barriers, Jersey barriers): Yes / No
Is fencing installed around the roadside construction area: Yes / No
Do traffic control measures include nighttime warnings: Yes / No
Are construction personnel present: Yes / No
Time of scene: Daytime / Nighttime / Other

3.3 Dataset Volume and Design Targets

To ensure international comparability, the benchmark design referenced leading evaluation frameworks frequently used to assess VLMs such as VILA and LLaVA. For example, ScienceQA contains 4,241 labeled items, MME includes 2,374, and MMBench comprises 1,784. Based on these precedents, this benchmark includes approximately 1,400 unique answer options. Under the assumption that each option is supported by a minimum of four distinct images, the total dataset size would reach approximately 5,600 images.

Design targets are as follows:

Baseline coverage: 4 images per answer option
Expanded objective: 15 images per option
Initial dataset scale: ~5,600 image-question pairs
Maximum target: 20,000 items if image sourcing capacity improves

To preserve fairness and eliminate the risk of test leakage, all selected test images were manually reviewed and confirmed to be excluded from any training datasets used in developing the evaluated models.

3.4 Data Selection Rules and Scenario Planning

A structured, three-layer data selection strategy was implemented:

Scene Type Differentiation: Indoor scenes were excluded from time-of-day and weather constraints, while outdoor scenes were tagged with complete environmental parameters.
Time-based Distribution: Dataset images were distributed based on real-world collection ratios—daytime: 75%, dusk: 2%, nighttime: 23%.
Weather-based Distribution: Aligned with Kaohsiung’s 10-year climatological data—sunny: 52%, cloudy: 20%, rainy: 28%.

This approach ensures scenario realism and contributes to comprehensive evaluation across variable operational conditions.

4. Evaluation Design and Scoring Mechanism

4.1 Evaluation Workflow

To simulate field-level deployment conditions, the benchmark’s evaluation procedure incorporates repeated testing cycles and semantic normalization. This allows for assessment of both predictive accuracy and output stability under variation.

Core workflow components:

Multiple inputs per test item: Each question is linked to multiple visual stimuli to test consistency.
Answer Standardization: Model-generated responses are mapped to the predefined answer options established by the benchmark for scoring purposes. The benchmark includes multiple question types—namely, binary classification, categorical identification, and ordinal/hierarchical classification—each associated with distinct scoring rubrics and evaluation criteria. (Detailed scoring methodologies for each type are outlined in the subsequent section.)
Performance metrics: Accuracy, error rate, output variability, and interpretability scores are captured and analyzed.

4.2 Scoring Rules by Question Type

Each question type is paired with tailored scoring rules:

Binary Classification: Simple correct/incorrect scoring.
Category Recognition: Full credit for precise matches, partial credit for semantically adjacent categories.
Ordinal/Ranking Tasks: Proportional scores based on distance from the correct rank.

All answers are aligned with domain-specific governance logic and validated through expert review.

4.3 Flexible Threshold Adjustment

To reflect real-world difficulty, scoring thresholds are dynamically adjusted based on:

Data scarcity or collection difficulty
Task-level inference complexity
Scene condition rarity or ambiguity

Such flexibility ensures fair benchmarking even in cases of rare scenarios, while also encouraging model improvement in traditionally underperforming domains.

5. Preliminary Validation and Interim Results

We have progressively completed early-stage evaluation tasks across a variety of test items that reflect key domains of urban governance. These include traffic accidents, roadway violations, environmental anomalies, and surveillance of community public spaces. The responses generated by VLM systems under evaluation have verified that this benchmarking framework is effective in uncovering typical model errors and behavioral biases—particularly those arising from insufficient training data or specific scene complexities. Notable examples include significant performance degradation under nighttime rain conditions and blurred, distant imagery, as well as confusion among semantically similar options in classification tasks. These analyses provide essential insights that inform targeted fine-tuning and dataset augmentation in future model development phases.

6. Future Directions

In summary, the Smart City VLM Benchmark established through this project has evolved beyond a conventional academic dataset for capability evaluation. It now serves as a policy-driven, multi-agency collaborative, locally sovereign, practically validated, and globally extendable AI evaluation platform. Through close cooperation among governance units of Kaohsiung City Government, technology developers, and academic research communities, we have realized a closed-loop paradigm in which data and evaluation items are generated from real needs, and model outputs provide direct feedback for decision-making processes.

We will continue refining and localizing the Kaohsiung-specific VLM Benchmark to support dataset standardization and fairness efforts. This initiative aims to promote broader inter-municipal collaboration and co-development, driving progress toward exporting this standard internationally and making it a standard AI testing criterion for urban governance in various countries.