GloBI Architecture - Detailed System Design#

This diagram provides a comprehensive view of the GloBI system architecture, including all major components, data flows, and external dependencies.

%%{init: {'theme':'base', 'themeVariables': { 'primaryColor':'#4a9eff','primaryTextColor':'#000','primaryBorderColor':'#2563eb','lineColor':'#64748b','secondaryColor':'#fbbf24','tertiaryColor':'#34d399','noteBkgColor':'#fef3c7','noteTextColor':'#000','noteBorderColor':'#f59e0b'}}}%%
flowchart TD
    %% User Inputs
    subgraph INPUTS["System Inputs"]
        M[Manifest File<br/>GloBIExperimentSpec]
        GIS[GIS Building Data<br/>Shapefile/GeoJSON/GPKG]
        CDB[Component Database<br/>SQLite via Prisma]
        SF[Semantic Fields<br/>YAML]
        CM[Component Map<br/>YAML]
        EPW[Weather Files<br/>EPW Archive]
    end

    %% CLI Layer
    subgraph CLI["CLI Layer (main.py)"]
        CLI1[submit manifest]
        CLI2[simulate]
        CLI3[get experiment]
        CLI4[ui]
        CLI5[output_viz]
    end

    %% Configuration Layer
    subgraph CONFIG["Configuration Layer (models/configs.py)"]
        EXP[GloBIExperimentSpec]
        FC[FileConfig]
        GPC[GISPreprocessorConfig]
        HDC[HourlyDataConfig]
    end

    %% GIS Processing
    subgraph GISPROCESS["GIS Preprocessing (pipelines.preprocess_gis_file)"]
        direction TB
        GP1[Load GIS File<br/>GeoPandas]
        GP2[Validate & Reproject CRS]
        GP3[Rename Columns<br/>Handle Shapefile Limits]
        GP4[Validate Semantic Fields]
        GP5[Generate/Validate IDs]
        GP6[Extract Coordinates]
        GP7[Filter by Height/Floors]
        GP8[Filter by WWR]
        GP9[Filter by Basement/Attic]
        GP10[Validate Geometry]
        GP11[Create Rotated Rectangles<br/>gis/geometry.py]
        GP12[Filter by Area & Edge Length]
        GP13[Compute Neighbor Indices]
        GP14[Extract Neighbor Geometries]
        GP15[Inject Semantic Context]
        GP16[Assign Weather Files<br/>gis/weather.py]

        GP1 --> GP2 --> GP3 --> GP4 --> GP5
        GP5 --> GP6 --> GP7 --> GP8 --> GP9
        GP9 --> GP10 --> GP11 --> GP12 --> GP13
        GP13 --> GP14 --> GP15 --> GP16
    end

    %% Allocation
    subgraph ALLOC["Allocation Layer (allocate.py)"]
        direction TB
        A1[For Each Building Row:<br/>Create GloBIBuildingSpec]
        A2[Calculate Branching Factor<br/>Based on Payload Size]
        A3[Create BaseExperiment<br/>with simulate_globi_building]
        A4[Configure RecursionMap<br/>Distribution Strategy]
        A5[Submit to Hatchet<br/>experiment.allocate]
    end

    %% Distributed Computing
    subgraph DIST["Distributed Computing Infrastructure"]
        direction TB
        H[Hatchet Workflow<br/>Orchestrator]
        W1[Docker Worker 1]
        W2[Docker Worker 2]
        WN[Docker Worker N]
        S3[S3/Cloud Storage]

        H --> W1
        H --> W2
        H --> WN
    end

    %% Simulation
    subgraph SIM["Energy Simulation (pipelines.simulate_globi_building)"]
        direction TB
        S1[Receive GloBIBuildingSpec]
        S2[Construct Zone Definition<br/>from Semantic Fields]
        S3[Build EnergyPlus Model<br/>epinterface.sbem]
        S4[Validate Conditioned Areas]
        S5[Run EnergyPlus Simulation<br/>model.run]
        S6[Extract Results from SQL<br/>Monthly Energy & Peak]
        S7[Extract Hourly Data<br/>Optional Timeseries]
        S8[Create GloBIOutputSpec<br/>DataFrames + Metadata]

        S1 --> S2 --> S3 --> S4 --> S5
        S5 --> S6 --> S7 --> S8
    end

    %% Results Aggregation
    subgraph AGG["Results Aggregation (Scythe Framework)"]
        direction TB
        R1[Collect Outputs from Workers]
        R2[Aggregate DataFrames<br/>Results + HourlyData]
        R3[Apply Semantic Versioning]
        R4[Store in Cloud<br/>Parquet Format]
    end

    %% Output
    subgraph OUTPUT["Output Layer"]
        direction TB
        O1[Download from S3<br/>to Local Directory]
        O2[Results.parquet<br/>Monthly Energy Data]
        O3[HourlyData.parquet<br/>Timeseries Optional]
        O4[Generate Visualization<br/>D3 Dashboard]
        O5[CSV Exports]
    end

    %% External Dependencies
    subgraph EXT["External Dependencies"]
        EPL[EnergyPlus<br/>Simulation Engine]
        EPI[EPInterface<br/>IDF Generation]
        ARC[Archetypal<br/>Building Templates]
        SCY[Scythe<br/>Distributed Framework]
        PRIS[Prisma<br/>Database ORM]
    end

    %% Data Flow Connections
    M --> CLI1
    GIS --> CLI1
    CDB --> CLI1
    SF --> CLI1
    CM --> CLI1
    EPW --> CLI1

    CLI1 --> EXP
    EXP --> FC
    EXP --> GPC
    EXP --> HDC

    FC --> GISPROCESS
    GPC --> GISPROCESS
    GIS --> GP1
    SF --> GP4
    EPW --> GP16

    GP16 --> A1
    A1 --> A2
    A2 --> A3
    A3 --> A4
    A4 --> A5
    A5 --> H

    W1 --> S1
    W2 --> S1
    WN --> S1

    CDB --> S3
    S8 --> R1

    R1 --> R2
    R2 --> R3
    R3 --> R4
    R4 --> S3

    CLI3 --> O1
    S3 --> O1
    O1 --> O2
    O1 --> O3
    O2 --> O4
    O2 --> O5

    CLI5 --> O4

    %% External dependency connections
    S5 -.uses.-> EPL
    S3 -.uses.-> EPI
    S3 -.uses.-> ARC
    A5 -.uses.-> SCY
    S3 -.uses.-> PRIS

    %% Styling - using medium-toned colors for better contrast
    style INPUTS fill:#60a5fa,stroke:#2563eb,stroke-width:3px,color:#000
    style CLI fill:#d1d5db,stroke:#6b7280,stroke-width:3px,color:#000
    style CONFIG fill:#fcd34d,stroke:#f59e0b,stroke-width:3px,color:#000
    style GISPROCESS fill:#fcd34d,stroke:#f59e0b,stroke-width:3px,color:#000
    style ALLOC fill:#fca5a5,stroke:#dc2626,stroke-width:3px,color:#000
    style DIST fill:#fca5a5,stroke:#dc2626,stroke-width:3px,color:#000
    style SIM fill:#fca5a5,stroke:#dc2626,stroke-width:3px,color:#000
    style AGG fill:#4ade80,stroke:#16a34a,stroke-width:3px,color:#000
    style OUTPUT fill:#4ade80,stroke:#16a34a,stroke-width:3px,color:#000
    style EXT fill:#d8b4fe,stroke:#9333ea,stroke-width:3px,color:#000

Component Details#

System Inputs#

Manifest File (GloBIExperimentSpec)#

Experiment name and scenario identifier
File paths configuration
GIS preprocessor parameters (thresholds, defaults, CRS)
Hourly data extraction settings (optional)

GIS Building Data#

Building footprints as polygons (Shapefile/GeoJSON/GeoPackage)
Properties: height, number of floors, typology, age, region
Coordinate reference system (CRS) information

Component Database#

SQLite database accessed via Prisma ORM
Building components: walls, windows, roofs, floors
Material properties and thermal characteristics
Accessed during simulation to construct energy models

Semantic Fields & Component Map#

YAML files defining categorical building attributes
Maps building typologies to component selections
Examples: residential/commercial, construction era, climate zone

Weather Data#

EPW (EnergyPlus Weather) files or archive
Can be queried dynamically based on building location
Provides hourly climate data for simulation

CLI Layer#

The command-line interface provides user-facing commands:

submit manifest: Load experiment configuration and initiate preprocessing/allocation
simulate: Run single building simulation (testing/debugging)
get experiment: Retrieve results from cloud storage
ui: Launch Streamlit web interface for interactive exploration
output_viz: Generate D3 visualization dashboards from results

Configuration Layer#

GloBIExperimentSpec#

Top-level experiment configuration
Links to FileConfig, GISPreprocessorConfig, HourlyDataConfig
Supports manifest loading from YAML files

FileConfig#

Paths to all required input files
File validation and existence checks

GISPreprocessorConfig#

Geometric filtering thresholds (min/max area, edge length)
Default values (height, WWR, basement, attic)
CRS projection settings
Weather query parameters

HourlyDataConfig#

Variables to extract from EnergyPlus SQL output
Enables optional hourly timeseries capture

GIS Preprocessing Pipeline#

The preprocessing pipeline transforms raw GIS data into simulation-ready building specifications:

Load & Validate: Read GIS file into GeoDataFrame, validate schema
Reproject: Convert to Cartesian CRS for geometric operations
Column Mapping: Handle Shapefile 10-character column name limits
Semantic Validation: Ensure semantic fields exist in GIS data
ID Handling: Generate UUIDs for buildings without IDs
Coordinate Extraction: Extract latitude/longitude for weather queries
Property Filters: Filter by height, floors, WWR, basement, attic
Geometry Processing:
Remove invalid geometries (non-polygons, self-intersections)
Convert to rotated rectangles (gis/geometry.py)
Filter by minimum/maximum building area
Filter by minimum/maximum edge length
Neighbor Analysis: Identify adjacent buildings for shading calculations
Semantic Context: Inject building typology, age, region metadata
Weather Assignment: Match buildings to EPW files by location

Output: Clean GeoDataFrame with enriched building data and column mappings

Allocation Layer#

Prepares building specifications for distributed execution:

Spec Generation: Create GloBIBuildingSpec for each building row
Extract geometry (rotated rectangle, neighbors)
Extract properties (height, floors, WWR, basement, attic)
Link semantic context and weather file
Branching Factor Calculation:
Sample 1000 random specs
Measure average JSON payload size
Calculate: sims_per_branch = 3MB / avg_size
Determine: branches_required = total_specs / sims_per_branch
Job Submission:
Create BaseExperiment with simulate_globi_building function
Configure RecursionMap for distribution strategy
Submit to Hatchet with S3 client for result storage

Output: Hatchet job reference and run metadata

Distributed Computing Infrastructure#

Hatchet Workflow Orchestrator#

Receives job submissions from allocation layer
Manages task queue and worker assignment
Handles retries and error recovery
Tracks job progress and completion status

Docker Workers#

Containerized execution environments
Pre-configured with EnergyPlus, EPInterface, dependencies
Scale horizontally based on workload
Stream results to cloud storage via Scythe framework

S3/Cloud Storage#

Stores simulation results as Parquet files
Versions experiments using semantic versioning
Provides durable storage for large-scale experiments

Energy Simulation Pipeline#

Each worker executes the following for assigned buildings:

Receive Spec: Deserialize GloBIBuildingSpec from JSON
Zone Definition: Construct building zones from semantic fields and component map
Model Construction: Use EPInterface/Archetypal to build EnergyPlus IDF
Validation: Check conditioned floor areas match geometry
Simulation: Run EnergyPlus simulation via model.run()
Results Extraction:
Query SQL output for monthly energy and peak results
Create MultiIndex DataFrame (Measurement, Feature levels)
Optionally extract hourly timeseries data
Output Creation: Build GloBIOutputSpec with results and metadata

Output: GloBIOutputSpec with DataFrames and hourly data references

Results Aggregation#

The Scythe framework handles result consolidation:

Collection: Gather GloBIOutputSpec objects from all workers
Aggregation: Concatenate DataFrames across buildings
Versioning: Apply semantic version to experiment results
Storage: Write aggregated Parquet files to S3

Output: Versioned experiment results in cloud storage

Output Layer#

Results are delivered to users via:

Download: Retrieve Parquet files from S3 to local directory
Results.parquet: Monthly energy and peak data
HourlyData.parquet: Optional hourly timeseries
Visualization: Generate interactive D3 dashboards
Summary statistics (mean, min, max)
Energy use intensity (EUI) distributions
Peak demand analysis
CSV Export: Convert Parquet to CSV for external analysis tools

External Dependencies#

EnergyPlus#

Building energy simulation engine that performs physics-based thermal calculations

EPInterface#

Python library for generating EnergyPlus IDF (Input Data File) models programmatically

Archetypal#

Provides building archetype templates and simplified building energy modeling (SBEM)

Scythe#

Distributed computing framework for experiment allocation, result aggregation, and storage

Prisma#

Database ORM for accessing component database during model construction

Key Design Principles#

Separation of Concerns: Clear boundaries between GIS processing, allocation, simulation, and results
Scalability: Horizontal scaling via distributed workers and cloud storage
Reproducibility: Version-controlled experiments with full provenance tracking
Flexibility: Configurable preprocessing, semantic mappings, and simulation parameters
Fault Tolerance: Retry logic and error handling throughout the pipeline
Data Efficiency: Parquet format for compressed, columnar data storage
Modularity: Independent components can be tested and deployed separately

Data Flow Summary#

User Manifest
    ↓
CLI loads configuration
    ↓
GIS preprocessing enriches building data
    ↓
Allocation creates building specs
    ↓
Hatchet distributes specs to workers
    ↓
Workers run EnergyPlus simulations
    ↓
Results aggregated and stored in S3
    ↓
CLI downloads and visualizes results
    ↓
User analyzes building stock performance

This architecture enables regional-scale building energy modeling with minimal manual intervention, supporting urban planning, policy analysis, and decarbonization strategies.