Municipal Similarity in Switzerland

Executive Summary #

Client: HSLU (University Project)
The Main Win: Pivoted from a failed clustering approach to a successful vector-space model, resulting in an end-to-end product that visualises complex socioeconomic relationships.
Tech Stack: Python (Scikit-learn, Pandas), PostgreSQL, Express.js, Vue 3, Docker.

The Challenge #

Regional planning in Switzerland frequently relies on “benchmarking”—comparing municipalities. The existing approach defaults to geographic proximity (comparing a town to its neighbour) or high-level cantonal aggregates.

This creates a data blind spot. A wealthy, industrial municipality in Aargau might face challenges more similar to a Zürich suburb than its rural neighbour. The lack of a tool to quantitatively answer “Which municipalities are actually structured like mine?” leads to inefficient resource allocation and poor policy modeling.

The Solution #

I built a “Similarity Engine” that ingests raw socioeconomic data from the Swiss Federal Statistical Office (BFS).

Instead of forcing municipalities into arbitrary groups, the system calculates the pairwise distance between every municipality in a 43-dimensional vector space. This enables “Nearest Neighbour” searches based on data (demographics, politics, land use) rather than GPS coordinates.

The Results #

Data Validation: The model reconstructed the “Röstigraben” (the cultural split between French and German-speaking Switzerland) purely based on voting patterns and demographics, without explicit geographic training.
Performance: Similarity queries run in <1ms via indexed SQL lookups on a 2,172 x 2,172 matrix.
Granularity: Querying Zürich returns Winterthur, Bern, and Lausanne as top matches, validating that the model prioritises urban structure over geography.

Technical Architecture #

The system follows a standard ETL pipeline feeding a containerised web app.

Data Pipeline (Python): Ingests BFS datasets, imputes missing values, standardises features, and computes the Cosine Similarity matrix.
Persistence (PostgreSQL): Stores precomputed similarity scores. Bidirectional indexing on source and target columns ensures $O(1)$ retrieval.
API (Express.js): A lightweight REST interface.
Frontend (Vue 3 + Leaflet): Interactive map visualisation.

Key Technical Challenges #

The Clustering Failure Initially, I attempted to segment municipalities using K-Means and Gaussian Mixture Models. The results were statistically weak (Silhouette score: 0.045), indicating that Swiss municipalities exist on a continuous spectrum rather than in discrete buckets.

The Fix: Cosine Similarity I pivoted to a vector-based approach. By calculating the cosine of the angle between vectors, the system captures proportional relationships (e.g., ratio of industrial to agricultural land) rather than absolute magnitude. This solved the issue of comparing large cities to smaller towns with similar structural profiles.

Implementation Details #

The core logic relies on vectorising the dataset. Since magnitude (population size) shouldn’t skew the similarity, Cosine Similarity is preferred over Euclidean Distance.

def compute_similarity_matrix(df: pd.DataFrame, feature_cols: List[str]) -> pd.DataFrame:
    """
    Computes pairwise cosine similarity for all entities in the dataframe.
    
    Args:
        df: DataFrame containing municipality data.
        feature_cols: List of column names to use for the vector space.
        
    Returns:
        DataFrame: A symmetric N x N matrix where index/columns are municipality_ids.
    """
    # 1. Standardize features (Mean=0, Std=1)
    # Essential for PCA/Similarity to prevent features with large scales 
    # (e.g., average income) from dominating features with small scales (e.g., tax rate).
    scaler = StandardScaler()
    feature_matrix = df[feature_cols].to_numpy()
    scaled_features = scaler.fit_transform(feature_matrix)

    # 2. Compute Cosine Similarity (Result is N x N matrix)
    similarity_matrix = cosine_similarity(scaled_features)

    # 3. Convert back to DataFrame for readability
    return pd.DataFrame(
        similarity_matrix, 
        index=df['municipality_id'], 
        columns=df['municipality_id']
    )

Infrastructure #

The stack is defined in Docker Compose for reproducibility on a Linux VPS.

Database: PostgreSQL container with persistent volume.
Backend: Node.js container (Express).
Frontend: Nginx container serving static Vue build.

Get in touch