Monday, April 4, 2022
HomeBig DataConstruct a Geospatial Lakehouse, Half 2

Construct a Geospatial Lakehouse, Half 2


In Half 1 of this two-part collection on learn how to construct a Geospatial Lakehouse, we launched a reference structure and design ideas to think about when constructing a Geospatial Lakehouse. The Lakehouse paradigm combines the very best components of knowledge lakes and information warehouses. It simplifies and standardizes information engineering pipelines for enterprise-based on the identical design sample. Structured, semi-structured, and unstructured information are managed below one system, successfully eliminating information silos.

In Half 2, we concentrate on the sensible issues and supply steerage that can assist you implement them. We current an instance reference implementation with pattern code, to get you began.

Design Tips

To appreciate the advantages of the Databricks Geospatial Lakehouse for processing, analyzing, and visualizing geospatial information, you have to to:

  1. Outline and break-down your geospatial-driven drawback. What drawback are you fixing? Are you analyzing and/or modeling in-situ location information (e.g., map vectors aggregated with satellite tv for pc TIFFs) to combination with, for instance, time-series information (climate, soil data)? Are you looking for insights into or modeling motion patterns throughout geolocations (e.g., machine pings at factors of curiosity between residential and industrial areas) or multi-party relationships between these? Relying in your workload, every use case would require completely different underlying geospatial libraries and instruments to course of, question/mannequin and render your insights and predictions.
  2. Determine on the information format requirements. Databricks recommends Delta Lake format primarily based on the open Apache Parquet format in your Geospatial information. Delta comes with information skipping and Z-ordering, that are notably nicely fitted to geospatial indexing (akin to geohashing, hexagonal indexing), bounding field min/max x/y generated columns, and geometries (akin to these generated by Sedona, Geomesa). A shortlist of those requirements you’ll will let you higher greatest perceive the minimal viable pipeline wanted.
  3. Know and scope the volumes, timeframes and use circumstances required for:
    • uncooked information and information processing on the Bronze layer
    • analytics on the Silver and Gold layers
    • modeling on the Gold layers and past

    Geospatial analytics and modeling efficiency and scale rely enormously on format, transforms, indexing and metadata ornament. Information windowing will be relevant to geospatial and different use circumstances, when windowing and/or querying throughout broad timeframes overcomplicates your work with none analytics/modeling worth and/or efficiency advantages. Geospatial information is rife with sufficient challenges round frequency, quantity, the lifecycle of codecs all through the information pipeline, with out including very costly, grossly inefficient extractions throughout these.

  4. Choose from a shortlist of really helpful libraries, applied sciences and instruments optimized for Apache Spark; these focusing on your information format requirements along with the outlined drawback set(s) to be solved. Take into account whether or not the information volumes being processed in every stage and run of your information analytics and modeling can match into reminiscence or not. Take into account what varieties of queries you have to to run (e.g., vary, spatial be part of, kNN, kNN be part of, and so on.) and what varieties of coaching and manufacturing algorithms you have to to execute, along with Databricks suggestions, to know and select learn how to greatest assist these.
  5. Outline, design and implement the logic to course of your multi-hop pipeline. For instance, together with your Bronze tables for mobility and POI information, you may generate geometries out of your uncooked information and adorn these with a primary order partitioning schema (akin to an acceptable “area” superset of postal code/district/US-county, subset of province/US-state) along with secondary/tertiary partitioning (akin to hexagonal index). With Silver tables, you may concentrate on further orders of partitioning, making use of Z-ordering, and additional optimizing with Delta OPTIMIZE + VACUUM. For Gold, you may contemplate information coalescing, windowing (the place relevant, and throughout shorter, contiguous timeframes), and LOB segmentation along with additional Delta optimizations particular to those tighter information units. You additionally might discover you want an extra post-processing layer in your Line of Enterprise (LOB) or information science/ML customers. With every layer, validate these optimizations and perceive their applicability.
  6. Leverage Databricks SQL Analytics in your prime layer consumption of your Geospatial Lakehouse.
  7. Outline the orchestration to drive your pipeline, with idempotency in thoughts. Begin with a easy pocket book that calls the notebooks implementing your uncooked information ingestion, Bronze=>Silver=>Gold layer processing, and any post-processing wanted. Guarantee that any element of your pipeline will be idempotently executed and debugged. Elaborate from there solely as obligatory. Combine your orchestrations into you administration and monitoring and CI/CD ecosystem as merely and minimally as attainable.
  8. Apply the distributed programming observability paradigm – the Spark UI, MLflow experiments, Spark and MLflow logs, metrics, and much more logs – for troubleshooting points. When you’ve got utilized the earlier step appropriately, it is a easy course of. There isn’t a “straightforward button” to magically remedy points in distributed processing you want good quaint distributed software program debugging, studying logs, and utilizing different observability instruments. Databricks presents self-paced and instructor-led trainings to information you if wanted.
    From right here, configure your end-to-end information and ML pipeline to watch these logs, metrics, and different observability information and replicate and report these. There may be extra depth on these matters obtainable within the Databricks Machine Studying weblog together with Drifting Away: Testing ML fashions in Manufacturing and AutoML Toolkit – Deep Dive from 2021’s Information + AI Summit.

Implementation issues

Information pipeline

To your Geospatial Lakehouse, within the Bronze Layer, we advocate touchdown uncooked information of their “authentic constancy” format, then standardizing this information into probably the most workable format, cleaning then adorning the information to greatest make the most of Delta Lake’s information skipping and compaction optimization capabilities. Within the Silver Layer, we then incrementally course of pipelines that load and be part of excessive cardinality information, multi-dimensional cluster and+ grid indexing, and adorning the information additional with related metadata to assist highly-performant queries and efficient information administration. These are the ready tables/views of successfully queryable geospatial information in a typical, agreed taxonomy. For Gold, we offer segmented, highly-refined information units from which information scientists develop and prepare their fashions and information analysts glean their insights, that are optimized particularly for his or her use circumstances. These tables carry LOB particular information for goal constructed options in information science and analytics.

Placing this collectively in your Databricks Geospatial Lakehouse: There’s a development from uncooked, simply transportable codecs to highly-optimized, manageable, multidimensionally clustered and listed, and most simply queryable and accessible codecs for finish customers.

Queries

Given the plurality of enterprise questions that geospatial information can reply, it’s crucial that you just select the applied sciences and instruments that greatest serve your necessities and use circumstances. To greatest inform these decisions, it’s essential to consider the varieties of geospatial queries you propose to carry out.

The principal geospatial question varieties embrace:

  • Vary-search question
  • Spatial-join question
  • Spatial k-nearest-neighbor question (kNN question)
  • Spatial k-nearest-neighbor be part of question (kNN-join question)
  • Spatio-textual operations

Libraries akin to GeoSpark/Sedona assist range-search, spatial-join and kNN queries (with the assistance of UDFs), whereas GeoMesa (with Spark) and LocationSpark assist range-search, spatial-join, kNN and kNN-join queries.

Partitioning

It’s a well-established sample that information is first queried coarsely to find out broader tendencies. That is adopted by querying in a finer-grained method in order to isolate all the pieces from information hotspots to machine studying mannequin options.

This sample utilized to spatio-temporal information, akin to that generated by geographic data programs (GIS), presents a number of challenges. Firstly, the information volumes make it prohibitive to index broadly categorized information to a excessive decision (see the following part for extra particulars). Secondly, geospatial information defies uniform distribution no matter its nature — geographies are clustered across the options analyzed, whether or not these are associated to factors of curiosity (clustered in denser metropolitan areas), mobility (equally clustered for foot visitors, or clustered in transit channels per transportation mode), soil traits (clustered in particular ecological zones), and so forth. Thirdly, sure geographies are demarcated by a number of timezones (akin to Brazil, Canada, Russia and the US), and others (akin to China, Continental Europe, and India) are usually not.

It’s tough to keep away from information skew given the dearth of uniform distribution until leveraging particular methods. Partitioning this information in a way that reduces the usual deviation of knowledge volumes throughout partitions ensures that this information will be processed horizontally. We advocate to first grid index (in our use case, geohash) uncooked spatio-temporal information primarily based on latitude and longitude coordinates, which teams the indexes primarily based on information density relatively than logical geographical definitions; then partition this information primarily based on the bottom grouping that displays probably the most evenly distributed information form as an efficient data-defined area, whereas nonetheless adorning this information with logical geographical definitions. Such areas are outlined by the variety of information factors contained therein, and thus can symbolize all the pieces from massive, sparsely populated rural areas to smaller, densely populated districts inside a metropolis, thus serving as a partitioning scheme higher distributing information extra uniformly and avoiding information skew.

On the identical time, Databricks is creating a library, generally known as Mosaic, to standardize this strategy; see our weblog Environment friendly Level in Polygons by way of PySpark and BNG Geospatial Indexing, which covers the strategy we used. An extension to the Apache Spark framework, Mosaic permits straightforward and quick processing of huge geospatial datasets, which incorporates in-built indexing making use of the above patterns for efficiency and scalability.

Geolocation constancy:

Usually, the larger the geolocation constancy (resolutions) used for indexing geospatial datasets, the extra distinctive index values will probably be generated. Consequently, the information quantity itself post-indexing can dramatically improve by orders of magnitude. For instance, growing decision constancy from 24000ft2 to 3500ft2 will increase the variety of attainable distinctive indices from 240 billion to 1.6 trillion; from 3500ft2 to 475ft2 will increase the variety of attainable distinctive indices from 1.6 trillion to 11.6 trillion.

We should always all the time step again and query the need and worth of high-resolution, as their sensible purposes are actually restricted to highly-specialized use circumstances. For instance, contemplate POIs; on common these vary from 1500-4000ft2 and will be sufficiently captured for evaluation nicely beneath the very best decision ranges; analyzing visitors at greater resolutions (masking 400ft2, 60ft2 or 10ft2) will solely require larger cleanup (e.g., coalescing, rollup) of that visitors and exponentiates the distinctive index values to seize. With mobility + POI information analytics, you’ll in all probability by no means want resolutions past 3500ft2

For an additional instance, contemplate agricultural analytics, the place comparatively smaller land parcels are densely outfitted with sensors to find out and perceive wonderful grained soil and climatic options. Right here the logical zoom lends the use case to making use of greater decision indexing, given that every level’s significance will probably be uniform.

If a legitimate use case calls for top geolocation constancy, we advocate solely making use of greater resolutions to subsets of knowledge filtered by particular, greater stage classifications, akin to these partitioned uniformly by data-defined area (as mentioned within the earlier part). For instance, should you discover a specific POI to be a hotspot in your specific options at a decision of 3500ft2, it could make sense to extend the decision for that POI information subset to 400ft2 and likewise for comparable hotspots in a manageable geolocation classification, whereas sustaining a relationship between the finer resolutions and the coarser ones on a case-by-case foundation, all whereas broadly partitioning information by the area idea we mentioned earlier.

Geospatial library structure & optimization:

Geospatial libraries differ of their designs and implementations to run on Spark. The bases of those components enormously into efficiency, scalability and optimization in your geospatial options.

Given the commoditization of cloud infrastructure, akin to on Amazon Internet Providers (AWS), Microsoft Azure Cloud (Azure), and Google Cloud Platform (GCP), geospatial frameworks could also be designed to benefit from scaled cluster reminiscence, compute, and or IO. Libraries akin to GeoSpark/Apache Sedona are designed to favor cluster reminiscence; utilizing them naively, you might expertise memory-bound habits. These applied sciences might require information repartition, and trigger a big quantity of knowledge being despatched to the driving force, resulting in efficiency and stability points. Working queries utilizing some of these libraries are higher fitted to experimentation functions on smaller datasets (e.g., lower-fidelity information). Libraries akin to Geomesa are designed to favor cluster IO, which use multi-layered indices in persistence (e.g., Delta Lake) to effectively reply geospatial queries, and nicely swimsuit the Spark structure at scale, permitting for large-scale processing of higher-fidelity information. Libraries akin to sf for R or GeoPandas for Python are optimized for a spread of queries working on a single machine, higher used for smaller-scale experimentation with even lower-fidelity information.

On the identical time, Databricks is actively creating a library, generally known as Mosaic, to standardize this strategy. An extension to the Spark framework, Mosaic gives native integration for simple and quick processing of very massive geospatial datasets. It consists of built-in geo-indexing for top efficiency queries and scalability, and encapsulates a lot of the information engineering wanted to generate geometries from frequent information encodings, together with the well-known-text, well-known-binary, and JTS Topology Suite (JTS) codecs.

See our weblog on Environment friendly Level in Polygons by way of PySpark and BNG Geospatial Indexing for extra on the strategy.

Rendering:

What information you propose to render and the way you intention to render them will drive decisions of libraries/applied sciences. We should contemplate how nicely rendering libraries swimsuit distributed processing, massive information units; and what enter codecs (GeoJSON, H3, Shapefiles, WKT), interactivity ranges (from none to excessive), and animation strategies (convert frames to mp4, native dwell animations) they assist. Geovisualization libraries akin to kepler.gl, plotly and deck.gl are nicely fitted to rendering massive datasets rapidly and effectively, whereas offering a excessive diploma of interplay, native animation capabilities, and ease of embedding. Libraries akin to folium can render massive datasets with extra restricted interactivity.

Language and platform flexibility:

Your information science and machine studying groups might write code principally in Python, R, Scala or SQL; or with one other language completely. In deciding on the libraries and applied sciences used with implementing a Geospatial Lakehouse, we’d like to consider the core language and platform competencies of our customers. Libraries akin to Geospark/Apache Sedona and Geomesa assist PySpark, Scala and SQL, whereas others akin to Geotrellis assist Scala solely; and there are a physique of R and Python packages constructed upon the C Geospatial Information Abstraction Library (GDAL).

Instance implementation utilizing mobility and point-of-interest information

Structure

As offered in Half 1, the final structure for this Geospatial Lakehouse instance is as follows:

Diagram 1

Making use of this architectural design sample to our earlier instance use case, we are going to implement a reference pipeline for ingesting two instance geospatial datasets, point-of-interest (Safegraph) and cellular machine pings (Veraset), into our Databricks Geospatial Lakehouse. We primarily concentrate on the three key phases – Bronze, Silver, and Gold.

A Databricks Geospatial Lakehouse detailed design for our example Pings + POI geospatial use case
Diagram 2

As per the aforementioned strategy, structure, and design ideas, we used a mixture of Python, Scala and SQL in our instance code.

We subsequent stroll by way of every stage of the structure.

Uncooked Information Ingestion:

We begin by loading a pattern of uncooked Geospatial information point-of-interest (POI) information. This POI information will be in any variety of codecs. In our use case, it’s CSV.

raw_df = spark.learn.format("csv").schema(schema) 
.possibility("delimiter", ",") 
.possibility("quote", """) 
.possibility("escape", """)
.possibility("header", "true")
.load("dbfs:/ml/blogs/geospatial/safegraph/uncooked/core_poi-geometry/2021/09/03/22/*")

show(raw_df)

Bronze Tables: Unstructured, proto-optimized ‘semi uncooked’ information

For the Bronze Tables, we remodel uncooked information into geometries after which clear the geometry information. Our instance use case consists of pings (GPS, mobile-tower triangulated machine pings) with the uncooked information listed by geohash values. We then apply UDFs to rework the WKTs into geometries, and index by geohash ‘areas’.

@pandas_udf('string')
def poly_to_H3(wkts: pd.Collection) -> pd.Collection:
    polys = geopandas.GeoSeries.from_wkt(wkts)
    indices = h3.polyfill(geo_json_geom, decision, True)
    h3_list = record(indices)
    return pd.Collection(h3_list)

@pandas_udf('float')
def poly_area(wkts: pd.Collection) -> pd.Collection:
    polys = geopandas.GeoSeries.from_wkt(wkts)
    return polys.space

raw_df.write.format("delta").mode("overwrite").saveAsTable("geospatial_lakehouse_blog_db.raw_safegraph_poi")

h3_df = spark.desk("geospatial_lakehouse_blog_db.raw_graph_poi")
        .choose("placekey", "safegraph_place_id", "parent_placekey", "parent_safegraph_place_id", "location_name", "manufacturers", "latitude", "longitude", "street_address", "metropolis", "area", "postal_code", "polygon_wkt") 
        .filter(col("polygon_wkt").isNotNull()
        .withColumn("space", poly_area(col("polygon_wkt")))
        .filter(col("space") < 0.001)
        .withColumn("h3", poly_to_H3(col("polygon_wkt"))) 
        .withColumn("h3_array", cut up(col("h3"), ","))
        .drop("polygon_wkt")
        .withColumn("h3", explode("h3_array"))
        .drop("h3_array").withColumn("h3_hex", hex("h3"))

Silver Tables: Optimized, structured & fastened schema information

For the Silver Tables, we advocate incrementally processing pipelines that load and be part of high-cardinality information, indexing and adorning the information additional to assist highly-performant queries. In our instance, we used pings from the Bronze Tables above, then we aggregated and remodeled these with point-of-interest (POI) information and hex-indexed these information units utilizing H3 queries to put in writing Silver Tables utilizing Delta Lake. These tables have been then partitioned by area, postal code and Z-ordered by the H3 indices.

We additionally processed US Census Block Group (CBG) information capturing US Census Bureau profiles, listed by GEOID codes to combination and remodel these codes utilizing Geomesa to generate geometries, then hex-indexed these aggregates/transforms utilizing H3 queries to put in writing further Silver Tables utilizing Delta Lake. These have been then partitioned and Z-ordered much like the above.

These Silver Tables have been optimized to assist quick queries akin to “discover all machine pings for a given POI location inside a selected time window,” and “coalesce frequent pings from the identical machine + POI right into a single file, inside a time window.”

# Silver-to-Gold H3 listed queries
%python
gold_h3_indexed_ad_ids_df = spark.sql("""
     SELECT ad_id, geo_hash_region, geo_hash, h3_index, utc_date_time 
     FROM silver_tables.silver_h3_indexed
     ORDER BY geo_hash_region 
                       """)
gold_h3_indexed_ad_ids_df.createOrReplaceTempView("gold_h3_indexed_ad_ids")

gold_h3_lag_df = spark.sql("""
     choose ad_id, geo_hash, h3_index, utc_date_time, row_number()             
     OVER(PARTITION BY ad_id
     ORDER BY utc_date_time asc) as rn,
     lag(geo_hash, 1) over(partition by ad_id 
     ORDER BY utc_date_time asc) as prev_geo_hash
     FROM goldh3_indexed_ad_ids
""")
gold_h3_lag_df.createOrReplaceTempView("gold_h3_lag")

gold_h3_coalesced_df = spark.sql(""" 
choose ad_id, geo_hash, h3_index, utc_date_time as ts, rn, coalesce(prev_geo_hash, geo_hash) as prev_geo_hash from gold_h3_lag  
""")
gold_h3_coalesced_df.createOrReplaceTempView("gold_h3_coalesced")

gold_h3_cleansed_poi_df = spark.sql(""" 
        choose ad_id, geo_hash, h3_index, ts,
               SUM(CASE WHEN geo_hash = prev_geo_hash THEN 0 ELSE 1 END) OVER (ORDER BY ad_id, rn) AS group_id from gold_h3_coalesced
        """)
...

# write this out right into a gold desk 
gold_h3_cleansed_poi_df.write.format("delta").partitionBy("group_id").save("/dbfs/ml/blogs/geospatial/delta/gold_tables/gold_h3_cleansed_poi")

Gold Tables: Extremely-optimized, structured information with evolving schema

For the Gold Tables, respective to our use case, we successfully a) sub-queried and additional coalesced frequent pings from the Silver Tables to supply a subsequent stage of optimization b) adorned coalesced pings from the Silver Tables and window these with well-defined time intervals c) aggregated with the CBG Silver Tables and remodel for modelling/querying on CBG/ACS statistical profiles in the USA. The ensuing Gold Tables have been thus refined for the road of enterprise queries to be carried out each day along with offering updated coaching information for machine studying.

# KeplerGL rendering of Silver/Gold H3 queries
...
lat = 40.7831
lng = -73.9712
decision = 6
parent_h3 = h3.geo_to_h3(lat, lng, decision)
res11 = [Row(x) for x in list(h3.h3_to_children(parent_h3, 11))]

schema = StructType([       
    StructField('hex_id', StringType(), True)
])

sdf = spark.createDataFrame(information=res11, schema=schema)

@udf
def getLat(h3_id):
  return h3.h3_to_geo(h3_id)[0]

@udf
def getLong(h3_id):
  return h3.h3_to_geo(h3_id)[1]

@udf
def getParent(h3_id, parent_res):
  return h3.h3_to_parent(h3_id, parent_res)


# Be aware that mother or father and kids hexagonal indices might typically not 
# completely align; as such this isn't supposed to be exhaustive,
# relatively simply reveal one sort of enterprise query that 
# a Geospatial Lakehouse may also help to simply deal with 
pdf = (sdf.withColumn("h3_res10", getParent("hex_id", lit(10)))
       .withColumn("h3_res9", getParent("hex_id", lit(9)))
       .withColumn("h3_res8", getParent("hex_id", lit(8)))
       .withColumn("h3_res7", getParent("hex_id", lit(7)))
       .withColumnRenamed('hex_id', "h3_res11")
       .toPandas() 
      )

example_1_html = create_kepler_html(information= {"hex_data": pdf }, config=map_config, top=600)
displayHTML(example_1_html)
...

Outcomes

For a sensible instance, we utilized a use case ingesting, aggregating and reworking mobility information within the type of geolocation pings (suppliers embrace Veraset, Tamoco, Irys, inmarket, Factual) with focal point (POI) information (suppliers embrace Safegraph, AirSage, Factual, Cuebiq, Predicio) and with US Census Bureau Group (CBG) and American Group Survey (ACS), to mannequin POI options vis-a-vis visitors, demographics and residence.

Bronze Tables: Unstructured, proto-optimized ‘semi uncooked’ information

We discovered that the candy spot for loading and processing of historic, uncooked mobility information (which usually is within the vary of 1-10TB) is greatest carried out on massive clusters (e.g., a devoted 192-core cluster or bigger) over a shorter elapsed time interval (e.g., 8 hours or much less). Cluster sharing different workloads is ill-advised as loading Bronze Tables is likely one of the most useful resource intensive operations in any Geospatial Lakehouse. One can scale back DBU expenditure by an element of 6x by dedicating a big cluster to this stage. In fact, outcomes will differ relying upon the information being loaded and processed.

Silver Tables: Optimized, structured & fastened schema information

Whereas H3 indexing and querying performs and scales out much better than non-approximated level in polygon queries, it’s typically tempting to use hex indexing resolutions to the extent it’s going to overcome any acquire. With mobility information, as utilized in our instance use case, we discovered our “80/20” H3 resolutions to be 11 and 12 for successfully “zooming in” to the best grained exercise. H3 decision 11 captures a median hexagon space of 2150m2/3306ft2; 12 captures a median hexagon space of 307m2/3305ft2. For reference concerning POIs, a median Starbucks coffeehouse has an space of 186m2/2000m2; a median Dunkin’ Donuts has an space of 242m2/2600ft2; and a median Wawa location has an space of 372m2/4000ft2. H3 decision 11 captures as much as 237 billion distinctive indices; 12 captures as much as 1.6 trillion distinctive indices. Our findings indicated that the steadiness between H3 index information explosion and information constancy was greatest discovered at resolutions 11 and 12.

Rising the decision stage, say to 13 or 14 (with common hexagon areas of 44m2/472ft2 and 6.3m2/68ft2), one finds the exponentiation of H3 indices (to 11 trillion and 81 trillion, respectively) and the resultant storage burden plus efficiency degradation far outweigh the advantages of that stage of constancy.

Taking this strategy has, from expertise, led to whole Silver Tables capability to be within the 100 trillion data vary, with disk footprints from 2-3 TB.

Gold Tables: Extremely-optimized, structured information with evolving schema

In our instance use case, we discovered the pings information as certain (spatially joined) inside POI geometries to be considerably noisy, with what successfully have been redundant or extraneous pings in sure time intervals at sure POIs. To take away the information skew these launched, we aggregated pings inside slim time home windows in the identical POI and excessive decision geometries to scale back noise, adorning the datasets with further partition schemes, thus offering additional processing of those datasets for frequent queries and EDA. This strategy reduces the capability wanted for Gold Tables by 10-100x, relying on the specifics. Whereas may have a plurality of Gold Tables to assist your Line of Enterprise queries, EDA or ML coaching, these will enormously scale back the processing instances of those downstream actions and outweigh the incremental storage prices.

For visualizations, we rendered particular analytics and modelling queries from chosen Gold Tables to greatest replicate particular insights and/or options, utilizing kepler.gl

With kepler.gl, we are able to rapidly render tens of millions to billions of factors and carry out spatial aggregations on the fly, visualizing these with completely different layers along with a excessive diploma of interactivity.

You may render a number of resolutions of knowledge in a reductive method — execute broader queries, akin to these throughout areas, at a decrease decision.

Under are some examples of the renderings throughout completely different layers with kepler.gl:

Right here we use a set of coordinates of NYC (The Alden by Central Park West) to supply a hex index at decision 6. We will then discover all the kids of this hexagon with a reasonably fine-grained decision, on this case, decision 11:

[kepler.gl rendering of H3 indexed data at resolution 6 overlaid with resolution 11 children centered at The Alden by Central Park in NYC
Diagram 3

Next, we query POI data for Washington DC postal code 20005 to demonstrate the relationship between polygons and H3 indices; here we capture the polygons for various POIs as together with the corresponding hex indices computed at resolution 13. Supporting data points include attributes such as the location name and street address:

Polygons for POI with corresponding H3 indices for Washington DC postal code 20005
Diagram 4

Zoom in at the location of the National Portrait Gallery in Washington, DC, with our associated polygon, and overlapping hexagons at resolutions 11, 12 and 13 B, C; this illustrates how to break out polygons from individuals hex indexes to constrain the total volume of data used to render the map.

Zoom in at National Portrait Gallery in Washington, DC, displaying overlapping hexagons at resolutions 11, 12, and 13
Diagram 5

You can explore and validate your points, polygons, and hexagon grids on the map in a Databricks notebook, and create similarly useful maps with these.

Technologies

For our example use cases, we used GeoPandas, Geomesa, H3 and KeplerGL to produce our results. In general, you will expect to use a combination of either GeoPandas, with Geospark/Apache Sedona or Geomesa, together with H3 + kepler.gl, plotly, folium; and for raster data, Geotrellis + Rasterframes.

Below we provide a list of geospatial technologies integrated with Spark for your reference:

  • Ingestion
    • GeoPandas
      • Simple, easy to use and robust ingestion of formats from ESRI ArcSDE, PostGIS, Shapefiles through to WKBs/WKTs
      • Can scale out on Spark by ‘manually’ partitioning source data files and running more workers
    • GeoSpark/Apache Sedona
      • GeoSpark is the original Spark 2 library; Sedona (in incubation with the Apache Foundation as of this writing), the Spark 3 revision
      • GeoSpark ingestion is straightforward, well documented and works as advertised
      • Sedona ingestion is WIP and needs more real world examples and documentation
    • GeoMesa
      • Spark 2 & 3
      • GeoMesa ingestion is generalized for use cases beyond Spark, therefore it requires one to understand its architecture more comprehensively before applying to Spark. It is well documented and works as advertised.
    • CARTO’s Spatial Extension for Databricks
      • Spark 3
      • Provides import optimizations and tooling for Databricks for common spatial encodings, including geoJSON, Shapefiles, KML, CSV, and GeoPackages
      • Scala API
    • Databricks Mosaic (to be released)
      • Spark 3
      • This project is currently under development. More details on its ingestion capabilities will be available upon release.
      • Easy conversion between common spatial encodings
      • Python, Scala and SQL APIs
  • Geometry processing
    • GeoSpark/Apache Sedona
      • GeoSpark is the original Spark 2 library; Sedona (in incubation with the Apache Foundation as of this writing), the Spark 3 revision
      • As with ingestion, GeoSpark is well documented and robust
      • As with in
      • RDDs and Dataframes
      • Bi-level spatial indexing
      • Range joins, Spatial joins, KNN queries
      • Python, Scala and SQL APIs
    • GeoMesa
      • Spark 2 & 3
      • RDDs and Dataframes
      • Tri-level spatial indexing via global grid
      • Range joins, Spatial joins, KNN queries, KNN joins
      • Python, Scala and SQL APIs
    • CARTO’s Spatial Extension for Databricks
      • Spark 3
      • Provides UDFs that leverage Geomesa’s SparkSQL geospatial capabilities in your Databricks cluster
      • Scala API
    • Databricks Mosaic (to be released)
      • Spark 3
      • This project is currently under development. More details on its geometry processing capabilities will be available upon release.
      • Optimizations for performing point-in-polygon joins co-developed with Ordnance Survey
      • Python, Scala and SQL APIs
  • Raster map processing
    • Geotrellis
      • Spark 2 & 3
      • RDDs
      • Cropping, Warping, Map Algebra
      • Scala APIs
    • Rasterframes
      • Spark 2, active Spark 3 branch
      • Dataframes
      • Map algebra, Masking, Tile aggregation, Time series, Raster joins
      • Python, Scala, and SQL APIs
  • Grid/Hexagonal indexing and querying
    • H3
      • Compatible with Spark 2, 3
      • C core
      • Scala/Java, Python APIs (along with bindings for JavaScript, R, Rust, Erlang and many other languages)
      • KNN queries, Radius queries
    • Databricks Mosaic (to be released)
      • Spark 3
      • This project is currently under development. More details on its indexing capabilities will be available upon release.
  • Visualization

We will continue to add to this list and technologies develop.

Downloadable notebooks

For your reference, you can download the following example notebook(s)

  1. Raw to Bronze processing of Geometries: Notebook with example of simple ETL of Pings data incrementally from raw parquet to bronze table with new columns added including H3 indexes, as well as how to use Scala UDFs in Python, which then runs incremental load from Bronze to Silver Tables and indexes these using H3
  2. Silver Processing of datasets with geohashing: Notebook that shows example queries that can be run off of the Silver Tables, and what kind of insights can be achieved at this layer
  3. Silver to Gold processing: Notebook that shows example queries that can be run off of the Silver Tables to produce useful Gold Tables, from which line of business intelligence can be gleaned
  4. KeplerGL rendering: Notebook that shows example queries that can be run off of the Gold Tables and demonstrates using the KeplerGL library to render over these queries. Please note that this is slightly different from using a Juypter notebook as in the Kepler documentation examples

Summary

The Databricks Geospatial Lakehouse can provide an optimal experience for geospatial data and workloads, affording you the following advantages: domain-driven design; the power of Delta Lake, Databricks SQL, and collaborative notebooks; data format standardization; distributed processing technologies integrated with Apache Spark for optimized, large-scale processing; powerful, high-performance geovisualization libraries — all to deliver a rich yet flexible platform experience for spatio-temporal analytics and machine learning. There is no one-size-fits-all solution, but rather an architecture and platform enabling your teams to customize and model according to your requirements and the demands of your problem set. The Databricks Geospatial Lakehouse supports static and dynamic datasets equally well, enabling seamless spatio-temporal unification and cross-querying with tabular and raster-based data, and targets very large datasets from the 100s of millions to trillions of rows. Together with the collateral we are sharing with this article, we provide a practical approach with real-world examples for the most challenging and varied spatio-temporal analyses and models. You can explore and visualize the full wealth of geospatial data easily and without struggle and gratuitous complexity within Databricks SQL and notebooks.

Next Steps

Start with the aforementioned notebooks to begin your journey to highly available, performant, scalable and meaningful geospatial analytics, data science and machine learning today, and contact us to learn more about how we assist customers with geospatial use cases.

The above notebooks are not intended to be run in your environment as is. You will need access to geospatial data such as POI and Mobility datasets as demonstrated with these notebooks. Access to live ready-to-query data subscriptions from Veraset and Safegraph are available seamlessly through Databricks Delta Sharing. Please reach out to datapartners@databricks.com if you would like to gain access to this data.



RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular

Recent Comments