Tuesday, April 5, 2022
HomeBig DataThe Nice Knowledge Debate: Unbundling or Bundling? - Atlan

The Nice Knowledge Debate: Unbundling or Bundling? – Atlan

Because of “Bundlegate”, it’s been a wild couple of weeks. Right here’s what occurred and the place I feel we’re going.

It’s been a wild couple of weeks in Knowledge Twitter with one other new sizzling debate.

As I lay tossing and handing over mattress, eager about the way forward for the trendy information stack, I couldn’t assist really feel the strain to write down yet one more opinion piece 😉

What truly occurred?

In the event you have been MIA, Gorkem Yurtseven kickstarted this debate with an article known as “The Unbundling of Airflow”.

He defined that when small merchandise turn into massive platforms, they’re ripe for unbundling, or having its element features abstracted into small, extra centered merchandise. Craigslist is a good instance  —  its Neighborhood part has been taken over by Nextdoor, Personals by Tinder, Dialogue Boards by Reddit, For Sale by OfferUp, and so forth.

Gorkem argued that within the information world, the identical factor is occurring with Airflow.

Earlier than the fragmentation of the info stack, it wasn’t unusual to create end-to-end pipelines with Airflow. Organizations used to construct nearly total information workflows as customized scripts developed by in-house information engineers. Larger firms even constructed their very own frameworks inside Airflow, for instance, frameworks with dbt-like performance for SQL transformations in an effort to make it simpler for information analysts to write down these pipelines.

As we speak, information practitioners have many instruments below their belt and solely very hardly ever they’ve to achieve for a software like Airflow… If the unbundling of Airflow means all of the heavy lifting is finished by separate instruments, what’s left behind?

Gorkem Yurtseven

As one of many information groups again within the day that ended up constructing our personal dbt-like performance for transformations in R and Python, Gorkem’s phrases hit residence. 

Airflow’s functions have been constructed into ingestion instruments (Airbyte, Fivetran, and Meltano), transformation layers (dbt), reverse ETL (Hightouch and Census), and extra.

The Unbundling of Airflow. Picture by Gorkem Yurtseven; shared right here along with his permission.

Sadly, this has led to a loopy quantity of fragmentation within the information stack. I joke about this rather a lot, however actually I really feel horrible for somebody shopping for information know-how proper now. The fragmentation and overlaps are mind-blowing for even an insider like me to completely grasp.

Nick Schrock from Elementl wrote a response on Dagster’s weblog titled the “Rebundling of the Knowledge Platform” that broke the info group… once more. He agreed with Gorkem that the info stack was being unbundled, and stated that this unbundling was creating its personal set of issues.

I don’t assume anybody believes that this is a perfect finish state. The submit itself advocates for consolidation. Having this many instruments and not using a coherent, centralized management airplane is lunacy, and a horrible endstate for information practitioners and their stakeholders… And but, that is the fact we’re slouching towards on this “unbundled” world.

Nick Schrock

Then, whereas I used to be writing this text, Ananth Packkildurai chimed in on the talk   — first with a tweet after which with the most recent subject of his Knowledge Engineering Weekly publication.

Ananth agreed with the concept that unbundling has occurred, however tied it to a bigger subject. As information groups and firms have grown, information has turn into extra complicated and information orchestration, high quality, lineage, and mannequin administration have turn into important issues. 

The info stack unbundled to resolve these particular issues, which simply resulted in siloed, “duct tape methods”.

The info group typically compares the trendy tech stack with the Unix philosophy. Nevertheless, we’re lacking the working system for the info. We have to merge each the mannequin and process execution unit into one unit. In any other case, any abstraction we construct with out the unification will additional amplify the disorganization of the info. The info as an asset will stay an aspirational aim.

Ananth Packkildurai

So… what’s my take? The place are we headed?

There are two sorts of individuals within the information world  —  those that imagine in bundling and those that assume unbundling is the longer term. 

I imagine that the reply lies someplace within the center. Listed here are a few of my predictions and takes.

1. There’ll completely be extra bundling from our present model of the trendy information stack.

The present model of the trendy information stack, with a brand new firm launching each 45 minutes, is unsustainable. We’re completely in the course of the golden period of innovation within the MDS, funded fairly generously by Enterprise Capital $$ —  all in seek for the following Snowflake. I’ve heard tales of completely blissful (information) product managers in FAANG firms being handed thousands and thousands of {dollars} to “check out any concept”.

This euphoria has had huge benefits. A ton of sensible persons are fixing information groups’ largest tooling challenges. Their work has made the trendy information stack a factor. It has made the “information perform” extra mainstream. And, most significantly, it has spurred innovation.

However, actually, this gained’t final endlessly. The money will dry up. Consolidation, mergers, and acquisitions will occur. (We’ve already began seeing glimpses of this with dbt’s transfer into the metrics layer and Hevo’s transfer to introduce reverse ETL together with their information ingestion product.) Most significantly, prospects will begin demanding much less complexity as they make decisions about their information stack. That is the place bundling will begin to win.

From a dialog about Gorkem’s weblog on Reddit

2. Nevertheless, we by no means will (and shouldn’t ever) have a completely bundled information stack.

Imagine it or not, the info world began off with the imaginative and prescient of a completely bundled information stack. A decade in the past, firms like RJ Metrics and Domo aimed to create their very own holistic information platforms.

The problem with a completely bundled stack is that sources are at all times restricted and innovation stalls. This hole will create a possibility for unbundling, and so I imagine we’ll undergo cycles of bundling and unbundling. That being stated, I imagine that the info area specifically has peculiarities that make it tough for bundled platforms to really win.

My co-founder Varun and I spend a ton of time eager about the DNA of firms or merchandise. We predict it’s vital  —  maybe crucial factor that defines who succeeds in a class of product.

Let’s take a look at the cloud battles. AWS, for instance, has at all times been largely centered on scale  —one thing they do a fantastic job in. Alternatively, Azure coming from Microsoft has at all times had a extra end-user-focused DNA, stemming from its MS Workplace days. It’s no shock that AWS doesn’t do as properly in creating world-class, person expertise–centered purposes as Azure, whereas Azure doesn’t do as properly in scaling technical workloads as AWS.

Earlier than we will speak in regards to the DNA of the info world, now we have to acknowledge that its sheer range. The people of information are information engineers, analysts, analytics engineers, scientists, product managers, enterprise analysts, citizen scientists, and extra. Every of those individuals has their very own favourite and equally various information instruments, all the things from SQL, Looker, and Jupyter to Python, Tableau, dbt, and R. And information tasks have their very own technical necessities and peculiarities  —  some want real-time processing whereas some want velocity for ad-hoc evaluation, resulting in a complete host of information infrastructure applied sciences (warehouses, lakehouses, and all the things in between).

Due to this range, the know-how for every of those completely different personas and use instances every may have very completely different DNA. For instance, an organization constructing BI needs to be centered on the end-user expertise, whereas firm constructing an information warehouse needs to be centered on reliability and scaling.

For this reason I imagine that whereas bundling will occur, it would solely occur in areas the place merchandise’ basic DNA is comparable. For instance, we are going to probably see information high quality merge with information transformation, and probably information ingestion merge with reverse ETL. Nevertheless, we most likely gained’t see information high quality bundled with reverse ETL.

3. Metadata holds the important thing to unlocking concord in a various information stack.

Whereas we’ll see extra consolidation, the elemental range of information and the people of information isn’t going away. There’ll at all times be use instances the place Python is best than SQL or real-time processing is best than batch, and vice versa.

In the event you perceive this basic actuality, you must cease looking for a future with a common “bundled information platform” that works for everybody  —  and as an alternative discover methods for the various components of our unbundled information stack to work collectively, in excellent concord.

Knowledge is chaos. That doesn’t imply that work must be.

I imagine that the important thing to serving to our information stack work collectively is in activating metadata. We’ve solely scratched the floor of what metadata can do for us, however utilizing metadata to its fullest potential can basically change how our information methods function.

As we speak, metadata is used for (comparatively) simplistic use instances like information discovery and information catalogs. We take a bunch of metadata from a bunch of instruments and put it right into a software we name the info catalog or the info governance software. The issue with this strategy is that it principally provides another siloed software to an already siloed information stack.

As a substitute, take a second and picture what a world might appear like in case you might have a Phase or Zapier-like expertise within the trendy information stack  —  the place metadata can create concord throughout all our instruments and energy excellent experiences.

Picture by Atlan, from my speak at Starburst Datanova.

For instance, one use case for metadata activation could possibly be so simple as notifying downstream shoppers of upstream adjustments.

Picture by Atlan.

A Zap-like workflow for this straightforward course of might appear like this: 

When an information retailer adjustments…

  1. Refresh metadata: Crawl the info retailer to retrieve its up to date metadata.
  2. Detect adjustments: Examine the brand new metadata towards the earlier metadata. Determine any adjustments that might trigger an influence  —  including or eradicating columns, for instance.
  3. Discover dependencies: Use lineage to seek out customers of the info retailer. These might embody transformation processes, different information shops, BI dashboards, and so forth.
  4. Notify shoppers: Notify every client via their most well-liked communication channel  —  Slack, Jira, and so forth.

This workflow may be included as a part of the testing phases of fixing an information retailer. For instance, the CI/CD course of that adjustments the info retailer might additionally set off this workflow. Orchestration can then notify shoppers earlier than manufacturing methods change.

In Stephen Bailey’s phrases, “Nobody is aware of what the info stack will appear like in ten years, however I can assure you this: metadata would be the glue.”

I feel this debate is way from over, however it’s wonderful what number of insights and sizzling takes (like this one) that it has stirred up.

To maintain the dialogue going, we simply hosted Gorkem Yurtseven, Nick Schrock, and Ananth Packkildurai for our Nice Knowledge Debate. 

Try the recording right here!

This text was initially revealed on In direction of Knowledge Science.

Header picture: Bakhrom Tursunov on Unsplash



Please enter your comment!
Please enter your name here

Most Popular

Recent Comments