Skip to content

Data Contracts: Community Conversation

52 mins

Slack Convos: https://getdbt.slack.com/archives/C033K8SF2CF/p1664560849333259 Statically typed dbt code: https://gist.github.com/bstancil/ee9ee57743e7423741fef3c0a3cc669d

View Comments and Reply

Transcript

Show Transcript

That's just, that's uncommon. Well, There's definitely, Oh, okay. Sorry. I was just making comment about the recording real quick. But, um, the, the, the last part of the thought was just, um, okay, maybe there's Abby, there's, you're a hundred percent right, that there's like the, the technical component.

But I think that the way that this more often falls down is just like really boring organizational misalignment that happens, like, at every freaking company that you go to that is above some threshold of complexity.

Um, so, so anyway, I'll, I'll leave it there. I I, I'm curious how many people are, are cynic versus, uh, optimist on, on this take?

So, I, I agree with, I agree with the take. Look, a lot of orgs can't do it. Um, I do wanna talk a little bit more about how Chad sees the world because, So Chad was a convo, right?

I was at Flexport, <laugh>, and Flexport did this first. So Flexport and Convoy worked very closely together. Um, you know, we partners, Chad and I were talking about contracts for, you know, a number of, uh, number of quarters here before, you know, it became a thing.

And what I shared with him was really this, that I'm gonna drop in the, uh, in the chat. So this is how, this is how we position these teams, right?

And I think this is kind of speaking to Tristan where this can go wrong, because how many teams can do this contracts work at Flexport because we follow this tech development life cycle with respect to data, right?

Uh, so, uh, product analytics and analytics engineers play key roles here. Uh, basically everything that happens, every, every new change, uh, is instigated by a prd, right?

Every new feature that's graded by the, by the technology groups, uh, they have, what we think of is essentially three side effects to data.

You either create new data, you're mutating old data, or you're asking new business questions of data. The job of the product requirements process, so the PRD and the a r d.

So there's all these docs, right? And the product analyst writes this analytics requirements doc is to clarify what those side effects are, right?

Here's a new data we're creating, here's a new business questions we're asking, you know, here's, you know, what, what we're going to be mutating as a result of this feature.

Uh, the, that set of requirements then goes down before anything is shipped, right? That forcibly is moved down to the engineering requirements process, where as part of that engineering requirements process, we position analytics engineers, or we position, I'm not there anymore, <laugh> to, uh, to own, uh, thinking about data and debt.

Uh, so how should we think about, based on this prd, based on the a r d, how should we think about evolving the domain model?

How should we think about evolving the telemetry? Uh, and based on that, how should we think about what we raise to the rest of the, uh, data environment and the rest of the transactional environment in terms of exposing our internal estate?

So the analytics engineers here are the ones that define the contracts, Uh, and they do that in partnership with the producer teams, right?

So it's like, who does it? Well, it's the producer team, but it's not a software engineer. It's someone with data context.

It's someone with, uh, you know, close ties to the rest of the, the data community. It's someone with the skills to think about how to model a domain, uh, with respect to data, right?

Data first and foremost. And then, you know, the, the, um, the, the chain continues, right? The value chain continues, This can work, but it requires a lot, right?

It requires alignment with product. It requires that you have an organization that has this regimented process of PRD and erd.

It requires that, you know, the engineering group will allow analytics engineers to play, right? To be a part of this process and define the data requirements, uh, and requires you to have enough folks.

And it requires those folks, analytics engineers, to actually be competent at not just writing sql, but thinking about the structure of data, how we represent data logically, right?

The domain model in transaction systems, all of those are a ton of requirements. Most orgs will not get there, right?

But if you have those, this is gonna be awesome <laugh>, right? And this is like, look, I, I hope everyone that looks at this, you know, everyone that looks at this chart should see, like, this could be the promise land, right?

But it's hard. Um, so I don't know, Trition, does that kind of co contextualize a little bit how Chad and I see the world and you know, also your cynicism To Yeah, totally.

And I guess I, um, I'm curious if you, you, you kind of alluded to this right there. You said this is could be the promise land, and, um, I wonder, I wonder if you see trade offs involved here.

So certainly there's a lot of maturity involved in this, and I think you could, uh, create a lot of good stuff, uh, out of, out of a process like this.

But, uh, whenever I see a diagram with enough boxes in errors on it, I start to get worried about, um, the costs of complexity associated with this.

Um, the, the number of people involved, the number of steps involved. Um, do you think that this is always appropriate or that there are contexts in which that this is appropriate?

I mean, it's scales and I think there's a lot of like, hands up. So maybe we shouldn't address those too, but Yeah.

Yeah, go ahead. Sure. I'd love to hear from one of the folks. Put your hands up. Sure. I can go.

I, I guess the question I have is like, is there something specific to the logistics industry that makes it like more willing to accept this cost and this process as compared to other industries?

Is there maybe like the leadership understands this better, or? No, I actually don't think it has anything to do with logistics.

I think it has everything to do in this company. And, uh, and I see this, but some of the other companies, I advise with the move to distributed systems, right?

That is actually a big key here. If you're moving distributed systems, if you are, look, I mean, one of the reasons why data mesh takes off in, uh, you know, orgs that have distributed systems is it's basically a data and analog to service mesh, right?

Uh, and, uh, distributed services, uh, well, I mean, one of the central tenants of distributed systems is that you don't expose your internal state, right?

You don't expose your internal persistence models. So if you're already living in this world where, yeah, I don't want a couple other people to my database, Well, what does that look like?

It looks like contracts. What does that entail? It looks like thinking about a domain model, things it looks about, think about thinking about a logical way to represent your domain in a way that's going to be scalable, right?

So, you know, for us, it's not logistics. It's actually, if your org is mature enough to think and distributed systems, or at least in, you know, it doesn't have to be that you have to communicate by events over Kafka, right?

Over a stream. But if you think about a service oriented architecture, if you're big enough to think in that way where, you know, the, there are these clean interfaces between teams, this can work, right?

Does that make sense? Yeah, I think that makes sense. Is there a certain like threshold where it starts to be worth the cost?

It maybe in terms of number of employees or something? Well, I think that's what we all have to figure out.

I mean, so Tristan just had talked about, Hey, there's complexity costs here. And yeah, look, I mean, I, I get it.

There are complexity costs and, you know, we can scale this down. And one of the things I'll throw in the doc is, so the analytics engineers write a, um, what's called a data design doc, and it's this log doc, and, you know, maybe we can talk about it later.

Folks are interested. Uh, and yeah, look, we can scale down what this looks like, but I do think that every organization, if you're a developing product and that, you know, I don't care if you're a five person setup, I think every organization should write a PRD for everything you do should write an ERD for everything you do.

And if you're gonna write that erd, you should just think thoughtfully about what the data design should be as part of that architecture.

Uh, and if that's all we're saying, you know, and that's, that's a hill I'm happy to die on, right? For any org.

And if that's what we're saying, then yeah, we'll scale down the process to scale down the people scale down the intensity, but we can still follow the scaffold.

At least. That's my take. Great. Thanks David. Go for it. Yeah, I was just, I was just gonna say, I don't think it's, uh, specific to, to logistics at all.

I think it's to do with size. So what, what, what I found this week when reaching out to some people, like n and Chris tab and a few other, like experience data engineers is what they found is when they've had to, when they've had to work at a ma like large scale org, which is having to use data, which is having to use distributed systems that they just, they, that's just the way they thought.

Everyone works with data. Because when you are in at that level of a, of advancement, that's what they, they, they do, They didn't call them data contracts necessarily.

They had different names. And they, they even showed me some examples of things from 2010, 2016, uh, the companies that they worked at where they've had these in place, and I think it's probably been before that, right?

Like Abby said, like, if you have, if you have to hide the schema, you end up down this route. It's the logical, logical step.

I think it can go down to smaller orgs. And I think the way it can go down is through automation.

And you see like the, I think you're starting to see the rudiments of that, like in tools like Segment, for example, Segment is not a tool that a large enterprise tech company would use.

They'd build their own, right? But for smaller companies, like, like ours, for example, we use it and it's coming, it's starting to have things like protocols, which are like the start of a data contracts kind of feature inside it.

And then you've got other open source tools like Jake's on, Jake's on the call and that, that are also starting to have these features for people to use.

Jake, you had your hand up for a bit before. Sure. Yeah, I can speak to a couple things there. Um, you know, like a, I I think that, uh, you know, that is from my experience, very similar to the workflow that, um, I've been used to, uh, over the past couple companies, I guess.

Um, one of which was just so happens to be logistics as well. And, uh, working in that space, um, you know, logistics, but very specifically, uh, like robotics within, uh, warehouses.

Uh, so, you know, basically a kind of niche side of, of the broader logistics spectrum. Um, and a couple of things that we ran into there.

So first and foremost, like, yeah, very much so the kind of consumer of whatever data was emitted, uh, was very much, uh, you know, the starting point for the contract draft.

And, um, you know, in industry it feels like very, I don't know, a post babel world where like product engineers are speaking this language and, you know, uh, analytics engineers are speaking SQL plus D B T, and, uh, you know, I don't know, on and on and on it, it just goes, uh, everybody is like speaking very different language, quite literally type script, JavaScript, Python, go, uh, you know, SQL over here.

And like there's, you get a analyst or an analytics engineer, um, trying to push upstream, and it's very hard for them to like bridge the communications gap between what they need.

And, you know, literally type script or whatever, uh, language happens to be on the other end of those pipes. Um, and like, you know, the, the data design document, we, uh, so I worked at Shopify on their robotics.

I also got to work very closely with, uh, Shopify, the mothership <laugh>, uh, who had done this for, you know, almost 10 years, uh, pre Kafka schema registry, pre a whole bunch of stuff.

Like that's what they were doing. Um, and we did a couple things. So from a contract drafting perspective, like almost all the time, the consumer was the, uh, initial draft of that contract.

And we circled around, uh, Jason Schema because protocol buffers are really hard for that. And it was kind of a good, like, middle ground, um, between teams and between technologies.

Um, and, you know, an analyst drafted that Jason Schema, this is what we want, this is SCADA gata. We had some tools to make that easier.

Uh, but still it was Jason Schema that they were using. Um, and the product engineer would basically be on the other side of that.

Like, Yeah, we can get that for you. Yeah, no, like, we're gonna have to work on that, or it's not possible in, you know, whatever scope or context upstream that, uh, was necessary.

And then what we did with automation, again, uh, like David said was, uh, we like used those Jason schema to rip out tracking SD K, so like, the product engineers were never disincentivized to like deploy and check, um, like, Hey, is my pipeline redirecting events or like blowing up because this was the wrong structure.

Like, they had all of that feedback long before production using VS code, um, and type script because we had everything written in type script.

And, uh, coming back to the logistics side, we did that very intentionally, uh, because, um, deploying code onto robots in, you know, thousands of robots and thousands of places, or not thousands of places, hundreds of places around the world, um, is like a very expensive time, time consuming process.

So we couldn't afford, uh, like, you know, software for the internet to like slap some instrumentation up and then go check and like, oops, we, you know, revert and fix our instrumentation.

Like we had to be always, uh, on top of everything before it goes out. Otherwise it's super expensive and you don't get those quick cycles, uh, in the world of robotics.

So, you know, that's something that, uh, worked pretty well. Sorry for the, for the ran, but So Jake, when you, when you said the, the consumers drive the contract, uh, relative to the data, the tech life cycle chart that I just showed, where did this happen?

When did this happen? Was this first, you know, before feature, or was this like at any point, you know, uh, analysts are like, Hey, I need some data and I would like it to look at this.

Correct? Yeah. Yeah. So that's a really good question. Um, it was a combination of both like features and like larger product initiatives.

It would be baked into like that, uh, project itself. Like here's the instrumentation that we plan on, yada, yada. But, uh, we use the same pipes to like, you know, instrument other stuff as well.

So like, DevOps wants a thing, or we want to instrument internal, like developer tools. Like it would just be, you know, write this contract, Hey, you on the other team, like this is what it, uh, bakes for you in go.

And, you know, here's a go struck that you can, uh, instrument. Like we, and oftentimes people, like, I'm a developer and I want analytics on my own stuff.

I would do it. Like, I would just rip out the thing and basically create a contract with myself or my code.

Um, interesting that, so there was so much automation, like we redirected that data using those contracts to various tables in big query, uh, or pub sub topics, you know, auto create those.

Uh, we did a lot of like instrumentation and like, uh, event level metrics, uh, using those same contracts. So like there was a ton of automation that incentivized people, uh, you know, basically across the spectrum to just use them.

<laugh> make, makes sense. The reason I ask is, you know, like everything in data, you know, it's all, it depends and we, we economy semantic battles, and I think when people talk about contracts, uh, we don't recognize that we mean different things.

And producer driven contracts are very different than consumer driven contracts. So what you're describing is, uh, you know, consumer driven con well, sort of a hybrid.

I think a lot of people in the data space, when they think about data contracts, they are thinking about consumer driven contracts.

Like, Hey, I'm an analyst. I would like this data to look like this because I'm doing this in office. Correct.

And, um, and to me, uh, that is, yeah, of course that's gonna be a hard software engineering, right? <laugh>. And to me, you know, I, I'm pro contracts and I completely ignored the flaws of, you know, consumer driven contracts because like, that's a non-starter to me.

Um, one of the reasons why I don't find, you know, people are always like, Oh, how are you gonna get engineers to get on board?

One of the reasons I don't find it problematic is because if you start with producer driven contracts, uh, there's a good case you made that producer driven contracts help producers, right?

Uh, you know, there's tons of, tons of reasons. Performance, storage, you know, what have you, that in the code base I might represent, you know, a user table in four different ways, right?

For my application. Uh, but as an engineer, uh, the domain model that helps me reason about the system, right? Uh, if I can come up with this cohesive way to reason about the system in a durable way, uh, I'm just gonna be able to onboard onto systems better, I'm going be able to work in those systems better.

I'm gonna be able to communicate with, you know, other people in other sub systems better. And so, yeah, I mean, I think consumer driven contracts are a bit of a wild west, like it's driven contracts are, you know, you might have tons of different people creating tons of different contracts on the same data.

And I think that's, that's messy and it can create, um, you know, friction engineers where, you know, they're like, Hey, I don't, I don't really wanna support this.

This isn't part of feature development. But producer driven contracts, especially if they're grounded in feature development, I think there's a really good case that, uh, hey, this is, this is a win-win, right?

This actually benefits engineers because it helps the engineers reason about their domains. So it is just something worth, I think people thinking about that, you know, producer, different contexts look very different, and there's far fewer of them them, right?

Like if you're going to the producer, different world, you'll yeah, you have fewer producers, <laugh>, uh, and they'll be much more bounded and it'll look different with different trade offs.

I really like the, the distinction between producer and consumer driven contracts that, like, that it's maybe even more that dichotomy is even more salient maybe than the individual personas involved, because like every contract is a negotiation between a producer and a consumer.

Um, but the thing I don't understand in the producer, different constructs world, okay, the, I I kind of understand where you're coming from there, but when the producer controls the contract mm-hmm.

<affirmative>, they can also change it. So, well, I think that's, I think that's, How do you navigate that? I think that's where it changed, where it's a, it is a different thing, right?

Because I agree consumer driven contracts won't scale. Cuz number one, the consumer doesn't know about the context about what's available in that context for the engineer.

Cuz you know, the engineers are like, What? You want me to add extra database calls to add data that wasn't in the context?

Start with, That's crazy. They're not gonna want to do that. But, um, once the, I think once the producer driven contracts published, and it's like DBT version 1.1, right?

You're gonna support that for a a time and it's gonna be like long time, long term stable and you can't just change it because, So there's a deprecation process.

Yeah, there's, there's totally depre. I mean, like, so the dvd, I don't know if everyone had a chance to skim, but the d you know, we, we look at like what are new, uh, DD domain data products?

What are new, what are new contracts? What are new events? What are updated? What are deleted? And as, as part of each of them, we version, right?

Um, so we'll either deprecate, uh, we'll announce a deprecation, you know, et cetera. And the other thing that we should call out here is that just because you're producer driven doesn't mean you don't get consumer feedback.

Uh, so the model that, you know, I've, I've helped like introducing other companies now is, uh, that yeah, the producer defines the contract, the analytics engineer, right?

Defines a contract, but they go talk to consumers and they broadcast out, Hey, this is gonna be our contract. You know, we know that these are the typically a affect consumers.

Do you want anything else? Like should we modify this? Does this work for you? And there's, you know, like a five day kind of like, uh, yeah, speak now or forever hold your piece period.

And that feedback is incorporated and that's how the producer contract is defined. And, you know, if that needs to be broken, then, you know, there's plenty of this kind of conversation.

But there needs to be someone, and again, like I view the wor i I view analytics engineering differently than a lot of people do.

And I, I, and analytics engineer is not someone the right sequel and analytics engineer is someone for me that understands how to model data at every layer of the stack, right?

How do you model telemetry? How do you model the domain model? How do you model Yes, the warehouse models? Those are different skills, right?

Uh, and it needs to be someone that can bridge the producers and consumers and can keep in their head, how do we evolve this domain gracefully?

Uh, and that needs to be a person. So I think this model is very hard. Personally, I don't think Squeeze can do it, right?

It, it's funny because we like, uh, you know, all applications are basically model view controller, right? In some sense. But, you know, the modern software engineering, uh, you know, training and incentives are all about the, like the, the views of controllers all about the front tip, all about JavaScript frameworks.

We don't teach software engineers about data models, right? Not sequel models, but how do you model data and reason about a domain model, uh, and someone needs to do that.

And if, if that's on the hands of sws, these projects will fail. So I do think it needs to be data experts, which is why is expensive.

Um, but, uh, but yeah, that's, that's my take. Maybe other things work for other people. So this is, this is an interesting maybe inflection point that we could, we could turn on here because it, what I'm hearing is from getting to this place of analytics engineers and data people, um, are the definers and potentially producers of these contracts, basically the people who are best suited to, to define them data model, domain model at the company, in the organization.

Um, the tools that we encourage analytics engineers to use today are the warehouse, are sequel, are D B T. Are there ways that we can produce those contracts using those tools?

Do we need an entirely different suite of tools? And I guess I am asking tools here in the sense of automation, right?

Not just proposals, not just requests for requirements and so on, Um, because someone has to enforce these things or make something happen when they're broken.

Yeah, I can answer that from an analytics engineer perspective. We actually do do this at vo. Uh, we use, we've basically jerry rig, D B T to create contracts for us.

I'd love to hear it. Tell me more. Uh, so essentially what we do is also like, keep in mind, I used to be a software engineer before being an analytics engineer.

So like, I actually know how to like code. And so what we do is every hour we go out to mode, uh, use modes, api, scrape all the queries for all the active reports, use a SQL Parr to convert those to D B T refs, and then essentially create a yamo file that we store in s3.

And then there's kind of two things. And then so, so that yamo file keeps getting updated every hour, and then essentially at CI time, we pull in that yamo file into the project and also while generating the D B T docs.

And so they don't actually, like, they're added in at built time, but they aren't in our version control. Uh, and so we kind of just use that as like the contract and it works well enough where essentially we have, we, we can identify like who is using a particular model in mode and kind of like ask them to migrate off and then, uh, sort of like, uh, then the models.

And then along with that we kind of have like a model versioning strategy where what we do is we just use Cal, we, we slap a date at the end of the model, call it good, uh, figure out who's using the old model, ask them to, So, so it's like a three step process.

First is like you create a new model with like a new date, uh, at the end of it, uh, you migrate all the exposures from the old model to the new model and do, sometimes there's some translation work that's required.

Most of the times it's drop in replacement and then essentially deprecate the old model. And that's like the best solution we've found with D B T.

But it's, it's still very clunky and you need to know, you need to be a software engineer in order to implement it.

So, So I can appreciate, like tactically there's some some clunk in there as you said, but even I'm interested, like conceptually, does that feel like the right model?

I I still hear like, we're still asking end consumers, right? And if they don't migrate off to the new version, what, what happens is this Standoff.

So, so, so our CI build breaks, which is great. So we, we end up just like not pushing the, the, like we cannot deprecate the old model because the CI build will break.

And so that kind of like, and so I think you're, you're pushing the responsibility. It, it's also like you, you need the right organization structure to do it.

So like, the way it works is like I am an analytics engineer, uh, products, Yeah, I build analytics, CI, build, uh, so this is all like the D B T project.

Uh, we are only using DBT core. We don't use D B T Cloud. Sorry guys. Uh, but, uh, so, so yeah, essentially it like you, So, so the way it works at VO is that the analytics team owns all the reporting in mode and they work very closely with us.

And so we can kind of like push things onto their roadmap being like, Hey, you need to up update this model.

But at the same time, if you didn't have that structure, if like random teams owned it, they're like, Yeah, we're not gonna update it.

And so I think we kind of have that buy-in where we have a very centralized structure in place, uh, for the data modeling piece, like we're, we're very, very permissive in like the reporting piece and mode, like how you use the models.

But for like the data modeling itself, we're very structured about it. And I think that's kind of the way we've essentially created data contracts downstream.

We're looking at creating data contracts upstream by like embedding, embedding essentially our DT tests into like the product staging, uh, staging database.

But we haven't gone there yet. So tbd. But like it's, It's gotta fail someone ci for that person To do anything.

Yeah. It's gotta fail someone ci it just ends up always failing the team. That's one level up. Um, Callum, you did it or hand raised, do you, do you want say anything or?

Uh, no. My, my question is going back to a's point I can, I can take it in the slack cause I just, I had a follow up, but this is a good thread to, to go down, so I don't wanna derail.

Michael, I'm gonna call on you since and then David. Cool. Yeah. Uh, I was just gonna say quickly, like, so as a software engineer, as a product engineer, what I would love is if an analytics engineer defined a source in D B T and then I work with type script almost exclusively, if that created a type script type that is the expected type that I have to send over, whether it's like the payload for my events or however I'm getting that data into analytics.

Now that type has been automatically generated and that's what I have to use so that I don't get errors in my project.

That would be amazing. Like, uh, along that note, um, you know, timing that to what we discussed earlier, like producer versus consumer, um, if it's only consumer oriented, uh, and you know, you don't have those, uh, types, then you have to, if, if it's a type script project, which an increasing number of things are, uh, or at least getting type script injected into, uh, then you have to go any in type script.

And like if you have lin rules and you have all of the other like mechanisms to keep your type script good and don't have some producer oriented mechanism, then like, uh, the code quality, uh, just drops off.

You know, like you just end up using any, all, all over the place and it's just a really big mess.

So Yeah. Plus one there, David, Go for. Yeah, I was just gonna say, I, I, I don't like the idea of like automatically generating the contract from usage because I feel like you're gonna have like blind spots of data that's not being used at this time and you still, you know, it might be something that's rarely used.

I feel like it needs to be somewhere that's not the analytics, It's not like in the analytics engineer's toolbox solely or the software engineer's tool.

It's like this third party place. Like, you know, some lawyers, some lawyers draw, right? <laugh> and it, that should be like the, the, the place where the contractor is stored, whether that's like adjacent scheme or something, but that's what everyone's agreed to.

So the analytics engineers pulling like from that J schema to, to define their model, going from raw to staging the software engineers looking at that schema to know this is the format in which I need to produce the data to, and then everyone is a, so that's, that's the point of agreement.

And I, I kind of built something clunky in DBT where we were just basically storing adjacent schema at the top of every staging model that we used.

And then some of the software engineers would look at that schema, but they, that wasn't very reliable. So yeah, that's why I think it should be somewhere else.

It reminds me of people who put SQL queries in the back of, uh, Google Sheets. Like if you wanna regen this Google sheet, just use the sequel query.

Yeah. Um, what, what if, what, what about jumping off of sources for that? Like could you auto generate the J schema theoretically from Yeah.

More. Yeah. Yeah. That's like a little bit arduous to like, describe it in, in JS schema. No, we, we don't like that.

No one likes that idea, But it has to be structured data somewhere is that's the claim. Yeah. Shopify motherships, uh, like schema creation process is a hundred percent ammo.

They're 9,810 when I left a couple weeks ago. Schema, uh, they're all versioned and they're all yaml. Um, and, uh, there's some, you know, uh, also created by 1800 plus contributors, most of which aren't software engineers and, uh, many of which aren't even data people, pm et cetera.

Like, this is what I want. Um, they kick off the process. Uh, but yes, it's a hundred percent yaml and, uh, the yaml in their case, uh, the schema, like they actually were going YAML to aro, but for four performance reasons.

Uh, like they literally are doing like many millions of, uh, concurrent requests at any time. Uh, it was too, like ARO was literally too slow.

So it, uh, went to whatever native format, uh, gor on the collector side type script interfaces on the production side.

There was some flin stuff downstream, so like basically went to Java. But yeah, like YAML was very much the, the interface between, um, the drafting process and then Jason's scheme got ripped out for like opiate open API specs and stuff like that.

But super interesting. Just a 2 cents There. There's, there's a product name that we haven't said at, at all for the past 41 minutes that I just wanna like say out loud because I think that this should be in the conversation.

How does Five trend fit into all of this? Like in a world where there are tighter guarantees between data protection and consumption, how does the EL model that is typified by five trend and plenty of others fit into the mix here?

Yeah, when I was thinking about this earlier, I was almost thinking like, the contracts almost have to be at that level.

Were enforced by something like Five Trend because that's, I mean, it's the system that actually has connections to all of the source systems and can like run a little query against them and be like, Hey, does this fit my schema?

Um, but the, the other part of it is, you know, sometimes they're gonna do transforms or like dump things in a table downstream in your data warehouse that like, it's not gonna be obvious when the schema is broken because you're adding onto something where the schema already exists.

And so like it, yeah, it, it almost needs to be like something that five 10 or a similar system can, So can, We went, went from talking about the transform system, kind of generating the contracts, maybe inform, enforcing the contracts now the EL system generating, like we're crawling our way up the stack and I just don't know why we don't crawl all the way.

You know, you said that the EL system has access to all the sources can run queries against them while the other thing that can access the source is the source itself at the query itself.

And, uh, you know, I, I like the idea of failing these analytics CI builds, but why don't we just start at the source and fail the product CI build.

Uh, it, it just seems like this is like we're working, we're, we're contorting ourselves backwards to try to come up with these technical solutions downstream when there's technical solutions upstream that already exists.

You don't need to be fancy, right? It's the normal CI process for these sources. And all we need to do to leverage those is process, right?

This is all processed dead. I think Ben Stansel who joined the, uh, the conversation, uh, you know, he had this great line in the, in his last post on data contracts.

Uh, something about, uh, well he, he has the opposite taken me, but something about how, uh, you know, the, the great spin here is trying to turn, um, technical problems into process problems.

Uh, and I think the exact opposite. I mean, I think he is right that, you know, that's a, that's a, that's a cleaving point here, but we're trying to turn what is fundamentally a process problem for which there's already technology to help mediate upstream and sources into a technology problem.

Why are we, why are we contorting ourselves backwards to solve this downstream? Well, that's fun <laugh>. Hundred percent. Yeah. So I, I definitely agree with that, especially when it's internal.

I think it is the point about five trend that we're dealing with third parties and therefore we can't push upstream.

Is is that what you meant? Trust in part? Uh, there's a little bit of that, but, but also I feel like, I don't know, even, even if, if you're using five Trend to replicate application databases that you control the source and you control the target, um, the, the fact that they are mediating the data transport means that there probably is a role to play there.

Um, and you could theoretically work around them, but I, I just feel like the, the, the technology that's moving the data I feel like somehow needs to be involved in the connecting the beginning and the end.

So I was gonna say before I asked that, what that I think five Trend that's actually looking, when it's a third party, there's like two contracts.

There's the one between the third party and Tran, and then there's the one between five Trend and their customer who, who they're landing the data to.

Mm. Right? I think five for their last, I don't, you don't know about the bit between five Tran and their, their source, right?

But other than if you are the source, but when it's between five trying, and you, I think they, they kind of do have, have that contract where they, they've got this, they've draw out the ERDs, they tell you about what's gonna, what, what version of this api it's gonna be, They tell you when they're gonna deprecate it, they tell you when there's gonna be changes, so they've kind of got a contract.

It's just, it's entirely on their terms, but it's, it is there I feel to an extent. Okay. For me, like I wanna push back a bit on what a, be a be said.

I, I think the solutions exist for product. I I haven't seen a solution exist for SAS tools specifically Salesforce. I think in my mind, like in my ideal world, that's where the tests need to live.

Not really in the ingest, like five trend kind of layer. I think it should just like prevent you from creating a Salesforce object, right?

At creation is the way I, like, I I would want to think about it, but in case that's not available, I think we'll basically converge to the software engineering standard practice telling somebody else about this earlier of dead letter cues.

And I think five channels start to have a dead letter cue of like, these rows didn't pass these tests. Uh, so we've put them in this other quarantine box and you can go look at them, figure out if you wanna like, change your tests or if you wanna pass them through or whatever.

Right? That's, that's an interesting point, and maybe I could reframe it if this isn't too naive of a question for everyone, like what actually needs to go in one of these contracts?

I know I always think of the case of a column gets renamed or added or removed with the data type that we're not expecting, but there's lots and lots of expectations here that could be failed, even if a column has not been renamed or changed its type, but simply like redefined in a way that's pretty subtle and hard to pin down, right?

Like there needs to be some place where you can define Yeah, so, so, so I think a lot of it is like, so I think one, there's two parts of it.

There's some part of it that's gonna be manual where essentially if you currency has to be greater than or equal to zero, no skima checker is really gonna know about that.

Uh, but I think we can sort of like borrow from a lot of work. Uh, ARO specifically has done where they have this concept of like a backwards compatible and incompatible kind of change.

And they've defined this like pretty well in their spec, where essentially if you add a new column that's backwards compatible, if you, uh, change the type of a column that's backwards, incompatible, uh, unless you like, And then generally if you, I, I think one of the more interesting things that came out of that is that essentially if you make a required field optional that's, uh, backwards compatible, but like not the other way.

And so they have a bunch of rules around stuff like this and I think like starting there is like a good base point and then you can kind of layer on top.

Yeah, I, I think I'm even getting at, you have a calculated field in Salesforce and the definition of the calculation has changed, but in no other way.

Yep. Protocol buffers have similar mechanisms too, but uh, it's just like, uh, there's not like really a good super good union of the functionality of aro and protocol buffers and these other things.

So like in the world of protocol or buffers, like you, uh, can never take stuff out without like incrementing, uh, and creating an entirely new one.

But then you don't get like those, you know, Jason Schema, you can say like, uh, make sure that this is greater than or equal to zero, which you can't do in other systems.

So like you end up having protocol buffers and then some other thing that layers on and then another thing that layers on and it gets like from a, uh, you know, again, the analyst perspective or you know, the downstream consumer perspective, it's like so much tedium that probably isn't useful.

Um, but yeah, I think for JS schema, particularly if they just made fields knowable by default, that would remove so much boiler plate.

Like it's incredible. Yeah, definitely. There's, there's all kinds of, uh, and Jason typed that actually is like a really interesting Kind of like, uh, you know, came out of segment, uh, segment still uses Jason Schema quite a bit.

Uh, but, uh, JTD is this a very interesting project out of, out of segment that has the boiler plate removed.

So we've got 10 minutes left in this conversation. Well, I'll say this. Has anyone looked at, um, the get up discussions and some of the work we've been talking about, cuz I'm noticing here the center of gravity of this conversation is around how do we create contract mechanisms between the EL and the T versus the work My teammates, Matt, Doug and I have been doing, have been around contracts within the T itself.

And so I guess maybe taking a step back, like, um, I think both need to be solved for, but I guess it sounds like the most painful signal is coming from the EL plus T versus my lens and worldview necessitates I focus on the tea at, at this time just given my role and my biases.

I guess are we, are we focusing on the right components here? Can you say more about what you mean by focusing on the tea?

Is it like, Yeah, focusing on this model produces some schema and then downstream ones know exactly what they're using or something else?

Yeah. To those lines. I'll, I'll name a problem statement. I hear from customers all the time. We have 2000 plus DBT models across five projects.

One is owned by a core team, there's a finance team, market team, et cetera. It's really hard to unite them together.

You know what we do today? We duplicate logic and sources. I heard signals of that earlier from you David, where it's like, hey, like the output of project A, we're gonna integrate that in project B and just replicate logic of sources.

We're gonna retest it. Again, I know it's a waste of time and energy to retest it, but just for peace of mind, that's what we're gonna do to be conservative from Project B's perspective and realize that mechanism in place doesn't scale very well when you get to 20 plus team members, 30 plus team members, 40 plus team members.

And so the, the problem we're focused on in, in the short term with this research is how do we enable large internal data teams to work ergonomically with each other?

And so that's, that's what I'm talking about. I hear you on the Monorepo piece, but we have organizations, I've ev I've even heard this, um, there's a prospect that came, it was like, hey, like we have a genuine use case to share our, um, we wanna unite lineage across our core project and the projects of our contractors and we need some clear kind of like least privileges mechanisms to make that happen.

I didn't just want them, I don't want them to see all my code as a D B T package, right?

That's, that's a, a non-negotiable. I just want to expose these two out of a hundred nodes. They only get to see the code for that and here's some elegant mechanism to make that happen.

Right? I guess I just wonder how, uh, like I, I'd be surprised if this is such a prevalent problem. Of course you see far more customers and users of DBT than I do, uh, just seems like most people should be able to use DBT tests and uh, and advantage of auto repo.

Uh, is this actually something like, do you see a ton of people that can't use DBT tests effectively to manage this and have to go or have to go poly repo?

Um, when it comes to monorepo, um, managing that amongst our largest customers can feel like a hot potato exercise of who owns what, who has to babysit these 10 concurrent prs at any given time to make sure things aren't going wrong.

Slash it's easier for teams to do that poly repo approach. Just a reason about things cuz like imagine I'm a new data team hire on day one and I see 2000 SQL files in my cell folder.

I'm like, this is too much. I'm not even a bother, you know, thinking through whether I'm duplicating something or not, I'm just gonna have at it, right?

Versus when you have your subset of, Hey, I'm in project B versus A okay, I know what universe I'm in, I can reason Why can't about that, Why can't you just have a folder?

You know, this is kinda like the argument that like we should have distributed systems to reason about it, but like you could also just have a bottle repo and have folders.

Yeah, I mean the, the thing that we've, we've always tried to choose like what things to solve ourselves and what things to not solve ourselves.

And one of the reasons that I think D B T moved into more complex organizations a little faster and more, more seamlessly than a lot of other data products is that we actually don't try to do things like solve permissions at all.

Like we rely on get to solve permissions or we rely on the data warehouse to solve permissions. And right now, inside of a given D B T project, there's an assumption that it all uses the same get credentials.

It all uses the same data warehouse credentials. And so even something as simple as just managing who has permissions inside a like to, to operate on a warehouse or to operate on the files requires you to actually split things out into multiple projects if you wanna use the same mechanism, if you don't wanna have all of a sudden all the authorization being managed by DBT itself, which is something i I have not a lot of interest in doing.

Um, so we, you, you need the, you're talking about with a service mesh, you need the ability to say, okay, there's like these multiple applications here, they're all DBT applications, they all kind of know how to interact with one another, but they're published by different teams and they released on different cadences and have all these different rules associated with them.

Yeah. Um, so we're trying to figure out how to create these kind of boundaries to let teams kind of operate in their own ways without, without like completely losing track of each other's work.

Like their DAG has to still fit together. Um, but but they don't have to be bound by each other's complexity.

Yeah, go ahead. Yeah, I was gonna say, I've seen organizations like, particularly in FinTech, which is pretty big in London, where they have DBT models which are like doing regulated things that they don't necessarily want the rest of the company to see.

So they, and it's quite hard if it's in a monorepo to mm-hmm <affirmative> to stop to restrict that. Um, but then my, We deal with HR data that way.

Like your, your standard DBT user can't, we just can't let it access HR data in the same way that it's gonna access your orders table.

I think it's like, so I was speaking to Pete fine about this topic recently, and he was saying that DBT tests are great, like quality tests, but they're not actually like what you define as like a unit test.

So I think that's something that probably needs to come in where you have like this fixed data set that you can use as an input to a model test the model once you've changed it, does it still produce the same output?

And if it doesn't, then it's, that's they, you know, your model's probably broken the data contract of any, you know, downstream consumer.

I think we've gotta start doing some things like that. Shameless plug, I have a package for that called data mock tool for D B T if any of you want to use it, but, uh, not exactly the permissions part, but I feel like a lot of what we're talking about, like is, and sorry to go back to this, but is what aesthetically typed language, like type script solves in software engineering and like, I wonder if you can just add some, some concept of types or like columns with types or interfaces to D B T models and give people a lot more power to compose and interact them that way.

I literally was just sending, uh, Ben a side chat saying, I feel like what we're trying to do is build a modern data stack that is statically typed and we just like don't quite know how to do that.

I mean, th this was this, the thing that we had that was a DBT internal thing, did this and it was, it was effectively a stupid data contract, but it every, I mean, I have a, I could share like a thing in the chat.

The, the, the way that this got created was basically we just had like table definition up front of every model.

And so there was typing on that and everything was like, it, it enforced a lot of things that we didn't think of as any kind of contract, but it was, a lot of things did not get broken because of that, because everything was sort of defined upfront and it was like a very simple way of doing things.

We actually did it cuz it was kinda lazy. We're like, we had to do it to generate the, the ddl.

It was like things that literally just took that, but, but it, it was effective in that way. Lemme see, I can find this link.

Um, it was this, So we did something like that where this, this was our, this was our like dot sequel files.

Um, but it had this of typing up front that you could, you could then understand the steams of every model.

Like we didn't actually do that, but you could parse the steams of every model in a much richer way than what you can do in DBT today.

Yeah. After writing by fun for quite a few years and writing more type script and go, uh, more recently, uh, my life has improved drastically because of some of that, like, don't wanna get into language wars, but like, it really has like, just things that I've had to be really concerned.

Writing Python, uh, I just don't have to think about and like BS code just yells at me if I do something dumb, which happens a lot.

So yeah, it, it's pretty, uh, substantively uh, better quality of life. Yeah. Thanks everyone for joining. We're at the top of the hour so I know a couple folks have to run.

This was a lot of fun. Thanks for taking the time outta your day. We might do some more of these in the future.

Um, so if you, if you liked it especially, uh, and if you didn't like it or have, have thoughts about how we should do it differently next time, definitely let us know and, uh, keep hanging out in that channel.

Have, have more conversations like this one, yada. Any, any closing notes that you wanna share? Um, I'm gonna post a little survey, quick one just cuz inspired by your, your feedback comments.

So thank you everyone for, for joining in. We'll definitely have more of these, so, uh, stay tuned in the round of channel.

Thank you. Bye.

Transcript

More than 25 million people across 400,000 companies choose Loom

My teammates and I love using Loom! It has saved us hundreds of hours by creating informative video tutorials instead of long emails or 1-on-1 trainings with customers.
Erica Goodell

Erica GoodellCustomer Success, Pearson

Loom creates an ongoing visual and audible experience across our business and enables our employees to feel part of a unified culture and company.
Tyson Quick

Tyson QuickCEO, Postclick

My new daily email habit. Begin writing an email. Get to the second paragraph and think 'what a time suck.' Record a Loom instead. Feel like 😎.
Kieran Flanagan

Kieran FlanaganVP of Marketing, HubSpot

Loom amplifies my communication with the team like nothing else has. It's a communication tool that should be in every executive's toolbox.
David Okuinev

David OkuinevCo-CEO, Typeform

My teammates and I love using Loom! It has saved us hundreds of hours by creating informative video tutorials instead of long emails or 1-on-1 trainings with customers.
Erica Goodell

Erica GoodellCustomer Success, Pearson

Loom creates an ongoing visual and audible experience across our business and enables our employees to feel part of a unified culture and company.
Tyson Quick

Tyson QuickCEO, Postclick