Skip toΒ content

dbt Contracts: Public/Private dbt Models: Monorepo Demo

8 mins

View Comments and Reply

Transcript

Show Transcript

Folks, this is sunk speaking here. And I'm gonna give you a rundown of my research so far when it comes to public and private models within, uh, Monorepo.

And so taking a step back, the problem I wanna solve for is really around when you have, let's say, theoretically 2000 D B T models within a project and contained in various sub folders, think things like core finance, marketing, sales.

It's very easy to get lost in the sauce, so to speak, and try and wonder, hey, like what do I really care about?

Let's say I'm part of the finance team that's owning this sub folder of D B T models. What do I really need from core or marketing or sales to be effective in my job and transforming the data that matters to me?

And so instead of, you know, running commands that run everything upstream, think things like, you know, staging models, et cetera, what if there was a mechanism that makes it really intuitive and elegant to reason about, Hey, I know with a core team is working on these 100 models, but the only thing I care about is model A, for example, and then I can cascade that down into my respective workflow and move on with my life.

And so, you know, explaining what this could look and feel like, at least from a conceptual standpoint, is really around taking a step saying, Hey, let's say I'm in the finance team.

I developed these three things. I want to expose model D and Model H, but model C is a staging model, for example, that has intermediate transformation logic.

And so what I'm doing here is just making sure that I'm reffing model A into model C, and that's, that's kosher.

Everything's smooth sailing. But if I try and ref model B directly into Model D, it'll be a broken re. And what that means is it won't use that as part of its, uh, transformation slash it won't even allow you to do that in the first place.

So instead it'll just, um, flow through model C and to model D instead. But that's okay because the logic from model B is inherently embedded within Model A.

And so I'm gonna presume I have the data and scope to do the things I want to do. And that same example replicates itself if I'm in a marketing, um, sub folder and I own these sets of DPT models where, same example again where, hey, I'm trying to ref from uh, model I in this case, but it won't let me, but that's okay because I'm gonna presume everything I need from the core team is in model A.

And then we go from there. So the next thing is how do we wanna represent this within DPT docs? Cause another problem we wanna solve for with this is sometimes you get something really huge within the d PT docs lineage and it can be really overwhelming where, Hey, I don't need to look at all the staging models all the time.

I don't need to look at all the sources. For example, that's where what if we imagine the DBT docs graph as something like this, or instead of seeing everything blown out for these private nodes, we have lines like this that say, Hey, there there are private nodes, um, hidden behind the scenes that looks like this.

And so anytime I wanna understand, hey, I wanna investigate further, like, hey, I know there are private nodes at a glance, I don't need to care about that.

But in case I do to understand dependencies over time, that's where I could click on that line. And then it shows me that private node.

I can click on that as per usual and then see the full lineage, the same premise for it. Let's say I want to focus on the model A node and view the documentation.

I want to view that sub graph within that documentation view to get full context when I'm investigating or doing my own troubleshooting.

And so in summary, the D B T docs dag or graph is a lot more consolidated, aka cleaner to reason about.

And then two, I'm not allowed to ref private models, think things like staging because I don't need to and I don't want to accidentally run those things when I'm doing, you know, upstream references.

Okay, now let's see, you know, just some pseudo code, uh, to see what this could look and feel like in the future.

So let's say I have something that looks like this, where I have a project AML file and I have something like remissions over here for my staging tables.

Okay. Or staging views I should say. And I say private, same thing goes for, uh, I'll show you an example over here where I say re permissions is private over here.

And you're probably wondering how does that show up in the manifest per se that shows up to look something like this where it's embedded within the node configuration.

If it's private, now we can do something within this information, within this j o manifest. And so that's where I'm gonna run a command that looks like this and it's gonna show you the behind the scenes logs that's powering this experience.

So let's say I have dim parts, but it's um, it's referencing a staging table upstream. And so it's gonna have a broken ref as a result.

And that's what I want because it's saying, Hey, you can't actually do that cuz it's not exposed to you yet.

And so over here right now, I have a broken rough, uh, over here and saying error. And that's because the depends on doesn't exist even though I know when I look at dim parts that I'm actually explicitly calling that out.

And so ignore these log messages. What I really care about is that there's an error in the first place. And so it's selecting from this stage TPC parts and it's like, Hey, how come a dim parts is ignoring this piece of important dependency information?

That's where we have the notion of private nodes, f permissions. I created this new protected function denoted by this underscore to essentially say, Hey, for each of the nodes and the manifest, I want you to look up their permissions.

If it's private, I want you to add it to this, um, dictionary of private nodes. And that's where you get logs that look like, whoops, this up here where these are the private sub folders and these are the nodes, it's the dictionary with the unique ID for that node plus the value, which is the sub folder.

We're just gonna keep track of that. And then here's just the, the keys for that, just so we only get the unique ID nodes.

And then there you're probably going, Hey, how are you getting these logs over here? That's where I build the manifest and then I recreate the dependencies, um, based off if that note is in the private key.

And if it is in the private key, remove that from the dependency list overall. That's where you see examples like this where I said, Hey, for this unique ID dim suppliers, here's what the depends on dependencies look like beforehand, after I move the private notes, it's empty.

So it's assuming there's nothing to depend on in the first place. And that's the general behavior I want because I wanna break those references.

Um, if it's private. Now there's some other elegant things I needed to do to have better error messaging, but that's the gist of it.

But it's cool that it was a bit easier than I thought to kind of inject some hacks in order to make dependencies be removed in the first place, which is super, super cool.

And so going back here in summary, we wanna have broken references to private models so that it forces behaviors to only hit public or published models that a core, um, team wants to expose in the first place all in the context of a monorepo.

Now the question is, is this useful for everyone or only some people? Mike Gut tells me this will only, this is primarily useful for teams that have 20 plus team members to avoid accidentally running things in the core data mart when they really don't need to.

And imagine having like five people per sub folder as a generic example, How do we make sure that these people are productive as possible, um, without having to comb through dozens of these private nodes just to understand, hey, do I really just need model A, uh, to be productive with my subject matter and move on with my life?

And so easier said than done, but it's cool to know at at least getting started. It's a bit more intuitive, uh, than not.

And that's so awesome. And so, yeah, let me know your feedback. Is this heading in the right direction, is it not?

And then, uh, we'll go from there. All right. See ya.

Transcript

More than 21 million people across 200,000 companies choose Loom

My teammates and I love using Loom! It has saved us hundreds of hours by creating informative video tutorials instead of long emails or 1-on-1 trainings with customers.
Erica Goodell

Erica GoodellCustomer Success, Pearson

Loom creates an ongoing visual and audible experience across our business and enables our employees to feel part of a unified culture and company.
Tyson Quick

Tyson QuickCEO, Postclick

My new daily email habit. Begin writing an email. Get to the second paragraph and think 'what a time suck.' Record a Loom instead. Feel like 😎.
Kieran Flanagan

Kieran FlanaganVP of Marketing, HubSpot

Loom amplifies my communication with the team like nothing else has. It's a communication tool that should be in every executive's toolbox.
David Okuinev

David OkuinevCo-CEO, Typeform

My teammates and I love using Loom! It has saved us hundreds of hours by creating informative video tutorials instead of long emails or 1-on-1 trainings with customers.
Erica Goodell

Erica GoodellCustomer Success, Pearson

Loom creates an ongoing visual and audible experience across our business and enables our employees to feel part of a unified culture and company.
Tyson Quick

Tyson QuickCEO, Postclick