Hey there! I hope you’ve had a great Friday. Today, we’re wrapping up Data Modeling Week here at SQL University. Many thanks to Jorge Segarra (Blog|Twitter) for giving me the opportunity to cover these topics. Data Modeling is my first love, and I’m thrilled that I’ve had a chance to write about it. There is no way to cover anything but some of the big ideas in three posts, but there is a lot of information out there for you to learn from. I provided some resources in the last post, Entity-Relationship Modeling (Part I), and I think they bear repeating. Jump to the bottom of this post for recommendations on further learning.
Any-who… On Wednesday, we talked about the basics of an Entity-Relationship Diagram (ERD). Entities, Relationships, and Attributes were discussed, and we got the beginnings of an ERD about Country down on paper. If you haven’t read that post yet, you might want to jump over to it and go through that first. (Link in paragraph above) What we’ll talk about here will probably make more sense. Or not. All up to you, my friend.
So what are we talking about today? Refinement! No, not finishing school refinement. Wait… maybe we are. What we’re going to do is polish up the first cut of the ERD by walking through the entities, relationships, and attributes and identifying areas that don’t quite work. Then, we’ll talk about how to tweak those areas to make them walk with their shoulders back and their eyes up, like the prim and proper ERD elements they are.
<Soapbox 1> You will never get any model perfectly right on the first try. It’s okay. The fear of imperfection can create inertia, and that’s why a lot of modeling projects stall and eventually fail. Recognize that you don’t know everything up front, and that your data isn’t perfect (and neither are you). Just do what you can with what you’ve got. Something is almost always better than nothing. </Soapbox 1>
Between Wednesday and today, I went through the entities for the Country model and added them to the ERD. I added attributes and even some inherited information between the entities.
A note on attribute notation. I used underlines and italics to note some different things about the attributes. This isn’t formalized, but it helps me see what’s what. If I’m using a real modeling tool (which I’m not here), a lot of this metadata about the attributes will be handled by the modeling tool GUI.
But, for today, here’s the notation I’m using. I kind of like it, because I’m not dependent on a particular piece of software, but it tells me what’s going on. There’s a not-so-subtle message here: Care less about formal notation and more about consistency and readability. Do what works for you and your organization.
Here’s what I ended up with:
I’ll let you read through the model since you totally rock at reading ERD’s now. Next up, we’ll start addressing issues we see.
The first issue we’re going to address is Many-to-Many relationships. In the real world, these kinds of relationships happen all the time. For example, COFFEE is drunk by many PEOPLE, and PEOPLE drink many cups of COFFEE. However, in a database, Many-to-Many relationships can create all kinds of problems. Trying to query data like this can be a bit of a nightmare, so we resolve the problem with what is called an Intersect Entity. It is just what it sounds like: An entity that intersects the relationship between two entities. We take that Many-to-Many relationship and turn it into two One-to-Many relationships. You know what? It’ll make more sense if I just show you:
This is actually very simple:
1) Create a new Entity, and give it a name that is the concatenation of the names of the entities you’re intersecting
2) Split up that Many-to-Many relationship into two One-to-Many relationships, with the Many side being on the Intersect Entity. Now I have two sets of verb phrases:
“LANGUAGE may be the communication method for one or more COUNTRY LANGUAGE, and COUNTRY LANGUAGE must be a type of LANGUAGE”
“COUNTRY must communicate via one or more COUNTRY LANGUAGE, and COUNTRY LANGUAGE must be the communication method for a COUNTRY”
One thing that you’ll want to make sure you adjust in the relationships: The optionality of the intersect entity needs to be mandatory, because it’s all inherited information.
3) Allow the Intersect Entity to inherit the key from the two entities. The combination of these two becomes the key for the Intersect Entity. (Note: A key made up of more than one attribute is known as a composite key)
4) Finally, add non-key attributes to the Intersect Entity. Now, this is very, very cool. Remember on Wednesday when we talked about Attribute rules? One of them was “Attributes must be atomic”. The example provided was about the percentage of people who speak a language within a country. An attribute like “English-60%” is not valid. However, we can now take advantage of that Intersect Entity, and we have a perfect place to put that information. Percentage is only one thing (atomic), and it helps to define the combination of COUNTRY and LANGUAGE.
We have one other Many-to-Many to resolve: CONTINENT and COUNTRY. Trivia Question: What countries are transcontinental? Answer: Egypt, Russia, Turkey, Kazakhstan, Azerbaijan, and Georgia. Keep that one in your back pocket for your next team trivia night.
When there is a relationship between two entities, the child entity (the one on the Many side of a One-to-Many relationship) will inherit the key of the parent entity (the one on the One side of a One-to-Many relationship). If you’re familiar with physical database implementation, these are your foreign keys. Thinking in the context of an ERD, there is a very important principle at play:
An Entity owns its data.
An entity is the one and only place for information about that particular thing. I don’t want information in more than one place. Let’s look at COUNTRY and GOVERNMENT TYPE:
I have an inherited key of Government Type Name. I also have an attribute in COUNTRY of Government Type. Because I’ve inherited the key of the GOVERNMENT TYPE entity, I don’t need the Government Type attribute any more. The advantage of allowing entities to inherit attributes like this is that I have a single place to store Government Type information, and will ensure that I don’t get variations on the descriptors later on. We’ll remove the Government Type attribute from COUNTRY.
The Aggregation Question
Another attribute that we should look at is Population. In my original ERD, I have Population in multiple entities: CONTINENT, COUNTRY, STATE/PROVINCE, and CITY. This is a tough one, and is going to require a decision on your part. Here’s what you need to ask yourself: Can the Population of a CITY aggregate up to a STATE/PROVINCE Population, then up to COUNTRY, then up to CONTINENT? Ideally, yes. In practice? Maybe not. Not only do you need to think about the principles of data modeling here, you need to think about your data quality and your business.
From a purist standpoint, we should store this attribute at the lowest level we can (CITY), and depend on the relationships between the entities to allow us to calculate Population at a higher level (STATE/PROVINCE, COUNTRY, or CONTINENT). From a pragmatic standpoint, we don’t really know if the entire population of a continent lives in defined cities. Also, are we looking for estimates or hard numbers?
The downside to storing an attribute in multiple places is that now I’m going to have to maintain that attribute multiple times. Let’s say I increase the Population of Atlanta by 5,000. Now I have to go through and increase the Population of Georgia by 5,000, the Population of the United States by 5,000, and the Population of North America by 5,000. That’s a bit of a nightmare. Conversely, what happens if I get an update to the population of North America? How do I push that data down to the lower levels? Allocation rules? Let me tell you something about allocation rules… they aren’t fun. I still get the shakes when I think of a system I worked on that had to manage financial allocations.
So, all that being said, be a purist and approach this question optimistically. If you’re not sure, assume you can define Population at the CITY level and aggregate up as needed. If someone proves you wrong, ask yourself if you really need Population at every level. Remember, you can always change it later, and use physical implementation to manage your data. Let the structure of the ERD work for you and keep things as simple as possible for as long as possible. We’re going to remove Population from STATE/PROVINCE, COUNTRY, and CONTINENT.
I could make the same argument about Area (as in the measurement width X height). But looking at the business rules, how often does the area of a continent change? Especially the “official” measurement. Same goes for COUNTRY and STATE/PROVINCE. While I could theoretically aggregate this data, I don’t see a lot of value in it. Entity-Relationship Modeling isn’t all about following checklists and rules. Sometimes you’re going to have to make judgment calls. Don’t panic. First, you’re pretty smart. Second, you can always change the model if you chose wrong.
Let’s take a quick look at the current state of our model now that we’ve made some tweaks:
Wow, she’s looking pretty good so far. Let’s see what else we can to do polish her up.
We touched on attribute granularity a little when talking about Population and Area. We also have another, less obvious, issue. Take a look at STATE/PROVINCE. We have two attributes, Capital City and Largest City. You can definitely make the argument that these attributes help define a STATE/PROVINCE. But, sitting back and thinking about it for a while, we also have a CITY entity. We could also define a CITY as being the Capital City or the Largest City.
Plus, remember that principle we talked about earlier? Entities own their data. Which entity owns information about cities? CITY does. STATE/PROVINCE has a relationship to CITY, but it shouldn’t be in the business of defining information about cities. CITY is quite capable of doing that herself, thankyouverymuch. I want attributes to define business keys at the lowest level possible. Plus, we’ve got a potential data integrity issue here. What if someone fat-fingers the value for Capital City and types “Sacremento” instead of “Sacramento”? And what if I want to know the population of the capital of California? Now I can’t get that information from CITY, because the misspelled name doesn’t exist in my table. We’re going to move Capital City and Largest City out of STATE/PROVINCE and into CITY, and instead of having to store the name of the city again, now we just need some sort of indicator.
IS MY ATTRIBUTE REALLY AN ATTRIBUTE?
The final area of refinement we’re going to discuss is the one you’re most likely to miss. Again, don’t panic. If you miss it, remember that you can always clean it up later. What we want to do is look at each attribute and ask the question, “Is this really an attribute?”
Let’s take the Currency attribute in COUNTRY. At first glance, yeah, a country has a currency. Makes sense. But, let’s dig in a bit more.
1) Does this attribute have meaningful information I’d like to store about it?
Remember the initial definition of an Entity? A thing/object that is significant enough that we need to know information about it. What might we want to know about currency? Well, we might care about exchange rate and stability. I can’t throw Exchange Rate into COUNTRY, because it doesn’t describe the COUNTRY, it describes the CURRENCY. (Attributes should not describe other attributes) Oh, look at that… CURRENCY just got upgraded to ALL CAPS. It’s an entity now.
2) Will the values in this attribute repeat?
Another way to evaluate an attribute is to look ahead to the data that might be stored in it. Looking at Currency again, let’s think about what will end up in this attribute. Many countries have their own currency, so those values wouldn’t repeat. But… Euro-Zone. Most of Europe is now on a single currency. Do I want to store the value “Euro” multiple times in the COUNTRY entity, or do I want to store it one time in a CURRENCY entity and allow the COUNTRY entity to inherit the CURRENCY key? Naysayers will tell you that more entities make a model overly complex. I disagree. Having to manage one value across multiple instances of an entity makes a model complex.
<Soapbox 2> I believe that data modeling is 90% thinking and 10% physical labor. This can be tough for someone who likes Seeing Things Happening to understand. Give yourself time to think. Digest your model, think about it, and then let it stew for a bit. If you’re anything like me, you’ll be in the shower at 6:00 in the morning, and somewhere between shampoo and conditioner, you’ll go, “Hey! Currency isn’t an attribute!” Epiphanies are important, and fight for the time to allow them to happen. </Soapbox 2>
Now we have two good arguments for a CURRENCY entity. Let’s add it to the model:
This has turned into a pretty clean model. We’ve eliminated our Many-to-Many relationships, ensured that attributes live in the correct entities, that attributes really are attributes. I’m pretty happy with where we ended up. As I’ve said a few times, there is so much to learn about Entity-Relationship Modeling that there is no way we could cover it in a single week. But, thank you for sticking it out and reading what I have to say about one of my very favorite topics.
Now get out there and model something! It’s fun, I promise!
As promised: Resources!
A note: Not only do these three people write about data modeling, they also present regularly. Keep an eye on their blogs for great content and information about where you can go watch them teach in person.
Case*Method: Entity Relationship Modelling by Richard Barker
This book is the one I received as course material in my very first database design class in 1997. I still use it. In fact, it’s sitting beside me right now so I can reference it as I work on this post. It’s out of print, but you can still find new and used versions for sale online. First published in 1990, it’s my personal definitive guide to purist modeling. It says Oracle on the cover, and you’re going to see COBOL code inside, but trust me… you’ll learn from this book.
Pro SQL Server 2008 Relational Database Design and Implementation by Louis Davidson, et al
Louis does an amazing job of teaching relational data modeling as well as couching his work in the context of SQL Server. If you’re a SQL Server professional (or want to be), pick up this book.
Data Model Patterns: Conventions of Thought by David C. Hay
Once upon a time, I used to think that all work I did had to be original. Now, I’m older and wiser. Mr. Hay provides an entire book of data model patterns for different industries and types of data. It’s a great reference to see how to do it right, and to get some ideas for your own database.