Schema Design: One Fact, One Place

Concept. Schema design decides which tables hold which columns. One rule does most of the work: store each fact once, in one place.

Intuition. Put everything in one wide table and a user's email rides along on every listen. The day Mickey changes his email, you have to fix it in every row, and you will miss one. Split the repeated facts into their own tables so each lives once. That split is normalization.

Example. The Spotify schema you have used since day one is already normalized. That is why it is three tables (Users, Songs, Listens), not one.

The problem: one big table repeats itself

One big table repeats itself Left: a single wide table holding listen_id, name, email, title, artist, and rating. Mickey's email is copied on all three of his rows (highlighted amber), and the artist Taylor Swift is copied on row after row (highlighted amber). Right: three red cards naming the anomalies this repetition causes. Update: Mickey changes his email, so you must fix every Mickey row and will miss one. Insert: a brand-new song nobody has played has no listen row to live on. Delete: removing Mickey's last listen deletes his email too. Color key: amber marks a fact copied many times; red marks the three anomalies that repetition causes. One big table repeats itself Put every column in one table and the same facts appear over and over. Listens, one wide table listen_id name email title artist rating 1Mickeymickey@ex.comEvermoreTaylor Swift4.5 2Mickeymickey@ex.comWillowTaylor Swift4.2 3Mickeymickey@ex.comYesterdayBeatles3.9 7Daffydaffy@ex.comEvermoreTaylor Swift2.9 8Daffydaffy@ex.comWillowTaylor Swift4.9 Mickey's email, copied on all 3 rows one artist, copied again and again What repetition costs you Update Mickey changes his email. Fix every Mickey row, miss one, and now one person has two emails. Insert Add a brand-new song nobody has played yet. There is no listen row to hang it on. Delete Delete Mickey's last listen and his email disappears with it. The fact had nowhere else to live. Color key  amber = a fact copied many times · red = the anomalies that repetition causes.

Figure 1. Repetition is the root problem. Because Mickey's email and the artist of each song are copied onto every listen, three kinds of bug become possible. An update has to touch many rows and can leave them disagreeing. An insert cannot record a song until someone plays it. A delete can erase a fact that happened to live in only one row. These are the update, insert, and delete anomalies.

The fix: give each fact one home

Pull the repeated facts out into their own tables. A user's email lives once in Users. A song's artist lives once in Songs. Listens keeps only the events, and points at the other two by key.

Give each fact one home Three tables. Users holds user_id, name, and email, with a note that Mickey's email lives here once. Songs holds song_id, title, artist, and genre, with a note that Taylor Swift lives here once. Listens at the bottom holds listen_id, user_id, song_id, and rating, and two arrows run from its user_id to Users and from its song_id to Songs, the foreign keys. Color key: green is the normalized design; the arrows are foreign keys that point at the one place each fact lives. Give each fact one home Each fact is stored once. Listens points to it by key. Users user_id (PK) · name · email 1 Mickey mickey@ex.com 3 Daffy daffy@ex.com Mickey's email lives here once Songs song_id (PK) · title · artist · genre 1 Evermore Taylor Swift 2 Willow Taylor Swift Taylor Swift lives here once Listens listen_id (PK) user_id (FK) · song_id (FK) · rating just the events, pointing by key Color key  green = the normalized design · arrows are foreign keys pointing at the one place each fact lives.

Figure 2. This split is normalization. The strict, formal version has a name, Boyce-Codd Normal Form (BCNF), but the intuition is the whole point: no fact is stored twice, so nothing can drift out of sync. Change Mickey's email in one row of Users and every query sees the new one. A foreign key, like `Listens.user_id`, is just a pointer to the single home of a fact.

Splitting the table is exactly why you JOIN. A query recombines the facts you separated, so every join you wrote in Module 1 was putting a normalized schema back together to answer one question. Normalization and joins are two halves of one design: pull facts apart to store them safely, join them back to use them.

The trade-off: normalize first, denormalize only when forced

Normalize first. Denormalize only when forced. Two panels. Left, green, Normalized: each fact once, no anomalies, recombine with a JOIN; the cost is a join at read time. Right, amber, Denormalized: copy facts back together for fast reads with no join, accepting the risk that the copies drift apart; used for read-heavy analytics at scale. Color key: green is the safe default, amber is the deliberate exception you take only when a measured read cost forces it. Normalize first. Denormalize only when forced. Same data, two layouts, opposite trade-offs. Normalized (the default) Each fact stored once. No update, insert, or delete anomalies. Cost: you JOIN to recombine at read time. safe by default; correctness first Denormalized (the exception) Copy facts back together so a read needs no join. Faster reads. Cost: the copies can drift apart. used for read-heavy analytics at scale Color key  green = the safe default · amber = the deliberate exception, taken only when a measured read cost forces it.

Figure 3. Normalization buys correctness and pays with joins. Denormalization buys read speed and pays with the risk that copies of the same fact disagree. Default to normalized; reach for denormalization only when a real, measured read cost forces it. You will meet denormalization again at scale in Module 5, where copying data on purpose is how a system serves millions of reads.

So the three-table Spotify schema was a design choice all along: one fact, one place, recombined by joins. Keep the rule in your head every time you sketch a table. If a column repeats the same value down the rows, it probably belongs in a table of its own.