idempotence
same same but same
Idempotence. I don’t think this word showed up in my SAT prep books (maybe because I didn’t read them… I promise I’ll pay you back mom & dad), but it’s a crucial concept to understand for all data plumbers.
Let’s take for instance that your AWS environment is “under maintenance” down (sadly, this has happened more often than you would think… ) and all your data pipelines failed. Once AWS is back online (hopefully sooner than 12 hours this time), you want to be able to just click the big green arrow in Airflow and run them again, without changing any downstream tables. You can even click the big green arrow again …or even 1,000 more times (sorry finance for the exorbitant compute costs this month), and the data should not change. That’s the essence of Idempotence.
mAtH Is cOoL. Let’s look at a mathematical equation that encompasses idempotence. This probably doesn’t help you understand the concept any better, but I wanted this post to seem more legit.
f(f(x)) = f(x)
See? Did that really help? Anyways, in plain english, this “fancy” equation means:
If a function is applied over the same input multiple times, and the output is the same, the function is idempotent.
Why is this important?
Nuke the dupes. Yeah. Dupes are bad. Analysts hate them, and they will hate you if they find them. You don’t want them to hate you. Trust me. They’re scary.
If your data pipeline is idempotent, it should elegantly identify / overwrite rows that already exist using a table’s primary key definition or partition mappings. If you don’t have the primary keys / partitions, or your pipeline is append-only, your pipeline should have a rollback mechanism in case of a failure.
If your pipeline is NOT idempotent, and one run fails with a partial load, you’ll end up with duplicates on the next run. The analysts will hate you.
More compute (hi, finance). Dupes = more rows = more compute required = more cloud costs.
With more rows, comes more needed compute resources. In this case, more rows is actually not more data. It’s the same data, just duplicated. If we have duplicate rows, we need separate jobs to clean them up, leading to more computing costs. If finance realizes you unnecessarily attributed to more compute costs, they might just take it out of your paycheck…
Last, and definitely not least, less fires. We all hate it. Getting pinged by the scary analyst. Scavenging through random .parquet
.avro
.csv
.wtf
files to figure out what’s going on. What the hell is going on? Oh yeah, my pipeline is not idempotent…
Idempotent data pipelines means less time banging your head on your keyboard while fixing self-created data bugs, and more time writing dumb medium posts that no one will read. Just look at me :)
Final words. Spin up your dev environment. Open up that not so pretty Airflow console. See the big green arrow? Click it. Did the output data change?
- Yes — Go refactor that crappy pipeline to be idempotent. You’ll thank me later.
- No — Go write a dumb substack post about how great your pipeline is.
Like this post? Hit the subscribe button to read more about data engineering (Pretty please).
Twitter: @NishantRRaman
Substack: dataplumber.substack.com