SQL Hack: How to join tables when the key is buried in a blob of text

SQL Hack: How to join tables when the key is buried in a blob of text

The Situation

Text blobs are a pain, but they are commonly used with data from email platforms, Salesforce, or dynamic UTMs. In these cases, the JOIN key is probably hidden in a blob of text.

For example, say you want to join your users table using the user_id from the Salesforce notes column. But the notes data looks like this…

Notes Column:
Ahmed (user_id: 1f65e898-0f67-4d34-8ac3-ae223bb71a79) called us today

First Attempt: Regex Parsing

Initially, you can try to use regex to parse the user_id out of the value column. This is always annoying but can get close.

If it works, then a simple subquery and join should do the trick. However, like any free text column, the regex may not work perfectly.

Regex can be risky when the structure of the data changes:

Notes Column:
Ahmed (user: 1f65e898-0f67-4d34-8ac3-ae223bb71a79)
(references “user” not “user_id”)

When regex doesn’t work, use this hack…

Solution: Use the "ilike" SQL hack

Use an ilike to JOIN data by creating a dynamic key:

FROM some_table a.    -- structured data
JOIN other_table b    -- data with the text blob	
    ON ( a.column ilike '%' || b.other_column || '%')

This can be a handy trick if you’re in a pinch, but it can also be slow, so use it with caution. And watch out for unwanted duplication in your JOIN if the “ilike” matches more than once!

Good Luck!

Check us out on the Data Engineering Podcast

Find it on the podcast page or stream it below