Row Access Policies to Implement a Data Clean Room in Snowflake

by Rackspace Technology Staff

It’s common for advertisers to ask for more analytics from their publishers. They want to know if their ads performed well, along with a series of other factors. But the publishers cannot traditionally share their sensitive or proprietary data.

Two such partners may have their own private lists of users, customers, and products, that they cannot share with each other. Because of PII and other privacy concerns, or simply because this is valuable proprietary data.

However, they could eventually try to derive some information with real business value. For instance, they could try to internally match their customers and see how many they have in common. But how and where to combine those lists, when both parties do not want to share the details?

Yao’s Millionaires’ Problem revisited

I wrote several times in the past about this famous puzzle, that goes like this:

How can two millionaires (Alice and Bob) determine whose net worth is higher without telling each other how much money they each have?

You see how similar the problem is with the scenario described before. And we used to implement this in Snowflake with a secure view and a secure function, but not without challenges:

technical-blog-Row-Access-Policies-for-a-Data-Clean-Room-in-Snowflake-1

A while ago, I challenged my coworkers (and my fellow Snowflake “Data Superheroes”) with a variation of the problem, to avoid looking for a solution by simply googling. And let’s implement this simple variation here, in Snowflake:

Two different users have exclusive access each to a different table, in which they each store some text. Let’s say Jack stores “I am great” in his table, and Mary stores “Friday it is” in her table. None of them knows what the other stored. Jack needs to know if they both stored the same text, but without telling Mary what he stored. How can they do this in Snowflake?

Yao’s Millionaires’ Problem implemented

Let’s start by implementing the infrastructure, which is straightforward. We create both tables in a separate test database, but each owned and accessed by two different roles: Jack and Mary.

use role accountadmin;
use warehouse compute_wh;
create or replace database yao_db;
create or replace role jack;
grant role jack to role accountadmin;
grant usage on database yao_db to role jack;
grant usage, create table on schema yao_db.public to role jack;
grant operate, usage on warehouse compute_wh to role jack;
create or replace role mary;
grant role mary to role accountadmin;
grant usage on database yao_db to role mary;
grant usage, create table on schema yao_db.public to role mary;
grant operate, usage on warehouse compute_wh to role mary;

Let’s create the “secret” tables now. Jack creates his table, Mary creates hers.

use role jack;
create or replace table jack_table(say text) as select 'I am great';
select * from jack_table;
use role mary;
create or replace table mary_table(say text) as select 'Friday it is';
select * from mary_table;

Test that Jack cannot see indeed Mary’s table, and Mary cannot access Jack’s:

use role jack;
select * from mary_table;
use role mary;
select * from jack_table;

And now the big hack, this is how we do it! Mary will be allowed to create a row access policy, which will be attached to her table and the column with her secret. The policy says that when the current role is Jack, the current statement (e.g. the SQL query he tries to execute) can be only the text defined there. Which simply joins the tables and returns True if the secrets are the same. Without giving up what is actually stored.

use role accountadmin;
grant create row access policy on schema yao_db.public to role mary;

use role mary;
create or replace row access policy mary_policy as (say text)
  returns boolean ->
  current_role() <> 'JACK'
  or current_statement() = 'select exists(select * from mary_table m join jack_table j on m.say = j.say) as result;';
alter table mary_table
  add row access policy mary_policy on (say);

And yes, now Mary can safely grant Jack access to her table. But remember that Jack cannot see raw table data. In fact, he can see nothing else! The only thing Jack can do is to run the exact query that Mary allowed him to execute. That will return only True or False.

grant select on table mary_table to role jack;

With the Jack role, let’s check if it works indeed. It works and this will return False:

use role jack;
select * from mary_table;
select exists(select * from mary_table m join jack_table j on m.say = j.say) as result;

Allow Mary to set her secret the same as Jack’s, just to test the functionality here:

use role mary;
update mary_table set say='I am great';

Now this will return True:

use role jack;
select exists(select * from mary_table m join jack_table j on m.say = j.say) as result;

It is important to remember that the query should be called exactly as it has been defined in the policy. You may not add, change, or remove any blank space, new line, or alias, because this will make it a different new query.

The following query, that changed the m alias to mm, will always return False:

select exists(select * from mary_table mm join jack_table j on mm.say = j.say) as result;

Data Clean Rooms in real life

I wrote a while ago another article with a concrete example.

Here below the Consumer – which can be a separate Snowflake partner account – has its own list of associates he doesn’t want to share. The Producer – which can be your Snowflake account – has a list of customers you don’t want to share. The names of the associates and the customers are PII (Personally Identifiable Information), which by law you cannot share with any third-party.

technical-blog-Row-Access-Policies-for-a-Data-Clean-Room-in-Snowflake-2

You saved the total sales value for each of your customers. And this is your very valuable proprietary information, you may never want to share this with anyone. However, you may extract some business value by allowing your partner account to eventually derive some information. Only for those customers and associates with the same names, assuming they are the same persons, you may allow a query that groups by their profession and returns the sales as an average value. Properly matching records by name it’s another story, for another time.

Your row access policy may define even more than one single query a third-party can run. It’s recommended to store all these allowed statements into a separate table that you own.

Remark also that you cannot run any of these queries because your partner granted you no access to their tables. But you are also safe, because the partner can execute only what you allow them to execute.

While we deal with separate Snowflake account, you must also create a read-only secure data share, that your partner will discover as an inbound share in their account.

Conclusions

  • Data Clean Rooms is a booming niche market today. Most companies collected and continue to collect huge amounts of data, and they are looking for better commercial value for every gigabyte they have.

  • Many companies partner together on a lot of issues. They each have their proprietary data and sensitive data they cannot share because of legal issues. But it is still possible to derive some valuable information without exposing the rest.

  • Snowflake made it very easy lately, when most other ways to implement a Data Clean Room are rather unsafe and complicated. Row access policies allow you to grant access to only a few statements that you know for sure will return predictable data. For the rest, both you and your partner are totally protected.

Let our experts guide you on your Data Warehousing journey