How to Backup Snowflake Data to S3 or GCS

7 minute read

In this post, we take a look at how to back up data in your Snowflake account to an external S3 or external GCS location.

Snowflake has amazing built-in backup functionality, called Time Travel that lets you access data as of a certain date. In addition, Snowflake protects your data in the event of a system failure or other catastrophic event with its Fail-safe feature which allows Snowflake support to restore data for you during the Fail-safe window.

In addition, Snowflake has a near magical ability to undrop objects that were dropped, which is a direct result of its immutable architecture.

However, Time Travel is limited to 1 day for Standard Edition (up to 90 days for Enterprise Edition), while Fail-safe adds 7 days of peace of mind. So, while both Time Travel and Fail-safe are convenient and storage-efficient means of backing up your data, some Snowflake customers, especially those using Standard Edition, may want to back up their data using alternate means.

One easy way is to simply unload your data to an object storage location, using Snowflake’s built-in copy into command. Let’s explore how to automate this using everyone’s favorite data transformation library dbt for a number of databases and schemas. (See more on dbt here.)

External StagesPermalink

Snowflake can access external (i.e. in your AWS/GCP account, and not within Snowflake’s AWS/GCP environment) S3/GCS buckets for both read and write operations. The easiest way to take advantage of that is to create an external stage in Snowflake to encapsulate a few things. (We’ve used stages in a previous post on loading data using Snowpipes.)

Create S3 StagePermalink

Let’s create a dedicated stage for our backups:

create or replace stage backup_stage url='s3://my_s3_bucket/snowflake_backup'
credentials=(aws_key_id='<AWS_KEY_ID>', aws_secret_key='<AWS_SECRET_KEY>')
file_format=(type=csv compression='GZIP' field_optionally_enclosed_by='"', skip_header=1)

Note that we’re specifying the csv file format, along with gzip compression as part of the stage definition. Your business case will vary here, and parquet or other format types may be better choices for you.
The skip_header option here only applies to reading data from this stage, not to unloading data. We *will specify during our unload process that we want to save table headers with our backups. However, for restore purposes, we want to skip headers. You can leave this option out if you’d rather get the relevant table headers back as the first row when you read from this external stages later.

You can view more information on the stage you just created by running:

desc stage backup_stage;

Pay attention to the database and schema in which you create the stage. The fully qualified name for the external stage will be <database>.<schema>.backup_stage. Also consider the roles which will be copying data to or from the stage. You can view the grants on the stage with:

show grants on stage backup_stage;

The only privilege you’ll need to worry about is USAGE.

grant usage on stage backup_stage to role your_role;

Find more information on creating stages here.

Create GCS StagePermalink

This assumes that you have a Storage Integration already configured for GCS. We can use the same format as for S3:

create or replace stage backup_stage url='gcs://my_gc_bucket/snowflake_backup'
storage_integration = YOUR_GCS_INTEGRATION
file_format=(type=csv compression="GZIP", field_optionally_enclosed_by='"', skip_header=1);

All of the details for the S3 stage creation are also relevant to the GCS stage.

Find more information on setting up your own integation here and on setting up a GCS stage here.

Copy into StagePermalink

We can now copy into our external stage from any Snowflake table. This works for either S3 or GCS:

copy into @<your_database>.<your_schema>.backup_stage/my_database/my_schema/my_table/data_
from my_database.my_schema.my_table
header = true
overwrite = true
max_file_size = 104857600

Let’s quickly talk about what’s going on here:

We are referencing the fully qualified name of the stage like we would with a table.
We are copying from a table into our external stage, which uses the compressed format specified earlier.
By naming the output file data_, with no other related options, we specify that we want Snowflake to create multiple files, all starting with data_*, which allows Snowflake to run this command in parallel in our virtual warehouse (parallel runs are limited by the size of the virtual warehouse).
We want the exported files to include table headers
We’re overwriting any existing files at that location
We specified an upper limit of 100MB per file (100*1,024^2 = 104,857,600 bytes) (See Snowflake’s General File Sizing Recommendations for more.)

There are many more options to be explored, see the copy into docs for help.

Automation via dbtPermalink

So, this helped backing up one table. However, we likely have dozens or hundreds more, across various schemas and databases. Thankfully, dbt allows us to script generation of SQL commands using jinja2

What we really want is a flexible macro that will build the copy into command for a given database/schema/table combination.

For example:

{% macro get_backup_table_command(table, day_of_month) %}

    {% set backup_key -%}
      day_{{ day_of_month }}/{{ table.database.lower() }}/{{ table.schema.lower() }}/{{ table.name.lower() }}/data_
    {%- endset %}

    copy into @<your_database>.<your_schema>.backup_stage/{{ backup_key }}
    from {{ table.database }}.{{ table.schema }}."{{ table.name.upper() }}"
    header = true
    overwrite = true
    max_file_size = 1073741824;

{% endmacro %}

In this case, we’re creating the copy into statement that will back up our given table into an object store key that looks something like this:

[gcs|s3]://my_bucket/snowflake_backup/day_09/dw/shop/dim_product/

(We’re upper-casing the table name in quotes in the from clause to guard against reserved words in table names, e.g. such as “ORDER” in the case of Fivetran-sourced Shopify data. 😒)

Using this macro, we could loop through a list of database, schemas and tables and execute backup statements like this in sequence.

Let’s see how that would look in a dbt macro.

The key components here is the backups dictionary that specifies a list of schemas for each database key. In this example, we’re backing up 2 databases ("RAW" and "DW") with several schemas in each.
In this case we’re looking to backup all tables in each specified schema, so we’re using the handy get_tables_by_prefix macro from the dbt_utils package to get a list of relations (i.e. tables) that match our filter. Note that we can exclude tables as well, such as any metadata tables (excluding anything starting with FIVETRAN_%).
In this example, we chose to keep up to 31 days of backups, one for each day of the month. Once the month is over, we simply roll over and start overwriting backups. This is a very simple backup scheme, so please make sure this modify this suit your business needs. This is just meant to illustrate that we have access to some of Python’s date formatting functionality via Jinja to create SQL statements.
Since we ultimately want to operationalize this backup process using a dbt run-operation, we wrap our code in a statement which allows us to use a run-operation to execute SQL that does not exclude a select statement.

{% macro backup_to_object_store() %}

    {%- call statement('backup', fetch_result=true, auto_begin=true) -%}

        {% set backups =
            {
                "RAW":
                    ["APP_DATA",
                    "FACEBOOK", 
                    "ADWORDS", 
                    "SHOPIFY", 
                    "ZENDESK"],
                "DW":
                    ["CUSTOMER", 
                    "SHOP",
                    "FINANCE"]
            }
        %}

        {% set day_of_month = run_started_at.strftime("%d") %}
        
        {{ log('Backing up for Day ' ~ day_of_month, info = true) }}

        {% for database, schemas in backups.items() %}
        
            {% for schema in schemas %}
        
                {{ log('Getting tables in schema ' ~ schema ~ '...', info = true) }}

                {% set tables = dbt_utils.get_tables_by_prefix(schema.upper(), '', exclude='FIVETRAN_%', database=database) %}

                {% for table in tables %}
                    {{ log('Backing up ' ~ table.name ~ '...', info = true) }}
                    {% set backup_table_command = get_backup_table_command(table, day_of_month) %}
                    {{ backup_table_command }}
                {% endfor %}
        
            {% endfor %}
        
        {% endfor %}

    {%- endcall -%}

{%- endmacro -%}

We can then call this from the command line like so,

dbt run-operation backup_to_object_store

which we can then automate in our CI/CD environment of choice, such as dbt Cloud or Gitlab.

Data RecoveryPermalink

So, how do we use these backup files if we want to recover data from a previous backup? A great thing about Snowflake external stages is that we can simply read from them, using the same stage definition we’ve used for unloading data to them. For example, we could pick a specific table (my_table) and day (day_09), and read from the backed up data like so:

select 
    $1 as id, 
    to_timestamp($2, 'yyyy-mm-dd hh24:mi:ss.ff Z') as ts 
from 
    @<your_database>.<your_schema>backup_stage/day_09/my_db/my_schema/my_table/

Which would result in output like this:

| ID      | TS                            |
|---------|-------------------------------|
| 9147485 | 2019-10-09 15:06:43.627000000 |
| 9147484 | 2019-10-09 15:06:36.230000000 |
| 9147483 | 2019-10-09 15:06:28.263000000 |
| 9147482 | 2019-10-09 15:06:09.158000000 |
| 9147481 | 2019-10-09 15:06:02.582000000 |

Consequently, we can create a table in Snowflake with the recovered data:

create table my_table_recovered as (

    select 
        $1 as id, 
        to_timestamp($2, 'yyyy-mm-dd hh24:mi:ss.ff Z') as ts 
    from 
        @<your_database>.<your_schema>backup_stage/day_09/my_db/my_schema/my_table/

)

If we’re happy with the recovered data, we can then swap the recovered data with the current (“bad”) data,

alter table my_table rename to my_table_bad;
alter table my_table_recovered rename to my_table;

Then at a future point, we could delete the “bad” table:

drop table my_table_bad;

Hopefully, this gave you some idea of how to extend the built-in Snowflake recovery features like Time Travel and Fail-safe with a few alternatives.

As always, please let me know if you have any feedback or comments!

Twitter Facebook LinkedIn

Claus Herther

How to Backup Snowflake Data to S3 or GCS

External StagesPermalink

Create S3 StagePermalink

Create GCS StagePermalink

Copy into StagePermalink

Automation via dbtPermalink

Data RecoveryPermalink

Comments

You May Also Enjoy

Exploring Google BigQuery with the R tidyverse

Multi-level Modeling in RStan and brms (and the Mysteries of Log-Odds)

Blue-Green Data Warehouse Deployments (Write-Audit-Publish) with BigQuery and dbt

Updated: How to Backup Snowflake Data - GCS Edition