Technical Blog


Technical and Product News and Insights from Rackspace

Amazon Redshift: Best practices for optimizing query performance


Originally published on June 17, 2020 at Onica.com/blog

Organizations across a plethora of industries look to use data analytics for operations and other functions critical to success. However, as data volumes grow, management and value-extraction can become increasingly complex.

Amazon Redshift

Amazon® Redshift® is a powerful data warehouse service from Amazon Web Services® (AWS) that simplifies data management and analytics. Let’s take a look at Amazon Redshift and some best practices you can implement to optimize data querying performance.

Data lakes versus Data warehouse

Before digging into Amazon Redshift, it’s important to know the differences between data lakes and warehouses. A data lake, like Amazon S3, is a centralized data repository that stores structured and unstructured data, at any scale and from many sources, without altering the data. On the other hand, Data warehouses store data in a reconciled state optimized to perform ongoing analytics and only load the data needed for analytics from data lakes.

Amazon Redshift takes storage for data analytics one level further, amalgamating the qualities of data lakes and warehouses into a “lake house” approach. It allows the querying of large exabyte-scale data lakes while remaining cost-effective, minimizing data redundancy, and minimizing maintenance overhead and operational costs.

Amazon Redshift architecture

To rapidly process complex queries on big data sets, Amazon Redshift architecture supports massively parallel processing (MPP) that distributes the job across many compute nodes for concurrent processing.

These nodes are grouped into clusters, and each cluster consists of three types of nodes:

  • Leader Node: These manage connections, act as the SQL endpoint, and coordinate parallel SQL processing.

  • Compute Nodes: Composed of slices, these execute queries in parallel on data stored in a columnar format and in 1 MB immutable blocks. An Amazon Redshift cluster can contain between 1 and 128 compute nodes, portioned into slices that contain the table data and act as a local processing zone.

  • Amazon Redshift Spectrum Nodes: These execute queries against an Amazon S3 data lake.

Optimizing query performance

By bringing the physical layout of data in the cluster into congruence with your query patterns, you can extract optimal querying performance. If Amazon Redshift is not performing optimally, consider reconfiguring workload management.

Reconfigure workload management (WLM)

Often left in its default setting, tuning WLM can improve performance. You can automate this task or perform it manually. When automated, Amazon Redshift manages memory usage and concurrency based on cluster-resource usage. It allows you to set up eight priority-designated queues. When done manually, you can adjust the number of concurrent queries, memory allocation, and targets.

You can also optimize querying performance through the following WLM configuration parameters:

  • Query monitoring rules help you manage expensive or runaway queries.

  • Short query acceleration helps you prioritize short-running queries over longer-running queries by using machine learning algorithms to predict querying execution time.

  • Concurrency scaling helps you add multiple transient clusters in seconds to speed up concurrent read queries.

WLM best practices

Some WLM tuning best practices include:

  • Creating different WLM queries for different types of workloads.
  • Limiting maximum total concurrency for the main cluster to 15 or less to maximize throughput.
  • Enabling concurrency scaling.
  • Minimizing the number of resources in a queue.

Refining data distribution

The rows of a table are automatically distributed by Amazon Redshift across node slices, based on the following distribution styles:

  • AUTO: Starts with ALL and switches to EVEN as the table grows.
  • ALL: Consists of small, frequently joined, and infrequently modified tables placed on the first slice of each compute node.
  • EVEN: Consists of large, standalone fact tables that are not frequently joined or aggregated in a round-robin distribution across the slices.
  • KEY: Consists of frequently joined fact tables or large dimension tables. In this style, a column value is hashed, and the same hash value is placed on the same slice.

Using the right distribution patterns can maximize the performance of JOIN, GROUP BY, and INSERT INTO SELECT operations.

Refining data sorting

Sort keys define the physical order of data on a disk. Table columns used in WHERE clause predicates are a good choice for sort keys and commonly use date or time-related columns. Use Zone maps, stored in memory and generated automatically, to define the value extremes for each block of data. Effectively using sort keys and zone maps together can help you restrict scans to the minimum required number of blocks.

The following diagram illustrates how table sorting focuses scanning targets for time-based queries, thereby improving query performance.

Optimal query performance best practices

Using the previously mentioned Amazon Redshift changes can improve query performance and improve cost and resource efficiency. Here are some more best practices you can implement for further performance improvement:

  • Use SORT keys on columns that are often used in WHERE clause filters.
  • Use DISTKEY on columns that are often used in JOIN predicates.
  • Compress all columns except the first sort-key column.
  • Partition data in the data lake based on query filters such as access pattern.

To explore some more best practices, take a deeper dive into the Amazon Redshift changes, and see an example of an in-depth query analysis, read the AWS Partner Network (APN) Blog.

If you are embarking on a data journey and are looking to leverage AWS services to quickly, reliably, and cost-effectively develop your data platform, contact our Data Engineering & Analytics team today.

Learn more about Onica services.

Use the Feedback tab to make any comments or ask questions. You can also click Sales Chat to chat now and start the conversation.

post avatar
Scott Peters

Scott Peters is a Lead Data Science Architect at Rackspace Technology. He has been working with AWS since 2013, and with Rackspace Technology since 2018. With a background in software development, he excels in application migration, software architecture, and big data engineering.

Share this information: