Unified data platform and data virtualization through PolyBase: Part One


I am excited to finally build on my earlier blog post Unified Data Platform - SQL 2019.

In this series, I dig deep into how to achieve data virtualization through PolyBase®. This post covers the introduction and demo prerequisites, and Part Two gets to the demo itself.

Recap

To refresh your memory, recall that the previous post covered the following items:

  1. Evolution of SQL Server® from DB Engine to the Unified Data Platform
  2. SQL 2019 presents Unified Data Platform using:
    • SQL DB Engine for OLTP
    • Data Virtualization through Polybase
    • Data Mart through Columnar store
    • Data Lake through HDFS
    • Big Data, ML, Streaming through Apache Spark
  3. Management and Monitoring using Azure® Data Studio (ADS)

Introduction

With data being omnipresent, we continuously face challenges to move or copy it to another location to process it further. With a small data set, this is easy enough, but it can be a problem with ever-increasing data sizes. Also, with the progressive augmentation of data mining by organizations, data leaders do not advocate keeping data in one place. Likewise, it can be tedious to fetch or use data from different data stores of structured and unstructured data and Big Data.

Data virtualization is the solution to this problem.

What is data virtualization?

Data virtualization is an approach to data management that enables an application to retrieve and manipulate data without requiring technical details about the data, such as how it is formatted at the source or physically located. It can provide a single customer view of the overall data.

There are a lot of Data Virtualization tools present in the market, such as the following tools:

  • Microsoft┬« Polybase┬«
  • Actifio┬« Virtual Data Pipe (VDP)
  • Informatica┬« Powercenter
  • IBM┬« Cloud Pak for Data
  • RedHat┬« JBoss Data Virtualization

For this series, I focus on PolyBase, which Microsoft introduced in SQL 2016 and has improved with each subsequent SQL version.

Polybase enables SQL Server to run Transact-SQL queries on external data sources like Azure® Blob, Hadoop®, Oracle®, MongoDB®, and so on. The same Transact-SQL used to process external data can also run on relational databases. This ability helps to integrate data from external sources with relational data in your database. The following image shows a simple illustration of SQL Polybase:

Figure 1


Now that you know the basics of Polybase, I want to share a demo that fetches data from an Azure blob external source by using SQL PolyBase. This post covers the prerequisites for the demo.

Demo prerequisites

Before you can run the demo, you need to perform the following prerequisite tasks:

  1. Install SQL 2016 or later with the PolyBase feature.
  2. Enable PolyBase on SQL Server.
  3. Create an Azure Storage account.
  4. Create an Azure blob container.
  5. Place a data file in the blob container.
1. Install SQL Polybase

You can install Polybase with only one SQL instance on a machine.

Currently, I have one default SQL 2019 instance running on my local machine. However, I did not select Polybase during installation. The following image shows the SQL Server configuration manager:

Figure 2


I had to re-run the SQL setup and select the following elements during the feature selection window to install Polybase:

  • PolyBase Query Service for external data
  • Java Connector for HDFS data sources

Go ahead and run the SQL Setup and perform the following steps to install the PolyBase feature. Keep clicking Next till you reach the very last screen. Then, click the Finish Installation tab, selecting the tabs highlighted in dark red.

  1. Click Installation in the sidebar and select New SQL Server Stand-alone installation or add features to an existing installation.

Figure 3


  1. After you reach the Installation Type window, select Add features to an existing instance, and select the required instance from the drop-down menu.

Figure 4


  1. After you reach the Feature Selection window, select the PolyBase features.

Figure 5


  1. In the Polybase Configuration window, select Use this SQL Server as standalone Polybase-enabled instance.

Figure 6


  1. For the rest of the sidebar options, select the defaults and click Install. After the insyallation completes, the folloiwng window displays:

Figure 7


At this point, you can see in the SQL configuration manager that we have two more features installed. However, you might still get the error message Polybase not installed in SSMS while trying to enable Polybase. To fix this, restart the server after installing Polybase.

Figure 8


2. Enable SQL Polybase

To enable PolyBase, run the following steps:

  1. Connect to SQL Server in SSMS and run the following query to confirm that Polybase installed successfully.

     SELECT SERVERPROPERTY ('IsPolyBaseInstalled') AS IsSuccessfullyInstalled;
    

    The following image shows the output for a successful installation:


Figure 9

  1. Enable Polybase by running the following queries:

     EXEC sp_configure 'polybase enabled', 1;
     Go
    
  2. Run the following query:

     Reconfigure
    

    This query is important. Without this step, errors can appear during external file format creation in the steps covered in Part Three of this series.

Figure 10


3. Create an Azure Storage account

Perform the following steps to create an Azure Storage account:

  1. Log in to the Azure Portal by using your credentials.

  2. Search for the Azure Storage account service and perform the following steps to create the storage account. Keep clicking Next till you reach the very last screen. Then, click the Review and Create option. Select the tabs highlighted in dark red.

  3. On the Azure Portal search bar, select Azure Storage account and click + Add to create a new storage account.

Figure 11


  1. On the Basics tab, enter the required details and click Next: Networking.

Figure 12


  1. Keep the default settings for the Networking, Data Protection, Advanced, and Tags screens.

  2. Next, click Review+Create and, after the validation succeeds, click Create Tab to create the storage account, as shown in the following image:


Figure 13

  1. On successful deployment, click Go To Resource, which takes you to the created storage account.

Figure 14


4. Create an Azure container

To create an Azure container, go to the created Azure storage account, click on Containers in the left pane, and then click on +Container.

Figure 15


5. Place a data file in the container

At this stage, create a text data file and upload it to the container.

  1. Create a text file similar to the following file:

Figure 16


Note: You can also use CSV, Excel®, or other external data sources. However, depending on the external data source, you need to take a few additional steps. For example, for CSV or Excel data sources, you should install proper drivers on the SQL Server and add the connection properties to an ODBC data source name (DSN). You can use the Microsoft ODBC Data Source Administrator to create and configure ODBC DSNs.

  1. Go to the container you created, polybasedemocontainer, click Upload, click the folder icon on the right,-hand side, and select the file to upload.

Figure 17


Next step

You have successfully completed the prerequisites for the PolyBase demo. Part Two presents the demo.

Learn more about our Data services.

Use the Feedback tab to make any comments or ask questions. You can also click Let’s Talk to start the conversation.

post avatar
Vashali Misra

I am a SQL Server DBA at Rackspace Technology APJ Region. My area of expertise is SQL Server, and I’m slowly and steadily gaining experience into the cloud technologies. I have 9 Years of experience with major projects like Dowjones, MarketWatch, Regeneron, etc.

Share this information: