R Bloggers Bulk Upload To Redshift ?
Using the Amazon Redshift Data API to interact with Amazon Redshift clusters
This post was updated on July 28, 2021, to include multi-statement and parameterization support.
Amazon Redshift is a fast, scalable, secure, and fully managed cloud data warehouse that makes it uncomplicated and cost-constructive to analyze all your information using standard SQL and your existing ETL (excerpt, transform, and load), business intelligence (BI), and reporting tools. Tens of thousands of customers use Amazon Redshift to process exabytes of information per twenty-four hour period and ability analytics workloads such as BI, predictive analytics, and real-time streaming analytics.
Every bit a data engineer or application developer, for some utilize cases, you desire to collaborate with Amazon Redshift to load or query data with a simple API endpoint without having to manage persistent connections. With Amazon Redshift Information API, you can interact with Amazon Redshift without having to configure JDBC or ODBC. This makes it easier and more secure to piece of work with Amazon Redshift and opens up new use cases.
This post explains how to use the Amazon Redshift Data API from the AWS Command Line Interface (AWS CLI) and Python. We also explain how to utilise AWS Secrets Manager to store and retrieve credentials for the Data API.
Introducing the Information API
The Amazon Redshift Information API enables you to painlessly access data from Amazon Redshift with all types of traditional, deject-native, and containerized, serverless spider web service-based applications and consequence-driven applications. The following diagram illustrates this architecture.
The Amazon Redshift Information API simplifies data access, ingest, and egress from programming languages and platforms supported by the AWS SDK such equally Python, Go, Java, Node.js, PHP, Blood-red, and C++.
The Data API simplifies access to Amazon Redshift past eliminating the demand for configuring drivers and managing database connections. Instead, you can run SQL commands to an Amazon Redshift cluster past just calling a secured API endpoint provided by the Data API. The Data API takes care of managing database connections and buffering information. The Data API is asynchronous, so you tin recollect your results later. Your query results are stored for 24 hours. The Data API federates AWS Identity and Access Direction (IAM) credentials so you tin use identity providers like Okta or Azure Active Directory or database credentials stored in Secrets Manager without passing database credentials in API calls.
For customers using AWS Lambda, the Data API provides a secure way to access your database without the additional overhead for Lambda functions to be launched in an Amazon Virtual Individual Deject (Amazon VPC). Integration with the AWS SDK provides a programmatic interface to run SQL statements and retrieve results asynchronously.
Relevant use cases
The Amazon Redshift Data API is not a replacement for JDBC and ODBC drivers, and is suitable for use cases where you don't demand a persistent connection to a cluster. It'due south applicable in the following apply cases:
- Accessing Amazon Redshift from custom applications with any programming language supported past the AWS SDK. This enables yous to integrate web service-based applications to access data from Amazon Redshift using an API to run SQL statements. For example, you can run SQL from JavaScript.
- Building a serverless data processing workflow.
- Designing asynchronous web dashboards because the Information API lets you run long-running queries without having to wait for information technology to complete.
- Running your query one time and retrieving the results multiple times without having to run the query again within 24 hours.
- Edifice your ETL pipelines with AWS Pace Functions, Lambda, and stored procedures.
- Having simplified access to Amazon Redshift from Amazon SageMaker and Jupyter notebooks.
- Edifice event-driven applications with Amazon EventBridge and Lambda.
- Scheduling SQL scripts to simplify data load, unload, and refresh of materialized views.
The Data API GitHub repository provides examples for different use cases.
Create an Amazon Redshift cluster
If y'all oasis't already created an Amazon Redshift cluster, or want to create a new one, see Step ane: Create an IAM function. In this mail service, nosotros create a table and load data using the Re-create control. Make certain that the IAM role you lot attach to your cluster has AmazonS3ReadOnlyAccess
permission.
Prerequisites for using the Data API
You must exist authorized to access the Amazon Redshift Data API. Amazon Redshift provides the RedshiftDataFullAccess
managed policy, which offers total access to Data APIs. This policy also allows access to Amazon Redshift clusters, Secrets Managing director, and IAM API operations needed to cosign and admission an Amazon Redshift cluster by using temporary credentials. If you lot want to employ temporary credentials with the managed policy RedshiftDataFullAccess
, you take to create one with the user name in the database equally redshift_data_api_user
.
You can too create your own IAM policy that allows admission to specific resource by starting with RedshiftDataFullAccess
as a template. For details, refer to Querying a database using the query editor.
The Data API allows you to access your database either using your IAM credentials or secrets stored in Secrets Manager. In this post, we utilise Secrets Director.
For instructions on using database credentials for the Data API, run across How to rotate Amazon Redshift credentials in AWS Secrets Director.
Use the Information API from the AWS CLI
Yous can employ the Data API from the AWS CLI to interact with the Amazon Redshift cluster. For instructions on configuring the AWS CLI, come across Setting up the Amazon Redshift CLI. The Amazon Redshift CLI (aws redshift
) is a part of AWS CLI that lets you manage Amazon Redshift clusters, such equally creating, deleting, and resizing them. The Data API at present provides a command line interface to the AWS CLI (redshift-information
) that allows you to interact with the databases in an Amazon Redshift cluster.
Before we get started, ensure that you lot take the updated AWS SDK configured.
Y'all tin invoke help using the following command:
The post-obit tabular array shows you different commands bachelor with the Data API CLI.
Command | Description |
listing-databases | Lists the databases in a cluster. |
list-schemas | Lists the schemas in a database. Y'all tin can filter this by a matching schema pattern. |
list-tables | Lists the tables in a database. Y'all can filter the tables list by a schema name pattern, a matching table proper noun pattern, or a combination of both. |
draw-table | Describes the detailed data about a table including column metadata. |
execute-statement | Runs a SQL statement, which tin can be SELECT,DML, DDL, Re-create, or UNLOAD. |
batch-execute-statement | Runs multiple SQL statements in a batch equally a part of unmarried transaction. The statements can be SELECT, DML, DDL, Copy, or UNLOAD. |
| Cancels a running query. To exist canceled, a query must be in the RUNNING state. |
describe-statement | Describes the details of a specific SQL statement run. The information includes when the query started, when it finished, the number of rows candy, and the SQL statement. |
list-statements | Lists the SQL statements. By default, only finished statements are shown. |
get-statement-effect | Fetches the temporarily cached result of the query. The result gear up contains the complete result set and the column metadata. You tin paginate through a set of records to retrieve the entire result as needed. |
If yous desire to get aid on a specific control, run the post-obit command:
Now we expect at how y'all can use these commands. First, get the secret cardinal ARN by navigating to your key on the Secrets Director console.
Listing databases
Nigh organizations use a single database in their Amazon Redshift cluster. You tin apply the following control to list the databases y'all accept in your cluster. This functioning requires you to connect to a database and therefore requires database credentials:
List schema
Similar to list databases, yous can list your schemas by using the list-schemas command:
Yous take several schemas that match demo
(demo
, demo2
, demo3
, and and so on). You tin optionally provide a design to filter your results matching to that design:
List tables
The Data API provides a uncomplicated command, list-tables
, to listing tables in your database. You lot might have thousands of tables in a schema; the Data API lets you paginate your result prepare or filter the tabular array list by providing filter conditions.
You tin search across your schema with table-pattern
; for example, you tin can filter the table list by all tables across all your schemas in the database. Encounter the following code:
You tin filter your tables list in a specific schema design:
Run SQL commands
You tin run SELECT, DML, DDL, COPY, or UNLOAD commands for Amazon Redshift with the Information API. You tin optionally specify a name for your argument. You can optionally specify a name for your statement, and if you want to send an event to EventBridge subsequently the query runs. The query is asynchronous, and you get a query ID after running a query.
Create a schema
Permit's at present use the Data API to come across how you can create a schema. The following command lets you create a schema in your database. You don't have to run this SQL if you have pre-created the schema.
The following shows an example output. We will discuss later how you lot tin check the status of a SQL that you executed with execute-statement
Nosotros discuss later how y'all can check the status of a SQL that you ran with execute-statement
.
Create a table
You can use the following command to create a table with the CLI.
Load sample data
The COPY command lets y'all load bulk data into your table in Amazon Redshift. Yous tin can use the following control to load information into the table we created earlier:
Retrieve Data
The post-obit query uses the table we created earlier:
If you're fetching a large corporeality of data, using UNLOAD is recommended. Y'all can unload data into Amazon Unproblematic Storage Service (Amazon S3) either using CSV or Parquet format. UNLOAD uses the MPP capabilities of your Amazon Redshift cluster and is faster than retrieving a large amount of information to the client side.
The post-obit shows an example output:
You can fetch results using the query ID that you lot receive as an output of execute-argument
.
Bank check the condition of a statement
You lot tin can check the status of your statement by using depict-statement
. The output for describe-argument
provides boosted details such as PID, query duration, number of rows in and size of the result set, and the query ID given by Amazon Redshift. See the following command:
The following is an example output:
The status of a statement can be FINISHED, RUNNING, or FAILED.
Run SQL statements with parameters
You lot can run SQL statements with parameters. The following instance uses two named parameters in the SQL that is specified using a name-value pair:
The describe-statement
returns QueryParameters
along with QueryString
:
You lot can map the proper name-value pair in the parameters list to 1 or more parameters in the SQL text, and the name-value parameter can be in random society. You can specify type cast, for example, :sellerid::BIGINT
, with a parameter. You can also specify a comment in the SQL text while using parameters. You can't specify a Zippo value or zero-length value as a parameter.
Cancel a running statement
If your query is even so running, y'all can employ cancel-statement
to cancel a SQL query. Run across the post-obit command:
Fetch results from your query
Y'all can fetch the query results past using get-argument-result
. The query consequence is stored for 24 hours. Meet the following control:
The output of the result contains metadata such every bit the number of records fetched, column metadata, and a token for pagination.
Run multiple SQL statements
You can run multiple SELECT, DML, DDL, Copy, or UNLOAD commands for Amazon Redshift in a batch with the Data API. The batch-execute-statement
enables you to create tables and run multiple Re-create commands or create temporary tables as a part of your reporting system and run queries on that temporary table. Encounter the post-obit code:
The describe-statement
for a multi-statement query shows the status of all sub-statements:
In the preceding example, we had two SQL statements and therefore the output includes the ID for the SQL statements every bit 23d99d7f-fd13-4686-92c8-e2c279715c21:one
and 23d99d7f-fd13-4686-92c8-e2c279715c21:ii
. Each sub-statement of a batch SQL statement has a condition, and the condition of the batch argument is updated with the status of the last sub-argument. For example, if the last argument has status FAILED, and then the status of the batch statement shows every bit FAILED.
You can fetch query results for each statement separately. In our example, the first statement is a a SQL statement to create a temporary table, then there are no results to call up for the first argument. Y'all can retrieve the issue set for the 2nd statement by providing the argument ID for the sub-argument:
Export the data
Amazon Redshift allows you to export from database tables to a gear up of files in an S3 saucepan by using the UNLOAD command with a SELECT statement. You can unload data in either text or Parquet format. The following command shows you an example of how you can use the data lake export with the Information API:
You can employ the batch-execute-statement
if you desire to employ multiple statements with UNLOAD or combine UNLOAD with other SQL statements.
Use the Data API from the AWS SDK
You tin can use the Information API in any of the programming languages supported past the AWS SDK. For this post, we utilise the AWS SDK for Python (Boto3) equally an instance to illustrate the capabilities of the Information API.
Nosotros first import the Boto3 package and plant a session:
Get a client object
You can create a customer object from the boto3.Session
object and using RedshiftData
:
If you don't want to create a session, your client is as unproblematic every bit the following code:
Run a statement
The following instance code uses the Secrets Manager cardinal to run a statement. For this postal service, we use the table we created earlier. Yous can employ DDL, DML, Copy, and UNLOAD as a parameter:
Equally we discussed earlier, running a query is asynchronous; running a statement returns an ExecuteStatementOutput
, which includes the statement ID.
If you want to publish an event to EventBridge
when the argument is complete, you can utilise the additional parameter WithEvent
set to truthful
:
Use IAM credentials
Amazon Redshift allows users to get temporary database credentials using GetClusterCredentials
. We recommend scoping the admission to a specific cluster and database user if you're allowing your users to use temporary credentials. The following example lawmaking gets temporary IAM credentials. As you lot can see in the code, we use redshift_data_api_user
. The managed policy RedshiftDataFullAccess
scopes to use temporary credentials just to redshift_data_api_user
.
Describe a statement
You can use describe_statement
to observe the status of the query and number of records retrieved:
Fetch results from your query
You can utilize get_statement_result
to call up results for your query if your query is complete:
The get_statement_result
command returns a JSON object that includes metadata for the result and the actual result set up. Y'all might need to process the data to format the result if you desire to display it in a convenient format.
Fetch and format results
For this mail service, we demonstrate how to format the results with the Pandas framework. The post_process
function processes the metadata and results to populate a DataFrame. The query function retrieves the result from a database in an Amazon Redshift cluster. Encounter the post-obit code:
import pandas equally pd
In this post, we demonstrated using the Data API with Python. Nonetheless, you lot tin can apply the Data API with other programming languages supported by the AWS SDK.
Best practices
Nosotros recommend the following best practices when using the Data API:
- Federate your IAM credentials to the database to connect with Amazon Redshift. Amazon Redshift allows users to get temporary database credentials with
GetClusterCredentials
. We recommend scoping the access to a specific cluster and database user if you're granting your users temporary credentials. For more than information, run across Example policy for using GetClusterCredentials. - Use a custom policy to provide fine-grained access to the Data API in the production environment if y'all don't want your users to utilize temporary credentials. Y'all have to use Secrets Manager to manage your credentials in such use cases.
- Ensure that the record size that you retrieve is smaller than 64 KB.
- Don't retrieve a large amount of data from your client and use the UNLOAD command to export the query results to Amazon S3. You lot're limited to retrieving just 100 MB of data with the Data API.
- Don't forget to recollect your results within 24 hours; results are stored just for 24 hours.
Client feedback
Datacoral is a fast-growing startup that offers an AWS-native data integration solution for analytics. Datacoral integrates data from databases, APIs, events, and files into Amazon Redshift while providing guarantees on data freshness and data accuracy to ensure meaningful analytics. With the Information API, they can create a completely outcome-driven and serverless platform that makes data integration and loading easier for our mutual customers.
Founder and CEO Raghu Murthy says, "As an Amazon Redshift Ready Advanced Technology Partner, we have worked with the Redshift team to integrate their Redshift API into our product. The Redshift API provides the asynchronous component needed in our platform to submit and respond to information pipeline queries running on Amazon Redshift. It is the last piece of the puzzle for us to offer our customers a fully event-driven and serverless platform that is robust, cost-effective, and scales automatically. We are thrilled to exist office of the launch."
Zynga Inc. is an American game developer running social video game services, founded in Apr 2007. Zynga uses Amazon Redshift as its central data warehouse for game event, user, and acquirement data. They apply the data in the information warehouse for analytics, BI reporting, and AI/ML beyond all games and departments. Zynga wants to supervene upon whatever programmatic admission clients connected to Amazon Redshift with the new Data API. Currently, Zynga's services connect using a wide variety of clients and drivers, and they plan to consolidate all of them. This will remove the need for Amazon Redshift credentials and regular password rotations.
Johan Eklund, Senior Software Engineer, Analytics Engineering squad in Zynga, who participated in the beta testing, says, "The Data API would exist an excellent pick for our services that volition use Amazon Redshift programmatically. The primary improvement would exist authentication with IAM roles without having to involve the JDBC/ODBC drivers since they are all AWS hosted. Our most common service client environments are PHP, Python, Go, plus a few more."
Conclusion
In this post, we introduced you to the newly launched Amazon Redshift Data API. We also demonstrated how to utilise the Data API from the Amazon Redshift CLI and Python using the AWS SDK. We also provided best practices for using the Data API.
To learn more, see Using the Amazon Redshift Data API or visit the Information API GitHub repository for code examples.
Near the Authors
Debu Panda, a Principal Product Manager at AWS, is an industry leader in analytics, application platform, and database technologies. He has more than xx years of experience in the IT industry and has published numerous articles on analytics, enterprise Coffee, and databases and has presented at multiple conferences. He is lead writer of the EJB 3 in Activity (Manning Publications 2007, 2014) and Middleware Management (Packt).
Martin Grund is a Principal Engineer working in the Amazon Redshift team on all topics related to information lake (e.thou. Redshift Spectrum), AWS platform integration and security.
Chao Duan is a software development manager at Amazon Redshift, where he leads the development team focusing on enabling self-maintenance and self-tuning with comprehensive monitoring for Redshift. Chao is passionate nearly building high-availability, high-performance, and price-effective database to empower customers with information-driven decision making.
Daisy Yanrui Zhang is a software Dev Engineer working in the Amazon Redshift team on database monitoring, serverless database and database user experience.
Source: https://aws.amazon.com/blogs/big-data/using-the-amazon-redshift-data-api-to-interact-with-amazon-redshift-clusters/
Posted by: smithhakis1990.blogspot.com
0 Response to "R Bloggers Bulk Upload To Redshift ?"
Post a Comment