Getting Started with Presto & Hive on AWS

Today I will show you how to get started with Hive and Presto. I compare both tools by looking at the strength of each tool and their use cases. At the end of this post, it explains a scenario in which we can leverage both tools. Let's dive right in!

What is Apache Hive?

Apache Hive is a Data Warehousing tool that sits on top of Apache Hadoop for performing data querying and analysis on large datasets. Hive uses a SQL like method to query data making it easy to explore and analyze big data.

Hives’ original release was primarily focused on leveraging data that existed within HDFS. However, data analytics storage and access patterns have evolved since its conception. Hive has become a valuable tool for building out external database structures for querying objects on storage platforms like AWS S3.

What is Presto?

Presto is a distributed SQL query engine that is used for querying datasets from multiple sources including Hadoop, S3, MySQL, Teradata, and other relational and non-relational databases.

Presto was developed by Facebook to run queries against multiple data stores with response times ranging from sub-second to minutes. Other companies using presto include Netflix, airbnb and dropbox.

One of the main advantages of using Presto is that it supports concurrent query workloads. Being able to run multiple queries at the same time makes it the go-to application for BI-users.

AWS Athena is basically a presto service managed by AWS. Presto can also be deployed as a service on the Amazon EMR cluster.

Presto vs Hive

Presto is preferably used for performing quick and fast data analysis that will not require very much memory.
Presto is designed for low latency while on the other hand Hive is used for query throughput and queries that require very large amount of memory.
We can also use both tools to explore data sitting on top of a Hadoop system.
Apache Hive uses a language similar to SQL, but it still has some differences that BI-Users might need to learn how to write some queries.

Using Hive and Presto

Presto does not include built in support for the Hadoop file system and it will need to leverage other tools such as Hive connector (aka HCatalog).

One of the strengths of presto is that it's suitable for star schema models.

In the use case below it shows how to leverage Hive and Presto while working with NoSQL partition parquet tables and file-system data paths.

Note: If you plan on leveraging Presto as a reporting query engine, it can be used against both flattened and dimensional models. However, if Presto is querying dimensional report models or you’re planning on joining datasets together, you will want to consider the joining tables sizes and Presto configuration to ensure optimal performance.

Large memory usage and bad performance are commonly seen if Presto is not configured correctly for your use case so it’s worth reviewing the documentation which can be found here: https://teradata.github.io/presto/docs/141t/admin/tuning.html.

Example of How to use Hive and Presto

Let's say we need to create a view that requires a fact table (parquet data) and multiple small dimension tables.
The ideal logic will be to use Presto, however we can’t explore parquet tables directly with Presto.
We will need other tools such as Hive that allows us create external tables.
With this setup we can read the other file system data sources with Presto.

In this example we have 3 dimension tables stored in a Key-Value Database and a Fact Partitioned Parquet Table.

Creating the Fact Partitioned Table with Spark

data = [(1,1001,'SMALL8','','USD', 10,2020,9),(2,1002,'MED9','','CAD', 17,2019,10),(3,1003,'LARGE10','','CAD',90,2019,10)] columns = ['uid','productid','sizeid','description','currency','quantity','year','month'] df = spark.createDataFrame(data, columns)

Writing Parquet Data with Spark

df.write.format("parquet").mode("overwrite").option("key", "uid")\ .partitionBy('year','month').save(data_location)

*Note, when working with partitioned parquet tables the partition fields needs to be in lowercase before you can create an external table in Hive.

Setup Hive

To setup Hive on AWS, you can deploy Hive directly on your Amazon EMR cluster.
This link provides instructions on how that can be done: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hive.html

Additional Fields

If you have extra fields you want to add to the table in the future, these fields can be included when creating the external table.Hive allows us add fields that are not yet available in a parquet table.

Now that the external table is created, we can access the data in Presto.

Presto can also be setup on Amazon EMR cluster. This link provides information on how to set it up https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-presto.html

Table in Hive, Data from Presto

Lastly, the view that joins the fact data and key-value database together.

In Conclusion, the example shows how to leverage Presto and Hive together by creating the external table in Hive and directly joining the key-value data in Presto.