7 Online Advertising Updates Every Business Owner Should Know

As the digital space is always evolving, it is important to keep up with all the updates. A major portion of the evolution of these ever-expanding suite of platforms revolves around new features, bug…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




How to work with Hive tables with a lot of partitions from Spark

One of the common practice to improve performance of Hive queries is partitioning. Partitions are simply parts of data separated by one or more fields. Creating a partitioned table is simple:

On HDFS will be created next folder structure:

So every time when we will use partitioned fields in queries Hive will know exactly in what folders search data. It can significantly speedup execution because instead of full scan Hive engine will use only part of data.

Win? No. We have another problem — there are a lot of recommendations to limit amount of partitions in about 10000. Lets calculate how much partitions could have our table per one year:

About 1MM partitions is much more than recommended 10K! Ok, we can remove country from partitioning and it will get us 8640 partitions per year — much better. But what if we require data for 2,3,5,10 years? We can again remove by hour partitioning but our queries became slower or may be we load data by hour and sometimes need to reload some hours. Solution is simple — keep our partitioning structure as is. Hive can efficiently work even with 1MM partitions but with some reservations.

Let’s review Hive architecture.

What main parts do we have here:

Spark implement his own SQL Thrift Server and interacts with Metastore (Schema Catalog in term of Spark) directly.

When HiveServer build execution plan on partitioned table it request data about available partitions and have two methods for it:

the filter part related to partitioned columns will used by Metastore Server to get only required data:

So to have fast queries we need to be sure listPartitionsByFilter method is used. To support it for Spark spark.sql.hive.metastorePartitionPruning option must be enabled.

To make sure that everything works correctly you can set INFO level logs for Hive Metastore and search lines like:

Add a comment

Related posts:

The Easiest Way To Improve Your English Communication

Hello everyone and welcome to English Hackers. My name is Braden Chase. Moreover, today was a great day for us. We went to church, had a good time with the kids and um, it was just an all-around…

Elite Cryo Lounge NYC

Welcome to Elite Cryo Lounge NYC. We provide Award Winning Cryotherapy in New York City. Elite Cryo Lounge NYC: Fat Freeze, Whole Body, Local & Facial Cryo. Whole Body Cryotherapy(WBC) is the…

Home61 Growth Plan

It is pivotal to your success at Home61 that you are ready to actually receive your first lead and showing! Lastly, PMA (Positive Mental Attitude) — Required to work in real estate. Effectively…