Introduction

AWS Glue is very useful Hive Metastore service for people using Hive on EMR / Spark on EMR / Presto on Athena. I felt that fetching partitions is very slow, especially tables with much partitions. Technically users need to call API many times because the api does not response all partitions at once, it response some of all partitions with next token. Query for table with much partitions may need much api call, I felt. So, in this article I measure the number of api call to check it.

Conclusion

f:id:ktr89:20190901151711p:plain

The more partitions table has, the slower the query is.

Suggestion

If you use Glue, you should not create too much parttitions. Your query will be slow.
You can create 10,000,000 partitions, but it's not realistic(API Limit).
I hope that AWS fix this problem.

How to measure

create table

create table

session.client('glue').create_table(
    CatalogId=CATALOG_ID,
    DatabaseName=DATABASE_NAME,
    TableInput={
        'Description': 'Description of table',
        'Name': TABLE_NAME,
        'Parameters': {'EXTERNAL': 'TRUE'},
        'PartitionKeys': [
            {
                'Name': 'patition_id',
                'Type': 'int'
            },
        ],
        'Retention': 0,
        'StorageDescriptor': {
            'Columns': [
                {'Name': 'col1', 'Type': 'string'},
            ],
            'Compressed': True,
            'InputFormat': 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat',
            'Location': LOCATION,
            'NumberOfBuckets': 0,
            'OutputFormat': 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat',
            'SerdeInfo': {
                'Parameters': {},
                'SerializationLibrary': 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
            },
            'SkewedInfo': {
                'SkewedColumnNames': [],
                'SkewedColumnValueLocationMaps': {},
                'SkewedColumnValues': []
            },
            'SortColumns': [],
            'StoredAsSubDirectories': False
        },
        'TableType': 'EXTERNAL_TABLE',
    }
)

2. Create partitions and measure the number of API call to fetch specific partition

Python script

def add_patition(session, id_from, id_to):
    partitions = [
        partition(j)
        for j in range(id_from,  id_to)
    ]
    session.client('glue').batch_create_partition(
        CatalogId=CATALOG_ID,
        DatabaseName=DATABASE_NAME,
        TableName=TABLE_NAME,
        PartitionInputList=partitions
    )

def get_partition(session):
    def f(next_token=''):
        return session.client('glue').get_partitions(
            CatalogId=CATALOG_ID,
            DatabaseName=DATABASE_NAME,
            TableName=TABLE_NAME,
            Expression=f"patition_id=1000",
            NextToken=next_token
        )
    next_token = ''
    partitions = []
    n_api_call = 0
    while True:
        res = f(next_token)
        n_api_call += 1
        if 'Partitions' in res:
            partitions.extend(res['Partitions'])
        if 'NextToken' not in res:
            break
        next_token = res['NextToken']
    return partitions, n_api_call

for i in range(100):
    add_patition(session, i * 100, (i + 1) * 100)
    ps, cnt = get_partition(session) 
    print((i + 1) * 100, cnt)

stdout is below

stdout

This article is English version of https://ktr89.hateblo.jp/entry/2019/09/01/152131

Data is Nutritious

Data Engineer's Memo

AWS Glue's GetPartition API is slow for table with much Partitions.

Introduction

Conclusion

Suggestion

How to measure