Data is Nutritious

Data Engineer's Memo

AWS Glue's GetPartition API is slow for table with much Partitions.

Introduction

AWS Glue is very useful Hive Metastore service for people using Hive on EMR / Spark on EMR / Presto on Athena. I felt that fetching partitions is very slow, especially tables with much partitions. Technically users need to call API many times because the api does not response all partitions at once, it response some of all partitions with next token. Query for table with much partitions may need much api call, I felt. So, in this article I measure the number of api call to check it.

Conclusion

f:id:ktr89:20190901151711p:plain

The more partitions table has, the slower the query is.

Suggestion

  • If you use Glue, you should not create too much parttitions. Your query will be slow.
  • You can create 10,000,000 partitions, but it's not realistic(API Limit).
  • I hope that AWS fix this problem.

How to measure

  1. create table

create table

session.client('glue').create_table(
    CatalogId=CATALOG_ID,
    DatabaseName=DATABASE_NAME,
    TableInput={
        'Description': 'Description of table',
        'Name': TABLE_NAME,
        'Parameters': {'EXTERNAL': 'TRUE'},
        'PartitionKeys': [
            {
                'Name': 'patition_id',
                'Type': 'int'
            },
        ],
        'Retention': 0,
        'StorageDescriptor': {
            'Columns': [
                {'Name': 'col1', 'Type': 'string'},
            ],
            'Compressed': True,
            'InputFormat': 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat',
            'Location': LOCATION,
            'NumberOfBuckets': 0,
            'OutputFormat': 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat',
            'SerdeInfo': {
                'Parameters': {},
                'SerializationLibrary': 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
            },
            'SkewedInfo': {
                'SkewedColumnNames': [],
                'SkewedColumnValueLocationMaps': {},
                'SkewedColumnValues': []
            },
            'SortColumns': [],
            'StoredAsSubDirectories': False
        },
        'TableType': 'EXTERNAL_TABLE',
    }
)

2. Create partitions and measure the number of API call to fetch specific partition

Python script

def add_patition(session, id_from, id_to):
    partitions = [
        partition(j)
        for j in range(id_from,  id_to)
    ]
    session.client('glue').batch_create_partition(
        CatalogId=CATALOG_ID,
        DatabaseName=DATABASE_NAME,
        TableName=TABLE_NAME,
        PartitionInputList=partitions
    )

def get_partition(session):
    def f(next_token=''):
        return session.client('glue').get_partitions(
            CatalogId=CATALOG_ID,
            DatabaseName=DATABASE_NAME,
            TableName=TABLE_NAME,
            Expression=f"patition_id=1000",
            NextToken=next_token
        )
    next_token = ''
    partitions = []
    n_api_call = 0
    while True:
        res = f(next_token)
        n_api_call += 1
        if 'Partitions' in res:
            partitions.extend(res['Partitions'])
        if 'NextToken' not in res:
            break
        next_token = res['NextToken']
    return partitions, n_api_call

for i in range(100):
    add_patition(session, i * 100, (i + 1) * 100)
    ps, cnt = get_partition(session) 
    print((i + 1) * 100, cnt)

stdout is below

stdout

100 2
200 2
300 2
400 2
500 2
600 3
700 3
800 3
900 3
1000 3
1100 4
1200 4
1300 4
1400 4
1500 4
1600 5
1700 5
1800 5
1900 5
2000 5
2100 6
2200 6
2300 6
2400 6
2500 6
2600 7
2700 7
2800 7
2900 7
3000 7
3100 8
3200 8
3300 8
3400 8
3500 8
3600 9
3700 9
3800 9
3900 9
4000 9
4100 10
4200 10
4300 10
4400 10
4500 10
4600 11
4700 11
4800 11
4900 11
5000 11
5100 12
5200 12
5300 12
5400 12
5500 12
5600 13
5700 13
5800 13
5900 13
6000 13
6100 14
6200 14
6300 14
6400 14
6500 14
6600 15
6700 15
6800 15
6900 15
7000 15
7100 16
7200 16
7300 16
7400 16
7500 16
7600 17
7700 17
7800 17
7900 17
8000 17
8100 18
8200 18
8300 18
8400 18
8500 18
8600 19
8700 19
8800 19
8900 19
9000 19
9100 20
9200 20
9300 20
9400 20
9500 20
9600 21
9700 21
9800 21
9900 21
10000 21

This article is English version of https://ktr89.hateblo.jp/entry/2019/09/01/152131