AWS Glue's GetPartition API is slow for table with much Partitions.
Introduction
AWS Glue is very useful Hive Metastore service for people using Hive on EMR / Spark on EMR / Presto on Athena. I felt that fetching partitions is very slow, especially tables with much partitions. Technically users need to call API many times because the api does not response all partitions at once, it response some of all partitions with next token. Query for table with much partitions may need much api call, I felt. So, in this article I measure the number of api call to check it.
Conclusion
The more partitions table has, the slower the query is.
Suggestion
- If you use Glue, you should not create too much parttitions. Your query will be slow.
- You can create 10,000,000 partitions, but it's not realistic(API Limit).
- I hope that AWS fix this problem.
How to measure
- create table
session.client('glue').create_table( CatalogId=CATALOG_ID, DatabaseName=DATABASE_NAME, TableInput={ 'Description': 'Description of table', 'Name': TABLE_NAME, 'Parameters': {'EXTERNAL': 'TRUE'}, 'PartitionKeys': [ { 'Name': 'patition_id', 'Type': 'int' }, ], 'Retention': 0, 'StorageDescriptor': { 'Columns': [ {'Name': 'col1', 'Type': 'string'}, ], 'Compressed': True, 'InputFormat': 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat', 'Location': LOCATION, 'NumberOfBuckets': 0, 'OutputFormat': 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat', 'SerdeInfo': { 'Parameters': {}, 'SerializationLibrary': 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' }, 'SkewedInfo': { 'SkewedColumnNames': [], 'SkewedColumnValueLocationMaps': {}, 'SkewedColumnValues': [] }, 'SortColumns': [], 'StoredAsSubDirectories': False }, 'TableType': 'EXTERNAL_TABLE', } )
2. Create partitions and measure the number of API call to fetch specific partition
def add_patition(session, id_from, id_to): partitions = [ partition(j) for j in range(id_from, id_to) ] session.client('glue').batch_create_partition( CatalogId=CATALOG_ID, DatabaseName=DATABASE_NAME, TableName=TABLE_NAME, PartitionInputList=partitions ) def get_partition(session): def f(next_token=''): return session.client('glue').get_partitions( CatalogId=CATALOG_ID, DatabaseName=DATABASE_NAME, TableName=TABLE_NAME, Expression=f"patition_id=1000", NextToken=next_token ) next_token = '' partitions = [] n_api_call = 0 while True: res = f(next_token) n_api_call += 1 if 'Partitions' in res: partitions.extend(res['Partitions']) if 'NextToken' not in res: break next_token = res['NextToken'] return partitions, n_api_call for i in range(100): add_patition(session, i * 100, (i + 1) * 100) ps, cnt = get_partition(session) print((i + 1) * 100, cnt)
stdout is below
100 2 200 2 300 2 400 2 500 2 600 3 700 3 800 3 900 3 1000 3 1100 4 1200 4 1300 4 1400 4 1500 4 1600 5 1700 5 1800 5 1900 5 2000 5 2100 6 2200 6 2300 6 2400 6 2500 6 2600 7 2700 7 2800 7 2900 7 3000 7 3100 8 3200 8 3300 8 3400 8 3500 8 3600 9 3700 9 3800 9 3900 9 4000 9 4100 10 4200 10 4300 10 4400 10 4500 10 4600 11 4700 11 4800 11 4900 11 5000 11 5100 12 5200 12 5300 12 5400 12 5500 12 5600 13 5700 13 5800 13 5900 13 6000 13 6100 14 6200 14 6300 14 6400 14 6500 14 6600 15 6700 15 6800 15 6900 15 7000 15 7100 16 7200 16 7300 16 7400 16 7500 16 7600 17 7700 17 7800 17 7900 17 8000 17 8100 18 8200 18 8300 18 8400 18 8500 18 8600 19 8700 19 8800 19 8900 19 9000 19 9100 20 9200 20 9300 20 9400 20 9500 20 9600 21 9700 21 9800 21 9900 21 10000 21
This article is English version of https://ktr89.hateblo.jp/entry/2019/09/01/152131