Data is Nutritious

Data Engineer's Memo

parquet-tools is easy and useful

TL;DR I installed parquet-tools and try to use it. It's easy to install and useful to fetch parquet files on Amazon s3 How to install parquet-tools Original Apache parquet-tools is not easy to use since it needs build using Java. But it's …

Speeding up URL forward-matching Query by splitting schema

Introduction In data processing context, we often use query with URL condition. For example, using Google Analytics URL parameters you can measure where your site's users are from(Search Engine, Listing Ad or Display Ad, etc.). Forward-mat…

AWS Glue's GetPartition API is slow for table with much Partitions.

Introduction AWS Glue is very useful Hive Metastore service for people using Hive on EMR / Spark on EMR / Presto on Athena. I felt that fetching partitions is very slow, especially tables with much partitions. Technically users need to cal…