Data is Nutritious

Data Engineer's Memo

parquet-tools is easy and useful

TL;DR

  • I installed parquet-tools and try to use it.
  • It's easy to install and useful to fetch parquet files on Amazon s3

How to install parquet-tools

Original Apache parquet-tools is not easy to use since it needs build using Java.

But it's simple. Just

pip install parquet-tools

How to use it

Show parquet file contents.

parquet-tools show /path/to/parquet
+-------+-------+---------+
|   one | two   | three   |
|-------+-------+---------|
|  -1   | foo   | True    |
| nan   | bar   | False   |
|   2.5 | baz   | True    |
+-------+-------+---------+

Show parquet file schema.

parquet-tools inspect /path/to/parquet

############ file meta data ############
created_by: parquet-cpp version 1.5.1-SNAPSHOT
num_columns: 3
num_rows: 3
num_row_groups: 1
format_version: 1.0
serialized_size: 2226


############ Columns ############
one
two
three

############ Column(one) ############
name: one
path: one
max_definition_level: 1
max_repetition_level: 0
physical_type: DOUBLE
logical_type: None
converted_type (legacy): NONE

############ Column(two) ############
name: two
path: two
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8

############ Column(three) ############
name: three
path: three
max_definition_level: 1
max_repetition_level: 0
physical_type: BOOLEAN
logical_type: None
converted_type (legacy): NONE

Speeding up URL forward-matching Query by splitting schema

Introduction

In data processing context, we often use query with URL condition. For example, using Google Analytics URL parameters you can measure where your site's users are from(Search Engine, Listing Ad or Display Ad, etc.). Forward-matching query is useful for that query.

In this article, I try to speed up SQL like below.

select uid, url
from accesslog
where regexp_like(url "^http://1.example.com/.*$")
Read more

AWS Glue's GetPartition API is slow for table with much Partitions.

Introduction

AWS Glue is very useful Hive Metastore service for people using Hive on EMR / Spark on EMR / Presto on Athena. I felt that fetching partitions is very slow, especially tables with much partitions. Technically users need to call API many times because the api does not response all partitions at once, it response some of all partitions with next token. Query for table with much partitions may need much api call, I felt. So, in this article I measure the number of api call to check it.

Read more