parquet-tools is easy and useful
TL;DR
- I installed parquet-tools and try to use it.
- It's easy to install and useful to fetch parquet files on Amazon s3
How to install parquet-tools
Original Apache parquet-tools is not easy to use since it needs build using Java.
But it's simple. Just
pip install parquet-tools
How to use it
Show parquet file contents.
parquet-tools show /path/to/parquet +-------+-------+---------+ | one | two | three | |-------+-------+---------| | -1 | foo | True | | nan | bar | False | | 2.5 | baz | True | +-------+-------+---------+
Show parquet file schema.
parquet-tools inspect /path/to/parquet ############ file meta data ############ created_by: parquet-cpp version 1.5.1-SNAPSHOT num_columns: 3 num_rows: 3 num_row_groups: 1 format_version: 1.0 serialized_size: 2226 ############ Columns ############ one two three ############ Column(one) ############ name: one path: one max_definition_level: 1 max_repetition_level: 0 physical_type: DOUBLE logical_type: None converted_type (legacy): NONE ############ Column(two) ############ name: two path: two max_definition_level: 1 max_repetition_level: 0 physical_type: BYTE_ARRAY logical_type: String converted_type (legacy): UTF8 ############ Column(three) ############ name: three path: three max_definition_level: 1 max_repetition_level: 0 physical_type: BOOLEAN logical_type: None converted_type (legacy): NONE
Speeding up URL forward-matching Query by splitting schema
Introduction
In data processing context, we often use query with URL condition. For example, using Google Analytics URL parameters you can measure where your site's users are from(Search Engine, Listing Ad or Display Ad, etc.). Forward-matching query is useful for that query.
In this article, I try to speed up SQL like below.
select uid, url from accesslog where regexp_like(url "^http://1.example.com/.*$")Read more
AWS Glue's GetPartition API is slow for table with much Partitions.
Introduction
AWS Glue is very useful Hive Metastore service for people using Hive on EMR / Spark on EMR / Presto on Athena. I felt that fetching partitions is very slow, especially tables with much partitions. Technically users need to call API many times because the api does not response all partitions at once, it response some of all partitions with next token. Query for table with much partitions may need much api call, I felt. So, in this article I measure the number of api call to check it.
Read more