Data is Nutritious

Data Engineer's Memo

Speeding up URL forward-matching Query by splitting schema

Introduction

In data processing context, we often use query with URL condition. For example, using Google Analytics URL parameters you can measure where your site's users are from(Search Engine, Listing Ad or Display Ad, etc.). Forward-matching query is useful for that query.

In this article, I try to speed up SQL like below.

select uid, url
from accesslog
where regexp_like(url "^http://1.example.com/.*$")

Hypothesis

In my opinion, comparing first 7 or 8 bytes is worthless since in access log file or table, url value starts with http:// or https:// for all records. Splitting URL into 2 columns( schema and others ) may speed up forward-matching query. It will somewhat effective as access log has very huge records.

Experiment in Scala

In order to check my hypothesis, I chose Scala and make two types of log file.

Experiment Data

Access Log1

http://0.example.com
http://1.example.com
...
http://999999.example.com

Access Log2

0.example.com
1.example.com
...
999999.example.com

Scala code

import org.scalameter._
object main extends App{

  val rmSchemes = {
    for{
      i <- 0 until 1000000
    } yield i.toString + ".example.com"
  }.toList

  val allUrls = rmSchemes.map(i => "http://" + i)


  val standardConfig = config(
    Key.exec.minWarmupRuns -> 5,
    Key.exec.maxWarmupRuns -> 20,
    Key.exec.benchRuns -> 10,
    Key.verbose -> false
  ) withWarmer(new Warmer.Default)

  val time1 = standardConfig measure {
    allUrls.filter(url => url matches ("^http://1.example.com.*"))
  }
  println(time1)


  val time2 = standardConfig measure {
    rmSchemes.filter(url => url matches ("^1.example.com.*"))
  }
  println(time2)
}
sbt:compare> run
[info] Compiling 1 Scala source to /home/ec2-user/environment/compare/target/scala-2.12/classes ...
[info] Done compiling.
[info] Packaging /home/ec2-user/environment/compare/target/scala-2.12/compare_2.12-0.1.0-SNAPSHOT.jar ...
[info] Done packaging.
[info] Running main 
3706.5724933000006 ms
3387.8245134 ms

Conclusion

Access Log Type avg time
Access Log1 3706 ms
Access Log2 3387ms
  • Splitting url into two columns may somewhat speed up forward-matching query.

This article is English version of https://ktr89.hateblo.jp/entry/2019/01/19/024115