Speeding up URL forward-matching Query by splitting schema
Introduction
In data processing context, we often use query with URL condition. For example, using Google Analytics URL parameters you can measure where your site's users are from(Search Engine, Listing Ad or Display Ad, etc.). Forward-matching query is useful for that query.
In this article, I try to speed up SQL like below.
select uid, url from accesslog where regexp_like(url "^http://1.example.com/.*$")
Hypothesis
In my opinion, comparing first 7 or 8 bytes is worthless since in access log file or table, url value starts with http://
or https://
for all records.
Splitting URL into 2 columns( schema and others ) may speed up forward-matching query.
It will somewhat effective as access log has very huge records.
Experiment in Scala
In order to check my hypothesis, I chose Scala and make two types of log file.
Experiment Data
Access Log1
http://0.example.com http://1.example.com ... http://999999.example.com
Access Log2
0.example.com 1.example.com ... 999999.example.com
Scala code
import org.scalameter._ object main extends App{ val rmSchemes = { for{ i <- 0 until 1000000 } yield i.toString + ".example.com" }.toList val allUrls = rmSchemes.map(i => "http://" + i) val standardConfig = config( Key.exec.minWarmupRuns -> 5, Key.exec.maxWarmupRuns -> 20, Key.exec.benchRuns -> 10, Key.verbose -> false ) withWarmer(new Warmer.Default) val time1 = standardConfig measure { allUrls.filter(url => url matches ("^http://1.example.com.*")) } println(time1) val time2 = standardConfig measure { rmSchemes.filter(url => url matches ("^1.example.com.*")) } println(time2) }
sbt:compare> run [info] Compiling 1 Scala source to /home/ec2-user/environment/compare/target/scala-2.12/classes ... [info] Done compiling. [info] Packaging /home/ec2-user/environment/compare/target/scala-2.12/compare_2.12-0.1.0-SNAPSHOT.jar ... [info] Done packaging. [info] Running main 3706.5724933000006 ms 3387.8245134 ms
Conclusion
Access Log Type | avg time |
---|---|
Access Log1 | 3706 ms |
Access Log2 | 3387ms |
- Splitting url into two columns may somewhat speed up forward-matching query.
This article is English version of https://ktr89.hateblo.jp/entry/2019/01/19/024115