12 months ago

A few years ago, I was assigned the task to extract the city/suburb names from our crawler results. I wrote a parser, using a bunch of if/else statements and regular expressions. It worked mostly, except in some extreme cases. In order to parse those extreme cases, I added more if statements and more obscure regular expressions. At the end I feel the code was very unreadable.

But was I an incompetent programmer? A few months ago I read a blog post about using machine learning to do address parsing, and I realized my old approach of creating rules, is not how our brains work. A lot of cases really requires us thinking in terms of possibility ("if there are more than three characters followed by this, it is probably a street"). These are fuzzy logics, but my if/ else regular expressions are discrete logics operating on a boolean level.

So as a pet project, I decided to implement an address parser in Ruby. In the Python community they already have Parserator. So why not in Rubyland? I am from Taiwan, so I also want to try applying that to addresses here.

I used the Conditional Random Fields model, though reading the Wikipedia article fried my brain:

I don't understand any of these. However I still keep my hopes that I can just copy & paste something and it would work out eventually. Though we don't know how to create a lego block, we can still build things using it without all the background knowledge right?

The first step is to gather the training data. My friend said that these are confidential, and can cost money. So I looked elsewhere. Eventually I found out that there are people adding address entries on this site called OpenStreetMap. Regional data can be downloaded at this site called Gisgraphy. The file is in .pbf, which stands for Protocolbuffer Binary Format. So I used pbf_parser gem to access the data inside. Not all data are for addresses, some are bus routes and some are geometry data. I wrote a parser to extract addresses into the a SQL database. There were around 15000 records.

Though in OSM people enters address in different sections such as city and suburb, in reality it is not strictly followed as to which field represents what. This is especially true in Eastern countries. there are a few distinct levels which does not have an English counterpart. People also puts the full address in the street field and the like. So I have to write scripts to boldly move the data around the columns, add new columns to match Taiwanese address rules. I feel I have touched more than 2/3 of the addresses. I call this part cleaning.

Once cleaning is done, all we have to do is to feed those data in to train the model. Sylvester Keil wrote two Rubygems to do CRF training, one of which is called wapiti. It is a wrapper to a C library of the same name. He was very kind and helped me when I wanted to know how to use the gem.

Eventually I was able to feed my data into wapiti and create a model file. Some East-Asian languages have the property that pharaes are not separated by space characters, I have to chop the address into individual characters, and then feed them in. On the other end, when the model determines the result, I then have to combine neighbouring characters of the same label back into a phrase.

The result was much better than I expected, it can parse common addresses just fine. All of these are me writing no rules at all. I created a website for people to try out http://addresstokenizer.lulalala.com/, so I can also gather some new data.

People do inform me extreme cases where the tokenization fails. As my first time writing something using Machine Learning, the feeling is quite different, as something like this:

if result.wrong?
  say "Not me! It's its fault!
       The machine is too stupid to learn~~"
  guilt = 0 # do not feel guilty at all~
  say "Hehe"
  feels "complimented"
  happiness += 100

I provided a gem (https://github.com/lulalala/lulalala_address_tokenizer) and provided a model file. The gem is intended for East Asian addresses (Chinese, Japanese and Korean), so if you are in these region, please try create your own model. Once you plug it in, it should just work. Once I have time, I plan to put my training data online for others to make correction on.

over 1 year ago

We have a controller action which has a error handling state.

  if error
    flash[:error] = render_to_string(partial: "foo_error").html_safe
    render :error # renders error.js.erb

However eventhough the js is rendered, the response content-type is still "text/html". (The request Accept is set to
*/*;q=0.5, text/javascript, application/javascript, application/ecmascript, application/x-ecmascript
) This is very puzzling to us, because this always worked for us.

Later we found out that the root cause was because we call render_to_string. If we remove that call, Rails will again be able to guess the content-type as text/javascript. Somewhere in the Rails internal must have set the type when render_to_string is called, eventhough it is not directly used by the response rendering.

Using respond_to would also solve this issue, forcing Rails to return text/javascript as the content type.

  if error
    flash[:error] = render_to_string(partial: "foo_error").html_safe
    respond_to do |format|
      format.js { render :error }

I guess we should always specify respond_to, so the content-type can be deterministic.

almost 2 years ago

參加了去年年底的 Ruby Kaigi 2015 ,所有演講的錄影都已經放在網路上了 http://rubykaigi.org/2015/schedule



本次大會由 Matz 開場,他提出了「Ruby 3x3」的口號,也就是希望 Ruby 3 的速度想變成 Ruby 2 的三倍。而響應這個期許,許多演講也圍繞著這個主題探討,有時候一個技術在不同的主題或抽象層被提到,讓我印象加深不少。


第一天第二場就是 IBM 的 Experiments in sharing Java VM technology with CRuby。主題是它開發的 OMR 技術,近日快要開源。這計畫主要是想把 IBM 自己的 JVM 引擎 J9 裡面一些跟語言實做特別常見的模組開源,並希望其他語言能夠納入自己的實做。這樣子的好處是許多語言不用重複開發輪子,共享技術。這些模組改進時,語言也就免費得到的改進。第三天 IBM 更進一步介紹了 OMR 的 GC 原理 ,以及其他 Ruby 可能最佳化的方式。可見 IBM 真的很大力推廣這塊。

關於 OMR 的一些臆測,這裡有篇有趣的中文文章可以讀讀

同時在 JRuby ,也有 Oracle 介紹 Truffle 是怎樣讓元編程也有良好的速度


最後一天最後一場則是呼應之前幾場,由 Rubinius 的開發者 Evan Phoenix 所寫的 Key note Ruby 2020。以往像是我這樣的小開發者,都覺的要是 Ruby 很慢的話,把核心功能改用 C 寫,就是最能加速的方法。但是這會讓近年來 JIT 各種加速的技術無用武之地,因為 C 程式本身會像是黑盒子般無法探究內側邏輯,而加速技術很依賴知道程式接下來會作什麼來優化。把 Ruby 的核心庫重新用 Ruby 寫一遍是個解法,但是這樣工程太浩大所以不太現實。

所以 Evan 想提議,把 C 寫的核心庫採用 LLVM 作處理,最後 Ruby 跟 C 會落在同樣的層面,這樣就有如 JIT 之類技術運作的空間了。當然這還需要更多研究,不過作者本身覺的是可行的,讓外行人如我覺的十分興奮。這個演講說得很簡單,讓對 VM 一竅不通的我也能理解發生了什麼,十分推薦大家去聽聽看。

High Performance Template Engine: Guide to optimize your Ruby code

兩位同一家公司的開發者各自開發了自己的 Haml 引擎,想要加快 Haml 的速度。他們探討了模板引擎的原理,以及 Haml 的瓶頸還有克服的方法。適合已經對 Ruby 駕輕就熟的人讀讀,有趣好懂。

Charming Robots

使用 Ruby 控制跳舞墊控制無人機~要說最有趣的 demo 就屬於這個了。請看影片

Plugin-based software design with Ruby and RubyGems

寫開源程式常常會希望自己寫出來的東西能夠有良好的擴展性,比如像是 Firefox 能夠安裝插件一樣。這篇提到 Treasure Data 如何嘗試有插件的架構。成果是插件本身是 gem ,有相依性的管理等等。希望有一天能花點時間弄懂。

Turbo Rails with Rust

介紹了 Rust 語言,以及 Ruby 能怎樣引用它的程式。算是推坑成功,有點想找時間學學看。最有印象的是他提到 static dispatch 很重要,因為這是其他最佳化的前提。




TRICK2015 results from mametter


食物與 Party

主辦人松田明提到,這次辦在築地,就是因為 2016 年築地市場就要關了,所以他希望大家能趁這機會去試試看各式壽司或海產(這也是本次大會是使用壽司當作圖像意象的發想的原因)。當然就是 2000¥ 3000¥ 一直丟囉,好吃!

每天都有 Party ,除了主辦單位有,各家公司也自己都有舉辦,聽說從會議前的週二就開始,我參加了週四的 Heroku Pre-Party ,第一個聊到天的人就是 Metaprogramming Ruby 的作者 Paolo Perrotta ,十分友善。

有一天是卡啦 OK,在銀座裡面,因為很高檔所以要 5000¥ (倒)比較可憐的是有些外國人,因為現場都是唱卡通歌,而且沒有輪唱的概念,所以他們點不到也唱不到什麼歌。這種卡啦OK就是應該要規定每個人最多點一首歌才對呀~不過,看到 JRuby 的 Charles Nutter 大大拿著手機螢幕看羅馬拼音唱棋靈王的主題曲便是無價~

Ruby Kaigi 2016

下次的 Ruby Kaigi 已經確定要在京都舉辦,9/8 ~ 9/10 ,標在年曆上期待中~

about 2 years ago

Naming view helper methods in Rails can be difficult. What do you think about the following helpers?


Since all helpers live under the same scope, they all need to be named differently to avoid name collision. This can become tedious as the number of models grow.

For me personally, these two issues always bugged me when creating a helper method:

  1. I don't know where to put the helper method.
  2. I worry about name collision.

Reinventing the wheel

People thought of ways to organize these helper methods, which are now commonly called presenters, decorators or exhibiters.

One school of thought uses decorators to append functionalities on top the ActiveModel, draper, display_case and active_decorator being the three most well-known solutions.

Built-in helper methods all reside within view-context objects. So for presenter to work it needs access to the view-context object. However passing it for every helper call can be tedius. I liked how draper and active_decorator grabs that view_context for you, so you won't have to keep passing it to presenter when calling it.

However I see that its effort in decorating the model object can easily leak. For example would you remember to also decorate the association objects?

I wanted something else instead, a presenter object which contains only the helper methods. I add a presenter() method in the model object to access the presenter object. No decoration and less issues. The presenter() object also takes care of caching the presenter object, so you only instantiate a presenter per model once.

So I made my changes on top the active_decorator, and called it lulalala_presenter (all the good names are taken already).


First install by putting this in the Gemfile:

gem 'lulalala_presenter'

Say you want to add a title() helper method for auction, which takes care of truncation and html cleansing. I simply create a AuctionPresenter class. Remember to subclas it with LulalalaPresenter::Base:

# app/presenters/auction_presenter.rb
module AuctionPresenter < LulalalaPresenter::Base
  def title
    h.truncate(model.body, length: 40)

And then in the view I call it like this

<%= @auction.presenter.title %>

Less magic

I think it is good to start with less magic, but allow user to add magic if they want more convenience.

By default you use h to access all the build-in helper methods, and model to access the model object. If you are lazy you can choose to add stuff in your presenter so you can type less. For example you can delegate a few model attributes:

delegate :body, to: model

You can probably also automatically delegate everything to models using other delegation libraries.


At the end I quite like my results. For the top example, now I have


Having to type .presenter when calling helpers isn't so bad as I first thought. This makes it very obvious that the method belongs to presenter, so you won't ever mix up a helper methods with model methods.

My two goals are also fulfilled. If I need an og_image_tag helper for my auction, I immediately know I should put it in the AuctionPresenter, and I can name it without worrying about prefixing model names.

over 2 years ago

Rails developers often use to_param to add more information to the url. Taking this very page as an example:


We can clear get an idea of what this page is about. Therefore search engines favours this kind of urls.

However, if we just override to_param, we would also see other urls getting changed. One most notable example would be edit path:


I think this can cause issues as the seo-friend param can potentially get very long. Then it would be difficult to notice the edit in the url. Also it may not be that useful to do SEO on action urls (rather than the content urls). We probably will never want edit page to appear on search engines.

I think instead, we should conditionally do SEO on urls that needs it. One way to do this is to have our own version of the url helpers. For example, we can define post_url and post_path like this:

# Override default generated Rails route helpers.
module CustomUrlHelper
  def post_path(target, options = nil)
    target = "#{post.id}-#{post.title[0..60]}" if target.respond_to?(:title)
    super(target, options)
Rails.application.routes.url_helpers.send(:include, CustomUrlHelper)

This means only show/delete/update paths will be SEO friendly. Other paths (especially custom actions) will be unaffected.

over 2 years ago

I was trying to upgrade from Rails 4.0 to Rails 4.1. Some specs broke, and I found a small behaviour change.

So my user has_one profile.

user.association(:profile).loaded? # false
FactoryGirl.create(:profile, user: user)
user.association(:profile).loaded? # false

Prior to my upgrade, the profile association will not be loaded before/after FactoryGirl.create call.

However after Rails 4.1

user.association(:profile).loaded? # false
FactoryGirl.create(:profile, user: user)
user.association(:profile).loaded? # true

This means I have to be careful to not have stale profile in tests, calling reload more often.

P.S. my factory_girl gem has remain unchanged in version 4.4.0 and factory_girl_rails gem in 4.4.1. So this is solely due to Rails behavior change.

over 2 years ago

這個週末嘗試把 sqlite 的資料轉移到 postgresql 上,結果遇到許多坑。

sqlite 的 data dump 其實需要更改不少地方,才能用在其他資料庫。這部份很容易犯錯,所以我想找其他方法。


我先是在 RailsCast 上看到 taps 這個 gem ,能夠透過另開 server 作到讀取資料再匯入的功能。可惜這個 gem 已經沒有維護了,自己嘗試修問題都是以失敗告終。

接下來我使用 pgloader ,這似乎能直接匯入資料進入 pg ,可是最近的版本似乎有 UTF8 的問題,我就算直接把 master 抓下來手動編譯,還是無法成功解決這問題,故放棄。

心灰意冷時,我突然想到,以 SQL 不行,但是既然 ActiveRecord 能夠支援各種資料庫,那麼只要把舊的資料以 yaml 格式備份出來,然後在新的資料庫給塞回去,就可以了。找了一下發現 active-dump 這款 gem ,最後作了一些調整,終於成功的轉移資料庫。


  • 有使用 foreign key 的話,記得依照 key 的 dependency 把匯入的順序調整,被 Key 指向的 table 移到前面。
  • 匯出前也要先把 orphan record 先刪除(就是有 foreign key 指向已經不存在的資料時)
  • 匯入前,先轉換資料庫然後跑 migration
  • 匯入後,每個 table 的 id 並不會自動設成從目前最大的 id 開始繼續遞增。必須跑類似 ALTER SEQUENCE product_id_seq RESTART WITH 1453; 來對每個 table 一一作設定,不然的話可能新增的 record 會從前面沒有的 id 開始建立。
  • 以前在 MySQL 或是 Sqlite 時似乎沒有指定 order 就會以 primary key 自動作排序。但是在 Postgresql 我發現丟回來的東西順序不是這樣(我看到的是相反的 id desc)。這代表 query 都得手動加入 order(:id) 才能確保跟以前一樣以 id desc 排序。


資料轉移完以後,我才偶然發現 sequel gem 似乎也有提供類似的轉移功能,因為這款 gem 很有名,所以很有可能比以上我使用的方法更可靠好用。這就等其他人來試試看報告一下了。

almost 3 years ago

最近想寫一個 email domain 的檢查器,想要提供 wrapper ,讓使用者能包一層 cache 或是給 testing 用的 dummy 。這個時候我就想到了 Rack middleware 的那種模式。

只不過我想說這麼簡單的程式庫,做出 middleware DSL 來是否太過浪費,所以就想說不要做成 middleware 模式那樣剝洋蔥的感覺。所以就打算平面化,用 array 把 wrapper 存起來,每個 wrapper 都有一個 valid? ,依序呼叫後有 true 就提前 return ,false就呼叫下一個。

實做後,發現我忘了應該要 cache miss 跟真正的檢查後,再把東西存起來。要這麼作就只能真正執行檢查後,再迴圈跑每個 wrapper 的另一個方法去存起來。這個時候才發現 middleware 的精妙之處。首先同一個方法就會同時負責 input 跟 output 。然後測試也比較簡單,各個 wrapper 可以獨立寫測試。用我的土炮方法,就得把 wrapper 套在 core class 上進行 integration test,並且確認裡層的東西在 cache hit 時就不會呼叫。

感覺自己因為想減少 method 呼叫層次,似乎只是一種 premature optimization 囧。

about 3 years ago

When I am integrating offsite payments in ActiveMerchant, I always need to provide an endpoint to receive payment notificaiton from gateway providers. After a couple of projects, I have come up with a best-practice to do it, and I thought this might be useful for others.

I have a PaymentNotification class solely to record notifications. This is because I want to decouple the persistence of notification from the order status change. If something goes wrong when I change Order, the notification will still be safely persisted in the database for later analysis.

As you can see, the controller action is pretty light. It only saves the notification. The notification request is setup so it contains order_id and payment method.

The interesting part is the PaymentNotification model. It has a few columns for persisting different information (mostly serialized). It is also linked to Order, for easier analysis.

The process method is to check the notification's validity and trigger order status update. It is to be called separately (maybe via a scheduled job). A few kinds of custom exceptions may be raised during process, because they are handled differently. Either way, errors will be recorded in the errors column.

Any improvement is welcomed :D

about 3 years ago


Re: 新增素人選項 醫師李宏信明登記參選北市長


剛好無聊,來 post 之前看賽局理論看到跟選舉有關的兩個模型 XD

1. 投票人的政治屬性只有一個維度(以台灣的例子來說,就是藍/綠)
2. 投票人會投給跟自己立場**最接近**的候選人



現在,候選人準備要提出自己的政見了!我們就用連柯當名字好了 XD








投票者/參選者模型(Voter-Candidate Model)



1. 有一群人,分佈在一維的政治立場上
2. 每個人無法自由選擇自己的立場,因為立場由過去的言行定案
3. 每個人能決定自己要不要參選
4. 每個人會投票給跟自己立場最接近的人


算了一下格數 33 : 39,柯領先 18%,跟實際民調的 15% 差不多。


33 : 32 : 7 連勝文贏了,

這個錯誤國民黨在 2000 跟 2004 年都犯了一次,


那現在國民黨民調還落後 15%,怎麼辦?
但如果你是連勝文,你敢不敢堵看看這民調準還是不準 XD 絕對不敢賭麻 XDDD

不過這手段被識破的話,效果就沒了 XD 只是不知道多少人會看破