4 months ago

## I wrote an address tokenizer using machine learning

A few years ago, I was assigned the task to extract the city/suburb names from our crawler results. I wrote a parser, using a bunch of if/else statements and regular expressions. It worked mostly, except in some extreme cases. In order to parse those extreme cases, I added more if statements and more obscure regular expressions. At the end I feel the code was very unreadable.

But was I an incompetent programmer? A few months ago I read a blog post about using machine learning to do address parsing, and I realized my old approach of creating rules, is not how our brains work. A lot of cases really requires us thinking in terms of possibility ("if there are more than three characters followed by this, it is probably a street"). These are fuzzy logics, but my if/ else regular expressions are discrete logics operating on a boolean level.

So as a pet project, I decided to implement an address parser in Ruby. In the Python community they already have Parserator. So why not in Rubyland? I am from Taiwan, so I also want to try applying that to addresses here.

I used the Conditional Random Fields model, though reading the Wikipedia article fried my brain:

I don't understand any of these. However I still keep my hopes that I can just copy & paste something and it would work out eventually. Though we don't know how to create a lego block, we can still build things using it without all the background knowledge right?

The first step is to gather the training data. My friend said that these are confidential, and can cost money. So I looked elsewhere. Eventually I found out that there are people adding address entries on this site called OpenStreetMap. Regional data can be downloaded at this site called Gisgraphy. The file is in .pbf, which stands for Protocolbuffer Binary Format. So I used pbf_parser gem to access the data inside. Not all data are for addresses, some are bus routes and some are geometry data. I wrote a parser to extract addresses into the a SQL database. There were around 15000 records.

Though in OSM people enters address in different sections such as city and suburb, in reality it is not strictly followed as to which field represents what. This is especially true in Eastern countries. there are a few distinct levels which does not have an English counterpart. People also puts the full address in the street field and the like. So I have to write scripts to boldly move the data around the columns, add new columns to match Taiwanese address rules. I feel I have touched more than 2/3 of the addresses. I call this part cleaning.

Once cleaning is done, all we have to do is to feed those data in to train the model. Sylvester Keil wrote two Rubygems to do CRF training, one of which is called wapiti. It is a wrapper to a C library of the same name. He was very kind and helped me when I wanted to know how to use the gem.

Eventually I was able to feed my data into wapiti and create a model file. Some East-Asian languages have the property that pharaes are not separated by space characters, I have to chop the address into individual characters, and then feed them in. On the other end, when the model determines the result, I then have to combine neighbouring characters of the same label back into a phrase.

The result was much better than I expected, it can parse common addresses just fine. All of these are me writing no rules at all. I created a website for people to try out http://addresstokenizer.lulalala.com/, so I can also gather some new data.

People do inform me extreme cases where the tokenization fails. As my first time writing something using Machine Learning, the feeling is quite different, as something like this:

I provided a gem (https://github.com/lulalala/lulalala_address_tokenizer) and provided a model file. The gem is intended for East Asian addresses (Chinese, Japanese and Korean), so if you are in these region, please try create your own model. Once you plug it in, it should just work. Once I have time, I plan to put my training data online for others to make correction on.

10 months ago

## Always use respond_to

We have a controller action which has a error handling state.

However eventhough the js is rendered, the response content-type is still "text/html". (The request Accept is set to  */*;q=0.5, text/javascript, application/javascript, application/ecmascript, application/x-ecmascript) This is very puzzling to us, because this always worked for us.

Later we found out that the root cause was because we call render_to_string. If we remove that call, Rails will again be able to guess the content-type as text/javascript. Somewhere in the Rails internal must have set the type when render_to_string is called, eventhough it is not directly used by the response rendering.

Using respond_to would also solve this issue, forcing Rails to return text/javascript as the content type.

I guess we should always specify respond_to, so the content-type can be deterministic.

## RubyKaigi 2015 感想

### VM技術

（果真這種大會議才會看到各個重量級公司來參一腳）

over 1 year ago

## My Presenter Library for Rails

Naming your helper methods in Rails can be very difficult. What do you feel when you see this:

To avoid method name collision, we usually have to append the model name in front. Helper methods are also global, so you can access all kinds of helpers in one go.

For me personally, I want the following when I create a helper method:

1. easily know where to put a helper method
2. easily know how to name the method

#### Reinventing the wheel

People thought of ways to organize these helper methods, which are now commonly called presenters, decorators or exhibiters.

One school of thought uses decorators to append functionalities on top the ActiveModel, draper, display_case and active_decorator being the three most well-known solutions.

Built-in helper methods all reside within view-context objects. So for presenter to work it needs access to the view-context object. However passing it for every helper call can be tedius. I liked how draper and active_decorator grabs that view_context for you, so you won't have to keep passing it to presenter when calling it.

However I see that its effort in decorating the model object can easily leak. For example would you remember to also decorate the association objects?

I wanted something else instead, a presenter object which contains only the helper methods. I add a presenter() method in the model object to access the presenter object. No decoration and less issues. The presenter() object also takes care of caching the presenter object, so you only instantiate a presenter per model once.

So I made my changes on top the active_decorator, and called it lulalala_presenter (all the good names are taken already).

#### Usage

First install by putting this in the Gemfile:

Say you want to add a title() helper method for auction, which takes care of truncation and html cleansing. I simply create a AuctionPresenter class. Remember to subclas it with LulalalaPresenter::Base:

And then in the view I call it like this

#### Less magic

I think it is good to start with less magic, but allow user to add magic if they want more convenience.

By default you use h to access all the build-in helper methods, and model to access the model object. If you are lazy you can choose to add stuff in your presenter so you can type less. For example you can delegate a few model attributes:

You can probably also automatically delegate everything to models using other delegation libraries.

For skipping typing h, you can mix-in active_decorator's helper module so it will redirect calls to view-context using method_missing.

#### Conclusion

At the end I quite like my results. For the top example, now I have

Having to type .presenter when calling helpers isn't so bad as I first thought. This makes it very obvious that the method belongs to presenter, so you won't ever mix up a helper methods with model methods.

My two goals are also fulfilled. If I need an og_image_tag helper for my auction, I immediately know I should put it in the AuctionPresenter, and I can name it without worrying about prefixing model names.

over 1 year ago

## Apply SEO-friendly url only to show path

Rails developers often use to_param to add more information to the url. Taking this very page as an example:

http://logdown.com/account/posts/290439-seo-and-to-param

We can clear get an idea of what this page is about. Therefore search engines favours this kind of urls.

However, if we just override to_param, we would also see other urls getting changed. One most notable example would be edit path:

http://logdown.com/account/posts/290439-seo-and-to-param/edit

I think this can cause issues as the seo-friend param can potentially get very long. Then it would be difficult to notice the edit in the url. Also it may not be that useful to do SEO on action urls (rather than the content urls). We probably will never want edit page to appear on search engines.

I think instead, we should conditionally do SEO on urls that needs it. One way to do this is to have our own version of the url helpers. For example, we can define post_url and post_path like this:

This means only show/delete/update paths will be SEO friendly. Other paths (especially custom actions) will be unaffected.

over 1 year ago

## Rails 4.1 upgrade and FactoryGirl.create with association

I was trying to upgrade from Rails 4.0 to Rails 4.1. Some specs broke, and I found a small behaviour change.

So my user has_one profile.

Prior to my upgrade, the profile association will not be loaded before/after FactoryGirl.create call.

However after Rails 4.1

This means I have to be careful to not have stale profile in tests, calling reload more often.

P.S. my factory_girl gem has remain unchanged in version 4.4.0 and factory_girl_rails gem in 4.4.1. So this is solely due to Rails behavior change.

over 1 year ago

## 從 sqlite 轉換到 postgresql

sqlite 的 data dump 其實需要更改不少地方，才能用在其他資料庫。這部份很容易犯錯，所以我想找其他方法。

## 注意的地方

• 有使用 foreign key 的話，記得依照 key 的 dependency 把匯入的順序調整，被 Key 指向的 table 移到前面。
• 匯出前也要先把 orphan record 先刪除（就是有 foreign key 指向已經不存在的資料時）
• 匯入前，先轉換資料庫然後跑 migration
• 匯入後，每個 table 的 id 並不會自動設成從目前最大的 id 開始繼續遞增。必須跑類似 ALTER SEQUENCE product_id_seq RESTART WITH 1453; 來對每個 table 一一作設定，不然的話可能新增的 record 會從前面沒有的 id 開始建立。
• 以前在 MySQL 或是 Sqlite 時似乎沒有指定 order 就會以 primary key 自動作排序。但是在 Postgresql 我發現丟回來的東西順序不是這樣（我看到的是相反的 id desc）。這代表 query 都得手動加入 order(:id) 才能確保跟以前一樣以 id desc 排序。

## Middleware模式的隨想

over 2 years ago

When I am integrating offsite payments in ActiveMerchant, I always need to provide an endpoint to receive payment notificaiton from gateway providers. After a couple of projects, I have come up with a best-practice to do it, and I thought this might be useful for others.

I have a PaymentNotification class solely to record notifications. This is because I want to decouple the persistence of notification from the order status change. If something goes wrong when I change Order, the notification will still be safely persisted in the database for later analysis.

As you can see, the controller action is pretty light. It only saves the notification. The notification request is setup so it contains order_id and payment method.

The interesting part is the PaymentNotification model. It has a few columns for persisting different information (mostly serialized). It is also linked to Order, for easier analysis.

The process method is to check the notification's validity and trigger order status update. It is to be called separately (maybe via a scheduled job). A few kinds of custom exceptions may be raised during process, because they are handled differently. Either way, errors will be recorded in the errors column.

Any improvement is welcomed :D

over 2 years ago

## [轉載] 圖解醫師李宏信明參選是為了瓜分柯的選票

### Re: 新增素人選項　醫師李宏信明登記參選北市長

（歷史上也有最後選擇「跟對方完全一樣的政見」的例子，還選贏了！）

（但其他縣市就不一定了，其他縣市的候選人可能會想要淡化自己的藍色）

33 : 32 : 7 連勝文贏了，

「國民黨落選」才是民進黨最重要的事情，成功不必在我。

「台北市選民還有沉默的多數沒講話，民調不準啦！」