WWW::Mechanize(Rubyの)で一部のフォームが取れない問題 - N.E.E.T―Never Ending Engineer’s Tragicomedy別館

どうも，well-formedじゃないHTMLの場合に，フォームの一部を取れないみたい．例えば，以下のようなフォームがあるときにbarが取れない(WWW::Mechanize#page.forms.first.field('bar').nil? == trueになる)．

<p>
<form>
<input name="foo">
</p>
<input name="bar">

検索してみたけど情報が無い．みんな困っていないのだろうか?

まだ対策できていないけど，テストだけ張っておく．これがpassすれば問題解決．

#!/usr/local/bin/ruby
# $Id$
require 'rubygems'
require 'mechanize'
require 'logger'
require 'webrick'
require 'test/unit'
require 'ruby-debug'

class DumbHTTPD
  class Servlet < WEBrick::HTTPServlet::AbstractServlet
    @@tmpl = lambda do |val|
      <<_EOT_
<html>
<body>
<h1>Fill out below form:</h1>
<p>
  <form action="/" method="POST">
  <input type="text" name="foo" value="#{val['foo']}">
</p>
  <input type="text" name="bar" value="#{val['bar']}">
  <input type="submit" value="save">
</form>
</body>
</html>
_EOT_
    end
    def do_GET(req, res)
      res['Content-Type'] = 'text/html; charset=utf-8'
      res.body = @@tmpl.call({
        'foo' => 'Hello world',
        'bar' => 'hoge',
      })
      return res
    end
    def do_POST(req, res)
      res['Content-Type'] = 'text/html; charset=utf-8'
      res.body = @@tmpl.call(req.query)
      return res
    end
  end

  attr_accessor :webrick, :runtime, :bindaddr, :port
  def initialize(bindaddr, port)
    self.bindaddr, self.port = bindaddr, port
  end

  def init_webrick
    self.webrick = WEBrick::HTTPServer.new(
      :BindAddress => self.bindaddr,
      :Port => self.port)
    self.webrick.mount('/', Servlet)
  end

  def start
    self.init_webrick
    self.runtime = Thread.new(self.webrick) {|w| w.start}
    return self
  end

  def stop
    self.runtime.kill.join
    self.webrick.shutdown
    return self
  end
end


class TC_Mech < Test::Unit::TestCase
  attr_accessor :agent, :servlet
  def setup
    self.agent = WWW::Mechanize.new {|a| a.log = Logger.new($STDERR) }
    self.agent.max_history = 1
    self.agent.user_agent_alias = 'Windows IE 6'

    self.servlet = DumbHTTPD.new('localhost', 10182).start
  end

  def teardown
    self.servlet.stop
  end

  def test_scrape_not_wellformed_html
    page = self.agent.get('http://localhost:10182')
    form = page.forms.first
    {'foo' => 'Hello world',
     'bar' => 'hoge',
    }.each do |k, v|
      field = form.field(k)
      assert(!field.nil?, "form field '#{k}' is not exists")
      assert_equal(v, field.value, "form field '#{k}'.value != '#{v}'")
    end
  end
end

対処法の考察(メモ)

WWW::Mechanize::Page#formsは，最初の一回呼び出された時にHpricot.parseしてsearch('form')して，出てきたHpricot::Elements分だけformを作って(WWW::Mechanize::Pageのインスタンス変数に)キャッシュする．んだけども，Hpricotは上記のような壊れたHTMLを読む時，後続のinputを無視してしまう．

だけど，Hpricot的には多分悪くない動作．多分，WWW::Mechanizeで対策するべきで，Mechはplaggable_parserという機構がありHTMLパーサを動的に差し替える事が可能．これを使って「ゆるくHTMLフォームを解釈するWWW::Mechanize::Pageの子孫クラス」を適当に作って以下のようにすればよい．

class WWW::Mechanize::LamePage < WWW::Mechanize::Page
  def initialize(uri=nil, response=nil, body=nil, code=nil)
    super(uri, response, body, code)
  end

  def forms
    # ここでformsを再定義
  end
end

agent = WWW::Mechanize.new
agent.pluggable_parser.html = WWW::Mechanize::LamePage  # 標準のパーサを差し替え
# 後は普通に使う

と，対策手法はわかっているんだけど，Hpricotの使い方が難しくてなかなか進みません……．

追記 (2009/2/24)

WWW::Mechanize 0.9.0で追試したところ，本エントリで触れた問題は解決していた．よかった．
どうやら，HTMLパーサがHpricotからNokogiriに変わったことで，このような問題がなくなった模様．