parsing html page with net http
In a previous question I have found an answer for a hacked - but working - way to parse the title from a page using
url = %x(curl http://google.com) simian = curl.match(/<title>(.*)<\/title>/) puts simian
now I wanted to know if there is a better way by using a ruby standard library like net/http to fetch the url (in lieu of curl).
Another issue is that if the pages has some non standard characters in the title it doesn't parse it and curl.match cannot be completed. I have tried
simian = s.encode('UTF-8') and then simian = curl.match(/<title>(.*)<\/title>/)
but it shows weird characters like 1# thanks in advance for your help
Using nokogiri is probably the simplest solution:
require 'nokogiri' require 'open-uri' doc = Nokogiri::HTML(open('http://www.google.com')) elt = doc.xpath('//title').first puts elt.text() if !elt.nil?