parsing html page with net http

In a previous question I have found an answer for a hacked - but working - way to parse the title from a page using

 url = %x(curl http://google.com)
 simian = curl.match(/<title>(.*)<\/title>/)[1]
 puts simian

now I wanted to know if there is a better way by using a ruby standard library like net/http to fetch the url (in lieu of curl).

Another issue is that if the pages has some non standard characters in the title it doesn't parse it and curl.match cannot be completed. I have tried

 simian = s.encode('UTF-8') and then
 simian = curl.match(/<title>(.*)<\/title>/)[1]

but it shows weird characters like 1# thanks in advance for your help

Answers


Using nokogiri is probably the simplest solution:

require 'nokogiri'
require 'open-uri'

doc = Nokogiri::HTML(open('http://www.google.com'))
elt = doc.xpath('//title').first
puts elt.text() if !elt.nil?

Need Your Help

Developer Setup for Starting Out with Cocoa/Mac Programming

cocoa osx

I'd like to start experimenting with Cocoa and programming for Mac OSX. I'm not terribly concerned with Objective C syntax/constructs/bheaviors at this point, but more curious as to an efficient s...