Zend_Dom gives you a DOMElement… how do I use it?

I'm trying to use Zend_Dom for some very light screen scraping (I want to grab a headline, some body text and a link from a small block of news items on my website) and I'm not sure how to handle the DOMElement that it gives me.

In the manual for Zend_Dom the code says:

foreach ($results as $result) {
    // $result is a DOMElement
}

How do I make use of this DOMElement?

A detailed example (looking for the anchor elements on Google):

$url='http://google.com/';
$client = new Zend_Http_Client($url);
$response = $client->request();
$html = $response->getBody();
$dom = new Zend_Dom_Query($html);
$results = $dom->query('a');
foreach($results as $r){
     Zend_Debug::dump($r);
}

This gives me:

object(DOMElement)#81 (0) {
}
object(DOMElement)#82 (0) {
}
object(DOMElement)#83 (0) {
}
... etc, etc...

What I find confusing is that this looks like each element contains nothing (0)! This isn't the case but that is my first impression. So I poke around online and find I can add nodeValue to get something out of this:

Zend_Debug::dump($r->nodeValue);

which gives me:

string(6) "Images"
string(6) "Videos"
string(4) "Maps"
...etc, etc...

But where I run into trouble is getting specific elements and their contents.

For instance given this html:

  <div class="newsBlurb">
   <span class="newsDate">Mon, 11 October 2010</span>
   <h3 class="newsHeadline"><a href="http://foo.com/1/2/">Some text</a></h3>
   <a class="newsMore" href="http://foo.com/1/2/">More</a>
  </div> 
  <div class="hr"></div>
  <div class="newsBlurb">
   <span class="newsDate">Mon, 16 August 2010</span>
   <h3 class="newsHeadline"><a href="http://bar.com/pants.html">Stuff is here</a></h3>
   <a class="newsMore" href="http://bar.com/pants.html">More</a>
  </div> 

I can grab the text from each newsBlurb, using the technique I use in the Google example, but cannot get each element by itself. I want to get the date and stick it somewhere, get the headline text and stick it somewhere and get the link to use. But all I get is the actual text in the div.

How do I get what I want from this?


EDIT Here is another example that does not work as I expect. Any ideas why?

$url = 'http://php.net/manual/en/class.domelement.php';
$client = new Zend_Http_Client($url);
$response = $client->request();
$html = $response->getBody();
$dom = new Zend_Dom_Query($html);
$newsBlurbNode = $dom->query('div.note');
Zend_Debug::dump($newsBlurbNode);

this gives me:

object(Zend_Dom_Query_Result)#867 (7) {
  ["_count":protected] => NULL
  ["_cssQuery":protected] => string(8) "div.note"
  ["_document":protected] => object(DOMDocument)#79 (0) {
  }
  ["_nodeList":protected] => object(DOMNodeList)#864 (0) {
  }
  ["_position":protected] => int(0)
  ["_xpath":protected] => NULL
  ["_xpathQuery":protected] => string(33) "//div[contains(@class, ' note ')]"
}

Trying to get anything out of this I used:

$children = $newsBlurbNode->childNodes;
     foreach ($children as $child) {
       }

Which results in an error because the foreach loop has nothing in it. Ack! What am I not getting?

Answers


You can use something like this to get access to the individual nodes:

$children = $newsBlurbNode->childNodes;
foreach ($children as $child) {   
    //do something with individual nodes
} 

Otherwise I would go through: http://php.net/manual/en/class.domelement.php


Hey I have been messing around with something similar - let me know if this is sufficient to help you out - if not I can explain it some more.

$data = "<p id='p_1'><a href='testing1.html'><span>testing in a span 1</span></a></p>
         <p id='p_2'><a href='testing2.html'></a></p>
         <p id='p_3'><a href='testing3.html'><span>testing in a span 3</span></a></p>
         <p id='p_4'><a href='testing4.html'><span>testing in a span 4</span></a></p>
         <p id='p_5'><a href='testing5.html'><span>testing in a span 5</span></a></p>";

$dom = new Zend_Dom_Query();
$dom->setDocumentHtml($data);

//Look for any links inside of paragraph tags
$results = $dom->query('p a');

foreach($results as $r){

   echo "Parent Tag: ".$r->nodeName."<br />";
   echo $r->nodeValue."<br />";
   $children = $r->childNodes;

   if($children->length > 0){

       $children = $r->childNodes;

       foreach($children as $c){
           echo "Child Tag: <br />";
           echo $c->nodeName."<br />";
           echo $c->nodeValue."<br />";
       }

  }

   echo $r->getAttribute('href')."<br /><br />";

}

echo $data;

Need Your Help

How to specify styles for h:panelgrid in jsf

css jsf jsf-2

I using a simple h:panelGrid for displaying the table how could i specify styles for the table to make it better. i am a bit confused with all the stuff i looked online

boost::replace_all whole word issue

regex boost

my regex doesn't works. Why?

About UNIX Resources Network

Original, collect and organize Developers related documents, information and materials, contains jQuery, Html, CSS, MySQL, .NET, ASP.NET, SQL, objective-c, iPhone, Ruby on Rails, C, SQL Server, Ruby, Arrays, Regex, ASP.NET MVC, WPF, XML, Ajax, DataBase, and so on.