Extract text from html: looking for a good sax-like parser or advices with a dom parser

I have an html document formatted this way:

<p>
 some plain text <em>some emphatized text</em>, <strong> some strong text</strong>
</p>
<p>
 just some plain text
</p>
<p>
  <strong>strong text </p> followed by plain, <a>with a link at the end!</a>
</p>

I'd like to extract the text. With dom like parsers I could extract each paragraph

, but the problem is inside: I'd have to extract text from inner tags too and have a resulting string with the same order, in the example above, first paragraph, I want to extract:

some plain text some emphatized text, some strong text

and for this purpose I guess a sax like parser would be better than a dom, given that I can't know inner tags number o sequence: a paragraph can have zero or more inner tags, of different type.

Answers


You can use dom parsers, get the text inside of the p tags (including child html elements) in to a string variable and use some other functionality to strip all the html tags out of the resulting string. This should leave you with all of the content between the p tags without any of the child element tags.

Example

<p>
    some plain text <em>some emphatized text</em>, <strong> some strong text</strong>
</p>
<p>
    just some plain text
</p>
<p>
    <strong>strong text </p> followed by plain, <a>with a link at the end!</a>
</p>

Use some dom parser to extract the p tags to strings, you would then have a string like so:

String content = "some plain text <em>some emphatized text</em>, <strong> some strong text</strong>";
content = stripHtmlTags( content );
println( content ); // some plain text some emphatized text, some strong text

String extractedText=Html.fromHtml(Your HTML String).toString()

This gives u extracted text.. Hope this help you.


Add code to read CDATA by DOM pase
**childNode.getNodeType() == Node.CDATA_SECTION_NODE**

if Using XMLUtils modify like

public static String getNodeValue(Node node) {
        node.normalize();
        String response = node.getNodeValue();
        if (response != null) {
            return response;
        } else {
            NodeList list = node.getChildNodes();
            int size = list == null ? 0 : list.getLength();
            for (int j = 0; j < size; j++) {
                Node childNode = list.item(j);
                if (childNode.getNodeType() == Node.TEXT_NODE
                        || childNode.getNodeType() == Node.CDATA_SECTION_NODE) {
                    response = childNode.getNodeValue();
                    return response;
                }
            }
        }
        return "";
    }

Need Your Help

Highlight NSImageView in NSView

xcode osx cocoa nsview nsimageview

I have a simple question. I have my NSView which is detecting drops (drag and drop). When user drops a link with image from browser, I detect that action, create NSImageView, initialize it on a place

How can I extract the address information from a Compressed ESRI shapefile datasource?

database csv gis dbf esri

When I download the zip file from the website it contains files with the following extensions:

About UNIX Resources Network

Original, collect and organize Developers related documents, information and materials, contains jQuery, Html, CSS, MySQL, .NET, ASP.NET, SQL, objective-c, iPhone, Ruby on Rails, C, SQL Server, Ruby, Arrays, Regex, ASP.NET MVC, WPF, XML, Ajax, DataBase, and so on.