How to extract exact tags in scrapy

I wrote a class for scrapy in order to get the piece of content of a page like so:

import html2text
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector

class StockSpider(BaseSpider):
    name = "stock_spider"
    allowed_domains = [""]
    start_urls = [""]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
#       sample ="WhatShouldIputHere").extract()[AndHere]
        converter = html2text.HTML2Text()
        converter.ignore_links = True
        print converter.handle(sample)

My main problem is the state that I commented it.

How can I set path and extract parameter for that?

Can you guide me over this and give me some examples?

Thank you


First you need to decide what data do you want to get out of the page, define an Item class and a set of Fields. Then, in order to fill item fields with data, you need use xpath expressions in the parse() method of your spider.

Here's an example that retrieves all of the paragraphs out of the body (all news, I suppose):

from scrapy.item import Item, Field
from scrapy.spider import Spider
from scrapy.selector import Selector

class MyItem(Item):
    content = Field()

class StockSpider(Spider):
    name = "stock_spider"
    allowed_domains = [""]
    start_urls = [""]

    def parse(self, response):
        sel = Selector(response)
        paragraphs = sel.xpath("//div[@class='newsBodyCont']/p/text()").extract()
        for p in paragraphs:
            item = MyItem()
            item['content'] = p
            yield item

Note that I'm using a Selector class since HtmlXPathSelector is deprecated. Also, I'm using xpath() method instead of select() because of the same reason.

Also, note that you'd better extract your Item definition in a separate python script to follow the Scrapy project structure.

Hope that helps.

Need Your Help

Why do I have to create a concrete implementation of `IEnumerable<T>` in order to modify its members?

c# c#-4.0 foreach ienumerable

Why do I have to create a concrete implementation of IEnumerable&lt;T&gt; in order to modify its members in the foreach loop?

Enabling UI virtualization on the WPFToolkit Accordion

.net wpf wpf-controls wpftoolkit ui-virtualization

I am using the WPFToolkit Accordion control with the items in the accordion defined to be a TreeView.