How to extract exact tags in scrapy
I wrote a class for scrapy in order to get the piece of content of a page like so:
#!/usr/bin/python import html2text from scrapy.spider import BaseSpider from scrapy.selector import HtmlXPathSelector class StockSpider(BaseSpider): name = "stock_spider" allowed_domains = ["www.hamshahrionline.ir"] start_urls = ["http://www.hamshahrionline.ir/details/261730/Health/publichealth"] def parse(self, response): hxs = HtmlXPathSelector(response) # sample = hxs.select("WhatShouldIputHere").extract()[AndHere] converter = html2text.HTML2Text() converter.ignore_links = True print converter.handle(sample)
My main problem is the state that I commented it.
How can I set path and extract parameter for that?
Can you guide me over this and give me some examples?
First you need to decide what data do you want to get out of the page, define an Item class and a set of Fields. Then, in order to fill item fields with data, you need use xpath expressions in the parse() method of your spider.
Here's an example that retrieves all of the paragraphs out of the body (all news, I suppose):
from scrapy.item import Item, Field from scrapy.spider import Spider from scrapy.selector import Selector class MyItem(Item): content = Field() class StockSpider(Spider): name = "stock_spider" allowed_domains = ["www.hamshahrionline.ir"] start_urls = ["http://www.hamshahrionline.ir/details/261730/Health/publichealth"] def parse(self, response): sel = Selector(response) paragraphs = sel.xpath("//div[@class='newsBodyCont']/p/text()").extract() for p in paragraphs: item = MyItem() item['content'] = p yield item
Also, note that you'd better extract your Item definition in a separate python script to follow the Scrapy project structure.
Hope that helps.