Scraping multiple pages for parsing in Beautiful Soup

I'm trying to scrape multiple pages off of a single website for BeautifulSoup to parse. So far, I've tried using urllib2 to do this, but have been encountering some problems. What I've attempted is:

import urllib2,sys
from BeautifulSoup import BeautifulSoup

for numb in ('85753', '87433'):
    address = ('http://www.presidency.ucsb.edu/ws/index.php?pid=' + numb)
html = urllib2.urlopen(address).read()
soup = BeautifulSoup(html)

title = soup.find("span", {"class":"paperstitle"})
date = soup.find("span", {"class":"docdate"})
span = soup.find("span", {"class":"displaytext"})  # span.string gives you the first bit
paras = [x for x in span.findAllNext("p")]

first = title.string
second = date.string
start = span.string
middle = "\n\n".join(["".join(x.findAll(text=True)) for x in paras[:-1]])
last = paras[-1].contents[0]

print "%s\n\n%s\n\n%s\n\n%s\n\n%s" % (first, second, start, middle, last)

This only gives me results for the second number in the numb sequence, i.e. http://www.presidency.ucsb.edu/ws/index.php?pid=87433. I also made some attempts at using mechanize, but had no success. Ideally what I would like to be able to do is have a page, with a list of links, and then automatically select a link, pass the HTML off to BeautifulSoup, and then move to the next link in the list.

Answers


You need to put the rest of the code inside the loop. Right now you're iterating over both items in the tuple, but at the end of the iteration only the last item remains assigned to address which subsequently gets parsed outside the loop.


I think you just missed the indentation in the loop :

import urllib2,sys
from BeautifulSoup import BeautifulSoup

for numb in ('85753', '87433'):
    address = ('http://www.presidency.ucsb.edu/ws/index.php?pid=' + numb)
    html = urllib2.urlopen(address).read()
    soup = BeautifulSoup(html)

    title = soup.find("span", {"class":"paperstitle"})
    date = soup.find("span", {"class":"docdate"})
    span = soup.find("span", {"class":"displaytext"})  # span.string gives you the first bit
    paras = [x for x in span.findAllNext("p")]

    first = title.string
    second = date.string
    start = span.string
    middle = "\n\n".join(["".join(x.findAll(text=True)) for x in paras[:-1]])
    last = paras[-1].contents[0]

    print "%s\n\n%s\n\n%s\n\n%s\n\n%s" % (first, second, start, middle, last)

I think this should solve the problem..


Here's a tidier solution (using lxml):

import lxml.html as lh

root_url = 'http://www.presidency.ucsb.edu/ws/index.php?pid='
page_ids = ['85753', '87433']

def scrape_page(page_id):
    url = root_url + page_id
    tree = lh.parse(url)

    title = tree.xpath("//span[@class='paperstitle']")[0].text
    date = tree.xpath("//span[@class='docdate']")[0].text
    text = tree.xpath("//span[@class='displaytext']")[0].text_content()

    return title, date, text

if __name__ == '__main__':
    for page_id in page_ids:
        title, date, text = scrape_page(page_id)

Need Your Help

Using the jQuery validatation plugin to send multiple values to an ASP.NET MVC controller action?

asp.net-mvc jquery-plugins jquery jquery-validate

Using the jQuery Validation plugin and AJAX, how can I validate the contents of say an input (textbox) but pass more than one parameter to a controller action?

Stable alternative to RXTX

java serial-port rxtx

After using RXTX for a number of different projects, I've come across many annoying discrepancies and issues that can only sensibly be put down to bugs in the library - deadlocks, race hazards, and