Locating specific <p> tag after <h1> tag in Python Html Parser
I'm attempting to parse through a series of webpages and grab just 3 paragraphs after the header occurs on each of these pages. They all have the same format (I think). I'm using urllib2 and beautiful soup, but i'm not quite sure how to just jump the to header and then grab the few
tags that follow it.I know the first split("h1") is not correct but its my only decent attempt so far. Here's my code,
from bs4 import BeautifulSoup import urllib2 from HTMLParser import HTMLParser BANNED = ["/events/new"] def main(): soup = BeautifulSoup(urllib2.urlopen('http://b-line.binghamton.edu').read()) for link in soup.find_all('a'): link = link.get('href') if link != None and link not in BANNED and "/events/" in link: print() print(link) eventPage = "http://b-line.binghamton.edu" + link bLineSubPage = urllib2.urlopen(eventPage) bLineSubPageStr = bLineSubPage.read() headAccum = 0 for data in bLineSubPageStr.split("<h1>"): if(headAccum < 1): accum = 0 for subData in data.split("<p>"): if(accum < 5): try: print(BeautifulSoup(subData).get_text()) except Exception as e: print(e) accum+=1 print() headAccum += 1 bLineSubPage.close() print() main()
>>> page_txt = urllib2.urlopen("http://b-line.binghamton.edu/events/9305").read( >>> soup = bs4.BeautifulSoup(pg.split("<h1>",1)[-1]) >>> print soup.find_all("p")[:3]
is that what you want?