Web-scraping JavaScript page with Python

I'm trying to develop a simple web scraper. I want to extract text without the HTML code. In fact, I achieve this goal, but I have seen that in some pages where JavaScript is loaded I didn't obtain good results.

For example, if some JavaScript code adds some text, I can't see it, because when I call

response = urllib2.urlopen(request)

I get the original text without the added one (because JavaScript is executed in the client).

So, I'm looking for some ideas to solve this problem.

Answers


I've successfully done this in Java (I've used the Cobra toolkit http://lobobrowser.org/cobra.jsp)

Since you want to hack in python (always a good choice) I recommend these two options:


You can also use Python library dryscape to scrape javascript driven websites.

Example

To give an example, I created a sample page with following HTML code. (link):

<!DOCTYPE html>
<html>
<head>
  <meta charset="utf-8">
  <title>Javascript scraping test</title>
</head>
<body>
  <p id='intro-text'>No javascript support</p>
  <script>
     document.getElementById('intro-text').innerHTML = 'Yay! Supports javascript';
  </script> 
</body>
</html>

without javascript it says: No javascript support and with javascript: Yay! Supports javascript

Scraping without JS support:

>>> import requests
>>> from bs4 import BeautifulSoup
>>> response = requests.get(my_url)
>>> soup = BeautifulSoup(response.text)
>>> soup.find(id="intro-text")
<p id="intro-text">No javascript support</p>

Scraping with JS support:

>>> import dryscrape
>>> from bs4 import BeautifulSoup
>>> session = dryscrape.Session()
>>> session.visit(my_url)
>>> response = session.body()
>>> soup = BeautifulSoup(response)
>>> soup.find(id="intro-text")
<p id="intro-text">Yay! Supports javascript</p>

It sounds like the data you're really looking for can be accessed via secondary URL called by some javascript on the primary page.

While you could try running javascript on the server to handle this, a simpler approach to might be to load up the page using Firefox and use a tool like Charles or Firebug to identify exactly what that secondary URL is. Then you can just query that URL directly for the data you are interested in.


Need Your Help

How do I impersonate using aspnet_personalization Windows Authentication

asp.net authentication web-parts

Having trouble figuring out how to impersonate another user while using aspnet_personalization with Windows authentication.

How to send iSeries command from Python using QCMDEXC

python ibm-midrange

I am trying to send a command to the iSeries (AS/400) utilizing QCMDEXC from Python. I know I can connect to the iSeries because I can display members from QGPL:

About UNIX Resources Network

Original, collect and organize Developers related documents, information and materials, contains jQuery, Html, CSS, MySQL, .NET, ASP.NET, SQL, objective-c, iPhone, Ruby on Rails, C, SQL Server, Ruby, Arrays, Regex, ASP.NET MVC, WPF, XML, Ajax, DataBase, and so on.