Scrape the Web
From Driscollwiki
Contents |
Notes
- Use Firebug to traverse XHTML tree
- "Scraping ... is reconstructing someone else's thought processes"
Parsers
- BeautifulSoup
- Relies on a deprecated parser
- Not as good under Py3k
- Not maintained
- html5lib
- Can return BeauSoup objects
- lxml parse
- XPath
- Query language for tracing through a DOM tree
- Some problems if the tree changes
- CSS Select
- Useful to search for CSS ids/classes
- XPath
- PyQuery
- Traverse the tree in a similar fashion to JQuery
Simplest scraper
import urllib2 fd = urllib2.urlopen(url) print 'eggplant' in fd.read()
Anti bot measures
- &s=
- Might be a token to verify that you aren't a bot
- Also the REFERRER header