dlencz / scraperwiki-scraper

ScraperWiki scraper

Scrapes scraperwiki.com and classic.scraperwiki.com

ScraperWiki

Last run failed 2016-09-13 with status code 1.

Console output of last run

Injecting configuration and compiling... Injecting scraper and running... Traceback (most recent call last): File "scraper.py", line 7, in <module> html = scraperwiki.scrape("https://scraperwiki.com/browse/scrapers/?page=%d" % (page_num)) File "/app/.heroku/src/scraperwiki/scraperwiki/utils.py", line 31, in scrape f = urllib2.urlopen(req) File "/app/.heroku/python/lib/python2.7/urllib2.py", line 127, in urlopen return _opener.open(url, data, timeout) File "/app/.heroku/python/lib/python2.7/urllib2.py", line 410, in open response = meth(req, response) File "/app/.heroku/python/lib/python2.7/urllib2.py", line 523, in http_response 'http', request, response, code, msg, hdrs) File "/app/.heroku/python/lib/python2.7/urllib2.py", line 442, in error result = self._call_chain(*args) File "/app/.heroku/python/lib/python2.7/urllib2.py", line 382, in _call_chain result = func(*args) File "/app/.heroku/python/lib/python2.7/urllib2.py", line 629, in http_error_302 return self.parent.open(new, timeout=req.timeout) File "/app/.heroku/python/lib/python2.7/urllib2.py", line 410, in open response = meth(req, response) File "/app/.heroku/python/lib/python2.7/urllib2.py", line 523, in http_response 'http', request, response, code, msg, hdrs) File "/app/.heroku/python/lib/python2.7/urllib2.py", line 448, in error return self._call_chain(*args) File "/app/.heroku/python/lib/python2.7/urllib2.py", line 382, in _call_chain result = func(*args) File "/app/.heroku/python/lib/python2.7/urllib2.py", line 531, in http_error_default raise HTTPError(req.get_full_url(), code, msg, hdrs, fp) urllib2.HTTPError: HTTP Error 403: Forbidden

Data

Downloaded 1 time by

data

To download data sign in with GitHub

Download table (as CSV) Download SQLite database (66.2 MB) Use the API

rows 10 / 16895

lang	date_scraped	code	id	name
php	2010-11-24 13:37:28	<?php $url = 'http://www.allbuffs.com/'; $html = scraperwiki::scrape($url); print get_include_path() . "\n"; $dail = "Daíl"; $q = "Здрав". "ствуй". "те"; //print html_entity_decode($q, ENT_NOQUOTES, 'UTF-8')."\n"; //echo $dail[1] . "\n"; //echo $dail . "\n"; //echo "Daíl\n"; //echo strlen("Hello"). "\n"; //echo strlen($html) . "\n"; scraperwiki::save(Array('html'), Array('html' => $html)); echo $html; echo "Hello\n"; ?>	php-bug	PHP Bug
python	2011-02-07 17:20:04	############################################################################### # START HERE: Tutorial 3: More advanced scraping. Shows how to follow 'next' # links from page to page: use functions, so you can call the same code # repeatedly. SCROLL TO THE BOTTOM TO SEE THE START OF THE SCRAPER. ############################################################################### import scraperwiki from BeautifulSoup import BeautifulSoup # define the order our columns are displayed in the datastore scraperwiki.metadata.save('data_columns', ['Artist', 'Album', 'Released', 'Sales (m)']) # scrape_table function: gets passed an individual page to scrape def scrape_table(soup): data_table = soup.find("table", { "class" : "data" }) rows = data_table.findAll("tr") for row in rows: # Set up our data record - we'll need it later record = {} table_cells = row.findAll("td") if table_cells: record['Artist'] = table_cells[0].text record['Album'] = table_cells[1].text record['Released'] = table_cells[2].text record['Sales (m)'] = table_cells[4].text # Print out the data we've gathered print record, '------------' # Finally, save the record to the datastore - 'Artist' is our unique key scraperwiki.datastore.save(["Artist"], record) # scrape_and_look_for_next_link function: calls the scrape_table # function, then hunts for a 'next' link: if one is found, calls itself again def scrape_and_look_for_next_link(url): html = scraperwiki.scrape(url) soup = BeautifulSoup(html) scrape_table(soup) next_link = soup.find("a", { "class" : "next" }) print next_link if next_link: next_url = base_url + next_link['href'] print next_url scrape_and_look_for_next_link(next_url) # --------------------------------------------------------------------------- # START HERE: define your starting URL - then # call a function to scrape the first page in the series. # --------------------------------------------------------------------------- base_url = 'http://www.madingley.org/uploaded/' starting_url = base_url + 'example_table_1.html' scrape_and_look_for_next_link(starting_url)	ti	ti
	2011-02-07 17:20:26	import scraperwiki import mechanize import urllib, urlparse import lxml.etree, lxml.html import re import urlparse import datetime #scraperwiki.cache(True) def Main(): purl = 'http://www.courtsni.gov.uk/en-GB/Judicial+Decisions/' root = lxml.html.parse(purl).getroot() urls = [ a.get('href') for a in root.cssselect('ul.QuickNav li a') if re.match("Published", a.text) ] for i, url in enumerate(urls): if i >= 1: # skip all but the first continue print i, url parsesession(url) def cleanup(data): assert data.get('Judge') == data.get('Author'), data author = data.pop('Author') judge = data.pop('Judge') mjudge = re.match('(.?)\s+([LCJ]+)$', judge) if mjudge: data['judgename'] = mjudge.group(1) data['judgetitle'] = mjudge.group(2) else: data['judgename'] = judge for dkey in ['Date Created', 'Date Modified', 'Date Issued' ]: if data.get(dkey): mdate = re.match('(\d\d)/(\d\d)/(\d\d\d\d)$', data.get(dkey)) data[dkey] = datetime.datetime(int(mdate.group(3)), int(mdate.group(2)), int(mdate.group(1))) mdate = re.match('(\d+) (\w\w\w) (\d\d\d\d)', data.get('Date')) #print mdate.groups() imonth = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'].index(mdate.group(2)) data['Date'] = datetime.datetime(int(mdate.group(3)), imonth+1, int(mdate.group(1))) def parsepage(url): for i in range(3): try: root = lxml.html.parse(url).getroot() break except IOError: pass rows = root.cssselect('table.MetaGrid_Display tr') data = { } for row in rows: td1, td2 = list(row) data[td1.text.strip(': ')] = td2.text a = root.cssselect('#MSCMSAttachmentCustom6_PresentationModeControlsContainer_presHyperLink')[0] #print data.keys() data['Judgment'] = urlparse.urljoin(url, a.get('href')) return data def parsecontentspage(response): root = lxml.html.parse(response).getroot() rows = root.cssselect('table#DynamicControlDataGrid1_MSCSDataListXML_Document_Dynamic1 tr') headers = [ th[0].text for th in rows.pop(0) ] assert headers == ['Date', 'Title', 'Identifier', 'Judge'], headers numdocs = int(root.cssselect("span#DynamicControlDataGrid1_lblDocCount")[0].text) lrt = '' if len(list(rows[-1])) == 1: lr = rows.pop() lrt = lxml.etree.tostring(lr) for tr in rows: data = dict(zip(headers, [tr[0].text, tr[1][0].text, tr[2].text, tr[3].text])) data['url'] = tr[1][0].get('href') fdata = parsepage(data['url']) assert data.get('Identifier') == fdata.get('Identifier'), (data, fdata) assert data.get('Title') == fdata.get('Title'), (data, fdata) data.update(fdata) cleanup(data) #print data scraperwiki.datastore.save(unique_keys=['Identifier'], data=data, date=data.get('Date Issued')) return lrt, numdocs, len(rows) def parsesession(url): br = mechanize.Browser() response = br.open(url) lrt, numdocs, count = parsecontentspage(response) ipage = 2 while lrt: plinks = re.findall("\"javascript:__doPostBack$'(.?)','(.*?)'$\">(\d+\|\.\.\.)</a>", lrt) lklook = dict([ (int(x[2]), x) for x in plinks if x[2] != '...' ]) nx = lklook.get(ipage) if not nx: if plinks and plinks[-1][2] == '...': assert (ipage % 10) == 1, lklook nx = plinks[-1] else: return print "Page", ipage br.select_form('DynamicTable') br.form.set_all_readonly(False) #print br.form.controls br['__EVENTTARGET'] = nx[0] br['__EVENTARGUMENT'] = nx[1] #print br.form.controls[-1].name br.form.find_control('DynamicControlDataGrid1:SearchButton').disabled=True response = br.submit() lrt, lnumdocs, lcount = parsecontentspage(response) ipage += 1 assert numdocs == lnumdocs count += lcount assert count == lcount Main()	courts-ni-judicial-decisions	Courts NI Judicial Decisions
python	2011-02-07 17:20:28	import scraperwiki from BeautifulSoup import BeautifulSoup from urlparse import urljoin def scrape(base_url): soup = BeautifulSoup(scraperwiki.scrape(base_url)) main_table = soup.findAll('table') assert len(main_table) == 1 main_table = main_table[0] for row in main_table.findAll('tr'): if row['class'] == 'table_header': continue server_info = {} server_info['offline'] = 'offline' in row['class'] server_td = row.find('td', 'server') server_info['software'] = server_td.find('img')['title'] server_div = server_td.find('div', 'tooltip_container') server_info['name'] = server_div.find('a').string if server_div.find('a').has_key('href'): server_info['url'] = server_div.find('a')['href'] print server_info['name'] #product_td = product_row.findAll('td') #product['type'] = ''.join(product_td[1].findAll(text=True)) #product_manufacturer = product_td[2].find('a') #if product_td[2].find('a') is None: # product_manufacturer = product_td[2] #product['manufacturer'] = product_manufacturer.string #product_line = product_td[3].find('a') #product['url'] = urljoin(base_url, product_line['href']) #product['model'] = product_line.string #if 'title' in product_line: # product['name'] = product_line['title'] scraperwiki.datastore.save(['name'], server_info) scrape('http://www.jabberes.org/servers/servers.html')	jabberxmpp-server-list	Jabber servers
python	2010-11-24 13:44:05	############################################################################### # Basic scraper ############################################################################### # Blank Python import re import scraperwiki from BeautifulSoup import BeautifulSoup #define the order our columns are displayed in the datastore scraperwiki.metadata.save('users', ['Count','Name','Private','URL','Friends','Interests']) #scrape the fan section def scrape_personal(vcard): #setup the data record record={} record['Count']=count record['Name'] = vcard.h1.text p=re.compile('.register.') s=vcard.h1.a['href'] m=p.search(s) if m: record['Private'] = "Y" else: record['Private'] = "N" record['URL']=s scraperwiki.datastore.save(["Count"], record) def scrape_interests(info): #setup the data record record={} record['Count']=count friends_row = info.find("div",{"class":"UIGridRenderer_Row clearfix"}) if friends_row: s="" friends=friends_row.findAll("div",{"class":"UIPortrait_Text"}) for friend in friends: s+=friend.text+":" record['Friends'] =s public_listing=info.find("div",{"id":"public_listing_pages"}) if public_listing: l="" likes=public_listing.findAll("th") things=public_listing.findAll("a",{"class":"psl"}) for like in likes: l+="\|"+like.text+"\|" for psl in things: l+=psl.text+":" record['Interests']= l scraperwiki.datastore.save(["Count"], record) def find_page(url): global count check = False try: html = scraperwiki.scrape(url) check=True except: print "Unable to retrieve page" check=False #continue if check: soup = BeautifulSoup(html) vcard = soup.find("div",{"class":"vcard"}) info = soup.find("div",{"class":"info_column"}) if vcard or info: scrape_personal(vcard) scrape_interests(info) #print info count+=1 else: directory=soup.find("div",{ "class" : "clearfix" }) #find the directory of links UIDirectoryBox=directory.findAll("ul",{"class":"UIDirectoryBox_List"}) #find all the directoryboxes for UI in UIDirectoryBox: links = UI.findAll("a") for link in links: #print link['href'] urls=link['href'] find_page(urls) def scrape_page(url): html = scraperwiki.scrape(url) #get the landing page soup = BeautifulSoup(html) link_table=soup.findAll(attrs={'class':re.compile("alphabet_list clearfix")}) for link in link_table: #get each link section next_urls=link.findAll("a") for urls in next_urls: #print urls['href'] #debug to check the urls url=urls['href'] #get the href print url find_page(url) #setup the base url base_url = 'http://www.facebook.com/directory/people/' #set the counter count=0 #call the scraping function scrape_page(base_url) #find_page('http://en-gb.facebook.com/people/Abdullah-Bin-A/867565175')	public-facebook-records	Public facebook records
python	2011-02-07 17:20:37	################################################################# # BBC Weather Scraper ################################################################# import scraperwiki from BeautifulSoup import BeautifulSoup # URL for the region Hull, East Riding of Yorkshire html = scraperwiki.scrape ('http://news.bbc.co.uk/weather/forecast/336?printco=Forecast') print html soup = BeautifulSoup(html) days = soup.findAll('tr') for day in days: if day['class'].find('day') == -1: continue record = { 'day': None, 'summary': None, 'temp_max_c': None, 'temp_min_c': None, # 'windspeeddir': None, # 'humpresvis': None, } tds = day.findAll('td') for abbr in tds[0].findAll('abbr'): record['day'] = abbr.text for span in tds[2].findAll('span'): try: if span['class'].find('temp max') != -1: record['temp_max_c'] = span.findAll('span',{'class':'cent'})[0].text[:-6] except: pass for span in tds[3].findAll('span'): try: if span['class'].find('temp min') != -1: record['temp_min_c'] = span.findAll('span',{'class':'cent'})[0].text[:-6] except: pass # Windpreed & Direction # for span in tds[4].findAll('span'): # try: # if span['class'].find('wind') != -1: # record['windspeeddir'] = span.findAll('span',{'class':'mph'})[0].text[:-6] # except: # pass # Humidity Pressure Visibility # for span in tds[5].findAll('span'): # try: # if span['class'].find('humpresvis') != -1: # record['humpresvis'] = span.findAll('span',{'class':'hum'})[0].text[:-6] # except: # pass record['summary'] = day.findAll('div',{'class':'summary'})[0].findAll('strong')[0].text print scraperwiki.datastore.save(["day"], record)	bbc-weather-5-day-forecast-for-hull-uk	BBC Weather 5 Day Forecast for Hull, UK
python	2011-02-07 17:20:40	import scraperwiki from BeautifulSoup import BeautifulSoup from urlparse import urljoin def scrape(): base_url = 'http://auto-hifi.ru/price.php' html = scraperwiki.scrape(base_url) soup = BeautifulSoup(html) div_contentpanel = soup.findAll('td', 'contentpanel') assert len(div_contentpanel) == 1 for product_row in soup.findAll('tr', {'align': 'center'})[1:]: product = {} product_td = product_row.findAll('td') product['type'] = ''.join(product_td[1].findAll(text=True)) product_manufacturer = product_td[2].find('a') if product_td[2].find('a') is None: product_manufacturer = product_td[2] product['manufacturer'] = product_manufacturer.string product_line = product_td[3].find('a') product['url'] = urljoin(base_url, product_line['href']) product['model'] = product_line.string if 'title' in product_line: product['name'] = product_line['title'] scraperwiki.datastore.save(['url'], product) scrape()	auto-hifiru	auto-hifi.ru
python	2011-02-07 17:20:41	############################################################################## # Basic scraper ############################################################################### import scraperwiki from BeautifulSoup import BeautifulSoup """ soupselect.py CSS selector support for BeautifulSoup. soup = BeautifulSoup('<html>...') select(soup, 'div') - returns a list of div elements select(soup, 'div#main ul a') - returns a list of links inside a ul inside div#main """ import re tag_re = re.compile('^[a-z0-9]+$') attribselect_re = re.compile( r'^(?P<tag>\w+)?\[(?P<attribute>\w+)(?P<operator>[=~\\|\^\$\]?)' + r'=?"?(?P<value>[^\]"])"?\]$' ) # /^(\w+)\[(\w+)([=~\\|\^\$\]?)=?"?([^\]"])"?\]$/ # \---/ \---/\-------------/ \-------/ # \| \| \| \| # \| \| \| The value # \| \| ~,\|,^,$,* or = # \| Attribute # Tag def attribute_checker(operator, attribute, value=''): """ Takes an operator, attribute and optional value; returns a function that will return True for elements that match that combination. """ return { '=': lambda el: el.get(attribute) == value, # attribute includes value as one of a set of space separated tokens '~': lambda el: value in el.get(attribute, '').split(), # attribute starts with value '^': lambda el: el.get(attribute, '').startswith(value), # attribute ends with value '$': lambda el: el.get(attribute, '').endswith(value), # attribute contains value '': lambda el: value in el.get(attribute, ''), # attribute is either exactly value or starts with value- '\|': lambda el: el.get(attribute, '') == value \ or el.get(attribute, '').startswith('%s-' % value), }.get(operator, lambda el: el.has_key(attribute)) def select(soup, selector): """ soup should be a BeautifulSoup instance; selector is a CSS selector specifying the elements you want to retrieve. """ tokens = selector.split() current_context = [soup] for token in tokens: m = attribselect_re.match(token) if m: # Attribute selector tag, attribute, operator, value = m.groups() if not tag: tag = True checker = attribute_checker(operator, attribute, value) found = [] for context in current_context: found.extend([el for el in context.findAll(tag) if checker(el)]) current_context = found continue if '#' in token: # ID selector tag, id = token.split('#', 1) if not tag: tag = True el = current_context[0].find(tag, {'id': id}) if not el: return [] # No match current_context = [el] continue if '.' in token: # Class selector tag, klass = token.split('.', 1) if not tag: tag = True found = [] for context in current_context: found.extend( context.findAll(tag, {'class': lambda attr: attr and klass in attr.split()} ) ) current_context = found continue if token == '': # Star selector found = [] for context in current_context: found.extend(context.findAll(True)) current_context = found continue # Here we should just have a regular tag if not tag_re.match(token): return [] found = [] for context in current_context: found.extend(context.findAll(token)) current_context = found return current_context def monkeypatch(BeautifulSoupClass=None): """ If you don't explicitly state the class to patch, defaults to the most common import location for BeautifulSoup. """ if not BeautifulSoupClass: from BeautifulSoup import BeautifulSoup as BeautifulSoupClass BeautifulSoupClass.findSelect = select def unmonkeypatch(BeautifulSoupClass=None): if not BeautifulSoupClass: from BeautifulSoup import BeautifulSoup as BeautifulSoupClass delattr(BeautifulSoupClass, 'findSelect') # retrieve a page monkeypatch(BeautifulSoup) starting_url = 'http://republicanwhip.house.gov/Newsroom/va7news.html' html = scraperwiki.scrape(starting_url) soup = BeautifulSoup(html) re_href = re.compile(r'href="(.?)"', re.I) def strip_tags(x): r = re.compile(r'<[^>]>') return r.sub('',x) statements = {} # use BeautifulSoup to get all <tr> tags trs = select(soup, 'tr.style2') for tr in trs: links = select(tr, 'td a') if len(links)==2: date = strip_tags(str(links[0])).strip() title = strip_tags(str(links[1])).strip() href = re_href.search(str(links[0])).group(1).strip() try: release = scraperwiki.scrape(href) except: continue b2 = BeautifulSoup(release) body = select(b2, 'div.asset-body div.style1') if not statements.has_key(date): statements[date] = {} statements[date][title] = str(body) record = {'date': date, 'title': title, 'text': body} scraperwiki.datastore.save(['date', 'title'], record)	rep-eric-cantors-press-releases	Rep. Eric Cantor's Press Releases
python	2011-02-07 17:20:42	import scraperwiki import re from BeautifulSoup import BeautifulSoup # The URLs we're going to scrape: url = "http://www.taoiseach.gov.ie/eng/Taoiseach_and_Government/Government_Legislation_Programme/SECTION_A1.html" html = scraperwiki.scrape(url) # http://www.taoiseach.gov.ie/eng/eng/Taoiseach_and_Government/Government_Legislation_Programme/SECTION_A1.html SECTION A - Lists 23 Bills which the Government expect to publish from the start of the Dáil session up to the beginning of the next session # http://www.taoiseach.gov.ie/eng/Taoiseach_and_Government/Government_Legislation_Programme/SECTION_B1.html SECTION B - Lists 13 Bills in respect of which Heads of Bills have been approved by Government and of which texts are being prepared # http://www.taoiseach.gov.ie/eng/Taoiseach_and_Government/Government_Legislation_Programme/SECTION_C11.html SECTION C - Lists 55 Bills in respect of which heads have yet to be approved by Government # http://www.taoiseach.gov.ie/eng/Taoiseach_and_Government/Government_Legislation_Programme/SECTION_D1.html SECTION D - Lists 25 Bills which are currently before the Dáil or Seanad # http://www.taoiseach.gov.ie/eng/Taoiseach_and_Government/Government_Legislation_Programme/SECTION_E1.html SECTION E - Lists 113 Bills which were enacted since the Government came to office on 14th June, 2007 # http://www.taoiseach.gov.ie/eng/Taoiseach_and_Government/Government_Legislation_Programme/SECTION_F1.html SECTION F - Lists 121 Bills which were published since the Government came to office on 14th June, 2007 # This is a useful helper function that you might want to steal. # It cleans up the data a bit. # def gettext(html): # """Return the text within html, removing any HTML tags i # cleaned = re.sub('<.?>', '', html) # remove tags # cleaned = ' '.join(cleaned.split()) # collapse whitespace # return cleaned #text = scraperwiki.scrape(url) soup = BeautifulSoup(html) #rows = re.findall('(?si)<tr[^>]>(.?)</tr>', text) # <td valign="top">Forestry Bill</td><td valign="top">To reform and update the legislative framework relating to forestry in order to support the development of a modern forestry sector, which enshrines #the principles of sustainable forest management and protection of the environment<br /><br /></td></tr><tr> #for row in rows: trs = soup.findAll('tr') for tr in trs: if tr.find(colspan="3"): continue elif tr.contents[1].contents[0]==" Name Of Company ": continue else: number, bill, desc = tr.contents[0].contents, tr.contents[1].contents, tr.contents[2].contents #dept = re.search('<td colspan="3"><strong>(.?)</strong></td>', row) #if dept: # deptb = dept # deptb, number, bill, desc = None, None, None, None #print deptb, number, bill, desc data = {'number': number, 'bill': bill, 'desc': desc } #'deptb':deptb, scraperwiki.datastore.save(['number'], data)	legislation-status	Irish Government: Legislation status
ruby	2011-02-07 17:20:42	# Welcome to the second ScraperWiki Ruby tutorial # At the end of the last tutorial we had downloaded the text of # a webpage. We will do that again ready to process the contents # of the page html = ScraperWiki.scrape("http://www.stadtbranchenbuch.com/muenster/P/338.html") puts html # Next we use Nokogiri to extract the values from the HTML source. # Uncomment the next five lines (i.e. delete the # at the start of the lines) # and run the scraper again. There should be output in the console. require 'nokogiri' doc = Nokogiri::HTML(html) # Then we can store this data in the datastore. Uncomment the following three lines and run # the scraper again. doc.search('.rLiTop').each do \|td\| data = td.css("a").first.inner_html.split("<br>").map(&:'strip') #name= td.css("address").first.inner_html.split("</a>").map(&:'strip') #td.css("a").inner_html something = "#{data},#{name}" ScraperWiki.save(['data'], {'data' => something}) end # Check the 'Data' tab - here you'll see the data saved in the ScraperWiki store.	stadtbranchenbuch	stadtbranchenbuch

Statistics

Total run time: less than 20 seconds

Total cpu time used: less than 5 seconds

Total disk space used: 66.2 MB

History

Manually ran revision c5751c4f and failed 2016-09-13.

run time 6 s

nothing changed in the database

2 pages scraped
Manually ran revision c5751c4f and failed 2016-09-13.

run time 6 s

nothing changed in the database

2186 pages scraped
Created on morph.io 2016-09-13

Scraper code

scraperwiki-scraper

git clone URL