Web-Scraping and Data Visualisation

2018-08-07 #python #Looter #Scraper #Pyecharts #Wordcloud

There are plenty of articles in zhihu, wechat official accounts and github on how to scraper from web. Scrapy, one of the most powerful open source python package , is almost capable for all web-scraping job. Here, I note the basic steps in scraping with looter and the data visualisation on crawled data with pyecharts.

Basic scraping steps

The basic steps in scraping are: first send a request, and followed by parsing the response and ended by data storage.

Make a Requests

One can send a request in using the package requests

    >>> import requests
    >>> r = requests.get('https://api.github.com/events')     # return a Response object r
    >>> r.text  # return the content of the response
    >>> r.status_code    # check the response status

While with looter one can

    looter shell <your url>
    >>> import looter as lt
    # in looter, has from lxml import etree
    # lxml is used for processing XML and HTML, Elements are list and carry attributes as a dict
    >>> tree = lt.fetch(url)  # return etree.HTML(r.text), the ElementTree of the response

Parse a Response

To parse a ElementTree, one can use cssselect and xpath, here I’ll use cssselect. For the response requested directly by requests.get, we first convert the content of the response explicitly with lxml.etree, while for looter, a tree object has already been returned by fetch.

    # with looter, we don't need import it explicitly, as it has already importted within it.
    >>> from lxml import etree
    >>> tree = etree.HTML(r.text)

    # select the interesting part of tree with CSS tag
    >>> items = tree.cssselect('td input')  # select the input class inside table td, td and input are html/css tag
    >>> items = tree.cssselect('.question-summary') #another example for stackoverflow
    # cssselect usually return a list of Element
    >>> items
    [<Element div at 0x1037f89c8>, <Element div at 0x1037f8a08>, ..., <Element div at 0x103954ac8>]

    >>> item = items[0]
    >>> item
    <Element div at 0x1037f89c8>
    # item may contains a dict
    >>> item.attrib  # Elements carry attributes as a dict
    {'class': 'question-summary', 'id': 'question-summary-231767'}
    >>> subitem = item.cssselect('a.question-hyperlink')
    >>> len(subitem)  # a list
    1
    >>> subitem[0].attrib
    {'href': '/questions/231767/what-does-the-yield-keyword-do', 'class': 'question-hyperlink'}
    >>> subitem[0].text
    'What does the “yield” keyword do?'

Save wanted data

For data, we can write with json; for image, here we use directly the function in looter. One can has its own realization in reference that in looter.

    >>> import looter as lt  # use function in looter
    >>> save_img(item.attrib.get("data-src"))  # dict with key 'data-src' has a url link value

Scraping with looter

As declared in its documentation, if one want to quickly write a spider, you can use looter to automaticaly generate one :)

    looter genspider <name> <tmpl> [--async]

In this code, tmpl is template, inculdes data and image. async is an option which represents generating a spider using asyncio instead of threadpool. In the generated template, you can custom the domain, and the tasklist. tasklist is actually the pages you want to crawl.

    domain = 'https://konachan.com'
    tasklist = [f'{domain}/post?page={i}' for i in range(1, 9777)]

And then you should custom your crawl function, which is the core of your spider.

    def crawl(url):
      tree = lt.fetch(url)
      items = tree.cssselect('ul li')
      for item in items:
        data = dict()
        # data[...] = item.cssselect(...)
        pprint(data)

In most cases, the contents you want to crawl is a list (ul or ol tag in HTML), you can select them as items. Then, just use a for loop to iterate them, and select the things you want, storing them to a dict. But before you finish this spider, you’d better debug your cssselect codes using shell provided by looter.

Here is a example within looter (with small modifications, write data to json, instead print it on screen)

    import looter as lt
    from pprint import pprint
    from concurrent import futures
    
    domain = 'https://stackoverflow.com'
    total = []

    def crawl(url):
      tree = lt.fetch(url)
      items = tree.cssselect('.question-summary')
      for item in items:
        data = dict()
        data['question'] = item.cssselect('a.question-hyperlink')[0].text
        data['link'] = domain + item.cssselect('a.question-hyperlink')[0].get('href')
        data['votes'] = int(item.cssselect('.vote-count-post strong')[0].text)
        data['answers'] = int(item.cssselect('.status strong')[0].text)
        data['views'] = int(''.join(item.cssselect('.views')[0].get('title')[:-6].split(',')))
        data['timestamp'] = item.cssselect('.relativetime')[0].get('title')
        data['tags'] = []
        listTags = item.cssselect('a.post-tag')
        for tag in listTags:
            data['tags'].append(tag.text)
        #data['tags'] = item.cssselect('a.post-tag')[1].text
        #pprint(data)
        lt.save_as_json(total, name='stackoverflow')
        total.append(data)
    
    
    if __name__ == '__main__':
      tasklist = [f'{domain}/questions/tagged/python?page={n}&sort=votes&pagesize=50' for n in range(1, 31)]
      #result = [crawl(task) for task in tasklist]
      with futures.ThreadPoolExecutor(10) as executor:
        executor.map(crawl, tasklist)

Data Visualisation

Here I just show one possibility, using pyecharts. To use pyecharts, it seems we need to convert the crawled data in two lists, one stored the data and the other store its frequency. Here is a example

    import json
    from pprint import pprint
    from pyecharts import WordCloud
    
    tag = []
    with open('stackoverflow.json') as json_file:
      data = json.load(json_file)
      for tgs in data:
        for t in tgs['tags']:
          tag.append(t)
    
    tagNm = []
    tagNb = []
    
    for t in tag:
      if t in tagNm:
        tagNb[tagNm.index(t)] = tagNb[tagNm.index(t)]+1
      else:
        tagNm.append(t)
        tagNb.append(1)
    
    wordcloud = WordCloud(width=1300, height=620)
    wordcloud.add("", tagNm, tagNb, word_size_range=[20, 100])
    wordcloud.render()

and its results