How to setup Juniper's Openstack FWaaS Plugin

I have written a tech wiki article on how to install Juniper's OpenStack FWaaS Plugin @ http://forums.juniper.net/t5/Data-Center/How-to...

Tuesday, August 25, 2015

Scrapy : A python framework for web crawling

Scrapy in the words of its creators:
"Scrapy is an application framework for crawling web sites and extracting structured data which can be used for a wide range of useful applications, like data mining, information processing or historical archival."
 A screenshot grabbed from the site shows how concise the working code can be:

Scrapy works only with Python 2.7. The objective of this blog is to get you started with Scrapy and provide you with enough information to carry further on your own. In this blog, I will setup a scrapy project and retrieve some data off my blog site.

Pre-requsites
Setup
pip install scrapy

Create a Project
scrapy startproject blog

This command will create a 'blog' directory with the following structure:

The file scrapy.cfg can be used to configure the project. Items.py holds the models necessary for the project. The spiders that need to crawl the internet and fetch data are defined under spiders folder.

Lets say our goal is to extract the following information from this blog site:
  • Blog Title
  • Blog Description
  • Blogs listed in Popular Posts section
We can define the following model in items.py:

import scrapy

class BlogItem(scrapy.Item):
 title  = scrapy.Field()
 desc   = scrapy.Field()
 pop1   = scrapy.Field()
 pop2   = scrapy.Field()
 pop3   = scrapy.Field()
 pop4   = scrapy.Field()
 pass

We can then proceed and create the spider to fetch the required data.
import scrapy
from blog.items import BlogItem

class BlogSpider(scrapy.Spider):
    name = "blog"
    allowed_domains = ["blogspot.in"]
    start_urls = [
        "http://sarathblogs.blogspot.in",
    ]

    def parse(self, response):
  b = BlogItem()
  b['title'] = response.xpath('//h1[@class="title"]/text()').extract()[0]
  b['desc'] = response.xpath('//p[@class="description"]/span/text()').extract()[0]
  b['pop1'] = response.xpath('//div[@id="PopularPosts1"]/div/ul/li[1]/div[1]/div[2]/a/b/text()').extract()[0]
  b['pop2'] = response.xpath('//div[@id="PopularPosts1"]/div/ul/li[2]/div[1]/div[2]/a/b/text()').extract()[0]
  b['pop3'] = response.xpath('//div[@id="PopularPosts1"]/div/ul/li[3]/div[1]/div[1]/a/b/text()').extract()[0]
  b['pop4'] = response.xpath('//div[@id="PopularPosts1"]/div/ul/li[4]/div[1]/div[1]/a/b/text()').extract()[0]
  return b


The spider by default crawls to the url mentioned in the start_urls list and hands over the response to the parse method. The response object provides selectors to parse the html response using XPATH, regex e.t.c. The method response.xpath() is the short hand notation for the same. I am using XPATH to traverse the html response and extract the data required. This data is stored in a BlogItem object and is returned back as the response.

Running the Spider
You can run the spider by executing the following command from the root directory:
scrapy crawl blog

The following snapshot shows the output:


Saving the results
Scrapy provides various handy options to save the data in the db, as json, csv. Lets checkout the commands for the last two options:
  • scrapy crawl blog -o json
  • scrapy crawl blog -o csv 

The json output is as follows:
[
    {
        "title": "\nSarath Chandra Mekala\n",
        "pop1": "\nOpenStack Kilo MultiNode VM Installation using Centos 7 on VirtualBox\n",        
        "pop2": "\nHow to run Juniper Firefly (vSRX) on KVM -- SRX in a box setup\n",
        "pop3": "\nOpenstack : Fixing Failed to create network. No tenant network is available for allocation issue.\n",        
        "pop4": "\nFixing Openstack VM spawning issue: No suitable host found/vif_type=binding_failed error\n",
        "desc": "Openstack, Cloud Orchestration, Networking, Java & J2EE"
    }
]

Hope this was helpful and motivated you at giving a shot at Scrapy (www.scrapy.org)