Skip to main content

Scrapy : A python framework for web crawling

Scrapy in the words of its creators:
"Scrapy is an application framework for crawling web sites and extracting structured data which can be used for a wide range of useful applications, like data mining, information processing or historical archival."
 A screenshot grabbed from the site shows how concise the working code can be:

Scrapy works only with Python 2.7. The objective of this blog is to get you started with Scrapy and provide you with enough information to carry further on your own. In this blog, I will setup a scrapy project and retrieve some data off my blog site.

Pre-requsites
Setup
pip install scrapy

Create a Project
scrapy startproject blog

This command will create a 'blog' directory with the following structure:

The file scrapy.cfg can be used to configure the project. Items.py holds the models necessary for the project. The spiders that need to crawl the internet and fetch data are defined under spiders folder.

Lets say our goal is to extract the following information from this blog site:
  • Blog Title
  • Blog Description
  • Blogs listed in Popular Posts section
We can define the following model in items.py:

import scrapy

class BlogItem(scrapy.Item):
 title  = scrapy.Field()
 desc   = scrapy.Field()
 pop1   = scrapy.Field()
 pop2   = scrapy.Field()
 pop3   = scrapy.Field()
 pop4   = scrapy.Field()
 pass

We can then proceed and create the spider to fetch the required data.
import scrapy
from blog.items import BlogItem

class BlogSpider(scrapy.Spider):
    name = "blog"
    allowed_domains = ["blogspot.in"]
    start_urls = [
        "http://sarathblogs.blogspot.in",
    ]

    def parse(self, response):
  b = BlogItem()
  b['title'] = response.xpath('//h1[@class="title"]/text()').extract()[0]
  b['desc'] = response.xpath('//p[@class="description"]/span/text()').extract()[0]
  b['pop1'] = response.xpath('//div[@id="PopularPosts1"]/div/ul/li[1]/div[1]/div[2]/a/b/text()').extract()[0]
  b['pop2'] = response.xpath('//div[@id="PopularPosts1"]/div/ul/li[2]/div[1]/div[2]/a/b/text()').extract()[0]
  b['pop3'] = response.xpath('//div[@id="PopularPosts1"]/div/ul/li[3]/div[1]/div[1]/a/b/text()').extract()[0]
  b['pop4'] = response.xpath('//div[@id="PopularPosts1"]/div/ul/li[4]/div[1]/div[1]/a/b/text()').extract()[0]
  return b


The spider by default crawls to the url mentioned in the start_urls list and hands over the response to the parse method. The response object provides selectors to parse the html response using XPATH, regex e.t.c. The method response.xpath() is the short hand notation for the same. I am using XPATH to traverse the html response and extract the data required. This data is stored in a BlogItem object and is returned back as the response.

Running the Spider
You can run the spider by executing the following command from the root directory:
scrapy crawl blog

The following snapshot shows the output:


Saving the results
Scrapy provides various handy options to save the data in the db, as json, csv. Lets checkout the commands for the last two options:
  • scrapy crawl blog -o json
  • scrapy crawl blog -o csv 

The json output is as follows:
[
    {
        "title": "\nSarath Chandra Mekala\n",
        "pop1": "\nOpenStack Kilo MultiNode VM Installation using Centos 7 on VirtualBox\n",        
        "pop2": "\nHow to run Juniper Firefly (vSRX) on KVM -- SRX in a box setup\n",
        "pop3": "\nOpenstack : Fixing Failed to create network. No tenant network is available for allocation issue.\n",        
        "pop4": "\nFixing Openstack VM spawning issue: No suitable host found/vif_type=binding_failed error\n",
        "desc": "Openstack, Cloud Orchestration, Networking, Java & J2EE"
    }
]

Hope this was helpful and motivated you at giving a shot at Scrapy (www.scrapy.org)

Comments

Its really helpful thank you for sharing with us. Our web crawling services are designed to ensure close to 100% accuracy between original source and digital formats and ensure the content is ready at your disposal.

Popular posts from this blog

Openstack : Fixing Failed to create network. No tenant network is available for allocation issue.

Assumptions : You are using ML2 plugin configured to use Vlans If you try to create a network for a tenant and it fails with the following error: Error: Failed to create network "Test": 503-{u'NeutronError': {u'message': u'Unable to create the network. No tenant network is available for allocation.', u'type': u'NoNetworkAvailable', u'detail': u''}} The problem can be due to missing configuration in the below files: In /etc/neutron/plugins/ml2/ml2_conf.ini network_vlan_ranges =physnet1:1000:2999 (1000:2999 is the Vlan range allocation) In /etc/neutron/plugins/openvswitch/ovs_neutron_plugin.ini bridge_mappings = physnet1:br-eth1 (in OVS we map the physical network to the OVS bridge) Note You should have created a bridge br-eth1 manually and mapped it to a port ovs-vsctl add-br br-eth1 ovs-vsctl add-port br-eth1 eth1 Once configuration is done, restart the neutron ovs agent on the compute node(s):

Solved: Fix for Git clone failure due to GnuTLS recv error (-9)

My devstack installation was failing with an error reported by the GnuTLS module as shown below: $ git clone https://github.com/openstack/horizon.git /opt/stack/horizon --branch master Cloning into '/opt/stack/horizon'... remote: Counting objects: 154213, done. remote: Compressing objects: 100% (11/11), done. error: RPC failed; curl 56 GnuTLS recv error (-9): A TLS packet with unexpected length was received. fatal: The remote end hung up unexpectedly fatal: early EOF fatal: index-pack failed The following Git config changes fixed the issue for me. Am hoping it will be useful for someone out there: $ git config http.sslVerify false $ git config --global http.postBuffer 1048576000

QuickBite: Tap Vs Veth

Linux supports virtual networking via various artifacts such as: Soft Switches (Linux Bridge, OpenVSwitch) Virtual Network Adapters (tun, tap, veth and a few more) In this blog, we will look at the virtual network adapters tap and veth. From a practical view point, both seem to be having the same functionality and its a bit confusing as to where to use what. A quick definition of tap/veth is as follows: TAP A TAP is a simulated interface which exists only in the kernel and has no physical component associated with it. It can be viewed as a simple Point-to-Point or Ethernet device, which instead of receiving packets from a physical media, receives them from user space program and instead of sending packets via physical media writes them to the user space program. When a user space program (in our case the VM) gets attached to the tap interface it gets hold of a file descriptor, reading from which gives it the data being sent on the tap interface. Writing to the file descri