Scrapy Python script to crawl darkweb/onion websites.

June 4, 2025Labs and Lessons

Scrapy is a fast, open-source web crawling and data extraction framework written in Python, widely used for web scraping and mining structured data from websites. Unlike simple scraping libraries, Scrapy is a comprehensive framework that manages every aspect of the scraping process—from sending HTTP requests and handling responses to processing, cleaning, and storing the extracted data in formats such as JSON, CSV, or XML. In the example below, we have a Python scrapy script that crawls darkweb onion websites.

# Create a scrapy project first
# scrapy startproject myproject
# cd myproject
# scrap

import <a class="glossaryLink" aria-describedby="tt" data-cmtooltip="<div class=glossaryItemTitle>SOCKS</div><div class=glossaryItemBody> SOCKS (originally not an acronym, but later commonly referred to as &quot;Socket Secure”) is an internet protocol that facilitates communication between a client and a server by routing network packets through a proxy server. This protocol operates at the session (circuit) layer, making it a versatile tool for forwarding any kind of TCP (and, with SOCKS5, UDP) traffic, rather than being limited to web traffic like HTTP proxies. </div><div class=glossaryTooltipMoreLinkWrapper><a class=glossaryTooltipMoreLink href=https://www.spartechsoftware.com/glossary/socks/ >Term details</a></div>" href="https://www.spartechsoftware.com/glossary/socks/" data-mobile-support="0" data-gt-translate-attributes="[{"attribute":"data-cmtooltip", "format":"html"}]" tabindex="0" role="link" data-wpel-link="internal">socks</a>
import socket
from <a class="glossaryLink" aria-describedby="tt" data-cmtooltip="<div class=glossaryItemTitle>Elasticsearch</div><div class=glossaryItemBody> Elasticsearch is an open-source, distributed search and analytics engine designed for speed, scalability, and versatility. Built on top of Apache Lucene, it enables users to store, search, and analyze large volumes of structured, unstructured, and even vector data in near real-time, delivering results in milliseconds. </div><div class=glossaryTooltipMoreLinkWrapper><a class=glossaryTooltipMoreLink href=https://www.spartechsoftware.com/glossary/elasticsearch/ >Term details</a></div>" href="https://www.spartechsoftware.com/glossary/elasticsearch/" data-mobile-support="0" data-gt-translate-attributes="[{"attribute":"data-cmtooltip", "format":"html"}]" tabindex="0" role="link" data-wpel-link="internal">elasticsearch</a> import Elasticsearch
import redis
from onionscan import Onion

class CombinedSpider(scrapy.Spider):
name = 'combinedspider'
start_urls = ['https://www.example.com/', 'http://exampleonion.onion/']


def __init__(self, *args, **kwargs):
    # Set up TOR <a class="glossaryLink" aria-describedby="tt" data-cmtooltip="<div class=glossaryItemTitle>Proxy</div><div class=glossaryItemBody> A proxy, in computer networking, is an intermediary server or application that sits between a client (such as your computer or web browser) and the server providing a resource (such as a website or file). When you use a proxy, your requests for resources are sent to the proxy server first, which then forwards those requests to the destination server. The proxy receives the response and relays it back to you, effectively acting as a go-between for network traffic. </div><div class=glossaryTooltipMoreLinkWrapper><a class=glossaryTooltipMoreLink href=https://www.spartechsoftware.com/glossary/proxy/ >Term details</a></div>" href="https://www.spartechsoftware.com/glossary/proxy/" data-mobile-support="0" data-gt-translate-attributes="[{"attribute":"data-cmtooltip", "format":"html"}]" tabindex="0" role="link" data-wpel-link="internal">proxy</a>
    socks.setdefaultproxy(socks.PROXY_TYPE_SOCKS5, "127.0.0.1", 9050)
    socket.socket = socks.socksocket
    # Set up Elasticsearch client
    self.es = Elasticsearch(['http://localhost:9200'])
    # Set up Redis client
    self.redis = redis.Redis(host='localhost', port=6379, db=0)
    # Set up OnionScan instance
    self.onion = Onion()
    super(CombinedSpider, self).__init__(*args, **kwargs)

    def parse(self, response):
        if response.<a class="glossaryLink" aria-describedby="tt" data-cmtooltip="<div class=glossaryItemTitle>URL</div><div class=glossaryItemBody> A URL, or Uniform Resource Locator, is the address of a unique resource on the internet, such as a webpage, image, video, or document. It acts like a digital roadmap, guiding web browsers to the exact location of the resource you want to access. </div><div class=glossaryTooltipMoreLinkWrapper><a class=glossaryTooltipMoreLink href=https://www.spartechsoftware.com/glossary/url/ >Term details</a></div>" href="https://www.spartechsoftware.com/glossary/url/" data-mobile-support="0" data-gt-translate-attributes="[{"attribute":"data-cmtooltip", "format":"html"}]" tabindex="0" role="link" data-wpel-link="internal">url</a>.startswith('http://exampleonion.onion/'):
            # Scan the Onion Service
            report = self.onion.scan(response.url)
            # Extract data from report
            data = {
                'summary': report.summary(),
                'services': report.services(),
                'links': report.links(),
            }
            # Index data in Elasticsearch
            self.es.index(index='onionindex', doc_type='oniontype', body=data)
            # Store data in Redis cache
            self.redis.set('last_scanned_onion', data)
        else:
            # Extract data from page
            data = {
                'title': response.css('h1::text').get(),
                'body': response.css('p').get(),
            }
            # Index data in Elasticsearch
            self.es.index(index='myindex', doc_type='mytype', body=data)
            # Store data in Redis cache
            self.redis.set('last_crawled_page', data)

            # Follow links to other pages
            for href in response.css('a::attr(href)'):
                yield response.follow(href, self.parse)

# Create a scrapy project first
# scrapy startproject myproject
# cd myproject
# scrap

import socks
import socket
from elasticsearch import Elasticsearch
import redis
from onionscan import Onion

class CombinedSpider(scrapy.Spider):
name = 'combinedspider'
start_urls = ['https://www.example.com/', 'http://exampleonion.onion/']


def __init__(self, *args, **kwargs):
    # Set up TOR proxy
    socks.setdefaultproxy(socks.PROXY_TYPE_SOCKS5, "127.0.0.1", 9050)
    socket.socket = socks.socksocket
    # Set up Elasticsearch client
    self.es = Elasticsearch(['http://localhost:9200'])
    # Set up Redis client
    self.redis = redis.Redis(host='localhost', port=6379, db=0)
    # Set up OnionScan instance
    self.onion = Onion()
    super(CombinedSpider, self).__init__(*args, **kwargs)

    def parse(self, response):
        if response.url.startswith('http://exampleonion.onion/'):
            # Scan the Onion Service
            report = self.onion.scan(response.url)
            # Extract data from report
            data = {
                'summary': report.summary(),
                'services': report.services(),
                'links': report.links(),
            }
            # Index data in Elasticsearch
            self.es.index(index='onionindex', doc_type='oniontype', body=data)
            # Store data in Redis cache
            self.redis.set('last_scanned_onion', data)
        else:
            # Extract data from page
            data = {
                'title': response.css('h1::text').get(),
                'body': response.css('p').get(),
            }
            # Index data in Elasticsearch
            self.es.index(index='myindex', doc_type='mytype', body=data)
            # Store data in Redis cache
            self.redis.set('last_crawled_page', data)

            # Follow links to other pages
            for href in response.css('a::attr(href)'):
                yield response.follow(href, self.parse)

Last updated on June 22, 2025

United States	~59% of ransomware attacks globally Thousands per year
Poland	1,000+ per week
Russia	Highest cybercrime threat level
China	Thousands per year
India	115% surge in attacks Q2 2024
Ukraine	Significant surge since 2022
Brazil	Among top countries for blocked attacks
Mexico	65% of businesses hit in 2024
Germany	High targeted rate (EU)
France	High targeted rate (EU)

AS Name	ASN
Bharat Sanchar Nigam Ltd	9829
No.31,Jin-rong Street	4134
CHINA UNICOM China169 Backbone	4837
DigitalOcean, LLC	14061
HUAWEI INTERNATIONAL PTE. LTD.	136907
Amazon.com, Inc.	14618
Alibaba (US) Technology Co., Ltd.	45102
Google LLC	396982
Amazon.com, Inc.	16509
3xK Tech GmbH	200373

IP Address	Notable Exploits/Context
104.238.159.149	SharePoint zero-day, broad exploitation
107.191.58.76	SharePoint zero-day, government targets
96.9.125.147	SharePoint, previously Ivanti exploits
139.162.47.194	Exploits on CitrixBleed 2
38.180.148.215	CitrixBleed 2 campaigns
185.224.128.17	High activity, Netherlands
89.248.163.200	High activity, Netherlands
15.235.218.150	Associated with APT, active C2
45.9.148.114	Associated with C2, malicious netflow
91.107.150.184	C2 infrastructure, recent IoC

Share this:

Related