B
B
BitSky
Search
K

Overview

Make life productive - 让生命富有成效
BitSky is open-source software for extracting the data from websites(web crawler) and web automation jobs with headless chrome and puppeteer, in a fast, simple, scalable, and extensible way.

Why do I need BitSky?

Compare with other web crawling and web scraping frameworks or libraries(e.g. Scrapy, Apify SDK), what is unique features BitSky has:
  1. 1.
    BitSky has a desktop application for MacOS, Windows OS, and Ubuntu, and already pre-installed packages you need. So you don't need to spend time installing or configure the environment for Python, NodeJS, or other programming languages.
  2. 2.
    BitSky supports all programming languages(e.g. Python, Java, NodeJS, and so on), so you can use the programming language you already familiar with, don't need to learn a new programming language just for web crawling.
  3. 3.
    BitSky can easily deploy to any cloud service(e.g. Heroku, AWS), the same code can run both on your laptop or cloud
Except for those unique features, BitSky also has the following features:
  1. 1.
    Crawling any type of websites. BitSky can crawl static websites or single page application
  2. 2.
    Based on microservices architecture, naturally support distributed, easy to scalable, and extendable
With BitSky you just need to focus on extract data, and other work, BitSky will do for you.

Getting Help

  1. 1.
    Try the FAQ - List most common questions
  2. 2.
    Try the How-Tos - List the most common solutions
  3. 3.
    Report bugs or features to us in our issue tracker

Terminologies and Relation

BitSky based on microservices architecture, so Retailer, Producer, Supplier are microservices.

Supplier

A Supplier creates a chain between Retailer and Producer. A Supplier includes all the functions that manage Retailer Configurations, manage Producer Configurations, receive Tasks from a Retailer, and assign Tasks to suitable Producers, and move success or fail Tasks to Task History

Producer Configuration

Configuration for a Producer, it controls a Producer whether can execute Tasks and how to execute Tasks. A Producer MUST connect to a Producer Configuration before it can be assigned Tasks and a Producer Configuration is one to one relationship with a Producer.

Retailer Configuration

Configuration for a Retailer, it has information about a Retailer. For example Base URL, Health Check URL, and receive Tasks URL. A Retailer MUST connect to a Retailer Configuration before it can create and receive Tasks and a Retailer Configuration is one to one relationship with a Retailer.

Task

A contract that both Retailer, Supplier, and Producer can understand. It describes the information about each crawling Task. For example endpoint of the crawling website, the priority of this crawling task, JavaScript script that needs to be executed in the browser, and so on. For more detail, please check Task Schema

Producer

Producer based on assigned Tasks to crawl websites and send crawled data back to the Retailer which created those Tasks. Currently, we have two types of Producer - Headless Producer and HTTP Producer. Headless Producer executes Tasks using headless Chrome with Puppeteer, it is good for crawling Single Page Application or execute JavaScript on page. HTTP Producer executes Tasks using the plain HTTP request, it is good for crawling static websites. HTTP Producer is fast and efficient, Headless Producer has all the features HTTP Producer has, also can execute JavaScript. HTTP Producer is about 10x faster than Headless Producer.
A Producer MUST connect to a Producer Configuration and both Producer Configuration and Producer should have the same type.

Retailer

Retailer creates Tasks and sends to Supplier, Supplier assign Tasks to suitable Producers, after Producers successfully execute Tasks, will send Tasks back to Retailer, send back Tasks will contain crawled data(e.g. HTML), Retailer can extract useful information from received Tasks or create more Tasks. Retailer also needs to decide where to store extract data and use what kind of format. Most of your time is working on creating your own Retailer.