B
B
BitSky
Search
K
Comment on page

Quick Start

This tutorial will help you to get starting BitSky in several minutes.
BitSky is open-source software for extracting the data from websites(web crawler) and web automation jobs with headless chrome and puppeteer, in a fast, simple, scalable, and extensible way.

Terminologies and Relation

Let us first spend some time to understand terminologies and the relation between them

Supplier

A Supplier creates a chain between Retailer and Producer. A Supplier includes all the functions that manage Retailer Configurations, manage Producer Configurations, receive Tasks from a Retailer, and assign Tasks to suitable Producers, and move success or fail Tasks to Task History

Producer Configuration

Configuration for a Producer, it controls a Producer whether can execute Tasks and how to execute Tasks. A Producer MUST connect to a Producer Configuration before it can be assigned Tasks and a Producer Configuration is one to one relationship with a Producer.

Retailer Configuration

Configuration for a Retailer, it has information about a Retailer. For example Base URL, Health Check URL, and receive Tasks URL. A Retailer MUST connect to a Retailer Configuration before it can create and receive Tasks and a Retailer Configuration is one to one relationship with a Retailer.

Task

A contract that both Retailer, Supplier, and Producer can understand. It describes the information about each crawling Task. For example endpoint of the crawling website, the priority of this crawling task, JavaScript script that needs to be executed in the browser, and so on. For more detail, please check Task Schema

Producer

Producer based on assigned Tasks to crawl websites and send crawled data back to the Retailer which created those Tasks. Currently, we have two types of Producer - Headless Producer and HTTP Producer. Headless Producer executes Tasks using headless Chrome with Puppeteer, it is good for crawling Single Page Application or execute JavaScript on page. HTTP Producer executes Tasks using the plain HTTP request, it is good for crawling static websites. HTTP Producer is fast and efficient, Headless Producer has all the features HTTP Producer has, also can execute JavaScript. HTTP Producer is about 10x faster than Headless Producer.
A Producer MUST connect to a Producer Configuration and both Producer Configuration and Producer should have the same type.

Retailer

Retailer creates Tasks and sends to Supplier, Supplier assign Tasks to suitable Producers, after Producers successfully execute Tasks, will send Tasks back to Retailer, send back Tasks will contain crawled data(e.g. HTML), Retailer can extract useful information from received Tasks or create more Tasks. Retailer also needs to decide where to store extract data and use what kind of format. Most of your time is working on creating your own Retailer.

Install and setup

Getting up and running with BitSky is quick and easy. It is a small download so you can install in a matter of minutes and give BitSky a try.
BitSky is a free and open-source application that runs on MacOS, Windows, and Linux operating systems.
BitSky Desktop Application includes Supplier, Headless Producer, HTTP Producer, a Retailer(Hello Retailer Service), and pre-installed necessary Node Modules to help you get started

MacOS

  1. 1.
    Download BitSky for MacOS
  2. 2.
    Double-click on the downloaded archive to expand the contents
  3. 3.
    Drag BitSky.app to Applications folder, making it available in the Launchpad
Open a Mac app from an unidentified developer
When you see the following dialog
  1. 1.
    Go Security & Privacy (System Preferences > Security & Privacy), and click Open Anyway
  2. 2.
    Click BitSky.app again, in the popup dialog, and click Open

Windows

  1. 1.
    Download BitSky for Windows
  2. 2.
    Once it is downloaded, run the exe file(BitSky-{version}-x64-setup.exe). This will take a minute.
  3. 3.
    By default, it is installed under C:\users{username}\AppData\Local\BitSky\app-{version}, and create a shortcut in Desktop
Windows protected your PC
When you see the following dialog
Click More info
And Click Run anyway
If you see this pop-up, also allows public networks.

Ubuntu

  1. 1.
    Download BitSky from Ubuntu
  2. 2.
    Unzip downloaded
  3. 3.
    Double click the deb file(bitsky_{version}_amd64.deb)
  4. 4.
    In the pop-up dialog, click install
Awesome job, now you successfully installed BitSky on your laptop.

Extract Blogs from exampleblog

Let us use BitSky to extract all articles from https://exampleblog.bitsky.ai, and save all articles as JSON format in local disk.
This section will walk you through these tasks:
  1. 1.
    Create a Retailer Service Configuration
  2. 2.
    Configure and Start Hello Retailer Service
  3. 3.
    Create a Headless Producer and Service Producer Configuration, and configure Headless Producer and Service Producer
  4. 4.
    Extract articles and view extract data
In this example, Hello Retailer Service is written in JavaScript, but it also can be written in any programming language, like Python and so on

1. Create a Retailer Configuration

Retailer response for creating Tasks, receive successfully Tasks, and extract useful information from received Tasks or send more Tasks, then save extracted data. Before you can use Retailer, you must create a Retailer Configuration first.

Open Retailer Configurations

Click Retailer Configurations, you should see the Retailer Configurations page. If you don't have any retailer configuration, then you should see a similar screen

Create a Retailer Configuration

Click Create Button, in the drawer panel, type your configuration:
  1. 1.
    Retailer Service Name: first retailer configuration
  2. 2.
    Base URL: http://localhost:8081
  3. 3.
    Click Create
Base URL may need to update after you start Hello Retailer Service, because the 8081 port may be used by another application. We will do this in the next step.
In this step, we registered a Retailer Configuration in Supplier, and this Retailer Configuration is ready to be connected by a Retailer(e.g. Hello Retailer Service).
A Retailer MUST connect to a Retailer Configuration, Retailer Configuration is 1 to 1 relationship with a Retailer.

2. Configure Hello Retailer Service

Click the first retailer configuratio to open Retailer Configuration Panel
Click copy icon(
) to copy Retailer Configuration Global ID. In this example, this Retailer Configuration Global ID is e96fc7f9-398f-41e1-b903-81e4428bd9e6
Click close icon(
) to close Retailer Configuration Panel
Click Hello Retailer Service to open Retailer Editor
In opened Retailer Editor, by default will select worker.js, if it isn't selected, then please select worker.js
Paste copied Retailer Configuration Global ID(you copied before, in this example copied value is e96fc7f9-398f-41e1-b903-81e4428bd9e6) to settings.GLOBAL_ID(line 50).
CTRL+S(Win/Linux) or CMD+S(MacOS) to save your change, after save successfully, you should see this notification
, and it will notify you to restart.
Let us quickly go through worker.js

Pre-Installed Packages

BitSky Desktop Application already installed several frequently used node_modules, you also can require NodeJS native packages
In Retailer Editor, you ONLY can use node_modules listed in Node Modules Full List and NodeJS native packages
// https://apis.bitsky.ai/bitsky-retailer-sdk/BaseRetailerService.html
const baseRetailerService = require("@bitskyai/retailer-sdk");
//--------------------------------------------------------------
// Following are frequently use packages you possible need
// All the list packages are already pre-installed
//
// Available Packages:
// 1. https://docs.bitsky.ai/user-manual/retailer-editor/node-modules-full-list
// 2. https://nodejs.org/dist/latest-v12.x/docs/api/
//--------------------------------------------------------------
const path = require("path");
// `cheerio`: Fast, flexible & lean implementation of core jQuery designed specifically for the server
// https://www.npmjs.com/package/cheerio
const cheerio = require("cheerio");
// DOM 3 XPath 1.0 implemention and helper for JavaScript, with node.js support.
// https://www.npmjs.com/package/xpath
const xpath = require('xpath');
// `lodash`: A modern JavaScript utility library delivering modularity, performance & extras
// https://lodash.com/
const _ = require("lodash");
// `moment`: Parse, validate, manipulate, and display dates and times in JavaScript
// https://momentjs.com/
const moment = require("moment");
// `fs-extra`: adds file system methods that aren't included in the native `fs` module and adds promise support to the `fs` methods
// https://www.npmjs.com/package/fs-extra
const fs = require("fs-extra");
// `uuid`: Generate RFC-compliant UUIDs in JavaScript
// https://www.npmjs.com/package/uuid
const { v4: uuidv4 } = require("uuid");
// `xlsx`: Parser and writer for various spreadsheet formats
// https://www.npmjs.com/package/xlsx
const XLSX = require("xlsx");
// `papaparse`: The powerful, in-browser CSV parser for big boys and girls
// https://www.papaparse.com/
const Papa = require("papaparse");
// `txt-file-to-json`: Reads a text file or data variable having a table and returns an array of obects
// https://www.npmjs.com/package/txt-file-to-json
const txtToJSON = require("txt-file-to-json");
// winston logger - https://www.npmjs.com/package/winston. It is useful for you to debug
// log file path: public/log/retailer.log
// Examples:
// logger.info('Hello again distributed logs');
// logger.error('Hello again distributed logs', {error: err});
const logger = require('./utils/logger');

Settings

Settings for hello retailer service, you MUST change to correct value.
  1. 1.
    GLOBAL_ID: Global ID of the Retailer Configuration you want this Retailer to connect. In this example is the first retailer configuration you created before, and the GLOBAL_ID value is e96fc7f9-398f-41e1-b903-81e4428bd9e6

Execute JavaScript

customFunction will wait 5 seconds when open page
// Page will wait 5 second, this is to show you how to execute JavaScript inside page
// For more infomation, please take a look of `metadata.scripts` in https://apis.bitsky.ai/bitsky-retailer-sdk/global.html#Task
async function customFunction() {
await $$page.waitFor(5 * 1000);
}

Initial Crawling

trigger is used for init first task/tasks for your data scrawling job. Supplier based on Task information to decide when to assign it to suitable Producers to execute. After the Producer successfully executes Task, will send Task back to parse function. It is the entry point, similar to the main function in Java, C/C++. For more information, please take a look at https://apis.bitsky.ai/bitsky-retailer-sdk/BaseRetailerService.html#trigger
/**
* Trigger is used for init first task/tasks for your data scrawling job.
* **Supplier** based on Task information to decide when to assign it to suitable **Producer** to exectue.
* After **Producer** successfully execute Task, will send Task back to **parse** function.
* It is the **enter point**, similar to the `main` function in Java, C/C++
* For more information, please take a look of https://apis.bitsky.ai/bitsky-retailer-sdk/BaseRetailerService.html#trigger
*
* @returns {object} - A JSON object has tasks property. Normally you can use `baseRetailerService.generateTask` to generate Task.
* Detail information: https://apis.bitsky.ai/bitsky-retailer-sdk/global.html#TriggerFunReturn
*/
const trigger = async function trigger({ req, res }) {
return {
tasks: [
// API: https://apis.bitsky.ai/bitsky-retailer-sdk/BaseRetailerService.html#generateTask
baseRetailerService.generateTask({
// Target website URL
url: "http://exampleblog.bitsky.ai/",
// Priority of this task. This is useful if your tasks need to be executed by order. `1` is highest priority
priority: 1,
// Additional metadata for this task, you should add it based your requirement. `script` is preserved, it only used for pass JavaScript Code String
// In this example, I use `type` to distinguish different page - `bloglist` or `blog`.
// If it is `bloglist` then get all blog links and add new tasks to continues crawl those blogs, otherwise save blog to JSON
//
// In this example, I let page to wait 5 second, this isn't necessary, only used for show you how to execute JavaScript Code.
// `script` is useful to crawl single page application or you need to interact with page. And only `Headless Producer` can execute tasks have script
// `script` is the JavaScript Code you want to execute, you need to convert your function to string. Normally you can use `functionName.toString()`
metadata: { type: "bloglist", script: customFunction.toString() },
}),
],
};
};

Receive Tasks

After Producer successfully executes Tasks, parse function will receive successfully Tasks, received Tasks that contain crawled data. Parse is used to extract data and decide whether to continue to add more Tasks.
For example, in trigger we create a task to crawl http://exampleblog.bitsky.ai/, after Producer crawled successfully, will send back Task that contains the HTML of http://exampleblog.bitsky.ai/ And inside parse function, we parse return HTML, and get the URL link of each blog, and create Tasks to continue to crawl each blog
/**
* After **Producer** successfully execute Task, parse function will be called. And receive the **Task** contains crawled data.
* Parse is used for extract data and decide whether contine to add more tasks.
*
* For example, in **trigger** we create a task to crawl http://exampleblog.bitsky.ai/, after **Producer** crawled successful, will send back Task that contains the HTML of http://exampleblog.bitsky.ai/
* And inside **parse** function, we parse return HTML, and get URL link of each blog, and create tasks to continue crawl each blog
*
* @returns {object} - https://apis.bitsky.ai/bitsky-retailer-sdk/global.html#ParseFunReturn
*/
const parse = async function parse({ req, res }) {
try {
// Task return from Producer, Task Schema - https://github.com/bitskyai/bitsky-supplier/blob/develop/src/schemas/task.json
// By default, crawled HTML was stored in task.dataset.data.content
const returnTasks = req.body;
// New Tasks that need to be sent to BitSky Supplier
const tasks = [];
// Crawled Data, by default will be stored in local disk
const storeData = [];
// Base URL for the new Task
const targetBaseURL = "http://exampleblog.bitsky.ai/";
for (let i = 0; i < returnTasks.length; i++) {
let task = returnTasks[i];
// Crawled HTML - https://github.com/bitskyai/bitsky-supplier/blob/develop/src/schemas/task.json
let htmlString = task.dataset.data.content;
// You can find how to use cheerio from https://cheerio.js.org/
// cheerio: Fast, flexible & lean implementation of core jQuery designed specifically for the server.
// if you like you also can try to use `xpath`, please check https://www.npmjs.com/package/xpath
let $ = cheerio.load(htmlString);
if (task.metadata.type == "bloglist") {
// If task type is **bloglist**, then need to get blog link
// Get more detail from https://docs.bitsky.ai/tutorials/crawl-example-blog#crawl-each-blog-list-page-and-get-blogs-link
let blogUrls = $("div.post-preview a");
for (let i = 0; i < blogUrls.length; i++) {
let $blog = blogUrls[i];
$blog = $($blog);
// Get blog page link, don't forget to add Base URL
let url = new URL($blog.attr("href"), targetBaseURL).toString();
// you can use `logger.info`, `logger.error` for debug
// please check https://www.npmjs.com/package/winston for detail
logger.info(`blog page link: ${url}`);
// Add Task to crawl blog page
tasks.push(
baseRetailerService.generateTask({
url,
// Set `priority` to `2`, so we can first crawl all blog list page, then crawl all blogs
priority: 2,
metadata: {
// Add `type: "blog"` to indicate this task is for crawl blog
type: "blog",
},
})
);
}
// Get next blog list page link
let nextUrl = $("ul.pager li.next a").attr("href");
if (nextUrl) {
nextUrl = new URL(nextUrl, targetBaseURL).toString();
logger.info(`blog list page link: ${nextUrl}`);
// If it has next blog list page, then create a Task to crawl Next Blog List page
tasks.push(
baseRetailerService.generateTask({
url: nextUrl,
// blog list page is highest priority
priority: 1,
metadata: {
// indicate this task is for crawl blog list page
type: "bloglist",
// Just to show you how to execute JavaScript in the browser
script: customFunction.toString(),
},
})
);
}
} else if (task.metadata.type == "blog") {
// If it is blog page, then crawl data and put to
storeData.push({
title: $("div.post-heading h1").text(),
author: $("div.post-heading p.meta span.author").text(),
date: $("div.post-heading p.meta span.date").text(),
content: $("div.post-container div.post-content").text(),
url: task.dataset.url,
});
} else {
logger.error("unknown type");
}
}
// return data that need to store and tasks need to be executed
// Check https://apis.bitsky.ai/bitsky-retailer-sdk/global.html#ParseFunReturn for more detail
return {
data: storeData,
tasks: tasks,
};
} catch (err) {
logger.error(`parse error: ${err.message}`);
}
};

3. Start Hello Retailer Service

After you configured GLOBAL_ID, Click the Start button to start Hello Retailer Service
After start successful, you should see the Retailer Editor like this:

User Manual(1)

Open user manual document in the browser

http://localhost:8081(2)

Base URL of Hello Retailer, click it you can view configuration information about this Retailer.
If the Base URL of Hello Retailer isn't http://localhost:8081, you need to update the Base URL value in first retailer configuration

Add trigger tasks(3)

Click Add trigger tasks will add initial Tasks

Refresh Folder Structure(4)

Show or hide Folder Structure(5)

Let us click Add trigger Tasks, it will add one Task to Supplier
You should be able to see one Task in the Tasks page

4. Create Producer Configurations

Now you already add an initial Task, let us configure Headless Producer and HTTP Producer.

Open Producer Configurations

If you don't have any Producer Configuration then you should see the same page

Create a Producer Configuration for Headless Producer

Click Create to popup Create a Producer Configuration drawer
  1. 1.
    Name: first headless configuration
Other parts, you can keep it as default.

Create a Producer Configuration for HTTP Producer

Click Create to popup Create a Producer Configuration drawer
  1. 1.
    Name: first http configuration
  2. 2.
    Producer Type: HTTP Producer
Other parts, you can keep it as default.

5. Configure Headless Producer

Click first headless configuration
Click Headless Producer
  1. 1.
    Producer Configuration Global ID(2): Paste copied first headless configuration Global ID
  2. 2.
    Headless Mode: Change to No, so when headless producer executes Tasks, you can see chrome will be automatically open.
Your change will be automatically saved, after save successfully, you should be able to see Update producer configuration, and restarting...(4)
Change Headless Mode to No, normally it is used for debug purpose

6. Configure HTTP Producer

Click first http configuration
Copy Producer Configuration Global ID
Click HTTP Producer
  1. 1.
    Producer Configuration Global ID(2): Paste copied first http configuration Global ID
Your change will be automatically saved, after save successfully, you should be able to see Update producer configuration, and restarting...(3)

7. Activate Producers

Click Producer Configurations
Click Activate(1, 2) to activate first headless configuration and first http configuration, now you just need to wait for Headless Producer and HTTP Producer to execute your Tasks.
After your activate, normally you need to wait about 30 seconds, then Headless Producer and HTTP Producer will start to execute Tasks. And wait about 10 seconds, when you click Tasks
You should see the Tasks page is empty, if it isn't empty waiting until it empty. When the Tasks page is empty, means your data crawling is finished. Click Tasks History, you should be able to see 14 Tasks

8. View crawled data

Click Hello Retailer Service to open Retailer Editor
Click
(Show all files icon), then select data.json, crawled data is saved in data.json
Awesome job! you just successfully crawl https://exampleblog.bitsky.ai/. How simple of data crawling if you use BitSky, now you can use BitSky to crawl all kinds of websites.

What is next

Now you know how to use BitSky Desktop Application, you also can start BitSky use Docker or deploy BitSky to Heroku
Last modified 3yr ago