Comment on page
Quick Start
This tutorial will help you to get starting BitSky in several minutes.
BitSky is open-source software for extracting the data from websites(web crawler) and web automation jobs with headless chrome and puppeteer, in a fast, simple, scalable, and extensible way.
Let us first spend some time to understand terminologies and the relation between them

A Supplier creates a chain between Retailer and Producer. A Supplier includes all the functions that manage Retailer Configurations, manage Producer Configurations, receive Tasks from a Retailer, and assign Tasks to suitable Producers, and move success or fail Tasks to Task History
Configuration for a Producer, it controls a Producer whether can execute Tasks and how to execute Tasks. A Producer MUST connect to a Producer Configuration before it can be assigned Tasks and a Producer Configuration is one to one relationship with a Producer.
Configuration for a Retailer, it has information about a Retailer. For example Base URL, Health Check URL, and receive Tasks URL. A Retailer MUST connect to a Retailer Configuration before it can create and receive Tasks and a Retailer Configuration is one to one relationship with a Retailer.
A contract that both Retailer, Supplier, and Producer can understand. It describes the information about each crawling Task. For example endpoint of the crawling website, the priority of this crawling task, JavaScript script that needs to be executed in the browser, and so on. For more detail, please check Task Schema
Producer based on assigned Tasks to crawl websites and send crawled data back to the Retailer which created those Tasks. Currently, we have two types of Producer - Headless Producer and HTTP Producer. Headless Producer executes Tasks using headless Chrome with Puppeteer, it is good for crawling Single Page Application or execute JavaScript on page. HTTP Producer executes Tasks using the plain HTTP request, it is good for crawling static websites. HTTP Producer is fast and efficient, Headless Producer has all the features HTTP Producer has, also can execute JavaScript. HTTP Producer is about 10x faster than Headless Producer.
A Producer MUST connect to a Producer Configuration and both Producer Configuration and Producer should have the same type.
Retailer creates Tasks and sends to Supplier, Supplier assign Tasks to suitable Producers, after Producers successfully execute Tasks, will send Tasks back to Retailer, send back Tasks will contain crawled data(e.g.
HTML
), Retailer can extract useful information from received Tasks or create more Tasks. Retailer also needs to decide where to store extract data and use what kind of format. Most of your time is working on creating your own Retailer.Getting up and running with BitSky is quick and easy. It is a small download so you can install in a matter of minutes and give BitSky a try.
BitSky is a free and open-source application that runs on MacOS, Windows, and Linux operating systems.
BitSky Desktop Application includes Supplier, Headless Producer, HTTP Producer, a Retailer(Hello Retailer Service), and pre-installed necessary Node Modules to help you get started
- 1.
- 2.Double-click on the downloaded archive to expand the contents
- 3.Drag
BitSky.app
toApplications
folder, making it available in theLaunchpad
Open a Mac app from an unidentified developer
When you see the following dialog

- 1.Go Security & Privacy (System Preferences > Security & Privacy), and click Open Anyway
- 2.Click
BitSky.app
again, in the popup dialog, and click Open
- 1.
- 2.Once it is downloaded, run the exe file(
BitSky-{version}-x64-setup.exe
). This will take a minute. - 3.By default, it is installed under
C:\users{username}\AppData\Local\BitSky\app-{version}
, and create a shortcut in Desktop
Windows protected your PC
When you see the following dialog

Click More info

And Click Run anyway
If you see this pop-up, also allows public networks.

- 1.
- 2.Unzip downloaded
- 3.Double click the deb file(
bitsky_{version}_amd64.deb
) - 4.In the pop-up dialog, click install
Awesome job, now you successfully installed BitSky on your laptop.
Let us use BitSky to extract all articles from https://exampleblog.bitsky.ai, and save all articles as JSON format in local disk.
This section will walk you through these tasks:
- 1.
- 2.Configure and Start Hello Retailer Service
- 3.Create a Headless Producer and Service Producer Configuration, and configure Headless Producer and Service Producer
- 4.Extract articles and view extract data
In this example, Hello Retailer Service is written in JavaScript, but it also can be written in any programming language, like Python and so on
Retailer response for creating Tasks, receive successfully Tasks, and extract useful information from received Tasks or send more Tasks, then save extracted data. Before you can use Retailer, you must create a Retailer Configuration first.
Click Retailer Configurations, you should see the Retailer Configurations page. If you don't have any retailer configuration, then you should see a similar screen

Click Create Button, in the drawer panel, type your configuration:

- 1.Retailer Service Name:
first retailer configuration
- 2.Base URL:
http://localhost:8081
- 3.Click Create
Base URL may need to update after you start Hello Retailer Service, because the
8081
port may be used by another application. We will do this in the next step.Click the
first retailer configuratio
to open Retailer Configuration Panel
Click copy icon(
) to copy Retailer Configuration Global ID. In this example, this Retailer Configuration Global ID is

e96fc7f9-398f-41e1-b903-81e4428bd9e6
Click close icon(
) to close Retailer Configuration Panel



In opened Retailer Editor, by default will select worker.js, if it isn't selected, then please select worker.js
Paste copied Retailer Configuration Global ID(you copied before, in this example copied value is
e96fc7f9-398f-41e1-b903-81e4428bd9e6
) to settings.GLOBAL_ID
(line 50).
CTRL+S(Win/Linux)
or CMD+S(MacOS)
to save your change, after save successfully, you should see this notification 

Let us quickly go through worker.js
BitSky Desktop Application already installed several frequently used node_modules, you also can require NodeJS native packages
In Retailer Editor, you ONLY can use node_modules listed in Node Modules Full List and NodeJS native packages
// https://apis.bitsky.ai/bitsky-retailer-sdk/BaseRetailerService.html
const baseRetailerService = require("@bitskyai/retailer-sdk");
//--------------------------------------------------------------
// Following are frequently use packages you possible need
// All the list packages are already pre-installed
//
// Available Packages:
// 1. https://docs.bitsky.ai/user-manual/retailer-editor/node-modules-full-list
// 2. https://nodejs.org/dist/latest-v12.x/docs/api/
//--------------------------------------------------------------
const path = require("path");
// `cheerio`: Fast, flexible & lean implementation of core jQuery designed specifically for the server
// https://www.npmjs.com/package/cheerio
const cheerio = require("cheerio");
// DOM 3 XPath 1.0 implemention and helper for JavaScript, with node.js support.
// https://www.npmjs.com/package/xpath
const xpath = require('xpath');
// `lodash`: A modern JavaScript utility library delivering modularity, performance & extras
// https://lodash.com/
const _ = require("lodash");
// `moment`: Parse, validate, manipulate, and display dates and times in JavaScript
// https://momentjs.com/
const moment = require("moment");
// `fs-extra`: adds file system methods that aren't included in the native `fs` module and adds promise support to the `fs` methods
// https://www.npmjs.com/package/fs-extra
const fs = require("fs-extra");
// `uuid`: Generate RFC-compliant UUIDs in JavaScript
// https://www.npmjs.com/package/uuid
const { v4: uuidv4 } = require("uuid");
// `xlsx`: Parser and writer for various spreadsheet formats
// https://www.npmjs.com/package/xlsx
const XLSX = require("xlsx");
// `papaparse`: The powerful, in-browser CSV parser for big boys and girls
// https://www.papaparse.com/
const Papa = require("papaparse");
// `txt-file-to-json`: Reads a text file or data variable having a table and returns an array of obects
// https://www.npmjs.com/package/txt-file-to-json
const txtToJSON = require("txt-file-to-json");
// winston logger - https://www.npmjs.com/package/winston. It is useful for you to debug
// log file path: public/log/retailer.log
// Examples:
// logger.info('Hello again distributed logs');
// logger.error('Hello again distributed logs', {error: err});
const logger = require('./utils/logger');
Settings for hello retailer service, you MUST change to correct value.
- 1.
GLOBAL_ID
: Global ID of the Retailer Configuration you want this Retailer to connect. In this example is thefirst retailer configuration
you created before, and theGLOBAL_ID
value ise96fc7f9-398f-41e1-b903-81e4428bd9e6
customFunction
will wait 5 seconds when open page// Page will wait 5 second, this is to show you how to execute JavaScript inside page
// For more infomation, please take a look of `metadata.scripts` in https://apis.bitsky.ai/bitsky-retailer-sdk/global.html#Task
async function customFunction() {
await $$page.waitFor(5 * 1000);
}
trigger
is used for init first task/tasks for your data scrawling job. Supplier based on Task information to decide when to assign it to suitable Producers to execute. After the Producer successfully executes Task, will send Task back to parse function. It is the entry point, similar to the main
function in Java, C/C++. For more information, please take a look at https://apis.bitsky.ai/bitsky-retailer-sdk/BaseRetailerService.html#trigger/**
* Trigger is used for init first task/tasks for your data scrawling job.
* **Supplier** based on Task information to decide when to assign it to suitable **Producer** to exectue.
* After **Producer** successfully execute Task, will send Task back to **parse** function.
* It is the **enter point**, similar to the `main` function in Java, C/C++
* For more information, please take a look of https://apis.bitsky.ai/bitsky-retailer-sdk/BaseRetailerService.html#trigger
*
* @returns {object} - A JSON object has tasks property. Normally you can use `baseRetailerService.generateTask` to generate Task.
* Detail information: https://apis.bitsky.ai/bitsky-retailer-sdk/global.html#TriggerFunReturn
*/
const trigger = async function trigger({ req, res }) {
return {
tasks: [
// API: https://apis.bitsky.ai/bitsky-retailer-sdk/BaseRetailerService.html#generateTask
baseRetailerService.generateTask({
// Target website URL
url: "http://exampleblog.bitsky.ai/",
// Priority of this task. This is useful if your tasks need to be executed by order. `1` is highest priority
priority: 1,
// Additional metadata for this task, you should add it based your requirement. `script` is preserved, it only used for pass JavaScript Code String
// In this example, I use `type` to distinguish different page - `bloglist` or `blog`.
// If it is `bloglist` then get all blog links and add new tasks to continues crawl those blogs, otherwise save blog to JSON
//
// In this example, I let page to wait 5 second, this isn't necessary, only used for show you how to execute JavaScript Code.
// `script` is useful to crawl single page application or you need to interact with page. And only `Headless Producer` can execute tasks have script
// `script` is the JavaScript Code you want to execute, you need to convert your function to string. Normally you can use `functionName.toString()`
metadata: { type: "bloglist", script: customFunction.toString() },
}),
],
};
};
After Producer successfully executes Tasks,
parse
function will receive successfully Tasks, received Tasks that contain crawled data. Parse
is used to extract data and decide whether to continue to add more Tasks. For example, in trigger we create a task to crawl http://exampleblog.bitsky.ai/, after Producer crawled successfully, will send back Task that contains the HTML of http://exampleblog.bitsky.ai/ And inside parse function, we parse return HTML, and get the URL link of each blog, and create Tasks to continue to crawl each blog
/**
* After **Producer** successfully execute Task, parse function will be called. And receive the **Task** contains crawled data.
* Parse is used for extract data and decide whether contine to add more tasks.
*
* For example, in **trigger** we create a task to crawl http://exampleblog.bitsky.ai/, after **Producer** crawled successful, will send back Task that contains the HTML of http://exampleblog.bitsky.ai/
* And inside **parse** function, we parse return HTML, and get URL link of each blog, and create tasks to continue crawl each blog
*
* @returns {object} - https://apis.bitsky.ai/bitsky-retailer-sdk/global.html#ParseFunReturn
*/
const parse = async function parse({ req, res }) {
try {
// Task return from Producer, Task Schema - https://github.com/bitskyai/bitsky-supplier/blob/develop/src/schemas/task.json
// By default, crawled HTML was stored in task.dataset.data.content
const returnTasks = req.body;
// New Tasks that need to be sent to BitSky Supplier
const tasks = [];
// Crawled Data, by default will be stored in local disk
const storeData = [];
// Base URL for the new Task
const targetBaseURL = "http://exampleblog.bitsky.ai/";
for (let i = 0; i < returnTasks.length; i++) {
let task = returnTasks[i];
// Crawled HTML - https://github.com/bitskyai/bitsky-supplier/blob/develop/src/schemas/task.json
let htmlString = task.dataset.data.content;
// You can find how to use cheerio from https://cheerio.js.org/
// cheerio: Fast, flexible & lean implementation of core jQuery designed specifically for the server.
// if you like you also can try to use `xpath`, please check https://www.npmjs.com/package/xpath
let $ = cheerio.load(htmlString);
if (task.metadata.type == "bloglist") {
// If task type is **bloglist**, then need to get blog link
// Get more detail from https://docs.bitsky.ai/tutorials/crawl-example-blog#crawl-each-blog-list-page-and-get-blogs-link
let blogUrls = $("div.post-preview a");
for (let i = 0; i < blogUrls.length; i++) {
let $blog = blogUrls[i];
$blog = $($blog);
// Get blog page link, don't forget to add Base URL
let url = new URL($blog.attr("href"), targetBaseURL).toString();
// you can use `logger.info`, `logger.error` for debug
// please check https://www.npmjs.com/package/winston for detail
logger.info(`blog page link: ${url}`);
// Add Task to crawl blog page
tasks.push(
baseRetailerService.generateTask({
url,
// Set `priority` to `2`, so we can first crawl all blog list page, then crawl all blogs
priority: 2,
metadata: {
// Add `type: "blog"` to indicate this task is for crawl blog
type: "blog",
},
})
);
}
// Get next blog list page link
let nextUrl = $("ul.pager li.next a").attr("href");
if (nextUrl) {
nextUrl = new URL(nextUrl, targetBaseURL).toString();
logger.info(`blog list page link: ${nextUrl}`);
// If it has next blog list page, then create a Task to crawl Next Blog List page
tasks.push(
baseRetailerService.generateTask({
url: nextUrl,
// blog list page is highest priority
priority: 1,
metadata: {
// indicate this task is for crawl blog list page
type: "bloglist",
// Just to show you how to execute JavaScript in the browser
script: customFunction.toString(),
},
})
);
}
} else if (task.metadata.type == "blog") {
// If it is blog page, then crawl data and put to
storeData.push({
title: $("div.post-heading h1").text(),
author: $("div.post-heading p.meta span.author").text(),
date: $("div.post-heading p.meta span.date").text(),
content: $("div.post-container div.post-content").text(),
url: task.dataset.url,
});
} else {
logger.error("unknown type");
}
}
// return data that need to store and tasks need to be executed
// Check https://apis.bitsky.ai/bitsky-retailer-sdk/global.html#ParseFunReturn for more detail
return {
data: storeData,
tasks: tasks,
};
} catch (err) {
logger.error(`parse error: ${err.message}`);
}
};
After you configured
GLOBAL_ID
, Click the Start button to start Hello Retailer Service

After start successful, you should see the Retailer Editor like this:

Open user manual document in the browser
Base URL of Hello Retailer, click it you can view configuration information about this Retailer.

If the Base URL of Hello Retailer isn't http://localhost:8081, you need to update the Base URL value in
first retailer configuration
Click Add trigger tasks will add initial Tasks

Let us click Add trigger Tasks, it will add one Task to Supplier

You should be able to see one Task in the Tasks page

Now you already add an initial Task, let us configure Headless Producer and HTTP Producer.
If you don't have any Producer Configuration then you should see the same page

Click Create to popup Create a Producer Configuration drawer

- 1.Name:
first headless configuration
Other parts, you can keep it as default.
Click Create to popup Create a Producer Configuration drawer

- 1.Name:
first http configuration
- 2.Producer Type: HTTP Producer
Other parts, you can keep it as default.
Click first headless configuration


Click Headless Producer

- 1.Producer Configuration Global ID(2): Paste copied
first headless configuration
Global ID - 2.Headless Mode: Change to
No
, so when headless producer executes Tasks, you can see chrome will be automatically open.
Your change will be automatically saved, after save successfully, you should be able to see Update producer configuration, and restarting...(4)
Change Headless Mode to
No
, normally it is used for debug purposeClick first http configuration

Copy Producer Configuration Global ID

Click HTTP Producer

- 1.Producer Configuration Global ID(2): Paste copied
first http configuration
Global ID
Your change will be automatically saved, after save successfully, you should be able to see Update producer configuration, and restarting...(3)
Click Producer Configurations

Click Activate(1, 2) to activate
first headless configuration
and first http configuration
, now you just need to wait for Headless Producer and HTTP Producer to execute your Tasks. After your activate, normally you need to wait about 30 seconds, then Headless Producer and HTTP Producer will start to execute Tasks. And wait about 10 seconds, when you click Tasks

You should see the Tasks page is empty, if it isn't empty waiting until it empty. When the Tasks page is empty, means your data crawling is finished. Click Tasks History, you should be able to see 14 Tasks

Click Hello Retailer Service to open Retailer Editor

Click
(Show all files icon), then select data.json, crawled data is saved in data.json


Awesome job! you just successfully crawl https://exampleblog.bitsky.ai/. How simple of data crawling if you use BitSky, now you can use BitSky to crawl all kinds of websites.
Last modified 3yr ago