B
B
BitSky
Search
⌃K

Crawl Example Blog

Please read the requirement for Crawl Example Blog, and we will teach you step by step how to implement this requirement use BitSky
Requirement:
Crawl all blogs from https://exampleblog.bitsky.ai/, and extract each blog's title, author, date, content, url.
Example Blog is a static website, without heavy Ajax.
After review the site structure of https://exampleblog.bitsky.ai/, in order to crawl all blogs, this is what we need to do:
To distinguish blog list page and blog page, when we create Task, for the blog list page Task we add metadata.type = "bloglist", and for the blog page, we add metadata.type = "blog"

Crawl each blog list page, and get blogs link.

First, we can extract the next blog list page(1) link from section 2 until no OLDER POSTS on the page. And create Task to crawl each blog list page
CSS Selector of OLDER POSTS: ul.pager li.next a
And in NodeJS, you can use https://www.npmjs.com/package/cheerio to find this element. In this examplelet nextPage = $("ul.pager li.next a").attr("href");, nextPage will be /page2
Then extract blog page(3) link from section 4 in each page list, and create Task to crawl each blog
CSS Selector of blog page link: let blogUrls = $("div.post-preview a"); then iterate blogUrls can get each blog page link

Crawl blog page

Extract title(1), author(2), date(3), content(4), url(5) and store crawled data
{
title: $("div.post-heading h1").text(),
author: $("div.post-heading p.meta span.author").text(),
date: $("div.post-heading p.meta span.date").text(),
content: $("div.post-container div.post-content").text(),
}
You can use any programming languages you familiar with to implement a Retailer to implement the above requirement, we provide the following tutorials, you can choose one language you familiar with to continue this tutorial

Tutorials

  1. 1.
    NodeJS
  2. 2.
    Python
Last modified 2yr ago