Comment on page
Crawl Example Blog
Please read the requirement for Crawl Example Blog, and we will teach you step by step how to implement this requirement use BitSky
Requirement:
Crawl all blogs from https://exampleblog.bitsky.ai/, and extract each blog's
title
, author
, date
, content
, url.
Example Blog is a static website, without heavy Ajax.
After review the site structure of https://exampleblog.bitsky.ai/, in order to crawl all blogs, this is what we need to do:
To distinguish blog list page and blog page, when we create Task, for the blog list page Task we add
metadata.type = "bloglist"
, and for the blog page, we add metadata.type = "blog"

First, we can extract the next blog list page(1) link from section 2 until no OLDER POSTS on the page. And create Task to crawl each blog list page
CSS Selector of OLDER POSTS:
ul.pager li.next a
And in NodeJS, you can use https://www.npmjs.com/package/cheerio to find this element. In this example
let nextPage = $("ul.pager li.next a").attr("href");
, nextPage
will be /page2

Then extract blog page(3) link from section 4 in each page list, and create Task to crawl each blog
CSS Selector of blog page link:
let blogUrls = $("div.post-preview a");
then iterate blogUrls
can get each blog page link
Extract title(1), author(2), date(3), content(4), url(5) and store crawled data
{
title: $("div.post-heading h1").text(),
author: $("div.post-heading p.meta span.author").text(),
date: $("div.post-heading p.meta span.date").text(),
content: $("div.post-container div.post-content").text(),
}
You can use any programming languages you familiar with to implement a Retailer to implement the above requirement, we provide the following tutorials, you can choose one language you familiar with to continue this tutorial
Last modified 2yr ago