I fantasize to collect personalized data daily. Astrology is personalized data based on your birthday. It changes daily.

Instead of copy pasting daily all the 12 astrological signs, web scraping extracts the data from HTML pages. The data is then saved in DynamoDB, a connection-less NoSQL database.

The deployed web scraping is based executed on AWS Lambda. EventBridge schedules the events.

Web scraping in Python

This web scraping technique is based on HTML parsing. To avoid an overload on the scraped website, the response is cached.

The dependencies are

  • BeautifulSoup (to extract data from HTML )
  • requests ( to read to the http response ).

Then the dependencies are package in a AWS layer.

Let’s take the example of https://www.jessicaadams.com/horoscopes/daily-horoscopes/ below , all the twelve astrological signs are on a single page.

However the limitations are server side rendered pages can be scraped.

DynamoDB why

Then the data extracted are persisted in DynamoDB. Why is DynamoDB a popular database choice for serverless applications ?

Pros:

  • DynamoDB is itself serverless.
  • DynamoDB is actually a NoSQL database, it is queried with SQL-compatible language (PartiQL) .
  • Most important, DynamoDB is connection-less , it doesn’t maintain a connection pool. The interactions are stateless. Applications do not need to maintain persistent network connections. Most RBDMS require to persistent connections initiated with login and password. The authorization in DynamoDB is handled by  Identity and Access Management (IAM).

Limitations:

  • A single Query operation can retrieve a maximum of 1 MB of data.

IaC with CDK

Instead of using Terraform, I wanted to try CDK (Cloud Development Kit). AWS CDK lets you write infrastructure as code ( IaC ) in the programming language of the application.

Underneath, the code is synthesized into lower level language AWS CloudFormation.

While Terraform underneath uses the API from Cloud providers.

The AWS Lambda is scheduled by EventBridge. Here it is scheduled at 22h to scrape a target website in Tasmania, Australia ( UTC+10:00 )

Here is the event schedule.

Next

Now the data is collected.

In the next episode, there will be how to query these data from an endpoint served by API Gateway.