Introduction to Data Pipeline with Serverless Architecture

Photo by JJ Ying on Unsplash
  1. We only pay for the resources we use. For instance if we have bunch of code that takes 10 seconds to run on a very powerful machine, then we do not have to reserve or buy that machine for an entire day, month or year. We could just pay for the 10 seconds our code actually ran on the machine.
  2. Secondly, with each machine there are softwares involved such as the operating system. Traditionally, as an organisation we would have to manage the upgrade of the software ourselves which include hiring a system administrator to configure, manage and upgrade the system. This not only costs money but more importantly take a lot of time to update the system. However, with the serverless framework, this responsibility is also handed over to the cloud platforms.
  3. Lately, the tech industry is moving to a micro service architecture. This means, rather than having a monolithic codebase having thousands of lines of code serving multiple purposes, the features are distributed as micro services such that each micro-service is independent and serves a single or handful of purposes. This serves many purposes by itself such as if one service fails or crashes, the rest of the application does not get affected. Serverless computing adopts this framework really well as we have a bunch of resources that we use to communicate with each other but are not dependent on each other. This method is also termed as decoupling the application.
  4. Finally, with serverless we have the capacity to scale up or down the number of requests our application serves. This means as a business owners if millions of customers are hitting the website, the architecture can serve all of them without a considerable delay. This can be done by horizontal scaling, i.e allocating more resources to serve more request.
  • Firstly, we can drop our data files and the pipelines
  • The pipeline should detect any change in state of our data source and trigger a series of event to stores the file in a datalake
  • Additionally, if certain conditions are met, i.e if the file contains user information, the pipeline loads also trigger a series of events to store the data in a database.

Application Code :

Architecture Diagram

  1. Data Source Layer : In the left hand side of the architecture with yellow background, we have data sources, this is where the data gets generated. In our case it gets generated from a website we build using Amplify.
  2. Landing Layer : In the middle of the architecture in green background we have landing zone. This is where data from different sources “lands”. This acts as our raw data i.e untouched data from different source systems.
  3. Database / Structured Layer : In the top right with purple background we have the database layer. This is where the structure data gets ingested into the database.
  4. Datalake / Unstructured Layer : In the bottom right in blue we have the datalake layer. This is where the data from different sources get stored in form of a file as a well managed storage solution
  5. Application Layer : In the far right in white we have the application layer. This layers consumes the data from the database or the datalake to generate business value
  • Amplify
  • Cognito
  • Cloud Formation
  • Simple Storage Solution (S3)
  • Lambda Functions
  • Simple Notification Service (S.N.S)
  • Simple Queue Service (S.Q.S)
  • Elastic Container Registry (E.C.R.)
  • Dynamo DB
  • Identity Access Management (I.A.M.)
  • Docker
  • Terraform

Amplify

Cognito

Cloud Formation

Simple Storage Service (S3)

  1. Data Source Layer to store data for our full stack application.
  2. Landing Layer to get data from different data sources into a single location
  3. Datalake Layer to store the file in a well managed location for easy access

Lambda Function

  1. Landing Layer to get data from the data source and store it in the landing bucket
  2. Database Layer to get data from the landing bucket and store it in a database
  3. Datalake Layer to get data from the landing bucket and store it in a managed S3 datalake.

Simple Notification Service (S.N.S)

Photo by Jason Leung on Unsplash

Simple Queue Service (S.Q.S)

  • Database Layer : To deliver message to lambda so as to trigger data load in the database
  • Datalake Layer :To deliver message to lambda so as to trigger data load in the datalake.

Elastic Container Registry (E.C.R.)

Photo by Rinson Chory on Unsplash

Dynamo DB

Identity Access Management

Photo by FLY:D on Unsplash

Docker

Terraform

Application Code :

Personal Note :

--

--

--

A consultant with an interest in Data Science, Data Engineering and Cloud Technology.

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

New Change From GitHub. Using Personal Access Tokens With Git and GitHub Instead Of Password.

Man Happily Coding — source: https://www.pexels.com

It’s just a Barn

Oops…We turned Nika, our studio manager, into a Bot!

TeX must die

Concerning the Separation of Concerns

(PDF) Practical Event-Driven Microservices Architecture

100 Days of DevOps — Day 41-Real-Time Apache Log Analysis using Amazon Kinesis and Amazon…

How to implement OTA updates for Linux and IoT devices

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Vachan Anand

Vachan Anand

A consultant with an interest in Data Science, Data Engineering and Cloud Technology.

More from Medium

AWS Lambda Cookbook — Elevate your handler’s code — Part 3 — Observability: Business KPI Metrics

My thoughts on AWS Managed Workflows for Apache Airflow

Bilbo Baggins going on an adventure

Data lakes and data integration with AWS Lake Formation

How Voodoo did Airflow 101