Introduction to Data Pipeline with Serverless Architecture

Vachan Anand
9 min readMar 22, 2022

In this blog we are going to have a gentle introduction to building end to end data pipelines using some of the serverless technology. Although, this blog will focus on Amazon Web Services (AWS) to build the pipeline, the architecture can easily be replicated in any other cloud platform including Google Cloud Platform (GCP), Microsoft Azure, etc.

Photo by JJ Ying on Unsplash

Before we proceed with the blog, let me introduce the concept of serverless computing. Serverless computing is a model where the infrastructure such as servers and their underlying softwares are owned and managed by the cloud computing platforms. We as an organisation or individuals can rent the infrastructure (so to say), for as long as we desire. There are some added constraints that each resource we are renting might have, however in the most basic form it implies that we own the code we want to run and not have to worry too much about the machines it might run on.

This has a few advantages such as

  1. We only pay for the resources we use. For instance if we have bunch of code that takes 10 seconds to run on a very powerful machine, then we do not have to reserve or buy that machine for an entire day, month or year. We could just pay for the 10 seconds our code actually ran on the machine.
  2. Secondly, with each machine there are softwares involved such as the operating system. Traditionally, as an organisation we would have to manage the upgrade of the software ourselves which include hiring a system administrator to configure, manage and upgrade the system. This not only costs money but more importantly take a lot of time to update the system. However, with the serverless framework, this responsibility is also handed over to the cloud platforms.
  3. Lately, the tech industry is moving to a micro service architecture. This means, rather than having a monolithic codebase having thousands of lines of code serving multiple purposes, the features are distributed as micro services such that each micro-service is independent and serves a single or handful of purposes. This serves many purposes by itself such as if one service fails or crashes, the rest of the application does not get affected. Serverless computing adopts this framework really well as we have a bunch of resources that we use to communicate with each other but are not dependent on each other. This method is also termed as decoupling the application.
  4. Finally, with serverless we have the capacity to scale up or down the number of requests our application serves. This means as a business owners if millions of customers are hitting the website, the architecture can serve all of them without a considerable delay. This can be done by horizontal scaling, i.e allocating more resources to serve more request.

Considering all of the points mentioned above, I personally like to use this architecture where ever feasible.

So the problem statement we are trying to solve is as follows : As an organisations we would like to have a pipeline, that collect data from our website and stores them in either a database or a file system that is easy to use and well managed.

For the purpose of the demonstration we will have a website where

  • Firstly, we can drop our data files and the pipelines
  • The pipeline should detect any change in state of our data source and trigger a series of event to stores the file in a datalake
  • Additionally, if certain conditions are met, i.e if the file contains user information, the pipeline loads also trigger a series of events to store the data in a database.

Application Code :

https://github.com/VachanAnand/ML-Ops-pipeline-AWS.git

Architecture Diagram

The entire solution can be divided into 5 layers or zones :

  1. Data Source Layer : In the left hand side of the architecture with yellow background, we have data sources, this is where the data gets generated. In our case it gets generated from a website we build using Amplify.
  2. Landing Layer : In the middle of the architecture in green background we have landing zone. This is where data from different sources “lands”. This acts as our raw data i.e untouched data from different source systems.
  3. Database / Structured Layer : In the top right with purple background we have the database layer. This is where the structure data gets ingested into the database.
  4. Datalake / Unstructured Layer : In the bottom right in blue we have the datalake layer. This is where the data from different sources get stored in form of a file as a well managed storage solution
  5. Application Layer : In the far right in white we have the application layer. This layers consumes the data from the database or the datalake to generate business value

In the architecture diagram above, we make use of the following technologies:

  • Amplify
  • Cognito
  • Cloud Formation
  • Simple Storage Solution (S3)
  • Lambda Functions
  • Simple Notification Service (S.N.S)
  • Simple Queue Service (S.Q.S)
  • Elastic Container Registry (E.C.R.)
  • Dynamo DB
  • Identity Access Management (I.A.M.)

Additionally, we make use of

  • Docker
  • Terraform

Let us start by what each technology mentioned above does and how it is used in our data pipeline. We will start from left to right which also the flow of data for our proposed solution.

Amplify

Amplify is a service that helps build full stack application easily. In the demonstration we make use of it to replicate a datasource. We build an interface where a user can drop in a file that needs to be ingested by our data pipeline.

The interface we build is as follows

The interface has a button that helps us select the file from our local machine and ingest it into the pipeline.

Cognito

Cognito is a service that is used to configure and give application access to users. We make use of Cognito to authenticate users before giving them access to our website.

Cloud Formation

Cloud formation is a service that provides infrastructure as code. We will have a deeper look at Infrastructure as code later in this blog when we talk about Terraform, another infrastructure as a code tool what is not an AWS service.

In the demonstration, we don’t explicitly create resources using cloud formation. However, we make use of amplify that create a bunch of resources under the hood using cloud formation templates.

Simple Storage Service (S3)

Simple storage service is a highly scalable, available and a reliable cloud storage solution on AWS. In the most simplest form think of it as file system on cloud. Within S3 we have a concept of buckets, think of it as a directory where we store data. We make extensive use of S3 in our demonstration. It is used for :

  1. Data Source Layer to store data for our full stack application.
  2. Landing Layer to get data from different data sources into a single location
  3. Datalake Layer to store the file in a well managed location for easy access

Additionally, S3 has a very useful functionality called Event Based Notification. This is useful to trigger a notification (or a message so to say) to a resource based on a condition. We use this functionality to trigger a message to a services such as lambda function or S.N.S. to tell those services that a data file has landed in a data source. This act as a trigger for our pipeline to start processing the new data.

Lambda Function

Lambda function in AWS is serverless computing service. Think of it as a CPU resource on cloud where our backend application / data pipeline code runs. We also make extensive use of lambda function in our demonstration. It is used in :

  1. Landing Layer to get data from the data source and store it in the landing bucket
  2. Database Layer to get data from the landing bucket and store it in a database
  3. Datalake Layer to get data from the landing bucket and store it in a managed S3 datalake.

Simple Notification Service (S.N.S)

A simple notification service as the name suggests is a messaging service. It receives a message from a producer ( i.e. a resource that produces the message) and delivers the message to a number of consumers ( i.e . a resource/resources for which the message is intended).

We use S.N.S in Landing Layer to receive the message from S3 indicating that a file has landed in the bucket and deliver that message to multiple consumers such as database and datalake to start processing the file.

Photo by Jason Leung on Unsplash

Simple Queue Service (S.Q.S)

A simple queue service is similar to an SNS described above. Its intention is to deliver message from a producer to the consumer however unlike SNS it can just have one consumer.

The intent on SQS is mostly decoupling of resources in a micro-service architecture. The SNS + SQS together form a “Fanout Architecture” which is very useful together to improve the reliability of the solution.

We make use of SQS in

  • Database Layer : To deliver message to lambda so as to trigger data load in the database
  • Datalake Layer :To deliver message to lambda so as to trigger data load in the datalake.

Elastic Container Registry (E.C.R.)

An ECR is a fully manager container registry in AWS. A container is a lightweight, standalone, executable package to run an application. Conceptually it is like a shipping container that stores cargo (in our case code). The code/cargo can be run on different machines independently.

We use it to store the code used by our lambda functions in Datalake and Database Layers.

Photo by Rinson Chory on Unsplash

Dynamo DB

Dynamodb is a high performing NoSQL database in AWS. It stores data as a key value pair and unlike traditional SQL database doesn’t have to conform to a tabular structure, therefore each record in the database can have different number of columns.

Although DynamoDB is NoSQL database, we store data that is well structured in out demonstration.

Identity Access Management

IAM is in my opinion the backbone of AWS. It is used to grant access to user so that they are autherized to perform certain tasks. Additionally, since we make use of so many different resources, it is used to give permissions to each resource so that they can access and communicate with other desired resources.

Photo by FLY:D on Unsplash

Some of the other tools used for the demonstration include:

Docker

Docker container is a technology that is used to bundle the application code as a lightweight, standalone, executable package. Containers are very useful to produce scalable products and have an extensive use with kubernetes.

We make use of docker containers to push container images for lambda function and store them in amazon ECR

Terraform

In our demonstration above, we had to create 11 resources in Landing, Database and Datalake layer combined. With each resource we had to assign an IAM role. With each role we had to attach an IAM policy. For the purpose of the demonstration, we made use of minimal number of resources and yet had to make use of so many services. Provisioning each service can be done manually via console. However, in a real world solution we would have to make use of 100s of resources. Moreover usually the architecture would be replicated in different environment such as dev, test and production. Managing all these resources via console become increasing difficult and at some point very inefficient.

To solve this problem, we make use of infrastructure as a code, i.e we create these resources via code which make it easy to manage, maintain and replicate the resources in different environment.

Application Code :

https://github.com/VachanAnand/ML-Ops-pipeline-AWS.git

Personal Note :

As a data scientist, one of the areas I never worried about was the infrastructure. My experience was mostly around building models in jupyter notebooks. however I was never able to operationalise the model I created as the model had to be fit into a larger network of technology used in the organisation for which I was building the machine learning model.

Soon, I started to understand the importance of having a broader knowledge of infrastructure and data pipelines, so that I could connect the model that I create into an operational environment. This not only helped me understand world of ML Ops (Machine learning operations) but also made me a better data scientist.

--

--

Vachan Anand

A consultant with an interest in Data Science, Data Engineering and Cloud Technology.