4 Strategies for building new Service from scratch very fast (as a software engineer)

Intro

If you were a software engineer joining a project with a month and a half to go before it goes live, what would you do? Without even knowing what to build, where would you start to make this project a success?

If you're working in a large organization, you probably don't have to think about this much because you already have a process for developing, deploying, and operating services within your company. On the other hand, if you're an early stage startup or a small team trying to launch a new service, you're likely to encounter this situation quite often.

There are a lot of challenges that come with being a software engineer, and I think one of the most challenging is when you're not sure what you want to build and you're not on a schedule. (It's even harder when both happen at the same time, right?)

A few months ago, I found myself in this exact situation, and as the only back-end engineer in a team of five, I'd like to share my strategy and how I executed it.

4 Strategies

With each meeting, the requirements for the service changed rather than being concretized. Accepting this challenge, the engineering team set up a strategy of rapidly developing MVP based on abstract requirements and refining them with feedback. To support this strategy, I prepared four pillars, the first of which was CI/CD pipelining.

CI/CD pipelining to deploy easily

First and foremost, I worked on CI/CD pipelining. I realized that not being able to reliably integrate the code base and deploying fixes would be a major stumbling block for the project.

If I could have used an existing in-house CI/CD platform (gitlab CI, Jenkins) or other infrastructure, I could have skipped this step. However, due to the specific needs of the project, I was unable to use an in-house platform, so I had to reconfigure the pipeline from scratch😇

There were many options for how to configure the infrastructure, but since the team had a single backend engineer to manage the entire infrastructure, I chose AWS ECS Fargate because it didn't require much management. Thus, the CI/CD pipeline using GitHub Action was configured to build a new version of the application docker image and deploy it to the ECS cluster. First, it dockerizes the application to run as a single container on the ECS service. Then, the image built is pushed to the Amazon Elastic Container Registry (ECR). Next, it updates the image tag value of the task definition in ECS and applies the updated task-definition to the service, and the new version of the application is deployed to the ECS cluster. Below is the full example yaml file for the github action.

deploy.yaml


name: Deploy API Server

on: 
  workflow_dispatch:

env:
  AWS_REGION: ap-northeast-2                   # set this to your preferred AWS region, e.g. us-west-1
  ECR_REPOSITORY: YOUR_ECR_REPOSITORY
  ECS_CLUSTER: YOUR_ECR_CLUSTER
  ECS_SERVICE: api-server
  ECS_TASK_DEFINITION: .aws/api-server/task-definition.json # set this to the path to your Amazon ECS task definition
                                               # file, e.g. .aws/task-definition.json
  CONTAINER_NAME: app           # set this to the name of the container in the
                                               # containerDefinitions section of your task definition

jobs:
  deploy:
    runs-on: ubuntu-latest
    strategy:
      matrix:
        node-version: [20.x]

    steps:
      - uses: actions/checkout@v3

      - name: Configure AWS credentials
        uses: aws-actions/configure-aws-credentials@v1
        with:
          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          aws-region: ${{ env.AWS_REGION }}

      - name: Login to Amazon ECR
        id: login-ecr
        uses: aws-actions/amazon-ecr-login@v1

      - name: Use Node.js ${{ matrix.node-version }}
        uses: actions/setup-node@v3
        with:
          node-version: ${{ matrix.node-version }}
          #  https://stackoverflow.com/questions/61010294/how-to-cache-yarn-packages-in-github-actions
          cache: "yarn"

      # https://github.com/kiranojhanp/fullstack-typescript-turborepo-starter/blob/main/.github/workflows/api.yaml
      - name: Get yarn cache directory path
        id: yarn-cache-dir-path
        run: echo "dir=$(yarn cache dir)" >> $GITHUB_OUTPUT

      - name: Cache node modules
        uses: actions/cache@v2
        env:
          cache-name: cache-node-modules
        id: yarn-cache # use this to check for `cache-hit` (`steps.yarn-cache.outputs.cache-hit != 'true'`)
        with:
          path: ${{ steps.yarn-cache-dir-path.outputs.dir }}
          key: ${{ runner.os }}-yarn-${{ hashFiles('**/yarn.lock') }}
          restore-keys: |
            ${{ runner.os }}-yarn-

      - name: Install Dependencies
        run: yarn install --frozen-lockfile --prefer-offline

      - name: Build and Push docker
        id: build-and-push-docker
        env:
          ECR_REGISTRY: ${{ steps.login-ecr.outputs.registry }}
          IMAGE_TAG: ${{ github.sha }}
        run: |
          docker build -f PATH_TO_DOCKERFILE -t $ECR_REGISTRY/$ECR_REPOSITORY:$IMAGE_TAG -t $ECR_REGISTRY/$ECR_REPOSITORY:latest .
          docker push $ECR_REGISTRY/$ECR_REPOSITORY --all-tags
          echo "::set-output name=image::$ECR_REGISTRY/$ECR_REPOSITORY:$IMAGE_TAG"

      - name: Fill in the new image ID in the Amazon ECS task definition
        id: task-def
        uses: aws-actions/amazon-ecs-render-task-definition@v1
        with:
          task-definition: ${{ env.ECS_TASK_DEFINITION }}
          container-name: ${{ env.CONTAINER_NAME }}
          image: ${{ steps.build-and-push-docker.outputs.image }}

      - name: Deploy Amazon ECS task definition
        uses: aws-actions/amazon-ecs-deploy-task-definition@v1
        with:
          task-definition: ${{ steps.task-def.outputs.task-definition }}
          service: ${{ env.ECS_SERVICE }}
          cluster: ${{ env.ECS_CLUSTER }}
          wait-for-service-stability: true

In addition, I used AWS Systems Manager for direct access to the container (docker exec) when troubleshooting in the ECS Fargate environment, and leveraged S3 buckets to inject environment variables when running the container.

Reduce the cost of communication with API docs

I believe that minimizing unnecessary wasted time resources is key to moving projects forward at a fast pace. However, communication is essential to working together as a team, and it's inevitable that it will take some time.

One of the most common areas of communication between frontend and backend engineers is talking about API interfaces. Imagine talking to your fellow frontend engineers about each of the dozens of API interfaces. It's obvious that BE engineers would be significantly less productive and projects would be slowed down.

We leveraged the power of the @nestjs-library/crud library to automatically generate openAPI swagger on the backend, and the swagger-typescript-api library to automatically generate API client code on the frontend.

The team's productivity took wings as most of the communication between frontend and backend engineers became simple and straightforward, with a pattern of "take AA, process BB, and create an API that gives CC as a response".

Manage infrastructure with code (feat. AWS cdk)

In a large company, you have expert SREs and DevOps engineers, so you can focus on development without worrying too much about cloud infrastructure. However, in this project, I couldn't get that support, so I started thinking about how to minimize the cost of managing infrastructure.

When I recall working on personal projects before, configuring infrastructure using various services such as EC2, S3, and Route 53 in the AWS console used to take up a considerable portion of the overall effort of the project. Without sufficient proficiency in each of AWS's services, the AWS console didn't feel very intuitive, and it was difficult to understand the infrastructure at a glance. It was also difficult to reuse the infrastructure once configured, and it wasn't easy to roll back if I made a mistake while changing settings.

AWS Cloud Development Kit was the solution that came to mind. The ability to define cloud infrastructure in code using a familiar language (Typescript) appealed to me.

Entire infrastructure configured with aws cdk.

By using a code-based infrastructure, new ECS services can always be configured consistently, and the risk of human error in configuration is reduced. These advantages have greatly contributed to improving the productivity of service development.

Monitoring infrastructure to catch problems quickly

What's the first thing you do after deploying a feature or patch you've worked so hard on? You probably cheer up your teammates and visit the user community to get feedback from users - but if you're an engineer, you're probably monitoring the system right after the release. Even if you've done tons of testing before deployment, it's still not time to let your guard down.

SpaceX Starship: Elon Musk promises second launch within months. https://www.bbc.com/news/science-environment-65334810

Monitoring tools are essential for any service, no matter how big or small. It is important to be able to quickly detect and analyze system failures and recover from them in a timely manner, even for a small service. Logs and metrics are a priority to ensure system observability.

Logging Infrastructure

First, let's introduce the logging infrastructure I've configured.

Logging Infra with Promtail, Loki and Grafana

The Application ECS service runs the application and the promtail container. Engineers who develop the application don't need to worry about the logging infrastructure, they just leave logs as usual, which are stored locally as files.


FROM node:20-alpine 

RUN mkdir -p /vanguard/logs/app
WORKDIR /vanguard/app

# build and install dependencies

EXPOSE 3000
CMD ["/bin/sh", "-c", "node main.js 2>> /vanguard/logs/app/stderr.log 1>> /vanguard/logs/app/stdout.log"]

Dockefile

This log file is then forwarded to the monitoring ECS service with the help of the Promtail agent. The Promtail agent scrapes the log file from local and pushes it to a Loki instance running as a container.


server:
  http_listen_port: 9080
  grpc_listen_port: 0

positions:
  filename: /tmp/positions.yaml

clients:
  - url: LOKI_INSTANCE_URL:3100/loki/api/v1/push

scrape_configs:
  - job_name: applogs
    static_configs:
      - targets:
          - localhost
        labels:
          job: applogs
          service: ${SERVICE_NAME}
          __path__: /vanguard/logs/app/stdout.log
  - job_name: apperrs
    static_configs:
      - targets:
          - localhost
        labels:
          job: apperrs
          service: ${SERVICE_NAME}
          __path__: /vanguard/logs/app/stderr.log

deploy.yaml

If you use the aws-cdk mentioned above, you can configure the Application ECS service with the following typescript code. In this example, we have assumed that the application leaves logs as files under the /vanguard/logs/app directory.


import { App, Stack } from 'aws-cdk-lib';
import { Repository } from 'aws-cdk-lib/aws-ecr';
import { Compatibility, ContainerImage, EnvironmentFile, LogDriver, NetworkMode, TaskDefinition } from 'aws-cdk-lib/aws-ecs';
import { ApplicationLoadBalancedFargateService } from 'aws-cdk-lib/aws-ecs-patterns';
import { AutoScalingFargateServiceProps } from '../type';

export class ApiServerAutoScalingFargateServiceStack extends Stack {
  constructor(scope: App, id: string, props: AutoScalingFargateServiceProps) {
    super(scope, id, props);

    const nameOfVolume = 'fargate-application-task-volume';
    /**
     * Task Definition
     */
    const taskDefinition = new TaskDefinition(this, 'ApiServerTaskDefinition', {
      networkMode: NetworkMode.AWS_VPC,
      compatibility: Compatibility.FARGATE,
      cpu: '2048',
      memoryMiB: '16384',
      volumes: [{ name: nameOfVolume }],
    });

    taskDefinition
      .addContainer('app', {
        containerName: 'app',
        image: ContainerImage.fromEcrRepository(Repository.fromRepositoryName(this, props.repository.id, props.repository.name)),
        portMappings: [{ containerPort: 3000, hostPort: 3000 }],
      })
      .addMountPoints({ sourceVolume: nameOfVolume, containerPath: '/vanguard/logs/app', readOnly: false });
    taskDefinition
      .addContainer('Promtail', {
        containerName: 'promtail',
        image: ContainerImage.fromEcrRepository(Repository.fromRepositoryName(this, 'MonitoringGrafanaPromtail', 'grafana/promtail')),
        command: ['-config.file=/etc/promtail/promtail-config.yml', '-config.expand-env=true'],
        environment: { SERVICE_NAME: 'api-server' },
        logging: LogDriver.awsLogs({ streamPrefix: 'promtail' }),
      })
      .addMountPoints({ sourceVolume: nameOfVolume, containerPath: '/vanguard/logs/app', readOnly: true });


    /**
     * Fargate Service
     */
    const fargateService = new ApplicationLoadBalancedFargateService(this, 'api-server', {
      cluster: props.cluster,
      taskDefinition,
      serviceName: 'api-server',
      loadBalancerName: 'ApiServerLoadBalancer',
      cpu: 512,
      memoryLimitMiB: 4096,
      enableExecuteCommand: true,
      cloudMapOptions: { cloudMapNamespace: props.cloudMapNamespace },
    });
	// ...
}

api-service-auto-scaling-fargate-service.stack.ts

Metric Infrastructure

Next, I'll briefly go over the metric collection system.

Metric Infra with Node exporter, Prometheus and Grafana

The Prometheus Node exporter collects metrics about the device hardware and kernel. Metrics such as cpu, diskstat, meminfo allow you to see how each resource on the device is being used.

The application uses a client library for each language to aggregate custom metrics and expose them to a specific endpoint (e.g. /metrics ). The figure below is an example of a websocket application measuring the number of live websocket connections as a metric and visualizing it in Grafana.

the number of live websocket connections measured in websocket application

The Prometheus server can scrape the metrics observed by the node exporter and application and visualize them by registering the Prometheus datasource in Grafana. When looking for targets to scrape metrics from, you can get help from HTTP Service Discovery provided by Prometheus.

Conclusion

By the end of the project schedule, our team was able to successfully deliver services such as attendance, rankings, posts, and live MMO games to users in the form of a web application, mobile application, admin webpage, and live broadcasting webpage. Over 2000 users were able to enjoy the services without any downtime, and their positive feedback was a great achievement for the team.

I wonder if this would have been possible without CI/CD pipelining for reliable code integration and fast deployment, CRUD API and API docs auto-generation tools to maximize development productivity, and code-managed cloud infrastructure and monitoring tools. Of course, it's all about teamwork and having the right people on your side. I hope this post serves as a reference for teams like ours who are just starting out and need to build a service from scratch quickly.