System Design Interview: Scanning Manifest files with Snyk’s

Shahar Shokrani
11 min readAug 23, 2024

--

Intention:

This article is designed to guide you through the process of architecting a system for scanning manifest files with Snyk, focusing on preparing for system design interviews. We will explore the key components, challenges, and solutions involved in building such a system, providing you with practical insights into real-world system design scenarios.

The System Design Blueprint (Credit):

Requirements:
The core features of the system, each defined by the phrase “user should.” These outline the primary functionalities that the system must deliver.

Non-Functional Requirements:
The quality attributes that measure how well the system fulfills the core requirements, including performance, scalability, and reliability.

Core Entities:
The data that will be persisted or exchanged via the API. It’s acceptable to initially identify only some of these and revisit them as the design progresses.

API:
The specific requests that clients will make to fulfill the functional requirements by interacting with the core entities.

External API:
The interactions with external systems required to fulfill the system’s functionality.

High-Level Design:
A design that addresses only the functional requirements, outlining how the system will deliver its core features.

Deep Dives:
Detailed explorations focused on how the system meets the non-functional requirements, ensuring it performs efficiently and effectively.

What is Snyk?

Snyk is a developer-first security platform that helps identify and fix vulnerabilities in open-source dependencies. By integrating with various development tools, Snyk ensures that security is embedded into the software development lifecycle, helping teams maintain secure code from the earliest stages.

Functional Key Requirements We’ll Cover:

  1. Dependencies Scanning:
    Users should be able to push changes to dependency manifest files to a Source Control Manager and have those files automatically scanned, with their project dependencies cross-referenced against a constantly updated database of known vulnerabilities.
  2. Notifications and Alerts:
    Users should receive customized alerts whenever a new vulnerability is detected or a critical issue arises, keeping developers and security teams informed and ready to take action.

Out Of Scope

  • Code\CI Scans —Scans that are part of the Continuous Integration (CI) pipeline or static code analysis that might check for code quality, security issues in the code itself, or other non-dependency-related concerns (Maybe in Part2).
  • Remediation Implementation via API: While we discuss the identification of vulnerabilities, how Snyk provides and applies remediation suggestions for those vulnerabilities via API is out of scope.

Non Functional Requirements:

Here are the non-functional requirements for each core requirement, emphasizing the quality attributes that determine how well the requirement will perform:

1. Dependencies Scanning

  1. Fast scanning is critical to providing developers with immediate feedback, enabling them to quickly address vulnerabilities or issues without disrupting the development workflow. Prolonged scanning times can cause delays and frustration, impacting overall project efficiency. Therefore, optimizing for Performance is essential.
  2. The platform should seamlessly scale to manage concurrent scans across numerous projects and repositories, maintaining consistent performance. Therefore, optimizing for Scale is essential.
  3. The scanning process must accurately identify dependencies and match them with the correct entries in the vulnerability database, ensuring minimal false positives and negatives. Therefore, optimizing for Consistency is essential.

2. Notifications and Alerts

  1. It also goes with Performance, as mentioned above.
  2. The notification system must guarantee delivery, even during high traffic or system load, ensuring that critical alerts are never missed, we should opt for Reliability.

By defining these non-functional requirements, ask yourself if we establish the quality standards that each core feature must meet to perform well in a production environment.

This approach ensures that Snyk not only functions correctly but also delivers a robust, scalable, and user-friendly experience.

Core Entities

  • Webhook Event: Represents the event triggered by the Source Control Management (SCM) system (e.g., GitHub, GitLab) when a code push occurs. It includes metadata about the repository, the commit, and the files that have been modified.
  • Dependency Manifest Files: Configuration files used in software projects to declare the external libraries, packages, or modules that the project depends on (pom.xml, package.json, requirements.txt ...).
  • Dependency: library, package, or module that has been parsed from a dependency manifest file (lodash@4.17.21).
  • Vulornabilty — An external entity, represent the vulornabilt in dependency, we could persist only its Id.
  • Alert: Represents a notification sent to users when a vulnerability or critical issue is detected.
  • User: Represents an individual or team using Snyk to monitor and secure their projects.

API

1. Dependencies Scanning

Webhook-Triggered Scan: POST /api/v1/scan { metadata }204 Accepted

When changes to dependency manifest files occur in a platform like GitHub or GitLab, a webhook triggers SnykCore to start the scanning process (we would like to have a seperate endpoint for each SCM’s webhook).

2. Notifications and Alerts

Manage Alerts: GET /api/v1/alertsAlert[]

The API allows users to retrieve alerts based on the outcomes of dependency scans.

Also, users can be notified through various channels like SMS, email, Slack, etc.

External API

1. Getting the file content from CSMs

Getting file content: GET /api/{scm}/{repo}/{file_path}

This API endpoint is used to retrieve the contents of a dependency manifest file from a Source Control Management (SCM) system such as GitHub or GitLab.

2. Getting the Vulnerability from Security Advisory Repo

Getting vulnerabilty: GET /api/v1/vulnerabilities/{dependency}

This API endpoint is used to check the security vulnerabilities associated with specific dependencies by querying a security advisory database (e.g., GitHub Security Advisory, NVD).

High Level Design

1. Dependency Scanning

At first, we will choose to begins with the Monolith Approach, which we might name SnykCore, the entire process of dependency scanning and vulnerability database integration is managed within a single service.

Workflow:

  1. Webhook Reception: The API receives a webhook event with metadata about changes to dependency manifest files.
  2. Ecosystem Check: SnykCore checks if the ecosystem (e.g., npm, Maven) is supported by one of our vulnerability providers.
  3. Files Fetch: SnykCore fetches the specific dependency manifest files that have been modified, as indicated in the webhook event.
  4. Dependency Extraction: The file is parsed, and its dependencies are extracted.
  5. Vulnerability Lookup: SnykCore queries external APIs to check for vulnerabilities associated with the extracted dependencies.
  6. Result Aggregation: The results from the external APIs are aggregated.
  7. Data Persistence: The system persists the Dependencies, Vulnerabilities, and the Scan details in the database.
Dependency Scanning: Monolith Approach (SnykCore)

The flow is pretty clear-cut. SnykCore receives the file and its metadata, checks if the ecosystem is supported, extracts dependencies, looks up vulnerabilities, aggregates the results, and then persists everything in the database. It’s a neat, end-to-end process housed within a single monolithic application.

But while this approach demonstrates a solid working high-level design, are there issues we should be aware of:

  • Ecosystem Complexity: Verifying if the file’s ecosystem is supported by the vulnerability providers is straightforward initially. However, as more ecosystems are supported, the logic could become increasingly complex, potentially making the system harder to maintain.
  • Dependency Extraction: Handling different file formats like pom.xml, package.json, and requirements.txt requires specific parsing logic for each format. This adds complexity and could make the system more difficult to manage as the number of supported formats grows.
  • External API Risks: Querying external APIs is inherently risky. There’s always the possibility of rate limits, timeouts, or downtime on the provider’s side. In a monolithic architecture like SnykCore, any issues with one provider could impact the entire application, potentially causing delays or failures in processing.
  • Result Aggregation Challenges: Aggregating results is crucial for providing a cohesive view of vulnerabilities, but it involves pulling data from multiple sources and combining it accurately. Ensuring consistency in how data is aggregated and presented could be challenging, especially if different APIs provide varying levels of detail or different data structures.
  • Database Scalability: As the amount of data grows, the database could become a bottleneck, particularly if it’s not optimized for handling large volumes of dependency and vulnerability data. This could impact the performance and reliability of the system.
  • Webhook Latency and Downtime: The most problematic issue arises when the service experiences latency or downtime. When the SCM (Source Control Management) system sends a webhook with changes, SnykCore must process them quickly. If SnykCore is down or slow to respond, the webhook payload could be lost because the SCM won’t wait for the service to recover. Additionally, if the system goes down while processing a file, that process could be interrupted, potentially leaving vulnerabilities unscanned.

It’s a solid approach, particularly for a small startup looking to move quickly. However, it’s important to remain mindful of these potential challenges as the system evolves.

2. Notifications and Alerts

The system is designed to deliver timely and relevant alerts through multiple channels, enabling teams to take prompt action.

Workflow:

  1. Alert Generation in SnykCore: Upon completing a dependency scan, SnykCore generates an alert if vulnerabilities or critical issues are detected and immediately placed into the AlertQueue.
  2. AlertService: Fetches alerts from the AlertQueue and stores each alert in a database and place it in NotificationQueue
  3. NotificationService: Retrieves notification from the NotificationQueue and sends them through the pre-configured channels.
  4. User Interaction and Alert Management:
    After the user gets the notification with the alertIds, then Users interact with the alerts via the GET /api/v1/alerts API endpoint. This API allows users to filter and retrieve alerts based on criteria such as severity, and status.
Notifications and Alerts: Monolith Approach (SnykCore)

Using this appraoch of alerts and notifications within the monolithic SnykCore architecture, ensuring that users are promptly informed and able to manage vulnerabilities efficiently.

But while this approach demonstrates a solid working high-level design, there are issues we should be aware of:

  • Get alerts performance: since its quering a relational db with filters.
  • Alert Service has two roles: insert the alert into db as well as inserting it to notification queue, it could affect the consistency.

Deep Dives

Now lets address our non-functional requirements one by one and see how it affects our design.

1. Dependencies Scanning

Performance

At first we could introduce an EventHooks queue in order to make sure the SCM’s webhooks are getting an immediate response of 204.

Introducing an Event Queue for the incoming webhook events

1. Dependencies Scanning — Scalleable

To improve performance and scalability, SnykCore can be separated into distinct services, each responsible for specific tasks in the dependency scanning process. Here’s how this separation could be structured:

EventService: This service handles events from the event queue, retrieves the actual file from the Source Control Management (SCM) systems (e.g., GitHub, GitLab), and stores it in a blob storage (e.g., AWS S3, Azure Blob Storage). The service then sends the blob’s URL, along with the event object, to the ParsingQueue for further processing.

Redundant Download Issue: The initial step involves downloading the file from the SCM and uploading it to blob storage, which can seem redundant. However, this approach is necessary unless the SCM supports direct download of files to blob storage using pre-signed URLs.

Why not send the file directly to the queue? While sending the file directly to the queue might seem reasonable, it’s important to note that most message queues are designed to handle relatively small payloads, typically up to 1MB. Sending larger files through the queue would not only exceed these limits but also slow down the queue's performance and introduce unnecessary complexity. By storing the file in blob storage and passing the URL via the queue, we keep the messages lightweight and maintain the efficiency and speed of the system. You could always add logic if the file size less the queue theshhold and skip the blob part.

Event Service: reading Event Queue, Get the file from SCM, save it in blob and sent to Parse queue

Parsing Services: These services are dedicated to parsing dependency files. They are optimized for handling specific file types (e.g., pom.xml, package.json, requirements.txt). like as in EventService each parsing service can easily scaled horizontally, and insert it to the dependency queue.

In order for parse queue could handle the scale we could make it be partitoned based on file types, ensuring that each parsing service instance only processes files it is optimized for.

Incremental Parsing: If the CSM can support it we could be implementing incremental parsing techniques where the system only parses the parts of the file that have changed since the last scan. This can reduce processing time, especially for large projects with minimal changes.

Parsing Services reads from parse queue and at the ene

Dependency Service: This service is responsible for checking the extracted dependencies against a vulnerability database by querying external vendor APIs. The service first checks an in-memory cache (e.g., Redis) for known vulnerabilities to minimize external API calls.

Dependency Service, out of the dependency queue goes to the security APIs via redis and persists in DB.

Dependencies Scanning — Consistency

In order to insure the consistency requirements we could address these issue:

External API calls: againts source content management repo or security advisories might fail due to network issues, rate limits, or temporary outages. we could easily address it with retries with exponential backoff back to the queue.

Consistency Checks: We could periodically trigger the client repository in order to make sure there are no webhooks that got lost.

Dependencies Scanning with non functional requirements addressed.

2. Notifications and Alerts — Performance

To enhance the performance of the GET /alerts/ API, especially when retrieving large sets of alerts based on various criteria such as severity, status, or time range, integrating Elasticsearch can significantly improve query performance and response times.

Adding ElasticSearch

2. Notifications and Alerts — Reliability

Change Data Capture (CDC): We could add a CDC like Kafka Connect Between the Database and Notification Services, and between the Database and the ElasticSearch.

Unacknowledge Alert Endpoint: To provide users with the ability to manage the status of their alerts more effectively, an endpoint for unacknowledging an alert is necessary. This allows users to revert an alert’s status if it was mistakenly acknowledged or resolved: POST /api/v1/alerts/{eventId}/status.

Notifications and Alerts with non functional requirements addressed

Final system design diagram:

Full diagram

Conclution

Designing a system for scanning manifest files with Snyk requires balancing immediate needs with future scalability. Starting with a monolithic approach (SnykCore) provides a solid foundation, but as the system grows, it’s essential to address challenges like ecosystem complexity, external API risks, and database scalability.

By introducing event queues, service separation, and caching, we can ensure the system remains resilient and performant. These design strategies not only meet current requirements but also prepare the system for future growth. Whether for interviews or real-world applications, this approach equips you to build scalable, reliable solutions.

Buy me a coffee.

--

--

No responses yet