Using the Open Source Insights dataset to analyze the security and compliance of your dependencies

By mullaned2002

November 18, 2022

438

Recent security events, like the Log4j issue, underscore the need for security teams to be vigilant in evaluating not only the code written by developers in-house, but also the open source packages that code depends on, and the code that those open source packages depend on, and so on. This web of dependencies forms a dependency graph, in which each node can introduce security vulnerabilities and other surprises. Reviewing all of these dependencies is a complex task for security teams.

Visualization for the dependency graph of Tensorflow 2.10

Security teams generally use one of two common approaches to evaluate dependencies for vulnerabilities. With the first approach, security teams begin by researching every single dependency and its security posture at the beginning of the application development process. When a project is underway, security teams must continue to monitor and analyze any changes to the security posture of each dependency over time. This approach has drawbacks: it is very tedious and difficult for security teams to continually check each dependency. This is why security teams frequently choose the second approach, which is to rely on commercial software that evaluates security posture. Unfortunately this approach limits the control that security teams have over how to integrate dependency evaluations into the team’s workflows.

Now, security teams have a new option for reviewing the security posture of dependencies! Google recently announced the Open Source Insights dataset, which was created to help developers better understand the structure and security of the software they use, and provide access to critical software supply chain information for developers, maintainers, and consumers of open-source software. The Open Source Insights project scans millions of open-source packages from the npm, Go, Maven, PyPI, and Cargo ecosystems and computes their dependency graphs. It then annotates those graphs with security advisories, license information, popularity metrics, and other metadata. Google teams are working to regularly update the dataset, specifically with information from Open Source Vulnerabilities and OpenSSF Scorecard, to make sure the data is current and relevant while also providing a snapshotted view of changes over time.

Let’s get hands-on into a few different ways you can use this dataset as a security team.

Using the Open Source Insights dataset

The Open Source Insights dataset is available as part of the Google Cloud Public Dataset Program, and can be explored both using SQL in BigQuery and using the interactive UI at deps.dev. The UI is especially useful for visualizing the dependency graph, while the BigQuery option enables you to write complex, custom queries to analyze the data. This post focuses on accessing and analyzing the data in BigQuery, with example queries you can adopt and adapt along the way.

For research across the ecosystem

The Open Source Insights dataset can be used to research packages across your ecosystem as a whole. Going back to Log4Shell as an example, it was determined at the time of the advisory that more than 35,000 Java packages, which amounts to over 8% of the Maven Central repository, were impacted by the Log4j vulnerability. If your team wanted to research all packages that used Log4j today, you could run one query using the Open Source Insights dataset (in fact you can see it below). This query’s size at over 4TB does show how complex this dataset really is and the immense work that is being done on the backend to create it. The dependencies table is 4 TB due to containing a row for every dependency of every package in npm, maven central, PyPI, Cargo, and Go.

Java packages that depend on the Log4j package as of 11/1/22 snapshot

For finding vulnerabilities and licenses that your code uses

Now let’s transition from the ecosystem as a whole and zoom in on your codebase. Imagine your team is building a JavaScript application using npm, and you want to audit the dependencies to ensure that each is compliant and contains no vulnerabilities. You could take the full transitive dependencies from your lock file and query the deps.dev dataset to check the license of each one, and then look for any vulnerabilities that might be lurking in your dependency depths. This process could take days if done manually, but with automation this review can be completed much faster.

For example, the query below fetches the exact licenses for the packages REACT, REACT DOMS, REACT IS, REACT Router, QS, and SEND.

code_block[StructValue([(u’code’, u”CREATE TEMP TABLE Packages (rn System string,rn Name string,rn Version string)rnCLUSTER BYrn System,rn Name,rn Version;rn rnINSERT INTO PackagesrnVALUES (‘NPM’, ‘react’, ‘16.0.0’),rn (‘NPM’, ‘react-dom’, ‘16.0.0’),rn (‘NPM’, ‘react-is’, ‘16.13.1’),rn (‘NPM’, ‘react-router’, ‘5.2.1’),rn (‘NPM’, ‘qs’, ‘6.3.1’),rn (‘NPM’, ‘send’, ‘0.1.0’);rn rnSELECTrn License,rn ARRAY_AGG(rn STRUCT(rn p.System,rn p.Name,rn p.Version)) AS PackagesrnFROMrn `bigquery-public-data.deps_dev_v1.PackageVersionsLatest` AS p,rn p.Licenses AS LicensernJOINrn PackagesrnUSINGrn (System, Name, Version)rnGROUP BYrn License;”), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e7a8d3b1e90>)])]

And similarly, this query finds vulnerabilities within REACT, REACT DOMS, REACT IS, REACT Router, QS, and SEND packages. Using the information from this query, security teams can tackle the most severe vulnerabilities first.

code_block[StructValue([(u’code’, u”CREATE TEMP TABLE Packages (rn System string,rn Name string,rn Version string)rnCLUSTER BYrn System,rn Name,rn Version;rn rnINSERT INTO PackagesrnVALUES (‘NPM’, ‘react’, ‘16.0.0’),rn (‘NPM’, ‘react-dom’, ‘16.0.0’),rn (‘NPM’, ‘react-is’, ‘16.13.1’),rn (‘NPM’, ‘react-router’, ‘5.2.1’),rn (‘NPM’, ‘qs’, ‘6.3.1’),rn (‘NPM’, ‘send’, ‘0.1.0’);rn rnSELECTrn p.System,rn p.Name,rn p.Version,rn ARRAY_AGG(rn STRUCT(rn a.Source,rn a.SourceID,rn a.Title,rn a.Severity,rn a.GitHubSeverity)) AS AdvisoriesrnFROMrn `bigquery-public-data.deps_dev_v1.PackageVersionsLatest` AS p,rn p.Advisories AS AdvisoryrnJOINrn PackagesrnUSINGrn (System, Name, Version)rnJOINrn `bigquery-public-data.deps_dev_v1.AdvisoriesLatest` AS arnONrn a.Source = Advisory.Sourcern AND a.SourceID = Advisory.SourceIDrnGROUP BYrn System,rn Name,rn Version;”), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e7a83b5a5d0>)])]

Integrate queries into a CI/CD pipeline

In addition to running queries for ad-hoc analysis, you can also operationalize the analysis and insights from these queries in your CI/CD pipeline using the BigQuery API and CLI.

Next steps

These are just a few of the many ways to use Open Source Insights dataset to understand and analyze the security posture of your codebase’s dependencies.

You can get started with the Open Source Insights dataset through its visual UI on deps.dev; for example, this graph showing exactly how a version of Apache Pulsar from late last year depends on log4j-core. You can also get started with the dataset by creating a BigQuery sandbox and running SQL queries on the BigQuery dataset. The BigQuery sandbox enables you to query data within the limits of the free tier without needing a credit card. If you decide to enable billing and go above the free tier threshold, you are subject to regular BigQuery pricing.

To learn more about the Open Source Insights dataset, head to the Marketplace listing and access detailed documentation at docs.deps.dev.

Cloud BlogRead More

Previous articleDatabase links within pluggable databases

Next articleHow Kiwi.com is accelerating travel with Cloud SQL

Using the Open Source Insights dataset to analyze the security and compliance of your dependencies

Using the Open Source Insights dataset

Next steps

Leverage enterprise data with Denodo and Vertex AI for generative AI applications

TypeScript takes aim at truthy and nullish bugs

Hex-LLM: High-efficiency large language model serving on TPUs in Vertex AI Model Garden

LEAVE A REPLY Cancel reply

Most Popular

Schneider Electric automates Salesforce account hierarchy management with generative artificial intelligence (AI) using Amazon Aurora and Amazon Bedrock

Leverage enterprise data with Denodo and Vertex AI for generative AI applications

TypeScript takes aim at truthy and nullish bugs

Make relevant movie recommendations using Amazon Neptune, Amazon Neptune Machine Learning, and Amazon OpenSearch Service

Recent Comments

EDITOR PICKS

Exploring the Click Element Variable in Google Tag Manager

How to track events with Google Tag Manager and Google Analytics

Data Layer Variable in GTM: What, Why, and Where?

POPULAR POSTS

Introducing Cloud SQL in-place upgrade: move from Enterprise to Enterprise Plus with ease

DBAs: 20 years after

Google Cloud Spanner Provider for Entity Framework Core

POPULAR CATEGORY