Introducing Open Source Insights data in BigQuery to help secure software supply chains

By mullaned2002

May 12, 2022

462

Today we’re announcing a new Google Cloud Dataset from Open Source Insights which will help developers better understand the structure and security of the software they use. This dataset provides access to critical software supply chain information for developers, maintainers and consumers of open-source software.

Your users rely not only on the code you write, but also on the code your code depends on, the code that code depends on, and so on. This web of dependencies forms a dependency graph, and while each node in the graph brings useful functionality to your project, they may also introduce security vulnerabilities, licensing issues, or other surprises, as recent events like the log4j issue demonstrated. To understand your code, you must have an accurate view of its dependency graph.

The Open Source Insights project scans millions of open-source packages from the npm, Go, Maven, PyPI, and Cargo ecosystems, computes their dependency graphs, and annotates those graphs with security advisories, license information, popularity metrics, and other metadata. The dataset is regularly updated, keeping it current and relevant while also providing a snapshotted view of change over time. Generated by resolving each package’s dependency constraints, this data provides precise, accurate, and actionable dependency graphs.

The rate of change in open-source packages is significant. Our analysis shows that roughly 15% of the packages in npm see changes to their dependency sets each day, and for 40,000 of those packages (2% of packages in npm) this results in a change to their license or advisory set. Keeping up with these changes is critical yet intractable without good tooling.

This new dataset allows anyone to use Google Cloud BigQuery to explore and analyze the dependencies, advisories, ownership, license and other metadata of open-source packages across supported ecosystems, and how this metadata has changed over time.

We are eagerly looking forward to seeing how this data will be used. Whether you’re a developer, security engineer, or researcher, you can use this public dataset to analyze components of your software supply chain, and integrate this information with your existing tools and pipelines.

How the Open Source Insights dataset works

We’re bringing Google’s mission to “organize the world’s information and make it universally accessible and useful” to open-source software. Open Source Insights examines each package in the packaging systems we cover, including npm, Go, Maven (Java), PyPI (Python), and Cargo (Rust) and more to come. A full, detailed graph of its dependencies and their properties is constructed and annotated with security advisory, license, owner, release information and other metadata, making a rich dataset covering entire package management language ecosystems.

The dataset is updated regularly, making this a valuable resource for tracking ecosystem level changes over time, analyzing the scope and impact of issues, or integrating into custom dashboards and build systems.

Getting started with the Open Source Insights dataset

To begin exploring these public dataset tables, you can look at the schema and try some sample queries, like the following examples. As with all other Google Cloud Datasets, users can obtain access without charges of up to 1TB/month in queries and up to 10GB/month in storage through BigQuery’s free tier. SQL queries above these thresholds are subject to regular BigQuery pricing. Users can also leverage the BigQuery sandbox to access BigQuery without the need to create a Google Cloud account or provide credit card information, subject to the sandbox’s limits and BigQuery’s free tier thresholds.

What are the most common licenses across each ecosystem?

We can aggregate the license data across packages within each dependency management system to get a list of the top three licenses per system. To do so we first find the newest snapshot in the dataset. Then within that snapshot we count the number of unique packages with at least one version using each license (multiple versions of a package are not double counted).

code_block[StructValue([(u’code’, u’– Find the most recent snapshot.rnDECLARErn Time TIMESTAMP DEFAULT (rn SELECTrn MAX(Time)rn FROMrn `bigquery-public-data.deps_dev_v1.Snapshots`);rnrnWITHrn — Compute the count of unique packages per system and license.rn Counts AS (rn SELECTrn System,rn License,rn COUNT(DISTINCT Name) AS NPackagesrn FROMrn `bigquery-public-data.deps_dev_v1.PackageVersions`rn CROSS JOINrn UNNEST(Licenses) AS Licensern WHERErn SnapshotAt = Timern GROUP BYrn System,rn License),rn — Compute a rank for each license within its systemrn Ranked AS (rn SELECTrn System,rn License,rn NPackages,rn ROW_NUMBER() OVER (PARTITION BY System ORDER BY NPackages DESC ) AS LicenseRankrn FROMrn Counts)rnrn– Finally output the top 3 per system.rnSELECTrn System,rn License,rn NPackagesrnFROMrn RankedrnWHERErn LicenseRank <= 3rnORDER BYrn System,rn LicenseRank;’), (u’language’, u”)])]

What are the most depended upon package versions?

We can use the dependency graphs to identify the most depended upon package versions in the cargo ecosystem. To do so, we filter all packages and available versions for just the release with the highest semantic version per package. We then sum the number of these highest release versions that depend on each version.

code_block[StructValue([(u’code’, u”– The dependency management system whose packages we will query. rnDECLARErn Sys STRING DEFAULT ‘CARGO’;rnrn– Find the most recent snapshot.rnDECLARErn Time TIMESTAMP DEFAULT (rn SELECTrn MAX(Time)rn FROMrn `bigquery-public-data.deps_dev_v1.Snapshots`);rnrnWITHrn — Select just the package-versions that are considered releasesrn — in the system of interest.rn Releases AS (rn SELECTrn Name,rn Version,rn VersionInforn FROMrn `bigquery-public-data.deps_dev_v1.PackageVersions`rn WHERErn SnapshotAt = Timern AND VersionInfo.IsReleasern AND System = Sys),rn — For each package, find its release with the highest version number.rn HighestReleases AS (rn SELECTrn Name,rn Versionrn FROM (rn SELECTrn Name,rn Version,rn ROW_NUMBER() OVER (PARTITION BY Name ORDER BY VersionInfo.Ordinal DESC) AS RowNumberrn FROMrn Releases)rn WHERErn RowNumber = 1)rnrn– Finally compute the number of dependents per package-version andrn– rank package-versions by this count in descending order.rnSELECTrn D.Dependency.Name,rn D.Dependency.Version,rn COUNT(*) AS NDependentsrnFROMrn `bigquery-public-data.deps_dev_v1.Dependencies` AS DrnJOINrn HighestReleases AS HrnONrn H.Name = D.Namern AND H.Version = D.VersionrnWHERErn D.SnapshotAt = Timern AND D.System = SysrnGROUP BYrn D.Dependency.Name,rn D.Dependency.VersionrnORDER BYrn NDependents DESCrnLIMITrn 10;”), (u’language’, u”)])]

What’s next for software supply chain security?

We hope this dataset will make it easier for developers to learn more fundamental information about their dependencies. You can also explore the Open Source Insights website for the latest open-source software insights and visualizations, learn more about our open source security and software supply chain security solutions at the upcoming Google Cloud Security Summit on May 17.

Cloud BlogRead More

Previous articleIntelligently search your Jira projects with Amazon Kendra Jira cloud connector

Next articleImage classification and object detection using Amazon Rekognition Custom Labels and Amazon SageMaker JumpStart

Introducing Open Source Insights data in BigQuery to help secure software supply chains

How the Open Source Insights dataset works

Getting started with the Open Source Insights dataset

What’s next for software supply chain security?

Leverage enterprise data with Denodo and Vertex AI for generative AI applications

TypeScript takes aim at truthy and nullish bugs

Hex-LLM: High-efficiency large language model serving on TPUs in Vertex AI Model Garden

LEAVE A REPLY Cancel reply

Most Popular

Schneider Electric automates Salesforce account hierarchy management with generative artificial intelligence (AI) using Amazon Aurora and Amazon Bedrock

Leverage enterprise data with Denodo and Vertex AI for generative AI applications

TypeScript takes aim at truthy and nullish bugs

Make relevant movie recommendations using Amazon Neptune, Amazon Neptune Machine Learning, and Amazon OpenSearch Service

Recent Comments

EDITOR PICKS

Exploring the Click Element Variable in Google Tag Manager

How to track events with Google Tag Manager and Google Analytics

Data Layer Variable in GTM: What, Why, and Where?

POPULAR POSTS

Scale your Amazon ElastiCache for Redis clusters at a lower cost with data tiering

Supply chain data analysis and visualization using Amazon Neptune and the Neptune workbench

Why xHE-AAC is being embraced at Meta

POPULAR CATEGORY

Introducing Open Source Insights data in BigQuery to help secure software supply chains

How the Open Source Insights dataset works

Getting started with the Open Source Insights dataset

What’s next for software supply chain security?

With software supply chain security, developers have a big role to play

LEAVE A REPLY Cancel reply

Most Popular

Recent Comments

EDITOR PICKS

POPULAR POSTS

POPULAR CATEGORY