Ultra-large-scale compressed graphs: The case of Software Heritage

Presenter: Zoe Kotti
Date: 29 May 2020

Abstract

While publicly available original source code doubles every two years, and state-of-the-art "big data" approaches require the occupation of several machines, compressed graphs can dramatically reduce the hardware resources needed to mine such large corpora. In this presentation we explore the compressed graph of Software Heritage, a dataset aimed at collecting and storing all publicly available source code. In general, graphs are suitable data models for conducting version control system analyses, while compressed graphs allow impressive graph traversal performances.