BALab yearly reports

Members

Faculty Members

Damianos Chatziantoniou

Maria Kechagia

Dimitris Mitropoulos

Panos (Panagiotis) Louridas

Diomidis Spinellis

Senior Researchers

Nikolaos Alexopoulos

Vaggelis Atlidakis

Makrina Viola Kosti

Vasiliki Efstathiou

Stefanos Georgiou

Thodoris Sotiropoulos

Marios Fragkoulis

Associate Researchers

Charalambos-Ioannis Mitropoulos

Zoe Kotti

Konstantinos Kravvaritis

Stefanos Chaliasos

Antonios Gkortzis

Tushar Sharma

Konstantina Dritsa

Researchers

Georgios Liargkovas

Apostolos Garos

Chris Lazaris

Rafaila Galanopoulou

Evangelia Panourgia

Christina Zacharoula Chaniotaki

Georgios - Petros Drosos

George Theodorou

Christos Pappas

Angeliki Papadopoulou

George Metaxopoulos

Theodosis Tsaklanos

Michael Loukeris

Marios Papachristou

Christos Chatzilenas

Ioannis Batas

Efstathia Chioteli

Vitalis Salis

Overview in numbers

New Publications	Number
Monographs and Edited Volumes	0
PhD Theses	1
Journal Articles	1
Book Chapters	0
Conference Publications	2
Technical Reports	0
White Papers	0
Magazine Articles	0
Working Papers	0
Datasets	0
Total New Publications	4
Projects
New Projects	1
Ongoing Projects	0
Completed Projects	2
Members
Faculty Members	5
Senior Researchers	7
Associate Researchers	7
Researchers	18
Total Members	37
New Members	5
PhDs
Ongoing PhDs	4
Completed PhDs	1
New Seminars
New Seminars	15

New Publications

PhD Theses

Thodoris Sotiropoulos. Abstractions for software testing. PhD thesis, Athens University of Economics and Business, Athens, Greece, 2022.

Journal Articles

Georgios Liargkovas, Angeliki Papadopoulou, Zoe Kotti, and Diomidis Spinellis. Software engineering education knowledge versus industrial needs. IEEE Transactions on Education, 65(3):419–427, August 2022.

Conference Publications

Konstantina Dritsa, Kaiti Thoma, John Pavlopoulos, and Panos Louridas. A Greek parliament proceedings dataset for computational linguistics and political analysis. In Proceedings of the 36th Conference on Neural Information Processing Systems (NeurIPS 2022). Datasets and Benchmarks Track., NeurIPS 2022. 2022.

Stefanos Chaliasos, Thodoris Sotiropoulos, Diomidis Spinellis, Arthur Gervais, Benjamin Livshits, and Dimitris Mitropoulos. Finding typing compiler bugs. In Proceedings of the 43rd ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI'22. ACM, June 2022. Distinguished Paper Award, Best Artifact Award.

Projects

New Projects

HFRI III (PhD Scholarship) - Data Analysis Applications in Software Engineering

Completed Projects

FASTEN - Fine-Grained Analysis of Software Ecosystems as Networks

REA - Real Estate Analytics

New Members

Apostolos Garos

Chris Lazaris

Nikolaos Alexopoulos

Evangelia Panourgia

Christina Zacharoula Chaniotaki

Ongoing PhDs

Zoe Kotti Topic: Data Analysis Applications in Software Engineering

Konstantinos Kravvaritis Topic: Data and Quality Metrics of System Configuration Code

Antonios Gkortzis Topic: Secure Systems on Cloud Computing Infrastructures

Konstantina Dritsa Topic: Data Science

Completed PhDs

Thodoris Sotiropoulos Topic: Abstractions for software testing

Seminars

Developing trustworthy AI systems

Date: 05 January 2022
Presenter: Theodoros Evgeniou, Professor INSEAD
Abstract

While research in AI has been focusing for the past 50 years on the AI algorithms themselves, little work has been done on how to (a) ensure the AI systems are safe and trustworthy, (b) they are developed using sound and robust software/product development processes and tools. Discussion in both academia and the industry has shifted in recent years on trustworthy AI (“ethical AI”, “AI risks”, “responsible AI”, etc) and regulators (e.g. in the EU) are also developing new rules the companies will need to abide to as they develop or procure and deploy AI systems. In this seminar (and open discussion) I will present an overview of key aspects of what is currently considered part of so-called “trustworthy AI”, some tools and practices that are under development to support the implementation of trustworthy AI, as well as some key initiatives by organisations such as the OECD and the IEEE in this space. One of the goals of this discussion is also to explore together potentially synergies between regulation, AI, and software engineering research.

Theos Evgeniou is Professor of Decision Sciences and Technology Management at INSEAD and director of the INSEAD Executive Education program on Transforming your Business with AI.

He has been working on Machine Learning and AI for the past 25 years, on areas ranging from AI innovations for business process optimization and improving decisions in Marketing and Finance, to AI regulation, as well as on new Machine Learning methods. His research has appeared in leading journals, such as in Science Magazine, Nature Machine Intelligence, Machine Learning, Lancet Digital Health, Journal of Machine Learning Research, Management Science, Marketing Science, Harvard Business Review magazine, and others.

Professor Evgeniou has been a member of the OECD Network of Experts on AI, an advisor for the BCG Henderson Institute, and a World Economic Forum Academic Partner for Artificial Intelligence. He gives talks and consults for a number of organisations in his areas of expertise, and he has been involved in developing hedge fund strategies with more than $100 million invested. He has received four degrees from MIT, two BSc degrees simultaneously, one in Computer Science and one in Mathematics, as well as a Master and a PhD degree in Computer Science.

An Empirical Investigation of Boilerplate Code

Date: 19 January 2022
Presenter: Christina Chaniotaki, AUEB
Abstract

Over the years, the use of computer applications and thus the creation of new software programs is growing rapidly. As a result, the creation and usage of new programming languages and services to meet existing needs is also developing quickly. Hence, the phenomenon of code repetition and reuse has been introduced to a large extent for the development of new applications, either intentionally or unintentionally. Consequently, a new research field of Computer Science was created known as “Code Cloning”. One subcategory of this field is the use of Boilerplate Code. Boilerplate code is code snippets that are used over and over again with little or no modifications, either in the same program or in different ones.

This work presents the concept of the Boilerplate Code as analyzed and understood by us and by previous research on the subject, over the last 21 years. In addition, an analysis has been carried out in order to comprehend Code Cloning, Clone Detection, and Clone Management. These theoretical analyses were performed using the Systematic Literature Review Process (SLR) and also two surveys were conducted. The first is a Qualitative Boilerplate Code Analysis using a clone detection tool. Specifically, open-source repositories were analyzed in terms of the amount and type of boilerplate code they contained, and categorization of boilerplate code was performed using the Open Coding method from Ground Theory. The second is an open survey in which questionnaires were distributed to software engineers and developers in order to understand how the concept of boilerplate code is perceived and how it is used. Finally, the results of these surveys are presented and discussed.

Dependable Software Supply Chains

Date: 24 February 2022
Presenter: Diomidis Spinellis, AUEB
Abstract

Modern software is typically based on hundreds or even thousands of software components. This practice has ballooned software development productivity, and has allowed the creation of extremely sophisticated software systems. However, software components, many of which rely on other components, come at a cost. They are part of an often brittle software supply chain with varying and sometimes lacking quality controls. This has led to phenomenal losses and disasters. I will present a research agenda aiming to reduce the considerable risk that modern software projects face by systematizing its analysis, by establishing responses through inter-disciplinary research, and by proposing a validated method for increasing SSC dependability.

Detecting hardware security threats through fingerprinting

Date: 24 February 2022
Presenter: Konstantinos Alexakis, AUEB
Abstract

Software fingerprinting is a popular method that can be used in order to to map a large amount of arbitrary data to a shorter,its fingerprint, in order to identify the original data for all practical cases. Such a case can be considered the fact of identifying the presence of a security threats, using this fingerprint. More specifically, according to literature, hardware security leaks can be detected through fingerprinting CPUs using hardware measurement techniques. However, these require access to ”golden” chips and cannot be easily performed in the field. The objective of this study is to design, develop, and evaluate a fingerprinting method that can reliably detect hardware deviations. The software will obtain the fingerprints by monitoring the CPU’s performance counters while executing different sets of CPU instructions. Testing can be done by fingerprinting the same CPUs on same hardware batches (they should provide identical fingerprints) and by testing CPUs with different microcode updates or disabled features (they should yield mismatched fingerprints). As part of a crowd-sourcing process, fingerprints can be collected from a wide variety of CPUs, which, when matched against CPU identifiers and microcode versions, may even allow the detection of genuine Trojan horses. The comparison between CPUs, will be performed through statistical methods that are capable of providing enough evidence in order to make the final decision.

Decentralized Finance: Privacy and security issues in the Ethereum ecosystem

Date: 18 April 2022
Presenter: Stefanos Chaliasos
Abstract

Decentralized Finance (DeFi) is an emerging area at the intersection of blockchain, digital assets, and financial applications. DeFi uses smart contracts to create protocols that provide (mainly) financial services in an interoperable and transparent way. The leading blockchain for DeFi applications is Ethereum. Despite the substantial progress in DeFi security research, many challenges are yet to be tackled. I will present some exciting research avenues on DeFi, focusing on privacy and security issues in the Ethereum ecosystem.

New Approaches to Software Security Metrics and Measurements

Date: 09 May 2022
Presenter: Nikolaos Alexopoulos, TU Darmstadt
Abstract

Meaningful metrics and methods for measuring software security would greatly improve the security of software ecosystems. Such means would make security an observable attribute, helping users make informed choices and allowing vendors to 'charge' for it—thus, providing strong incentives for more security investment. In this talk, I will present an overview of the contributions of my dissertation consisting of three empirical measurement studies introducing new approaches to measuring aspects of software security, focusing on Free/Libre and Open Source Software (FLOSS).

For reference, I will talk about work already published as:

[1] Nikolaos Alexopoulos, Sheikh Mahbub Habib, Steffen Schulz, Max Mühlhäuser. "The Tip of the Iceberg: On the Merits of Finding Security Bugs." In ACM Trans. Priv. Secur. 24, 1, Article 3 (February 2021), 2021.

[2] Nikolaos Alexopoulos, Andrew Meneely, Dorian Arnouts, Max Mühlhäuser. "Who are Vulnerability Reporters?: A Large-scale Empirical Study on FLOSS." In ESEM ‘21: ACM / IEEE International Symposium on Empirical Software Engineering and Measurement, Bari, Italy, October 11-15, 2021, 2021.

[3] Nikolaos Alexopoulos, Manuel Brack, Jan Wagner, Tim Grube, Max Mühlhäuser. "How Long Do Vulnerabilities Live in the Code? A Large-Scale Empirical Measurement Study on FOSS Vulnerability Lifetimes." In 31th USENIX Security Symposium, USENIX Security 2022, Boston, MA, USA, August 10-12, 2022 (to appear), 2022.

Software Engineering Education Knowledge versus Industrial Needs

Date: 16 May 2022
Presenter: Giorgos Liargovas and Angeliki Papadopoulou
Abstract

Contribution: Determine and analyze the gap between software practitioners' education outlined in the 2014 IEEE/ACM Software Engineering Education Knowledge (SEEK) and industrial needs pointed by Wikipedia articles referenced in Stack Overflow (SO) posts.

Background: Previous work has uncovered deficiencies in the coverage of computer fundamentals, people skills, software processes, and human-computer interaction, suggesting rebalancing.

Research Questions: 1) To what extent are developers' needs, in terms of Wikipedia articles referenced in SO posts, covered by the SEEK knowledge units? 2) How does the popularity of Wikipedia articles relate to their SEEK coverage? 3) What areas of computing knowledge can be better covered by the SEEK knowledge units? 4) Why are Wikipedia articles covered by the SEEK knowledge units cited on SO?

Methodology: Wikipedia articles were systematically collected from SO posts. The most cited were manually mapped to the SEEK knowledge units, assessed according to their degree of coverage. Articles insufficiently covered by the SEEK were classified by hand using the 2012 ACM Computing Classification System. A sample of posts referencing sufficiently covered articles was manually analyzed. A survey was conducted on software practitioners to validate the study findings.

Findings: SEEK appears to cover sufficiently computer science fundamentals, software design and mathematical concepts, but less so areas like the World Wide Web, software engineering components, and computer graphics. Developers seek advice, best practices and explanations about software topics, and code review assistance. Future SEEK models and the computing education could dive deeper in information systems, design, testing, security, and soft skills.

Finding Typing Compiler Bugs

Date: 30 May 2022
Presenter: Thodoris Sotiropoulos
Abstract

We propose a testing framework for validating static typing procedures in compilers. Our core component is a program generator suitably crafted for producing programs that are likely to trigger typing compiler bugs. One of our main contributions is that our program generator gives rise to transformation-based compiler testing for finding typing bugs. We present two novel approaches (type erasure mutation and type overwriting mutation) that apply targeted transformations to an input program to reveal type inference and soundness compiler bugs respectively. Both approaches are guided by an intra-procedural type inference analysis used to capture type information flow.

We implement our techniques as a tool, which we call Hephaestus. The extensibility of Hephaestus enables us to test the compilers of three popular JVM languages: Java, Kotlin, and Groovy. Within nine months of testing, we have found 156 bugs (137 confirmed and 85 fixed) with diverse manifestations and root causes in all the examined compilers. Most of the discovered bugs lie in the heart of many critical components related to static typing, such as type inference.

Permanent and ephemeral linking in scientific publishing

Date: 22 August 2022
Presenter: Diomidis Spinellis
Abstract

The world-wide web has allowed scientific publications to include links to resources available in it via URLs. Previous research has show that the availability of resources identified by URLs is ephemeral, decaying with the passage of time. As a response to the problem of URL decay more permanent identifiers, such as DOIs, have been developed and are often used.

I describe the methods and first findings of an ongoing study on the evolution of web and DOI linking in scientific publications. The study is based on processing a collection of n-grams collected from 71 million documents on a small cluster of computers. Preliminary results indicate that the density of published links per document has increased 10 × over the past quarter century, but the percentage of DOIs has also been increasing to reach 20% in 2018. Looking at failures, I find the expected increase of URL failures as years go by. The DOI failures are surprising and warrant further closer investigation.

As an aside, I also looked at the feasibility of recreating documents from the n-grams via the Eulerian path approach used for stitching together DNA fragments. However, I reached the conclusion that there are too many candidate paths to do this deterministically.

The SecOPERA Project

Date: 08 September 2022
Presenter: Panos Louridas, Dimitris Mitropoulos
Abstract

The SecOPERA project (Secure Open-source softwarE and hardwaRe Adaptable framework), which was recently accepted for funding by the EU, aims at improving the security of Open Source solutions. It will use a number of different approaches, contributed by the different partners, in order to provide a functional toolbox for security assessment.

SecOPERA will tackle the security problem by recognising that it can be divided in four layers, all of which can be present in today's systems:

Cognitive: attacks on systems using Machine Learning and Artificial Intelligence.
Network.
Software.
Hardware.

These different levels call for different techniques. For instance, adversarial attacks on Neural Networks are very different from vulnerabilities in IP (Intellectual Property) cores. Trying to prevent attacks on software, and in particular system libraries, is not the same as securing network devices or pieces of the network stack in an operating system. That said, well-developed principles can apply to the different levels and SecOPERA aims exactly at developing a holistic solution, leveraging the knowledge brought by the different partners of the project.

The partners will follow the SecOPERA Secure Flow, which will be developed in detail in the project, and which will consist of the following activities:

Decompose an open source solution to its components, map them in the four layers described above, associate components with their source repositories, create dependency graphs.
Audit/Assess by performing vulnerability scans both against known vulnerabilities and by using techniques that may indicate problems beyond those.
Secure by creating a pool of secure modules that will contain both secured components and tools for securing components.
Adapt existing open source solutions by combining SecOPERA secure modules that can harden open source solutions with the actual audited components of an open source solution.
Update/Patch open source software and hardware using formal verification mechanisms.

The SecOPERA approach will run for 36 months and will be validated with two user-pilots, one concerning automotive software. The first pilot involves partner Voxel from Austria who develops such software and KTM, also from Austria, who embeds the software in the motorcycles it produces. The second pilot involves Greencityzen, who offers an IoT (Internet of Things) water management system.

Abstractions for Software Testing

Date: 03 October 2022
Presenter: Thodoris Sotiropoulos
Abstract

Developers and practitioners spend considerable amount of their time in testing their software and fixing software bugs. To do so more effectively, they automate the process of finding deep software bugs that are challenging to uncover via manually-written test cases by integrating automated bug-finding tools in the software development process. A challenge of automated bug detection is the identification of subtle and latent defects in software that involves complex functionality. This is because such bugs are easy to remain unnoticed as the software under test does not complain with warnings or other runtime failures (e.g., crashes) during its execution. Worse, subtle defects often confuse users who do not blame the buggy software for the unexpected behavior, because they believe that the error is from their side (e.g., wrong input is given). Another shortcoming of many existing bug-finding tools is their limited applicability. Indeed, many of them are tailored to specific piece of software. This lack of applicability is mainly attributed to fundamental issues related to the design of the underlying methods.

To tackle the aforementioned issues, this thesis investigates ways for improving the effectiveness and applicability of automated software testing by introducing different forms of abstractions in testing workflow. The aim of these abstractions is to provide a common platform for reasoning and identification of (subtle) bugs in software systems and programs that exhibit dissimilar interfaces, implementations, or semantics. The thesis demonstrates this concept by applying abstractions in the context of three important problems: the detection of (1) compiler typing bugs, (2) bugs in data-centric software, and (3) dependency bugs in file system resources. This is achieved through the design and development of three bug-finding tools: Hephaestus, Cynthia, and FSMoVe respectively.

The work presented in this thesis improved the reliability of well-established and critical software used by millions of users and applications. Overall, our bug-finding techniques and tools led to the disclosure and fix of more than 400 bugs found in complex software systems, such as the Java and the Groovy compilers, the Django web framework, or dozens of popular configuration management libraries used for managing critical infrastructures (e.g., the Apache server). This thesis exhibits practical impact on the software industry, and opens new research opportunities related to the application of programming language concepts to automated software testing and software reliability.

Reality check on developers' perception! A case study of software testability and its effects

Date: 12 October 2022
Presenter: Tushar Sharma
Abstract

Software testability is commonly defined as the degree to which the development of test cases can be facilitated by software design choices. Despite various studies on testability and its characteristics, the effects of testability on tests and overall software quality are unknown. In this presentation, I will discuss a catalog of four testability smells, and a survey targeted to gather software developers' perspectives on testability in general and our proposed testability smells in particular. I will also elaborate on a large-scale empirical study on 941 Java repositories containing approximately 11 million lines of code to investigate whether empirical data supports the perception of the developers on testability. Specifically, the study explores the relationship of testability with test quality, the number of tests, and reported bugs.

Alexandria3k: Reproducible publication research on the desktop

Date: 21 November 2022
Presenter: Diomidis Spinellis
Abstract

Sustained exponential advances in computing power, drops in associated costs, and Open Science initiatives allow us to process on a personal computer the metadata from most major international academic publishers as well as corresponding author, funder, and journal details. Alexandria3k is an open-source software library and command-line tool that builds on this capability to allow the conduct of sophisticated bibliometric and scientometric studies as well as systematic literature reviews in an transparent, repeatable, reproducible, and efficient manner. In total the Alexandria3k system provides relational access to 1.5 PB of data comprising 134 million publication records, of which 60 million contain full citation data, 15 million author records, 109 thousand journal records, and 32 thousand research funding bodies. The system allows the execution of simple ad hoc queries over the publication dataset or the selective population of a database for running more complex relational queries. Application examples include the independently verifiable calculation of bibliometric figures, such as the journal impact factor and the h-index, the creation of detailed research data subsets based on the publication topic, funder, institution, publication outlet, or author country, and the study of the publication citation graph.

Data Virtual Machines: Simplifying Data Sharing, Exploration & Querying in Big Data Environments

Date: 12 December 2022
Presenter: Damianos Chatziantoniou
Abstract

Today’s analytics environments are characterized by a high degree of heterogeneity in terms of data systems, formats and types of analysis. Many occasions call for rapid, ad hoc, on demand construction of a data model that represents (parts of) the data infrastructure of an organization, including ML tasks. This data model is given to data scientists to play with (express reports, build ML models, explore, etc.) We present a novel graph-based conceptual model, the Data Virtual Machine (DVM) representing data (persistent, transient, derived) of an organization. A DVM can be built quickly and agilely, offering schema flexibility. It is amenable to visual interfaces for schema and query management. Dataframing, a frequent preprocessing task, is usually carried out by experienced data engineers employing Python or R: a procedural approach with all the known drawbacks. Dataframes over DVMs are expressed declaratively - and visually, via a simple and intuitive tool. This way, non-IT experts can be involved in dataframing. In addition, query evaluation takes place within an algebraic framework with all the known benefits. I.e. a DVM enables the delegation of data engineering tasks to simpler users. Finally, a DVM offers a formalism that facilitates data sharing, data portability and a single view of any entity – because a DVM’s node is an attribute and an entity at the same time. In this respect, DVMs can excellently serve as a data virtualization technique, an emerging trend in the industry. We argue that DVMs can have a significant practical impact in today’s big data environments.

AI Explainability and Tech Trust & Safety

Date: 19 December 2022
Presenter: Theodoros Evgeniou, INSEAD
Abstract

Discussions about trustworthy and responsible AI have become central across multiple communities in recent years - machine learning, law, social sciences, among others. A key challenge regarding trust in AI - also considered important by regulators as part of transparency for some AI applications - is to understand why “black boxes” may be making specific predictions. As a result, explainable AI (XAI) has been a growing topic of research. In this talk, I will discuss some potential drawbacks XAI may have - including the potential to erode safety in practice - and also present some work that takes into account behavioural aspects researchers and practitioners may need to consider when developing XAI.

Biography

Theos Evgeniou is a professor of Decision Sciences and Technology Management at INSEAD and director of the INSEAD Executive Education program on Transforming your Business with AI.

Professor Evgeniou is a member of the OECD Network of Experts on AI, an advisor for the BCG Henderson Institute, an advisor for the World Economic Forum Academic Partner for Artificial Intelligence, and together with three INSEAD alums also a co-founder of Tremau, a B2B SaaS company whose mission is to build a digital world that is safe & beneficial for all. He gives talks and consults for a number of organisations in his areas of expertise, and in the past he has been involved in developing hedge fund strategies with more than $100 million invested. He has received four degrees from MIT, two BSc degrees simultaneously, one in Computer Science and one in Mathematics, as well as a Master and a PhD degree in Computer Science.

Note: Data before 2017 may refer to grandparented work conducted by BALab's members at its progenitor laboratory, ISTLab.

Yearly Report 2022