New Publications | Number |
---|---|
Monographs and Edited Volumes | 1 |
PhD Theses | 0 |
Journal Articles | 5 |
Book Chapters | 0 |
Conference Publications | 8 |
Technical Reports | 0 |
White Papers | 0 |
Magazine Articles | 0 |
Working Papers | 0 |
Datasets | 0 |
Total New Publications | 14 |
Projects | |
New Projects | 0 |
Ongoing Projects | 2 |
Completed Projects | 0 |
Members | |
Faculty Members | 4 |
Senior Researchers | 6 |
Associate Researchers | 7 |
Researchers | 15 |
Total Members | 32 |
New Members | 7 |
PhDs | |
Ongoing PhDs | 6 |
Completed PhDs | 1 |
New Seminars | |
New Seminars | 21 |
Date: 20 February 2020
Presenter: Damianos Chatziantoniou
Abstract
In this talk we introduce the concept of Data Virtual Machines (DVM), a graph-based conceptual model of the data infrastructure of an organization, much like the traditional Entity-Relationship Model (ER). However, while ER uses a top-down approach, in which real-world entities and their relationships are depicted and utilized in the production of a relational representation, DVMs are based on a bottom up approach, mapping the data infrastructure of an organization to a graph-based model. With the term ``data infrastructure'' we refer to not only data persistently stored in data management systems adhering to some data model, but also of generic data processing tasks that produce an output useful in decision making. For example, a python program that “does something” and computes for each customer her probability to churn is an essential component of the organization’s data landscape and has to be made available to the user, e.g. a data scientist, in an easy to understand and intuitive to use manner, the same way the age or gender of a customer are made. We define formally a DVM, queries over DVMs and an algebraic framework for query evaluation. We also claim that a conceptual layer, such as DVM, is a prerequisite for end-to-end processing. In addition, we present a prototype tool based on DVMs, called DataMingler, which enables end-to-end processing in analytics environments by data stakeholders. Specifically, DataMingler:
Date: 29 May 2020
Presenter: Konstantinos Kravvaritis
Abstract
Spring boot is a Spring Framework extension that aims to provide more convenience utilities and accelarate the development process. In this presentation, we give a brief overview of Spring boot, the reasons that it was introduced and its advantages.
Date: 29 May 2020
Presenter: Stefanos Georgiou
Abstract
Raspberry Pi is a small, compacted, and cheap computer to build easily reliable systems. In this presentation, we attach a DHT-11 sensor to obtain temperature and humidity measurements. After, we extend this computer system to act upon events and inform the user if something goes wrong.
Date: 29 May 2020
Presenter: Zoe Kotti
Abstract
While publicly available original source code doubles every two years, and state-of-the-art "big data" approaches require the occupation of several machines, compressed graphs can dramatically reduce the hardware resources needed to mine such large corpora. In this presentation we explore the compressed graph of Software Heritage, a dataset aimed at collecting and storing all publicly available source code. In general, graphs are suitable data models for conducting version control system analyses, while compressed graphs allow impressive graph traversal performances.
Date: 05 June 2020
Presenter: Konstantina Dritsa
Abstract
Enriching the Greek Parliament Proceedings Dataset (https://zenodo.org/record/2587904) with information on the gender and parliament role of the Parliament members.
Date: 05 June 2020
Presenter: Diomidis Spinellis
Abstract
Epidose is an open source software reference implementation for an epidemic dosimeter. Just as a radiation dosimeter measures dose uptake of external ionizing radiation, the epidemic dosimeter tracks potential exposure to viruses or bacteria associated with an epidemic. The dosimeter measures a person's exposure to an epidemic, such as COVID-19, based on exposure to contacts that have been tested positive. The epidemic dosimeter is designed to be widely accessible and to safeguard privacy. Specifically, it is designed to run on the $10 open-hardware Raspberry Pi Zero-W computer, with a minimal user interface, comprising LED indicators regarding operation and exposure risk and a physical interlock switch to allow the release of contact data. The software is based on the DP3T contact tracing "unlinkable" design and corresponding reference implementation code.
Date: 05 June 2020
Presenter: Vasiliki Efstathiou
Abstract
Details will be provided during the seminar.
Date: 19 June 2020
Presenter: Stefanos Chaliasos
Abstract
A challenging issue in analyzing C packages is that it is challenging to reproduce them. In this talk, we present how we can exploit sbuild for running static analysis tools that need to rebuild a package to analyze it. sbuild is a tool that automates the build process of Debian packages. Moreover, we will see how we can scrape the Ultimate Debian Database for selecting packages base on some criteria.
Date: 19 June 2020
Presenter: Antonis Gkortzis
Abstract
This short presentation is a compilation of the tasks and activities completed during the COVID-19 quarantine time. The first part of the quarantine was dedicated to finalizing and submitting the revised paper, titled Software Reuse Cuts Both Ways, to JSS. The second part was dedicated to the Real Estate Analytics EU-funded program. Specifically, we finalized a data warehouse schema, consisting of > 300 tables (facts and dimensions), and begun its population. During both quarantine parts, teaching was an always-on task and required a great effort, with the participation being the largest of the last four years.
Date: 19 June 2020
Presenter: Charalambos Ioannis Mitropoulos
Abstract
We will make an introduction of our recent work in fuzzing Android Native Libraries, and we will take a look of how we can improve fuzzing with the use of Neural Networks, presenting three different approaches proposed in three different papers that we read in quarantine time.
Date: 26 June 2020
Presenter: Diomidis Spinellis
Abstract
GitHub projects can be easily replicated through the site's fork process or through a Git clone-push sequence. This is a problem for empirical software engineering, because it can lead to skewed results or mistrained machine learning models. We provide a dataset of 10.6 million GitHub projects that are copies of others, and link each record with the project's ultimate parent. The ultimate parents were derived from a ranking along six metrics. The related projects were calculated as the connected components of an 18.2 million node and 12 million edge denoised graph created by directing edges to ultimate parents. The graph was created by filtering out more than 30 hand-picked and 2.3 million pattern-matched clumping projects. Projects that introduced unwanted clumping were identified by repeatedly visualizing shortest path distances between unrelated important projects. Our dataset identified 30 thousand duplicate projects in an existing popular reference dataset of 1.8 million projects. An evaluation of our dataset against another created independently with different methods found a significant overlap, but also differences attributed to the operational definition of what projects are considered as related.
Date: 26 June 2020
Presenter: Diomidis Spinellis
Abstract
We present a dataset of open source software developed mainly by enterprises rather than volunteers. This can be used to address known generalizability concerns, and, also, to perform research on open source business software development. Based on the premise that an enterprise's employees are likely to contribute to a project developed by their organization using the email account provided by it, we mine domain names associated with enterprises from open data sources as well as through white- and blacklisting, and use them through three heuristics to identify 17,264 enterprise GitHub projects. We provide these as a dataset detailing their provenance and properties. A manual evaluation of a dataset sample shows an identification accuracy of 89%. Through an exploratory data analysis we found that projects are staffed by a plurality of enterprise insiders, who appear to be pulling more than their weight, and that in a small percentage of relatively large projects development happens exclusively through enterprise insiders.
Date: 26 June 2020
Presenter: Thodoris Sotiropoulos
Abstract
Puppet is a popular computer system configuration management tool. By providing abstractions that model system resources it allows administrators to set up computer systems in a reliable, predictable, and documented fashion. Its use suffers from two potential pitfalls. First, if ordering constraints are not correctly specified whenever a Puppet resource depends on another, the non-deterministic application of resources can lead to race conditions and consequent failures. Second, if a service is not tied to its resources (through the notification construct), the system may operate in a stale state whenever a resource gets modified. Such faults can degrade a computing infrastructure's availability and functionality.
We have developed an approach that identifies these issues through the analysis of a Puppet program and its system call trace. Specifically, a formal model for traces allows us to capture the interactions of Puppet resources with the file system. By analyzing these interactions we identify (1) resources that are related to each other (e.g., operate on the same file), and (2) resources that should act as notifiers so that changes are correctly propagated. We then check the relationships from the trace's analysis against the program's dependency graph: a representation containing all the ordering constraints and notifications declared in the program. If a mismatch is detected, our system reports a potential fault.
We have evaluated our method on a large set of popular Puppet modules, and discovered 92 previously unknown issues in 33 modules. Performance benchmarking shows that our approach can analyze in seconds real-world configurations with a magnitude measured in thousands of lines and millions of system calls.
Date: 03 July 2020
Presenter: Marios Papachristou
Abstract
In this seminar, we are going to talk about how one can infer the interests (e.g. hobbies) of users in online social networks using information from highly influential users of the network. More specifically, we experimentally observe that the majority of the network users (>70%) is dominated by a sublinear fraction of highly-influential nodes (core nodes). This structural property of networks is also known as the "core-periphery" structure, a phenomenon long-studied in economics and sociology.
Using the influencers' initial opinions as steady-state trend-setters, we develop a generative model through which we explain how the users' interests (opinions) evolve over time, where each peripheral user looks at her k-nearest neighbors. Our model has strong theoretical and experimental guarantees and is able to surpass node embedding methods and related opinion dynamics methods and is able to scale to networks with millions of nodes.
Duration: 30-40min.
Joint work with D. Fotakis (NTUA).
Date: 11 September 2020
Presenter: Georgios Theodorou
Abstract
Pawk is an extension to GoAwk, having been designed with efficiency in mind. It manages to achieve considerably higher performance to both the standard use Awk as well as GoAwk. The two reasons behind the significant speed boost offered by Pawk are the use of multi-threading programming, as well as the choice of Golang as the language of implementation. Since Pawk makes use of parallel programming, it is logical to be restricted only to operations that can be executed in parallel. However, we believe that Pawk can come handy in a plethora of cases, especially when dealing with multi-GB files.
Date: 25 September 2020
Presenter: Dimitris Mitropoulos
Abstract
GRNET CERT (Computer Emergency Response Team) provides incident response and security services to both the National Infrastructures for Research and Technology (GRNET) and to all Greek academic and research institutions. To do so, it employs Open-source Software (OSS) and approaches proposed by the academic community. In this talk we will discuss how GRNET CERT uses OSS to provide early warnings and alerts to its members and relevant organizations regarding risks and incidents. Furthermore, we will discuss how the team utilizes program analysis methods to assist the security audits it performs.
Date: 16 October 2020
Presenter: Konstantina Dritsa
Abstract
How different are search engines? The search engine wars are a favorite topic of on-line analysts, as two of the biggest companies in the world, Google and Microsoft, battle for prevalence of the web search space.Differences in search engine popularity can be explained by their effectiveness or other factors, such as familiarity with the most popular first engine, peer imitation, or force of habit. In this work we present a thorough analysis of the affinity of the two major search engines, Google and Bing, along with DuckDuckGo, which goes to great lengths to emphasize its privacy-friendly credentials. To do so, we collected search results using a comprehensive set of 300 unique queries for two time periods in 2016 and 2019, and developed a new similarity metric that leverages both the content and the ranking of search responses. We evaluated the characteristics of the metric against other metrics and approaches that have been proposed in the literature, and used it to (1) investigate the similarities of search engine results, (2) the evolution of their affinity over time, (3) what aspects of the results influence similarity, and (4) how the metric differs over different kinds of search services. We found that Google stands apart, but Bing and DuckDuckGo are largely indistinguishable from each other.
Date: 23 October 2020
Presenter: Thodoris Sotiropoulos
Abstract
Incremental and parallel builds are crucial features of modern build systems. Parallelism enables fast builds by running independent tasks simultaneously, while incrementality saves time and computing resources by processing the build operations that were affected by a particular code change. Writing build definitions that lead to error-free incremental and parallel builds is a challenging task. This is mainly because developers are often unable to predict the effects of build operations on the file system and how different build operations interact with each other. Faulty build scripts may seriously degrade the reliability of automated builds, as they cause build failures, and non-deterministic and incorrect outputs.
To reason about arbitrary build executions, we present BuildFS, a generally-applicable model that takes into account the specification (as declared in build scripts) and the actual behavior (low-level file system operation) of build operations. We then formally define different types of faults related to incremental and parallel builds in terms of the conditions under which a file system operation violates the specification of a build operation. Our testing approach, which relies on the proposed model, analyzes the execution of single full build, translates it into BuildFS, and uncovers faults by checking for corresponding violations.
We evaluate the effectiveness, efficiency, and applicability of our approach by examining 612 Make and Gradle projects. Notably, thanks to our treatment of build executions, our method is the first to handle JVM-oriented build systems. The results indicate that our approach is (1) able to uncover several important issues (247 issues found in 47 open-source projects have been confirmed and fixed by the upstream developers), and (2) much faster than a state-of-the-art tool for Make builds (the median and average speedup is 39X and 74X respectively).
Date: 13 November 2020
Presenter: Uri Goldshtein
Abstract
GraphQL is an open-source data query and manipulation language for APIs, and a runtime for fulfilling queries with existing data. I'll talk a bit about GraphQL in general, then a bit about the ideas behind GraphQL-Mesh, and then talk about the future ideas like GraphQL and semantic web and ideas like that for which there is a lot of room for exploration.
Uri is the founder of The Guild - A group of open source developers, mostly focused around GraphQL. He was part of the writers of the GraphQL Subscriptions spec. The Guild works with companies around the world, helping them with their API technologies and improving the open source libraries while doing it.
Date: 27 November 2020
Presenter: Anders Møller, Benjamin Barslev Nielsen, Martin Toldam Torp
Abstract
JavaScript libraries are widely used and evolve rapidly. When adapting client code to non-backwards compatible changes in libraries, a major challenge is how to locate affected API uses in client code, which is currently a difficult manual task. In this paper we address this challenge by introducing a simple pattern language for expressing API access points and a pattern-matching tool based on lightweight static analysis.
Experimental evaluation on 15 popular npm packages shows that typical breaking changes are easy to express as patterns. Running the static analysis on 265 clients of these packages shows that it is accurate and efficient: it reveals usages of breaking APIs with only 14% false positives and no false negatives, and takes less than a second per client on average. In addition, the analysis is able to report its confidence, which makes it easier to identify the false positives. These results suggest that the approach, despite its simplicity, can reduce the manual effort of the client developers.
This is a paper presented at this year's OOPSLA (Object-Oriented Programming, Systems, Languages & Applications) conference and recommended for the seminar by Theodoris Sotiropoulos. We will watch the talk's video and then discuss the method, the results, and the paper.
Date: 18 December 2020
Presenter: Orestis Polychroniou
Abstract
Query execution engines for analytics are continuously adapting to the underlying hardware in order to maximize performance. Wider SIMD registers and more complex SIMD instruction sets are emerging in mainstream CPUs and new processor designs such as the many-core Intel Xeon Phi CPUs that rely on SIMD vectorization to achieve high performance per core while packing a greater number of smaller cores per chip. In the database literature, using SIMD to optimize stand-alone operators with key–rid pairs is common, yet the state-of-the-art query engines rely on compilation of tightly coupled operators where hand-optimized individual operators become impractical. In this article, we extend a state-of-the-art analytical query engine design by combining code generation and operator pipelining with SIMD vectorization, and show that the SIMD speedup is diminished when execution is dominated by random memory accesses. To better utilize the hardware features, we introduce VIP, an analytical query engine designed and built bottom-up from pre- compiled column-oriented data-parallel sub-operators and implemented entirely in SIMD. In our evaluation using synthetic and TPC-H queries on a many-core CPU we show that VIP outperforms hand-optimized query-specific code without incurring the runtime compilation overhead, and highlight the efficiency of VIP at utilizing the hardware features of many-core CPUs.
Amazon Redshift is a fast, fully managed, petabyte-scale data warehouse solution that makes it simple and cost-effective to efficiently analyze large volumes of data using existing business intelligence tools. Since launching in February 2013, it has been Amazon Web Service's (AWS) fastest growing service, with many thousands of customers and many petabytes of data under management. Amazon Redshift's pace of adoption has been a surprise to many participants in the data warehousing community. While Amazon Redshift was priced disruptively at launch, available for as little as $1000/TB/year, there are many open-source data warehousing technologies and many commercial data warehousing engines that provide free editions for development or under some usage limit. While Amazon Redshift provides a modern MPP, columnar, scale-out architecture, so too do many other data warehousing engines. And, while Amazon Redshift is available in the AWS cloud, one can build data warehouses using EC2 instances and the database engine of one's choice with either local or network-attached storage.
Orestis Polychroniou is a senior software engineer at Amazon Web Services working on the core query execution performance of Amazon Redshift. Before joining Amazon he did his PhD with Kenneth A. Ross at Columbia University on modern databases with his PhD thesis focusing on analytical query execution optimizations optimized for all layers of modern hardware. His research has ~500 citations.