Sign up for our newsletter and get the latest big data news and analysis.

Interview: Felix Dorrek, Ph.D.

Processed with VSCO with inf preset

I recently caught up with Felix Dorrek to discuss his research in the area of deep generative networks. The interview touches on the technology’s power, practical applications, importance to business in 2020, as well as generating synthetic data. Felix holds a Ph.D. in mathematics from the Vienna Technical University where he conducted research in the field of Convex Geometry. In 2018 he joined the Austrian Tech Startup Mostly AI where he researched and developed Deep Generative models and their applications to data-privacy.

insideBIGDATA: Can you give us a quick overview of your area of research: deep generative networks and their importance to business in 2020?

Felix Dorrek: Deep generative models have gained a lot of prominence in recent years, particularly for generating highly realistic—imitation—images of human faces. A lesser-known fact is that similar technology can be leveraged to model structured business data. This has several applications with enormous potential such as leveraging synthetic data for data-privacy purposes. The key idea is to use generative AI to create synthetic datasets that retain the structure of the privacy-sensitive source data, while not leaking any private information about customers. When I started working on this at Mostly AI, a  Vienna-based startup pioneering this technology, it wasn’t much more than a novel concept. More recently, however, to allow a better trade-off between the ever-increasing demand for data-privacy and data-utility, several banks, including Capital One, and some insurance companies, have established research efforts and are actively using this technology. Another leading example is Alphabet spinoff, Replica, a company using synthetic data for urban planning.

insideBIGDATA: Why are deep generative networks being revealed as a more powerful form of classical statistical modeling?

Felix Dorrek: Classical statistical methods use only a handful of parameters to describe a particular dataset, making them overly simplistic. On the other hand, deep learning datasets allow one to model with millions, or even billions, of parameters, provided there is sufficient data to learn from. This difference is a game-changer as it allows detailed modeling of highly complex datasets. A relevant example of such behavior is patterns of the population of a city, possibly based on people’s mobile phone locations. In this instance, deep learning is able to, for the first time, allow for the creation of a synthetic population that behaves almost identical to the real one. Using a synthetic dataset unlocks this data for many scenarios; especially useful in situations where privacy would otherwise have been a blocking issue (e.g. easy access for city planners).

insideBIGDATA: Please describe your work in developing AI systems for generating synthetic data sets that are completely anonymous as well as their value.

Felix Dorrek: Modeling complex business datasets with deep learning pose unique challenges. A key challenge is the marriage of ‘hard’ business rules (e.g. opening hours of stores) with the ‘softness’ of statistical modeling. Through my work, I have created solutions to overcome these types of problems and developed innovative methods to incorporate business rules into deep generative networks. Another challenge my work tries to solve is how to incorporate privacy-preserving mechanisms into these models. A particular focus of my work in this instance rests on differential privacy – a mathematical framework for privacy, like that which is being actively used by companies like Apple.

insideBIGDATA: As the quantity and complexity of data grows do you believe there will be more demand for realistic, but anonymized, data sets?

Felix Dorrek: Privacy is without a doubt becoming increasingly important. You only need to look at the legislation that has been passed in recent years, take GDPR in Europe for example. However, the increasing detail of data collected on individuals renders classical anonymization techniques almost useless. Realistic anonymous synthetic datasets are one of a few emerging technologies that provide a solution to this problem. Other approaches include technology like homomorphic encryption and federated learning. Ultimately, all of these approaches have their strengths and weaknesses meaning we will need a combination to solve the privacy issue. It is safe to say that realistic synthetic data will see a huge increase in demand in the coming years.

insideBIGDATA: What are some practical applications of deep generative networks?

Felix Dorrek: In addition to the already discussed applications for data-privacy, there are several other applications. One important application lies in creating training data for other machine-learning models. If a generative model is designed well it can be leveraged to create highly realistic data for edge cases, where real data coverage is sparse.

An example of this would be a system that is tasked to flag fraudulent credit card transactions. To make this system work well, there needs to be a lot of examples of fraudulent transactions, which are not that common. Therefore, in this case, AI-generated data can be very valuable.

Sign up for the free insideBIGDATA newsletter.

Leave a Comment


Resource Links: