How CISOs Can Enable Productization of Valuable Data Assets

Print Friendly, PDF & Email

Machine learning is being adopted in industry at lightning speed. As a result, data has very rapidly become one of the most valuable assets that an organization can own. However, for use cases where compliance with regulation and data privacy is of paramount importance, unlocking the full potential of data raises unique challenges. 

Why productize data now?

The quality of machine learning is only as good as the data that it is trained on. To develop machine learning systems that are capable of powering extraordinary breakthroughs, the right quantity and quality of data must be readily available. A decade ago, the prevailing concept was theory-driven models, where models were trained based on expert knowledge and predefined rules. However, today, we find ourselves in a new era dominated by data-driven needs. Companies can now harness the power of large models trained on large datasets, driven by the availability of data and compute.

As a result, we are now witnessing a paradigm shift where commoditization is enabling greater innovation. In the past, companies struggled with small, bespoke models trained on limited datasets, yielding less than optimal results. Now, companies are leveraging increasingly commoditized foundational models, pre-trained on extensive datasets, and subsequently adapting them to suit specific proprietary data needs. This approach yields significantly more accurate models, exemplified by the concept of Generative AI, currently used predominantly in text analysis but with image and time series not far behind. 

By addressing the historical challenge of data scarcity that many have faced, this new approach is empowering companies to unlock the full potential of their proprietary datasets. Some industries have had huge success leveraging such benefits of AI to perform tasks such as internet search, digital personal assistants, and targeted advertising. 

What are the challenges?

However, for use cases where sensitive data such as personally identifiable information (PII) is involved, companies must take a more nuanced approach. This data, due to its inherently sensitive nature, remains unsuitable to drive commercial products without appropriate safeguards in place. In other cases where datasets are unique to an enterprise, protecting the valuable IP encoded in their proprietary datasets becomes imperative.

Highly regulated industries, while acknowledging AI’s benefits, must exercise caution due to the extreme sensitivity of their data. In healthcare and similarly data-sensitive sectors such as finance or public sector, the presence of sensitive data introduces constraints that limit the organizations’ capacity to productize this data. 

In healthcare, where collaboration among numerous independent organizations is essential, navigating the regulations and geographical boundaries that govern data access and sharing is the norm. Whether it’s complying with GDPR in Europe or adhering to healthcare data restrictions in other geographies, such as HIPAA in the US, these regional and geographical data custodians are bound by their unique set of regulations that could in some cases simply prohibit the actual sharing of data. 

It is inevitable that machine learning is going to be embedded into digital products. But these efforts are often held back by Information Security (InfoSec) and Data Protection teams who need to strike a balance between providing access to data and ensuring sufficient governance.

How to meet them?

When seeking to make the most out of sensitive data for regulated industries, there are three main considerations that both business leaders and CISOs should consider:

  1. Bring the compute to the data

Increasing regulation and other practical considerations such as security and cost dictate that data – and in particular sensitive data – shouldn’t move. By taking an approach that doesn’t centralize data, it is easier for organizations to comply with this. Decentralized compute paradigms can additionally help organizations to safeguard privacy and IP. Therefore, a better approach is to bring compute to the data rather than bringing data to the compute. This way it is easier to ensure compliance over both the data and the resulting models that have been trained on it. 

  1. Consider federated learning when working with distributed sensitive data

Even when data residency is strictly adhered to, data within one individual pool may not always be adequate to build a sufficiently accurate model. Ideally one would increase the size of the data pool without jeopardizing privacy or security. One solution to meet this requirement is to pool distributed data repositories into a machine learning algorithm without moving or sharing raw data; a practice referred to as distributed machine learning. Federated learning, originally proposed by Google, is emerging as a key distributed machine learning solution. Although originally proposed in mobile use cases, federated learning is gaining traction in markets with access to other types of fleets of data such as in healthcare, industrial equipment operations, and finance.

  1. Look to techno-regulation as a governance solution

Many enterprise use cases are bound by regulatory complexities, better known as ‘red tape’. This regulatory landscape can differ significantly based on geographic location and may even vary within the same organization. Techno-regulation is a concept which refers to the influencing of behavior via the implementation of values and rules in technology. Techno-regulation can allow data custodians to apply computational governance via a technology solution. This can provide organizations with the means to efficiently enforce ever-changing and location-specific regulations. 

This can also be applied at a granular level if audit and regulation demands it. Technology solutions can enable data owners to exercise control over which computations are permitted on which data without substantially slowing down the pace of innovation in organizations.

Conclusion

In the current landscape, enterprises grapple with the complexities of sensitive data — facing challenges like regulatory compliance, data scarcity, and the imperative to safeguard privacy. However, a path forward exists for product teams looking to productize data assets.

To enable productization of data assets within their organizations, CISOs must first ensure that data residency is in alignment with regulation. Additionally, federated learning is an elegant solution that speaks to the problems raised by data scarcity, data residency, and safeguarding privacy. Finally, looking to technology solutions to meet regulatory requirements can dramatically improve the pace of innovation even when working with sensitive data. 

By remaining compliant and embracing technological advancements, businesses can navigate the intricate data landscape, fostering cross-border collaboration and unlocking the true potential of data.

About the Author

Ellie Dobson is VP Product at Apheris with a rich career spanning various industries. Prior to this, she held key leadership positions at Graphcore and Arundo Analytics, Inc., leveraging her expertise in product management and data science. With academic roots at the University of Oxford, Ellie holds an MPhys and a DPhil in Elementary Particle Physics. Her career journey, from Research Fellow at CERN to leading roles in tech companies, reflects her commitment to innovation and leadership.  With an extensive background in technology and data science, Ellie is a distinguished leader in the field.

Sign up for the free insideBIGDATA newsletter.

Join us on Twitter: https://twitter.com/InsideBigData1

Join us on LinkedIn: https://www.linkedin.com/company/insidebigdata/

Join us on Facebook: https://www.facebook.com/insideBIGDATANOW

Speak Your Mind

*