This is the fifth article in our “AI 101” series, where the team at Lewis Silkin will unravel the legal issues involved in the development and use of AI text and image generation tools. In the previous article of the series, we considered the regulatory framework for AI being proposed by the European Commission, and its impact. In this article we consider the key data protection issues with these new tools and what you need to think about when you use them in order to remain compliant.


With the exponential growth of AI tools, including chatbots such as ChatGPT and Dall-e, eBay’s Image Processing feature that allows users to remove distracting backgrounds in product photos, DeepMind’s Alphafold revolutionising medical research and bioengineering through predicting protein folds, and AmazonGo streamlining our in-person grocery shopping more than ever, AI continues to spread steadily into all parts of our lives and amasses more data as it grows - is this a data protection headache in waiting or an opportunity to place data protection front and centre?

While many are delighted with the world of possibilities opened up by this new tech, some are calling it a tipping point, a wake-up call, or even the last opportunity to have some element of control over your personal data.

As the previous articles in our AI:101 series have discussed, these AI tools rely on vast data sets that require mind-boggling amounts of data not only to operate, but also to continuously learn and improve. Looking just at our homes by way of a simple example: 70% of UK residents use a voice assistant every month, there are over 2.2 million smart homes in the UK, and the UK’s ‘smart home industry’ is worth approximately £7 billion per annum. With voice assistants such as Alexa offering increasingly more features, it’s no surprise that they continue to collect increasingly more data; records of communication requests, voice recordings from smart assistant interactions, purchase histories and shopping habits; your payment information and your live location, to name but a few categories.

So, for these AI tools to work, where is all this data coming from? In short, data is being sourced from all over the place. For example tools such as ChatGPT will rely heavily on data that is scraped from the internet.  Tools such as Linkfluence looking at customer insights, trends and patterns will ingest data from various of its customer touchpoints including live chats, social media, and purchase histories, etc. Tools focused on improving workplace efficiencies will likely rely on data collected from its employee behaviours. Does this raise any data protection concerns? Yes, you bet it does. It is very likely that these data sources will include personal data as well as potential special category personal data and even data about minors. This throws up numerous questions around how this data can be lawfully processed by the tool -  What is the legal basis for processing the data? Who is the controller and the processor? How does the AI tool comply with data protection legislation? Lots of different views are emerging, and some of the key issues can be highlighted by looking at the UK GDPR/GDPR data protection principles in turn.

Data Protection Principles

1.  Fair, Lawful, and Transparent

According to the ICO’s guidance on AI and data protection, as AI systems process personal data in different ways for different purposes “you must break down and separate each distinct processing operation, and identify the purpose and an appropriate lawful basis for each one, in order to comply with the principle of lawfulness”.

The ICO guidance is clear it is “your responsibility to decide which lawful basis applies to your processing”, that this decision should be made before the processing is started, your decision should be documented and for special category data you need “both a lawful basis and an additional condition for processing”.

This assessment can get quite complicated. Many tend to look to rely on ‘legitimate interests’ as a catch-all and adaptable lawful basis for their processing, but ultimately the availability and appropriateness of legitimate interest depends on the purpose of the AI tool and processing in question, as well as the types of personal data being processed. Legitimate Interest Assessments (LIAs) are essential in assessing whether this is the correct lawful basis, and must always be carried out in advance of the data processing.

Deciding on the appropriate lawful basis becomes even more complicated where the processing or system in question entails any automated decision making or profiling (as many AI systems will indeed do). Data subjects have a right to not be subject to decisions based solely on automated decision making, that go on to affect them significantly. Obtaining individuals’ explicit consent is one way to overcome this hurdle, but ensuring that consent is (and remains) ‘valid’ under the UK GDPR/GDPR can be a complicated task in and of itself, and requires careful thought about how to implement “opt-ins” for such processing, and how to deliver adequate transparency information to individuals. Likewise use of special category data in an AI tool will also require an additional processing condition or lawful basis (again likely to be explicit consent).

The context of the intended processing can introduce further considerations; for example, where you’re looking to introduce an AI tool in the workplace for monitoring employees, such as Aware (which transforms digital conversation data from Slack, Teams, Zoom and more into real-time insights), if you are in the EU you will likely need to consider involving works councils, whose requirements vary depending on which jurisdiction you are in.

Moving onto fair, if you use AI to infer data about people, you must ensure:

  • the system is sufficiently statistically accurate and avoids discrimination; and
  • you consider the impact of individuals’ reasonable expectations.

Unsurprisingly, the potential for bias and discrimination is one of the key issues faced by AI developers, as AI systems trained on data that reflects human biases or historical inequalities will learn and implement those same patterns. Over in the US, a US Department of Commerce study found that facial recognition AI often misidentifies people of colour, whilst Amazon found that their AI recruiting algorithm was biased against women, as it was based on the resumes submitted over the past 10 years and consequent hires, most of which were men.

A large part of the solution to this problem must come during the design and development process; data scientists need to be educated on responsible AI and how an organisation’s values should be embedded into the model itself, and organisations should follow a standardised production framework (taking into account applicable legislation such as the EU’s Artificial Intelligence Act). Synthetic data should also be considered when training AI systems. Later down the production line, organisations should focus on increasing transparency into how their AI systems arrive at decisions.

And what of transparency in telling data subjects what you will do with their data? We know the perils of getting this wrong, we all remember the Irish Data Protection Commission’s WhatsApp €225 million fine or the more recent Meta €390 million fine. However trying to explain in easy to understand terms how complicated AI algorithms work is no easy feat. So how do you comply? Well, there is further guidance the ICO produced in conjunction with the Alan Turing Institute entitled Explaining decisions made by AI which should help, although a quick internet search for explainability statements shows they are still largely conceptual or aspirational rather than widely adopted in practice, meaning there’s not much yet by way of example.

2.  Purpose Limitation

This is where a major philosophical difference between the aims of data protection legislation and the aim of AI becomes clear. Article 5(1)(b) of the UK GDPR/GDPR states personal data shall be:

collected for specified, explicit and legitimate purposes and not further processed in a manner that is incompatible with those purposes;”,

(unless you obtain consent of course for another purpose), while the AI tools and their language models use the data for any purpose, and indeed a range of purposes. How are we going to round this square?

3.  Data Minimisation

How does a vast (and possibly) expanding data set comply with the data minimisation principle? The ICO guidance on this point says “the key is that you only process the personal data you need for your purpose”.

Determining what is “adequate, relevant and limited” is key to complying with this principle, and robust risk management practices from concept to delivery are essential, e.g. if you are building the AI tool, data minimisation should be considered from the design phase and all the way through to deployment, or if you are purchasing AI tools or they are provided by third parties, data minimisation should be high up on your procurement due diligence. To address these issues in practice, companies might also look to data masking (i.e. modifying sensitive data in such a way that it is of little value to unauthorised users, while still being usable by (and of value to) authorised personnel or software), or full data anonymisation.

4.  Storage Limitation

To make the AI tool really effective, there might be a tendency to want to retain data on an indefinite basis.  However this will fall foul of the UK GDPR/GDPR storage limitation principle if there is not appropriate justification. In order to comply with this principle, robust audit trails and effective retention and deletion will be key when developing the AI tool. Documentation of decisions around policies and processes put in place will also be essential to demonstrate compliance. 

5. Accuracy and Data Quality

Although the importance of data accuracy and data quality goes without saying, it was highlighted by Google’s Bard (a new AI tool) recently when in a promotional video Bard answered a question incorrectly. Not only did this impact Google’s parent company Alphabet’s bottom line, it also raised questions around data accuracy and data quality and how you would go about correcting data in a vast AI data set and teaching the AI model to ‘unlearn’ the mistake and ‘relearn’ the correct answer in a live, rather than a test, environment.

6.  Security - Integrity and Confidentiality

Again the quantity of data used for AI tools has raised security concerns. Does the vast data set make it more vulnerable to attack and therefore increase the risk profile? What if there was a data breach? How would you identify which data was affected? How would you notify data subjects?

While these are indeed concerns, the counter argument might be that if the data set is publicly available information that has been scraped from the internet, there would be no value in the data from a security attack perspective. While this might be one argument, pooling this data all in one place is always going to be a huge risk even if the information is publicly available. Further should the data be combined in the future with other data sets to improve the AI tool this may alter the risk profile and need further consideration and action. Of course, it is not just about unauthorised disclosure or misuse of the data, there are new types of adversarial attacks on AI machine learning models designed to introduce bias or to skew the results in favour of the threat actor’s agenda.

The ICO guidance to assess appropriate security measures to minimise risks of privacy attacks on AI models is helpful, e.g. it summarises the factors that need to be considered when assessing appropriate security measures, stating it is dependent on:

  • the way the technology is built and deployed;
  • the complexity of the organisation deploying it;
  • the strength and maturity of the existing risk management capabilities; and
  • the nature, scope, context, and purposes of the processing of personal data by the AI system, and the risks posed to individuals as a result.

It also gives guidance on what steps should be taken to minimise attacks (i.e. how to protect the underlying code itself from attack), and gives examples of privacy-preserving techniques that are available for AI systems. For example, for code developed by a third party, the guidance suggests that external code security measures should always include subscribing to security advisories to be notified of vulnerabilities. Where code is developed internally, it states that internal code security measures should include adhering to coding standards and instituting source code review processes.  

The guidance also suggests using a range of techniques for enhancing privacy, such as perturbation (e.g. adding ‘noise’ by changing the value of some of the data points), synthetic data (e.g. using dummy data along with real data) and/or federated learning (e.g. allowing multiple parties to train the algorithm using their own local data, and then combining the patterns (known as gradients) into a “global” model without sharing the underlying training data).

7.  Data Subjects have Rights

How can a data subject find out if the company behind the AI tool holds their personal data? If so, is it stored? How long for? For what purposes is it used? Is it accurate? Is the data secure? What if a data subject wants their data corrected or deleted completely from the AI tool? And what about deleting a data subject’s data from the actual model that powers the AI tool?

Again these are issues that have been addressed by the ICO in their guidance, looking in turn at training data, data used to make predictions in a live environment, data contained in the model itself and Article 22 UK GDPR/GDPR automated decision making and the role of human oversight.

8.  Adequate protections when Personal Data is exported

International data transfers are a hot topic at the moment. Understanding your data flows, the options available to transfer data and how to do so in the most business efficient and compliant way are essential for a global business today.

9.  Accountability

This is the principle, which in the words of the ICO “makes you responsible for complying with data protection law and for demonstrating that compliance in any AI system that processes personal data”. The uses for AI are evolving all the time and therefore so are the data protection implications. A detailed risk assessment for the specific use cases of the AI is essential, considering regulatory requirements in the round, which data subjects are affected by the AI model, as well as “social, cultural and political considerations”.

How can we demonstrate compliance? 

It is widely acknowledged these AI tools are here to stay and there is a huge potential for their use, as long as we use them in a responsible and compliant manner. From a data protection point of view the usual assessment and recording of risk (i.e. in the form of a comprehensive data protection impact assessment) will almost always be required and will be essential should a regulator come knocking on the door.

There is an opportunity to put data protection higher on the business agenda with these AI tools. It is clear the spotlight of several regulators is shining on AI, e.g. the Italian Data Protection Authority’s widely reported decision in relation to Replika. In this instance the many headlines generated show the reputational issues and the reaction of wider society, while, as already discussed, Google’s Bard had a significant financial impact on Alphabet’s bottom line.

Co-ordinated AI ‘regulatory sandboxes’ are another compliance tool for businesses to consider; these sandboxes (recommended by the European Commission) are intended to act as a tool allowing businesses to explore and experiment with innovative products and services, under a regulator’s supervision. Aimed in particular at SMEs and start-ups, the sandbox initiative is expected to generate easy-to-follow best practice guidelines for businesses to learn from.

Again, to ensure you don’t fall foul of the regulators, and to demonstrate your compliance and best practice, when assessing risk and documenting the outcome of your decision making, there are a number of existing data protection documents to assist you in doing so, e.g. Data Protection Impact Assessments (DPIAs) (see the ICO’s blog on AI DPIAs), and Legitimate Interest Assessments (LIAs) (both of which are essential if AI tools are being deployed), the ICO’s guidance on Data Protection by Design and Default, the ICO’s AI and data protection toolkit and Explainability Statements – to name but a few! Any companies using AI tools and systems should have a specific policy in place addressing the related data privacy concerns and practices.

Businesses should always keep in the forefront of their minds that the risk profile of any AI tool will always be impacted by the purpose of the tool; an AI tool used for basic customer insights will likely overcome the issues we have outlined above more easily, as the potential for harm to data subjects is lower. Conversely, AI tools being used for analysing specific customer behaviours and subsequent decisions will require more careful consideration of the associated risk.

It will also be important to keep up-to-date with the evolving legislative and regulatory landscape, as well as tracking the risk appetite of your customers and clients to the opportunities such AI tools present.

And finally…

It seems only fair to let an AI tool have it’s say on the whole debate, so here’s what ChapGPT had to say when asked “are there any concerns that data scraped from the internet for use in AI might not comply with the GDPR?” – we think the closing statement speaks for itself!

And don’t miss our upcoming AI 101 webinar on 22 March 2023, aimed at organisations that need to get up to speed with AI and spot the opportunities and hazards on the horizon.