In this episode, Ellis explores the critical issue of data privacy for technical writers using AI tools and chatbots. He delves into the potential risks, from data leaks and copyright infringement to compliance violations and intellectual property concerns. The episode also provides practical solutions and strategies for mitigating these risks, empowering technical writers to leverage AI responsibly and ethically.
Key discussion points
- The Promise and Peril of AI: AI offers significant productivity gains for technical writers (content creation, first drafts, automation of tasks), but introduces critical privacy risks.
- Potential Risks of Using AI:
- Data Leaks: Inputted data becoming part of the AI model, accessible to others.
- Copyright Infringement: AI generating content based on competitor data.
- Data Breaches: Risk of AI providers being hacked.
- Data Sovereignty: Data stored in different countries potentially conflicting with regulations.
- Compliance Violations: Risks related to regulated industries (healthcare, finance).
- Intellectual Property Rights: Ambiguity over who owns AI-generated content.
Transcript
Hello again, and welcome to the Cherryleaf Podcast. My name is Ellis Pratt. I’m one of the directors at Cherryleaf.
Privacy with regards to what technical authors do has become a topic of discussion relating to a project recently. Also, it was something that was raised by one of our delegates to our AI in technical writing course. So I thought it would be good to talk about privacy and the considerations that we should make, and the protections that we should implement, when it comes to using chatbots in our work when writing documentation.
As we’ve talked about on previous episodes of this podcast, AI tools can be useful for technical writers. They can be useful when it comes to writing content, for generating first drafts, and freeing up time.
And also for taking source content and converting it into user-orientated content.
It can be good for creating code samples.
And in the future, we’re likely to see AI be used to automate repetitive tasks like updating documentation, writing release notes and changelog documents, and also personalising content so that end users get information that’s specific to their scenario, to their levels of expertise, to their needs. And all of these benefits come with the promise of creating better outputs, saving time, and being more efficient, but with that crucial caveat, and that is data privacy. So let’s look at what the risks are for technical writers when using AI.
Probably the biggest concern is if we give data to a chatbot provider, can we be confident that it’s only used for our purposes? Is there a risk that others might get access to that information? Because often what technical writers are doing is they’re talking about new products, they manage confidential information, and they create end-user documentation. And then when the product’s released, that information is made available to users, to the public. So you might be using an AI chatbot to take some technical documentation provided by a developer and using AI to help you create the documentation.
But there’s this fear: What happens if that information that you give the chatbot becomes part of the chatbot’s data set and potentially making that then accessible to employees of the chatbot provider or incorporated into the model itself? So it then becomes available to other organisations.
And one example of this might be that you ask a chatbot and AI system to generate some sample documentation for your application. And it does that, but it creates content that’s actually based on a competitor’s internal documentation that somehow has made it into the model’s training data. And if you use that content, do you then expose your organisation to the risk of breaching the competitor’s copyright?
And if the information is stored by the AI provider, what happens if there is a data leak at that organisation? Might they be hacked? Might they share that data with third parties, with external vendors? Might they collect metadata about you as a user and how you’re interacting with the software that could be used to surveil and gain information about you and your organisation, and the types of applications that you’re developing?
And could that, if you’re talking about security and passwords, compromise the protections that you have in place to stop people accessing your product? In most cases, we don’t know the answers to those questions because many AI models operate as black boxes. They make it difficult to understand how they process data and where that data is stored.
And this lack of transparency can make it difficult to be confident that there are compliance measures in place, that there are privacy regulations, and that those are being followed.
And linked to that is issues around data residency and data sovereignty, because the data processed by these chatbots, by these AI tools, might be stored in data centres located in different countries. And that potentially could conflict with the residency requirements and sovereignty laws of the country that your organisation is based in. For example, a popular chatbot now is DeepSeek, and that’s based in China.
The issue of where the data is going to be is particularly important for organisations that are regulated: healthcare, finance, and also governmental bodies, because they have strict rules on how data is handled, and using AI tools could potentially create compliance risks when you’re dealing with that regulated information. For example, if you’re involved in documenting medical devices, then there is potential to inadvertently process patient data examples, and that might violate requirements by organisations such as HIPAA in America. And the final risk, which isn’t unique to technical writers but could affect anybody within an organisation that’s using a chatbot and AI system, is who owns the content that’s generated. Unclear licensing terms from an AI provider might create ambiguity as to who owns it, and that potentially might compromise your organisation’s intellectual property rights.
And we’re seeing questions in law about can information be copyrighted when it’s been generated by an AI system? In fact, what we’re tending to see is that if AI-generated content is then modified by a human, then yes, it’s copyrightable.
But is there the risk that the AI provider has terms and conditions that mean they claim partial rights to generate the content? Certainly something to check.
Okay, so enough of the scary stuff. Let’s look at some of the practical solutions to mitigate and avoid those risks.
The most important thing as a technical author, as a technical writer, that you can do when using an AI system is sanitise the content before you put it into a chatbot window.
So that can mean replacing API keys and credentials and sensitive data relating to APIs with placeholder text. It can mean changing the name of the product so that it becomes ABC or XYZ.
So without knowing that, it’s not possible to understand what the product is doing or how it’s different from any other generic product. If you’re using data in your documentation, generate generic examples rather than actual customer data.
Another thing that you can do is only put portions of content into the chatbot. So if you’re asking the chatbot to rephrase some content from a developer so that it’s in clearer English, that by itself, it wouldn’t be possible to make sense of what this piece of information relates to apart from just a general description of a particular technology or product.
So only provide the minimum amount of data required for the AI tool to function effectively. What this means is the output that you get from the chatbot, you will then need to change and improve and be able to use it. So if you have given it source content where the product says XYZ, you’ll have to go back and change the output so where it used the word XYZ, replace that with your actual product name or actual company name.
It means anonymising, and excuse me for this word, pseudomising values in the content to reduce the risk of revealing sensitive information.
And it also means reviewing and redacting before you put the content into the AI system, looking for any sensitive information and taking it out. So if there ever were a data breach at the company, then you shouldn’t be exposed to having your sensitive information made available elsewhere.
It’s also worth checking what’s already in the public domain. If you’re updating content that already exists that’s in the public domain, then you may not have these concerns because this information is publicly available anyway.
The second thing to do is to look at the AI provider’s privacy policies and settings, and not all AI tools handle data equally. So before you start, you should check what their data retention policies are and whether there’s an opt-out for data collection.
And if you have a choice between different tools, pick the one with the best data deletion options.
So, for example, Google with GeminiAI Studio and with Claude, by default, they don’t use the queries and the content that’s inputted into the chatbot window to train their large language models.
With ChatGPT, by default, they do. If you go into the settings, the three dots in the top right-hand corner, and then select the settings option, you’ll find an option for data controls. And if you turn off the option, “Improve the model for everyone,” then you disable that data collection and that’s a potential for them to use your queries to improve their large language model. So that’s sort of one essential step for every user of ChatGPT where they’re using it in a commercial setting.
You also have from providers like Google and Microsoft the option of having a large language model that isn’t the one that’s publicly available. It’s managed and it’s controlled by Microsoft. It can be specific to you, and that has extra layers of data protection.
And if you trust Google with storing your company documents on Google Drive, and if you trust Microsoft with having your data on SharePoint and on OneDrive, then you’re probably also likely to trust their versions of the large language models.
And that if your organisation does go that route, those would be the chatbots and large language models to use, and they for you may be freer to use proprietary information in those AI systems.
Another option is to use local or self-hosted AI models. By local, we mean that it’s installed on your computer, and so hosted, we mean it’s in a private cloud. And that allows for greater control over data storage and processing. It means that the risk of data leakage is as risky as all the other data that your organisation stores.
The downside of that is they can be slower. They can have smaller context memories than the most popular systems that are out there. But if they do the job, then you do have that protection of them not being on somebody else’s computer.
And there are also hybrid approaches that can use public AI for general tasks while keeping sensitive processing internal on your computer. What we’ve talked about so far has been really the technical aspects of using the systems. There can also be a need for guidelines and procedures to make sure that everybody within the team is working in the same way. So you can also develop policies specifically for the technical publishing team.
In that, there might be procedures on a tiered approach, identifying what types of documents can and cannot be processed by AI tools and which types of documents can be processed by which AI tools.
You can establish processes for using AI with confidential information checklists that people can use before they upload information into a chatbot. You can supplement that with training materials to help them identify privacy risks, and you can also implement peer reviews when using AI for sensitive documentation so you get somebody else to double-check that there isn’t going to be a leak.
Another potential approach is to upload the content that will be used in an AI system to a specific area on your network and to have on that folder content filters so that it checks if there is any sensitive information in the system and either flags it or removes it, and you only use content from that sanitised area when using an AI system.
You can also carry out audits and assessments. You can get your IT security people involved to do impact assessments, security audits, just to double-check that there if there are any potential vulnerabilities, what level of risk there is associated with those AI tools.
What you’re likely to end up with is a combination of technology approaches, solutions, policies and procedures, and human oversight to check that it’s all working in a way where you can balance privacy with the promise of being able to use AI systems to make your life as a technical author better.
So you can start small, start with low-risk documentation to build up your experience, to be more aware of some of the AI privacy considerations. You can stay informed by keeping you up to date with what the privacy risks and solutions might be. Example, what settings for privacy are there in tools like Cloud and ChatGPT, and similar applications.
You can collaborate with people who might be more expert than you: your IT security people, your legal team, your compliance team, and make sure that you’re following any advice or rules that they have established.
And just thinking about security when you’re using AI, just being aware that it’s a tool and that it is one where you’re giving information to somebody else, to a third party, just maybe doing some checks to minimise the risk, washing up the document and anonymising information.
And so to conclude, AI and chatbots offer in technical writing, like in other sectors, other professions, a huge promise of a way to be more efficient and more effective in what we do. But particularly with the work that technical writers do, we do need to be aware that there is a privacy risk, and we need to address that to minimise it or to avoid it completely.
We need to be aware that there’s two sides to AI. It’s both a powerful productivity tool and a potential risk. So we need to think about those two aspects and manage it.
So we can do preventative actions. We can minimise the data that is uploaded to them, portioning up the information. We can anonymise information, use XYZ for example.
We can use robust data processing agreements only going with organisations that we trust, using products that may be by trusted vendors like Microsoft and Google so that we’re in a position where we can use AI and also safeguard sensitive data.
So we’re going to be on a tightrope, a privacy tightrope. We need this commitment to data privacy to maintain trust, to ensure compliance, just to be ethical.
But the good news is that this is recognised by the providers of AI systems, and it is certainly possible to keep those data privacy checks and balances in place.
So I hope you found this episode useful. If you’ve got any questions or comments, then you can contact us at info@cherryleaf.com. If you’re interested in our technical writing services, you’ll find information on the Cherryleaf website, and you’ll also find information on our training course on using generative AI in technical communication.
Thank you for listening.
Leave a Reply