This is the unedited (final draft) version of the paper, written with a more technical voice. The formally published version of the paper can be found here.
The potential of generative artificial intelligence (GenAI) to create malicious software is sparking varied reactions within the cybersecurity community. Opinions are split between those who fear this technology could lead to sophisticated novel threats and those who argue it merely automates the assembly of existing malicious code found on the internet. This debate highlights the growing complexity of GenAI’s role in cybersecurity, specifically its ability to create malicious software. Drawing on academic research and grey literature, this paper seeks to offer a balanced exploration of both the capabilities and limitations of GenAI, contributing to a more nuanced understanding of its impact.
What is malware? Malware is a catch-all term used to describe various types of software designed for malicious intent such as viruses, worms, trojans, ransomware, scareware, spyware, and adware. It is deployed for multiple reasons including stealing identities and financial details, gaining control of computers, and using them to launch denial-of-service attacks, or to mine cryptocurrencies. Additionally, malware can be used to spread mis and disinformation.
What do we mean by AI-generated code? For the purposes of this paper, we refer to code generated by large language models, trained on vast datasets of publicly available source code. These systems assist software developers by generating code in response to text prompts (code snippets or full functions), refactoring, repairing and refining code, and by being able to explain code. The first successful model of this kind was OpenAI’s Codex[1], which powers GitHub’s Copilot. As of February 2024, GitHub Copilot had 1.3 million paid subscribers. Embedded within the development environment, Copilot serves as an autocomplete tool that reportedly enhances developer productivity, enabling 55% faster coding[2], [3]. However, it is essential to recognize that AI-generated code can contain flaws and should be meticulously reviewed, edited, and refined by developers[4].
In the wider view, GenAI is good at creation and communication, and its helpfulness is largely agnostic to the intent of the user. Although many of the main commercial LLM systems available have safety protections, such as content filtering, they are not totally reliable and can often be defeated by clever prompting, or indeed, avoided altogether with the use of some open-source models. Fundamentally, they are designed to ingest knowledge found on the internet, and generate plausible responses using statistics, often in the guise of a ‘AI Chatbot’ persona. At time of writing, however, the systems struggle with attention, deterministic-ness, reasoning, and contextual understanding, the latter two, being functions expected, but that cannot be achieved with current transformer architectures.
GenAI has proven itself to be a particularly useful tool in the support of social engineering attacks, due to an LLM’s ability to process natural language, meaning that it is able to both help select suitable targets for spear phishing deployments and write the ‘personalised’ messages[5]. A recent report from the NCSC on the near-term impact of AI on cyber threat[6], speaks to AI uplifting the social engineering/spear phishing ability of all threat-actors; highly skilled / state-sponsored, skilled but resource constrained / organised crime, and novice hackers / opportunistic criminals.
… But could GenAI autonomously create novel malware that exploits previously unknown vulnerabilities, and which is able to evade state-of-the-art defences, reminiscent of the insecurity prevalent in the late 90s, as some would suggest?
Since GPT-4’s public release over a year ago, there hasn’t been a noticeable uptick in novel malware detections in the wild[7]. This observation suggests that while GenAI may be a powerful tool, it currently lacks the specific capabilities and training necessary to independently create operational malware.
It’s worth noting that not all attacks require malware, in fact due to the hostile environment afforded to malware by modern antivirus, platform security and response process improvement, for those in serious crime malware-free attacks have proven more lucrative and more popular. This trend is partly related to the success of identity attacks (phishing, social engineering, and access brokers). According to Crowdstrike malware-free attacks represented 75% of detections in 2023[8].
The first research question this paper sets out to answer is: what specific coding tasks within malware development can currently be automated by GenAI?
To answer this question a breakdown of what comprises malware is first required:
For each of these pieces of functionality, the LLM training dataset likely contains a set of examples acquired from many publicly available sources, including public disclosures, academic publications, hacker forums, security blogs, cybersecurity training, red teaming resources, and bug bounties and vulnerability reports. Meaning the LLM can generate code that achieves the described functionality. But does that mean it can autonomously create malware? Likely not – at least, not yet.
Current LLMs are not capable of autonomously writing high-quality code[9], [10], [11], [12], [13]; they typically require human intervention to correct and refine what they generate. This is primarily because LLMs lack an understanding of the logical structures and contextual nuances needed for sophisticated software development[14], [15], [16].
Additionally, LLMs are generally trained on more commonplace and less sophisticated examples, as comprehensive and high-quality datasets of malicious code are rarely publicly available due to ethical, legal, and logistical issues.
There are systems that have been fine-tuned on malicious content, such as WormGPT, FraudGPT and DarkBERT[17], [18] and many of these systems are not constrained by content-filtering or safety requirements, as they are specialised versions of open-source models. Indeed, FraudGPT and WormGPT are available as subscription services essentially democratising the ability to perform social engineering attacks. The inclusion of dark web content will increase the breadth of the training data set, but it will still be constrained, as the most effective - i.e. sophisticated, operationally reliable, and stealthy - techniques will still not be accessible due to their value (an exploit can sell for up to $7million[19]). Access to the most sophisticated malware is limited to state actors[20], platform owners, and security product vendors.
Even with advanced models like GPT-4, the capabilities necessary to autonomously create sophisticated malware are lacking and will continue to rely on human expertise[21]. Sophisticated malware often requires a delicate balance between stealth, security, and functionality. It involves making strategic decisions that weigh the effectiveness of the malware against its detectability. Such decisions require a level of tactical foresight and adaptive problem-solving that current LLMs do not possess[22], [23].
While GenAI may currently lack the capabilities necessary to autonomously create sophisticated malware due to limitations in training data and reasoning abilities, its utility for malicious purposes should not be underestimated.
When it comes to coding, LLMs can act as both teacher and assistant, effectively lowering the barrier to entry for writing malicious software[24]. However, the quality of LLM output hinges on the expertise of the person prompting them, who uses knowledge and experience to frame questions that maximize the quality of the response[25].
There are several reports of hackers using (abusing) ChatGPT for tasks such as ‘improving’ existing info-stealing malware[26], and learning how to write ransomware scripts[27] (although small fixes were required). A recent Threat Intelligence/Security blog[28] by Microsoft sheds light on the activities of known threat actors when it comes to exploring the use of LLMs for malicious purposes. All five actors described in the report used LLMs for ‘LLM-enhanced scripting techniques’, as categorised by MITRE. These techniques involved using LLMs to generate or refine scripts that could be used in cyberattacks, ranging from basic scripting tasks like programmatically identifying certain user events on a system, and help with fixing coding errors, through to seeking assistance with troubleshooting and understanding various web technologies and refining scripts which appear to support automation or streamlining of their cyber tasks.
In summary, while current GenAI capabilities fall short of autonomously creating sophisticated malware due to limitations in training data and reasoning abilities, they still serve as valuable tools for malicious actors. Through specialized models and exploitation of LLMs, hackers can leverage GenAI to assist with some aspects of malware development. Although, human expertise remains a crucial factor.
The second research question is: to what extent can GenAI autonomously identify and exploit new security vulnerabilities in software?
To achieve their goals, malicious binaries need ways to defeat security and privacy enhancing subsystems, to do this they often exploit vulnerabilities within the host operating system or applications. These vulnerabilities arise from a range of issues that can include human error, such as coding mistakes or configuration oversights, but also stem from the inherent complexity of software development and the pressures of tight release schedules. ZeroDay vulnerabilities are defects which the vendor is not yet aware of, OneDay Vulns are disclosed but not yet patched vulnerabilities.
The value of exploitable vulnerabilities is high, because they are valuable to both attackers and the platform/application owners whose customers/users are being attacked. As such, automated vulnerability discovery has been the holy grail in the cyber security world for many years, and there is much research dedicated to this topic. Traditional machine learning (deep learning) techniques have shown a strong ability to detect vulnerable functions with a higher degree of accuracy[29], and reduce both false positives and false negatives when compared with baseline static analysers[30].
The use of LLMs has been explored for various purposes: classifying insecure functions[31], improving the results returned by traditional static analysers[32], and finding and fixing vulnerabilities. In the latter cases, this is achieved by utilising certain ‘oddities’ of LLM behaviour, such as the performance improvement gained through Self-Reflection[33] and Chain of Thought[34] prompting.
A common approach to measuring the performance of LLMs in this field is to compare their results to those of traditional static analysers, such as CppCheck and SonarQube. However, it has long been understood that for these tools to be effective, i.e. to return pertinent results and not too many false positives, they need to be configured correctly within a representative build environment. This can sometimes lead to skewed results as the necessary effort is burdensome. Of course, it is also the reason the use of LLMs as an alternative solution is so appealing.
The second uncertainty comes from not being able to determine whether the test data is present in the training data for the system being examined. If the test data is curated from existing datasets such as CVE[35]s and CWE[36]s, it is difficult to determine the effectiveness of the LLM’s ability to generalise, as opposed to “approximate retrieval”[37]. Which raises concerns about the system’s real-world applicability and its ability to identify novel vulnerabilities.
A paper[38], [39], published in April ’24, stated that GPT-4 powered agents were able to autonomously exploit real world security vulnerabilities by reading security advisories. Using the CVE description the agents were able to exploit 87% of the vulnerabilities they were presented with, 11 of which were disclosed after the training data cut off for the LLM. This work built on previous work by the same authors, that demonstrated how using the LLM’s planning ability agents were able to autonomously hack websites[40], showing that for low-difficulty rated security weaknesses the LLM-agents (using GPT4) were able to successfully develop an exploit for the vulnerability, although the authors point out that the success rate for harder vulnerabilities was lower. In both scenarios, the agents were equipped with web-searching tools, this doesn’t mean the agents were able to figure out how to exploit the vulnerabilities autonomously, in fact one researcher in response to the paper stated that they were able to find publicly available exploits for 11 of the vulnerabilities[41]. Thus, throwing doubt on the original paper’s claim, that the LLM agents can autonomously write the exploits as an emergent behaviour.
While it appears that the current generation of LLMs are not capable of autonomously finding and exploiting vulnerabilities, in part due to the inherent complexity of the task, they still have unique strengths which may help vulnerability researchers.
A range of tools exist such as static analysers, port scanners, and decompilers, to aid the vulnerability researcher in their task, such tools fall into five main categories:
As discussed in the previous section, GenAI is most effective when its being used by an expert. Below is a set of novel LLM applications, where an LLM’s unique talents positively offset its stochastic limitations, with the view of aiding vulnerability research:
Fuzzing is a form of dynamic testing which involves feeding a program random, invalid, or unexpected data to uncover defects which may lead to vulnerabilities and/or security flaws. LLM enhanced fuzzing tools provide superior API and code coverage, find more complex bugs, and improve the automation of testing[42]. TitanFuzz[43] a tool for fuzzing PyTorch and TensorFlow libraries, utilises the many code snippets within the LLM’s training set to derive correct and diverse programs that can then be used as input to the fuzzing system. TitanFuzz achieved 30% and 51% better code coverage on the two libraries respectively, compared to other SotA fuzzers[44].
Another system ChatAFL[45], [46] a guided fuzzing engine for protocol implementations, utilises the RFC knowledge that LLMs have from their training; ChatAFL iteratively uses the LLM to generate machine readable grammars to use as inputs, to increase the diversity of those inputs and to generate specific inputs aimed to reach new states within the protocol implementation. Another researcher simply gave Claude 3 the entire C library for decoding GIF files, and asked it to write a python function to generate random GIFs to exercise the library. The GIFs it generated got 92% line coverage and found 4 memory safety bugs and one hang[47], [48].
Penetration Testing or PenTest, is when a computer network is investigated for weaknesses. Typical activities include discovery (mapping the target network and assets), scanning (using tools to scan for known vulnerabilities), identification (analysing scan results to identify weak points), and possibly even exploiting any weakness to determine what level of access can be achieved.
Unlike red-teaming these activities are normally done within an agreed time with the owners and managers of the network under investigation.
PentestGPT[49] utilises an LLM in three main ways, firstly it uses its planning ability to adopt the role of lead tester, using an attack tree structure to steer the testing process, secondly, its generative ability to perform the role of junior tester, where it constructs test for the specific tasks, and lastly it utilises the natural language ability of LLMs to parse outputs and results.
Although the system showed some promising results, solving most of the easy targets and some medium ones, researchers acknowledged that the system struggles with harder targets that typically demand a deep understanding from the penetration tester, which is not present in the LLM.
[Add a possible section on GenAI in RevEng tools?]
Consistent with conclusions drawn in the previous section, while GenAI cannot yet autonomously identify and exploit novel vulnerabilities due to limitations in reasoning abilities and training data, specialised applications like ChatAFL and PentestGPT demonstrate how LLMs can assist researchers in vulnerability detection and analysis. However, concerns about their reliance on existing datasets and inability to generalise beyond training data limit their effectiveness. Ultimately, the collaboration between automated tools and expert human oversight remains essential.
So far, we’ve reviewed the ability of current Generative AI systems as to whether they can create novel malware, and/or find and exploit new vulnerabilities. This section looks to the future, and how AI could change the landscape for malware, not limited to Generative AI, but by considering the trajectory of AI as a field, including generative AI.
For malware to remain undetected, it must evade anti-virus products. Modern consumer antivirus products employ two main types of detection: signature based and behaviour based. Signature based detection involves reverse engineering a known piece of malware to identify unique patterns within its binary structure, which can be checked for, as files are downloaded to disk or loaded into memory. Behaviour based detection analyses the actions of applications and processes to identify suspicious behaviour, for example, it would be suspicious for Word to try and write to files in the protected system folder. Traditional machine learning techniques such as pattern recognition, continuous learning and adaption, and reducing false positives are critical to ensure anti-virus can respond effectively to evolving threats.
From the perspective of the malware, these defences pose a significant challenge. It aims to be installed, persist, and operate without being ‘quarantined’ or having the user be alerted to its presence. Techniques that allow the malware to alter its code when it executes (polymorphic[50], [51]) or rewrite itself entirely (metamorphic) will improve the chances of the malware avoiding detection.
To achieve this, back-end support would be required as once a piece of malware is quarantined - it’s game over. However, creating a feedback loop between the malware and its C&C server could be effective. By exchanging details about the malware’s current environment, and receiving patches or self-patching instructions, the malware can adapt to remain undetected. This would require a lot of processing at the base, including automated testing and simulation of antivirus systems and machine learning techniques to identify patterns and generate improvements for new variants. If you consider this a reversal of what AV vendors are currently doing, with the uplift from GenAI for generating code, this strategy is not only plausible but increasingly feasible.
Advanced persistent threats (APTs) are sustained cyber-attacks where the malware remains relatively dormant within networks for extended periods waiting for the right conditions to execute. Commonly associated with upstream / supply chain attacks, they are predominately deployed for strategic advantage, not for immediate financial gain. In other words, are more likely to be state sponsored. The recent vulnerability found in the XZ utils open source library (CVE-2023-3094) demonstrates such an long-term, supply chain attack operation[52]. The goal was to install a backdoor in a library that would have left many Linux servers compromised world-wide. Remarkably, this attack was detected almost at the last minute, having survived pre-deployment checks, thanks to vigilant performance monitoring by a Microsoft engineer.
As the AI industry matures away from large resource intensive systems, and towards on device and at-edge computing, it is reasonable to theorise that such capabilities when embedded in an APT piece of malware, it could evolve the malware into a malicious autonomous agent, able to devise its own plans to achieve its stated goals. Such localised autonomy could significantly complicate defence strategies. Traditional cybersecurity measures that rely on detecting known patterns of malicious communications or behaviour may become less effective against APT attacks that are making decisions and adapting without relying on outside contact. This could be revolutionary; current confidence levels afforded by air-gapped networks would be eroded.
While much attention is given to the intrusion and initial infection of systems, the true nastiness often unfolds through the deployment of the malware’s payload. With the advent of AI in cybersecurity, imagine malware that not only infiltrates a network but also autonomously generates its own payloads based on the systems it encounters.
This AI-enhanced malware could identify the most lucrative data repositories and transaction systems, or maybe a company’s proprietary designs and confidential information. It would then craft payloads on-the-fly to extract financial information, intellectual property, or even manipulate transaction processes.
Additionally, it could dynamically adjust its programming to mask its C&C channels, for example embedding its data exfiltration efforts within legitimate business processes tailored to the culture it finds itself in. This capability would significantly complicate the detection and mitigation efforts, as the malware’s data would be slipping past conventional security measures designed to flag known suspicious activities.
While sophisticated cyberattacks, like the XZ Utils attack described above, often capture headlines, a significant threat persists from less skilled developers deploying ‘muck spreading’ malware tactics. By utilising kits such as Metasploit, these attackers develop malware that exploits known vulnerabilities which haven’t been patched. They leverage the volume of deployments over sophistication of the malware, indiscriminately targeting vast numbers of users. This approach is economically viable due to the minimal costs involved and the potential for returns, even if only a small percentage of attempts are successful. However, if this approach is augmented by a GenAI system capable of autonomously generating functional and buildable code, performing testing and deployment, and including simple but effective social engineering based on AI-enable reconnaissance, the return of investment could increase substantially. The prospect of AI-enhanced volume-driven attacks underlines the need for better cybersecurity to strengthen defences, proactively anticipating an increase in the attack volume and reducing the time from vulnerability disclosure to exploit use.
Of course, sometimes AI can surprise us. In 2016, AlphaGo beat one of the world’s best Go players by making a move that no human player would ever consider, a move that “so perfectly demonstrated the enormously powerful and rather mysterious talents of modern artificial intelligence”[53]. Although trained on human gameplay, the system was able to look beyond how we play to an entirely different level of the game. If this type of phenomenon were to occur within the volume-over-sophistication style approach to malware development, we could see the emergence of novel malware being generated autonomously, i.e. without the expert assistance and specialised data as previously suggested. This could lead to a new era where AI-generated malware could innovate novel attacks independently, presenting unprecedented challenges for cyber defence.
“I think that copy pasting from Stack Overflow is inherently less bad than the AI suggestions. When you copy paste from SO you know the answer is not for your question. It’s the answer for someone else’s question that happens to match your question. The AI answer however is presented to us as the answer to our question. But it’s not. It’s just the logical completion of the previous tokens, whether that is text or the code.”– Reddit user in r/programming. [54]
The ability for GenAI to produce workable malware is hinged on the data it is trained / fine-tuned on, as well as its ability to interpret instructions and deliver expected results. But how good is GenAI at writing ‘normal’ code? This question elicits a variety of responses.
The integration of tools like Copilot, which are embedded directly in the development environment and operated through natural language, means that there is virtually no barrier at adoption. However, experienced programmers soon realise that such systems have limitations, that they generate code which is often buggy, incomplete or ‘not what I asked for’. They typically restrict its use to assisting with repetitive tasks such as unit testing, or when faced with the dreaded ‘new file’ blank page.
Copilot, ChatGPT, GPT-4 et al are more reliable when used for coding tasks which are found more often in their training sets but are less helpful for novel or proprietary interfaces. They don’t make good design decisions for critical aspects such as privacy and security. Indeed, there have been instances of leakage of personal information and API keys from the training set[55].
Studies, including a Visual Studio Developer magazine[56], indicate troubling trends for maintainability “We find disconcerting trends for maintainability. Code churn – the percentage of lines that are reverted or updated less than two weeks after being authored – is projected to double in 2024 compared to its 2021, pre-AI baseline”, this suggests that Copilot dissuades developers from upholding the DRY (don’t repeat yourself) principal of good coding practices, leading to less maintainable code and therefore increased technical debt.
While coding, especially secure coding, is challenging, Copilot is not a panacea. Its use without guard rails simply shifts the burden of finding and detecting defects down the development pipeline, putting more emphasis on quality assurance activities to catch the bugs before release. As any software engineer will confirm, the later you find a bug, the more difficult and costly it is to fix.
Agentic approaches which utilise an LLM’s ability to break down a task and to understand code for iterative refinement, could improve outcomes. Devin.ai is a new system being promoted as a replacement for software engineers. However, it faces many of the same limitations as Copilot and GPT-4, and its demo projects seem to have been selectively chosen[57].
The unchecked use of Copilot risks integrating more bugs into codebases. Consequently, any increase in bugs not only undermines code quality but also amplifies the risk of security weaknesses. Some argue that the onus remains on engineers to ensure the integrity of the code they produce, regardless of the tools used. This view highlights the need for developers to maintain a high level of vigilance and responsibility.
However, a there is also a school of thought that posits that relying too heavily on Copilot could erode the fundamental coding skills of new programmers, who might become overly dependent on AI assistance. This dependency raises a critical question: if Copilot diminishes the coding proficiency of new engineers, where will the experienced programmers, necessary for effective quality assurance (QA), come from? The industry could face a shortage of skilled programmers capable of performing the essential QA tasks that catch critical bugs before software releases. Such a scenario underscores the importance of balanced training and development practices that ensure programmers are both proficient in coding and software engineering fundamentals.
Generative AI has the potential to enhance and complicate the cyber security landscape. While GenAI can bolster existing risks with respect to the speed and scale of reconnaissance, social engineering, and spear-phishing, the current impact of its code generation abilities demonstrates a lesser effect on the attack landscape. However, as machine learning models become more sophisticated and training datasets more comprehensive, GenAI’s role in cyber security is likely to grow significantly.
There is a belief among some experts that future advancements in AI might lead to scenarios where malware created by AI can only be effectively countered by other AI-driven defence systems. This reflects a broader concern of an arms race, where capabilities of offensive and defensive technologies continually evolve to outpace each other.
Whether such worst-case scenarios play out will partly depend on the rate of AI development. It is uncertain whether the next generation of LLMs will match the leap in capabilities from GPT-3 to GPT-4. Some experts even cast doubt that current architectures can provide the foundation for further spikes in capability, suggesting that GPT-4 may be nearing the upper bound of its abilities given available training data.
Current GenAI systems offer unique strengths, particularly in pattern recognition and natural language processing, drawing on extensive training data and offering multimodal capabilities. Targeted application of these abilities to enhance state-of-the-art systems could significantly elevate existing technologies, for both cyber threat and cyber defence.
However, if cyber defence is to stay ahead of the game, realising this potential requires collaboration between currently disconnected groups, bridging not only the gap between cyber security and AI, but also between the compartmentalisation within the AI field. The heightened focus on GenAI over the last year has led to an increase in research in this area. Our review of the literature indicates that much of this research tends to be conducted in pockets. There are instances where LLMs are used to develop solutions that may not surpass existing approaches, leading to scepticism about their utility. Simultaneously, innovators at the forefront of advanced solutions sometimes overlook the value LLMs can contribute.
A first step could be for professionals and researchers from both fields to share best practice on managing ethical and security considerations, develop frameworks to bridge terminology gaps, and ensure that the practical application of research is prioritised. Fostering a culture of collaboration and open dialogue is necessary to stay ahead of adversaries in an ever-evolving cyber threat landscape.
This publication is licensed under the terms of the Creative Commons Attribution License 4.0 which permits unrestricted use, provided the original authors and source are credited.