The mission of the hackathon: to write a program that can scan millions of lines of open-source code, identify security flaws and fix them, all without human intervention. Success would mean winning millions of dollars in a two-year contest sponsored by DARPA, the Defense Advanced Research Projects Agency.
The contest is one of the clearest signs to date that the government sees flaws in open-source software as one of the country’s biggest security risks, and considers artificial intelligence vital to addressing it.
Free open-source programs, such as the Linux operating system, help run everything from websites to power stations. The code isn’t inherently worse than what’s in proprietary programs from companies like Microsoft and Oracle, but there aren’t enough skilled engineers tasked with testing it.
As a result, poorly maintained free code has been at the root of some of the most expensive cybersecurity breaches of all time, including the 2017 Equifax disaster that exposed the personal information of half of all Americans. The incident, which led to the largest-ever data breach settlement, cost the company more than $1 billion in improvements and penalties.
If people can’t keep up with all the code being woven into every industrial sector, DARPA hopes machines can.
“The goal is having an end-to-end ‘cyber reasoning system’ that leverages large language models to find vulnerabilities, prove that they are vulnerabilities, and patch them,” explained one of the advising professors, Arizona State’s Yan Shoshitaishvili.
To get there, the team is grappling with the often grim reality behind lofty AI aspirations. The students are doing things like imposing “sanity checks” to catch hallucinations, verifying that patches actually solve the issues they are supposed to, and having two AI systems debate each other over the best fixes — with a third AI deciding the winner.
“AI is a like a 3-year-old with infinite knowledge,” said UC-Santa Barbara graduate student and team co-captain Lukas Dresel. “You have to give it actionable feedback.”
Team Shellphish is one of about 40 contestants in a competition known as AIxCC, for artificial intelligence cyber challenge, and run by DARPA, the Pentagon research arm charged with developing secret weapons and defending against them.
“We want to redefine how we secure widely used, critical codebases, because of how ubiquitous open-source is across the critical infrastructure sectors,” said Andrew Carney, DARPA project manager for the contest.
Though DARPA helped birth the internet to survive communication failures, it has become painfully obvious that the net also introduced enormous weaknesses.
With no built-in security, the vast interconnections allow anyone or anything to start from anywhere and look for ways into machines that power the modern world. Once inside, users can pose as employees or system administrators, steal national or trade secrets, and shut the place down or hold it up for ransom.
Hackers are claiming more victims than ever: The number of data breaches reported to the FBI-run U.S. Internet Crime Complaint Center tripled between 2021 and 2023. Government agents burrow into rival nations’ power and water plants. Crime gangs engorged by illicit profit think nothing of knocking out hospitals and sending desperate patients elsewhere.
Open-source software, whether written by students or farseeing geniuses, is almost as ubiquitous as the internet itself, by some estimates nestling inside 90% of commercial software.
Like all software, it has bugs, some of which can be exploited to seize control of a machine.
Some large open-source projects are run by near-Wikipedia-size armies of volunteers and are generally in good shape. Some have maintainers who are given grants by big corporate users that turn it into a job.
And then there is everything else, including programs written as homework assignments by authors who barely remember them.
“Open source has always been ‘Use at your own risk,’” said Brian Behlendorf, who started the Open Source Security Foundation after decades of maintaining a pioneering free server software, Apache, and other projects at the Apache Software Foundation.
“It’s not free as in speech, or even free as in beer,” he said. “It’s free as in puppy, and it needs care and feeding.”
The risks have been underscored recently by two very different incidents.
The first was a vulnerability in a small program for keeping track of system activity, known as Log4j, used by thousands of software developers and installed on millions of machines.
In 2013, a user proposed adding some code to Log4j, and the small Apache Foundation team maintaining Log4j approved it. In November 2021, a Chinese engineer saw that the added section contained a massive design flaw that would allow system takeovers, and he flagged the issue to the Apache group.
While Apache was working on a patch to fix the problem, an unidentified researcher discovered the pending changes and developed a malicious tool to grab control of computers running Log4j. Apache rushed out the patch, setting off a race between thousands of defenders and those trying to exploit the flaw before it was fixed.
Many Log4j instances have still not been fixed. On Thursday, the National Security Agency and others warned that North Korean spies were still breaking into U.S. web servers running old versions.
The White House’s Cyber Safety Review Board concluded that only better coding and thorough audits could have stopped the Log4j flaw’s distribution, and that open-source efforts like Apache’s “would need sustained financial support and expertise.”
The Department of Homeland Security’s Cybersecurity and Infrastructure Security Agency (CISA) has responded with small grants to start-ups and has been pushing companies to declare what’s inside their software. But those are slow-moving initiatives.
The most recent reminder of the vulnerability came in March. That’s when a Microsoft engineer traced a slight increase in processor use to open-source tools for Linux that had just been updated. He found that a back door for spying had been inserted by the tools’ official maintainer, and blew the whistle in time to stop it from shipping in the most popular versions of Linux.
In a nightmare scenario for security professionals, the anonymous maintainer had won control of the project after contributing for years, aided by secret allies who lobbied the previous manager to cede control.
As open-source security was rising to become a top priority for CISA and the national security establishment, OpenAI and Microsoft loosed ChatGPT and generative artificial intelligence on the world.
By democratizing programming, the new tools allowed non-coders to create software. AI also aided existing programmers, including criminal hackers who could more quickly incorporate tricks to take advantage of vulnerabilities and deliver more convincing lures, such as emails that appeared to come from regular contacts with shared interests.
AI is also boosting defensive endeavors, such as analyzing reams of logs for unusual behavior and summarizing security incidents. It can also flag security missteps in programs as they are written.
But figuring out where the holes in open-source programs are before attackers find them is a holy grail for DARPA and the contestants of AIxxCC.
DARPA ran a cyber challenge at the 2016 Def Con hacker convention, where programs competed in a “capture the flag” contest to hack into one another in an artificial environment.
In this year’s contest, the teams use their AI-enhanced programs to digest and improve millions of lines of real code.
Shellphish is one of seven teams that wrote papers outlining their approach well enough to get $1 million in funding for the steps that will climax at the semifinals in August at Def Con, which attracted 40 entries. The winner will get another $2 million in 2025.
Some of Shellphish’s first million dollars went for the Airbnb-listed home in Brea, which housed hackers for three weeks in June and another two in July. More went for a huge testing environment that used 5,000 central processing unit cores.
Shellphish is no random group of hackers. Though strongly associated with two public universities with changing populations, the team has been around for 20 years, and its founders are still involved.
Italian native Giovanni Vigna was teaching computer security at UC-Santa Barbara, including techniques for attacking and defending, when he founded a capture-the-flag team in 2003 to get students more interested and stretch their capabilities. It won the Def Con competition in 2005 and hosted the contest later for a four-year stretch.
As his students graduated and spread to Arizona and elsewhere, some stayed involved, or got their own students into it.
Shellphish competed in the original 2016 Cyber Grand Challenge, but got knocked out before the finals.
“We had all these cool tools but ran out of time to integrate them,” Shoshitaishvili recalled. “So ‘Don’t get nerd-sniped’ was my No. 1 piece of advice.” (Nerd-sniping refers to distracting someone technical with an interesting problem.)
Core to the effort are tools known in security as “fuzzers.” These fire all manner of data at a program to see how it handles the unexpected.
Even souped-up fuzzers are unlikely to find the most obscure flaws or deliberate back doors, the team members admit. At its best, Shellphish’s master program and the others will be able to find a lot of low-hanging fruit, quickly, and get rid of it before malicious hackers can exploit them.
“AI will be able to solve things that take humans months,” Dresel said.
Under the terms of the DARPA contest, all finalists must release their programs as open source, so that software vendors and consumers will be able to run them.
Yan compared the expected advance to security milestones like forced software updates and browser “sandboxes” that keep web programs from escaping the browser and executing elsewhere on a user’s device.
AI won’t be able to make all software safe, he said. But it will give the humans more time to try.
After a final, near-sleepless night of debugging and panicked last-minute fixes, Shellphish submitted its program at the 9 a.m. deadline. In a few weeks, at the next Def Con in Las Vegas, they will find out if they’re finalists. Win or lose, their AI-aided code will be available for others to build on, improving security for everyone.