Hi,
So here are some things I’m thinking about this week.
AI needs wizards
It’s not commonly understood outside of the cybersecurity industry that cybersecurity isn’t just an industrial-scale effort. Of course, many aspects of cybersecurity have been improved through systematic, rigorous, and often genius efforts applied at scale. For example, every line of code in the browser you’re reading this on has been checked not only by humans, but by a rigorous set of automated tests, static analysis tools, etc. to check for and prevent entire classes of vulnerabilities that are known to commonly recur when Humans Write Code. This work has slowly but surely raised the waterline of cybersecurity, to the point where you now have very little worry about, e.g., buying things with your credit card online, which was not the case a quarter-century ago.1
But as much as the industrial approach has moved us forward, in parallel a more-bespoke brand of wizardry is needed, too. You need people with strange curiosity and an adversarial mindset banging on the doors of your software to know if it’s truly secure; and many of those people, frankly, don’t want to work office jobs. To be sure, many people who think that way do work in the industrial-scale side of things, but a significant number of them contribute their wizardry in ways that look more like freelance experimentation in a wizard’s tower.
The cybersecurity industry resisted this for the longest time: first, denying that the long-haired wizards had contributions at all; then, trying to demand that the wizards sign massive NDAs and give them their contributions for free in exchange for not being turned over the FBI; but eventually and currently, accepting that there’s a role for bringing the wizards inside the tent and giving them payment (either in prestige or cash or, increasingly both) to nudge them to work with you.
This, frankly, scares the hell out of lawyers; they might reasonably ask, “You’re telling me we’re going to be not only okay with people hacking our stuff, but we’re going to encourage them?”
Well, yes. The software you’re reading this on is in part secured by an implicit deal that Apple or Google or whomever will not only try to restrain its own lawyers from suing a random 14-year-old hacker in (e.g.) Egypt, but will also often reward them if you report a serious bug or vulnerability. (Of course, there’s also a massive, nation-state-scale level of expertise inside each of those software companies testing and fortifying their own software; the industrial and wizardry efforts are complementary approaches).
So, what happens if you have software that you really, really need to secure, but it’s really, really new and weird? So new and weird that there’s no industrial approach, just yet? Something like, oh I don’t know, Large Language Model AIs like chatGPT or Bard or Claude or something?
You throw a party, and invite both the industrial types and the wizards.
Earlier this month, the AI Village at the hacking conference DEF CON in Las Vegas held what they called the “Generative AI Red Teaming Challenge” with the support of, no kidding, the White House Office of Science, Technology, and Policy, as well as all the major AI companies and a bunch of other nonprofits. (Red teaming, for those less familiar, means attacking a computer system while pretending to be a bad guy.)
In the challenge, you got access to 8 white-labeled Large Language Models (LLMs) from various AI companies.2 Some had features intentionally weakened for pedagogy, some were closer to full strength. You had 50 minutes to get them to do bad stuff, which was scored.
Terrifyingly, the official estimate is that the ~2450 hackers who participated doubled the total number of humans who have red-teamed an AI model. And reportedly, some novel and dangerous stuff was found and reported.
(Oh, no worry, we’re building a god in our basement, and sufficiently few experts are involved that you can double the testing team on it by asking hackers to delay getting a beer for an hour3 at a convention and come play with your toys.)
It was a bold and awesome decision to bring these models to Defcon to be tested by thousands of professional hackers, and I genuinely salute the companies and nonprofits, as well as the White House, for having the confidence to put their work out there for criticism, especially when some of the models were weakened so they’d have results they would not have in real life.
While specific attacks are supposed to remain non-disclosed for a period of one year, the companies appear to be OK with us discussing general takeaways. So here is what I learned, all of which I've seen (in some form) in the media or previous writing. In other words, it’s a best-practices list, NOT novel research. I’m no wizard, but I do know how to copy a few of their spells.
And even if you’re not using Generative AI for hacking, you might well be using it for, well, actual work in your day job, so perhaps this will be of some interest anyways. For each security takeaway, I’ll offer a “mundane utility” application as well.
My key takeaways:
1. Why not just:
In general, your first approach should always be to just ask the model to Do The Thing before coming up with a more clever attack. It'll surprisingly frequently work. Always ask yourself, "why not just ______".
Mundane utility application: Same.
2. Old tricks are the best tricks:
Similarly, always ask yourself how a computer programmer from 1995 would attack a LLM with basic things like attacks on inputs, renaming things, manipulating stored data, etc. before modern mitigations were created. The LLM-equivalents of those protections haven't yet been fully fleshed out, so often you can get a nearly-right thing, then tweak it to right by asking it to find-and-replace various elements.
Mundane utility application: For those with more of a business background, think, "how would I manipulate this text in Excel to get the desired output?"
3. Conversations aren't soundbites:
A bunch of the good safety features for LLMs only really work in a conversational context -- you can often get them to give warnings early in a conversation, but they don't always repeat them throughout the convo. You can often ask it to rephrase previous content in a terse way, or with bullet points, or by combining multiple previous outputs.
Bear in mind that the LLM can be its own attacker here: LLMs work on next-token-prediction, so if you get the chatbot to make an output sorta-kinda-close to where you’re going, it’s more likely to move further in that direction when prompted to look back at its own output.
Mundane utility application: Think about a prompt you’ve failed to get good results from in the past: is there a way to break it up into steps or pieces? Can you get the chatbot to encourage you along the way, or can you encourage the chatbot along the way, to get the final result?
4. Dual-use remains an unsolved moral problem:
Some LLMs will refuse to give you information about bad things, even if you give them a good reason, others won't. Sometimes saying, "I need this info to protect myself from harm" is sufficient predicate to get instructions on how to do something bad, other times it will refuse to give you information useful to both good and bad guys.
Note that often the most-advanced LLMs will flinch from providing you information that you can easily Google, if it’s convinced the topic is dangerous. In general, the more benign a reason you have for wanting to do something, the more likely the LLM is to play along.
Mundane utility application: Consider telling a LLM, in detail, a fictional story that precisely justifies the type of output you want, and discourages the type of output you don’t.
5. Think about the training data:
Whenever in doubt of how to proceed, think about what the LLM is likely trained on: the corpus of the whole searchable internet and most books. Where does the thing you want to get out of the model happen the most in that corpus, and how can you get the model to focus on that context, rather than potential harms? (As Simon Williamson points out, for example, the average popular fictional character has many times more instances of fanfiction than the original work in the data set).
Mundane utility application: think of a specific named internet person4 who did the thing you want it to do, and then prompt the model to go there. Did they tweet, or write a bunch of essays on the topic and post them online? The more likely the model is to know how to narrow in on their voice. For example, as I’ve tweeted about before: my current best prompt for getting useful text to send to a bureaucrat is by prompting ChatGPT to write in the style of my friend, Patrick “patio11” McKenzie, who has made a career out of helping people out of bureaucratic jams.5
Overall, this was the most fun that I’ve ever had at DEF CON, and encouraged me to try a bunch of new projects with generative AI. I can’t wait to see what the next year brings, and what the eventual report-out by the Generative AI Red Team organizers reveals about what the true wizards found…
I’m looking for new opportunities; I’ve been especially successful in the past in corporate strategy and consulting roles where I work closely with, yes, wizards, but also technical, product and legal experts, and am looking for similar for my next role. I’m open to a range of different industries, but am especially interested in organizations where I can help keep people safe from bad guys, and/or help them get easier access to government services and benefits that they deserve. If you’re aware of an interesting opportunity, I’d love to chat with you and learn more.
Disclosures:
Views are my own and do not represent those of current or former clients, employers, or friends.
Is this what getting old feels like?
While they were supposed to not reveal who they were from, some of them often answered, “As an AI model from [COMPANY NAME]…”
Honesty compels me to note that probably some hackers had already had a beer. But you get the point.
Or failing that, at least a class of person there’s many articles written about. For example, “Harvard Business Review manager” is also a useful prompt.
“Write a letter of three paragraphs, in a polite but terse manner modeled after Patrick "patio11" McKenzie” is an extremely useful prompt for many mundane tasks.