My friend Nikhil and I uncovered something unexpected: OpenAI's Deep Research feature (which I will refer to as ODR) can be used to hunt down exposed API keys in public GitHub and HuggingFace repositories. Ironically this includes the very keys needed to access OpenAI's own services. This feature was recently made available to $20/month Plus subscribers, putting a powerful web search tool in many more hands. The key concern to us (no pun intended) was that we did not employ any sophisticated jailbreaking technique to achieve this outcome: anyone could have done this too.
What is Deep Research?
This term has begun to refer to a broad group of LLM tools augmented with web search capabilities. ODR specifically empowers ChatGPT to automatically search the web to analyze and parse content from a broad number of sources and provide a comprehensive summary to user queries. For example, you can ask it to search the market for a particular bicycle that matches your height, budget, and features you really need. It will then access dozens of websites, parse the listings, and synthesize a list for you in about five minutes.
Traditional chat usage has been to ask a question and get an answer instantly, but now we are seeing an evolution in how LLM's are being used as a part of larger systems capable of doing longer horizon tasks. In fact, OpenAI wasn't the first to release this kind of tool. Google released a product with the same name late last year, Grok just added this feature as well, and Perplexity's whole business is integrating LLM's with web search. While these tools represent the cutting edge of AI assistance, they can also create unexpected security implications - as we discovered recently.
Our Serendipitous Discovery
While working on a project for an LLM security class, my partner Nikhil and I were using Gemini's free fine-tuning service, which made me wonder whether OpenAI had similar free offerings. To my dismay, I found that they had discontinued their API sign up promotion for free api credits long ago.
This made me recollect the time when ChatGPT was first released, how it produced a flock of unseasoned developers who knew nothing about security and ended up exposing their API keys online. Mostly as a joke, and partly wishful thinking, I wondered whether ChatGPT's new web search feature could be used to find these secrets. Here is what I tried below:

While I thought I ended up empty handed, Nikhil noticed that it didn't refuse the request. Having a plus membership, he tried the prompt with ODR along with a few clarifying follow up statements. The end result is shown below:

The first deep search gave about a dozen or so keys, of which one actually worked (we responsibly notified the owner of the repository after confirming the key was active). A naive second attempt did not yield any new active keys, but with limited queries per month, we didn't get to substantially test this.
This initial success raised an important question: was this capability unique to OpenAI's implementation, or could other similar tools be used in the same way? Both Perplexity and Grok have similar features available for free. I tried the exact same prompt and some variations, but ultimately none were successful because they said they werent really capable of parsing through websites like a human would. It seems they just have access to some limited search API.
This highlights a key difference between Perplexity and ODR. Perplexity simply utilizes its pre-training data and any relevant data that comes from simple searches. On the other hand, ODR is capable of performing real time analysis by running python code. This is evident when looking at the provided reasoning trace:

How it Differs From Historical Approaches
While our discovery might alarm some, security professionals will recognize that automated secret scanning isn't new. Tools for automatically scanning repositories for exposed secrets have existed for years. In fact, Github has its own internal scanner that can be used for free for any public repository.
Traditional scanning tools like GitRob and TruffleHog approach the problem through:
- Using regular expressions to match known formats of API keys
- Measuring the randomness of strings to identify potential keys
- Examining all commits, not just current repository state
While we can't know exactly how Deep Research works internally, our observations suggest it might differ in these ways:
- It appears to leverage web scraping rather than direct API access
- It seems to reason about where its most likely to find keys (like .env files)
- It seems capable of implementing rudimentary scripts to perform pattern recognition while also using context from surrounding code
What's most significant is not necessarily advanced technical capabilities, but rather the accessibility. ODR packages existing scanning techniques into a user-friendly interface that requires no technical setup or specialized knowledge. This democratization of security tooling is the real change, as it puts previously specialized techniques into anyone's hands through a simple chat interface.
In our limited testing, we can't conclusively state that it's more effective than traditional tools, but in a rapidly evolving landscape, many capabilities may seem to emerge out of no where. Just remember how most of the world was shocked at the launch of ChatGPT in 2022 even though anyone who had been paying attention could have seen it coming. For better or worse, these tech giants control destiny. If they make a shiny new tool, everyone will try to recreate it, including open source versions. This provides more and more opportunities for adversaries to find exploits.
Highlighting the Need for Red Teaming
GitHub repositories, while often used for code sharing and collaboration, are not secure storage locations for sensitive credentials. In 2016, Uber's AWS account was compromised when hackers gained access to their GitHub account and found the credentials in stored in plain text. This allowed them to access the personal information of around 50 million users. They are not alone in this blunder: GitGaurdian's State of Sprawl report details the detection of millions of unique secrets within both public and private code repositories, and they seem to be finding more exposed secrets year over year. Yet many years later, the same issue that plagued Uber is now becoming easier to exploit, highlighting how security considerations often remain a secondary concern to functionality.
In hindsight, this seemingly intuitive yet malicious use case of Deep Research likely went under the product team's radar because they were focused on intended applications rather than potential misuse. This oversight demonstrates a recurring pattern in AI development:
- A capability is released with legitimate use cases in mind
- The same capability enables unexpected misuse cases
- Only after public release do these misuse cases become apparent
This is precisely why red teaming is crucial for AI systems. Effective red teams approach products with an adversarial mindset, asking not "What was this designed to do?" but "How could this be misused?" The challenge isn't just identifying technical exploits like jailbreaks or prompt injections, but also anticipating how normal features might intersect with sensitive domains in unexpected ways."
As capabilities of these LLM agents increase, we must be increasingly aware of these adversarial use cases. Browser automation could be used for digital dumpster diving: systematically search through public data sources for specific types of information (financial data, personal details, internal documentation) that was accidentally made public. They can also aid in illegal operations. For example, these tools could potentially find and utilize credentials to create networks of accounts for crypto tumbling, wash trading, or market manipulation. Its important to re-iterate that these tools will only get stronger in the next few years.
To this end, we should encourage more bounty events like Anthropic recently did . They thought they had made progress towards making jailbreak-proof models, only to find that it had been broken within five days of releasing the bounty. Companies can also benefit from dedicated adversarial use teams that specifically focus on how features could be repurposed in unexpected ways. They would also benefit from ensuring that employees are aware of critical safety practices. After all, no lock is strong enough when you leave the door wide open.
Acknowledgements
Shout out to Nikhil for helping me test ODR and for finding that it worked. You can check out his website here I also want to thank Prof. Earlence Fernandes for encouraging us to write about it and how to responsibly disclose this vulnerability. You can check out his website here