Search This Blog

Powered by Blogger.

Blog Archive

Labels

Footer About

Footer About

Labels

Showing posts with label prompt injection attack. Show all posts

OpenClaw Security Flaws Expose AI Agents to Hidden Commands and Data Theft Risks

 

Two independent cybersecurity studies published this week have uncovered serious security weaknesses in OpenClaw, a widely used self-hosted AI agent platform. The findings demonstrate how attackers can manipulate AI agents into executing malicious code or leaking sensitive information through seemingly harmless inputs.

Researchers from Imperva and Varonis approached the issue from different angles but reached a similar conclusion: AI agents that trust incoming data and possess broad system access can become powerful attack vectors when exploited.

Hidden Instructions Embedded in Everyday Content

Imperva researchers discovered that OpenClaw could be tricked into processing concealed instructions embedded within shared contacts, vCards, and location pins. These malicious commands were executed by the AI agent without any visible indication to the user.

The issue stemmed from how OpenClaw handled certain message objects before passing them to the large language model (LLM). While content fetched from the web was clearly marked as untrusted, information contained within contacts, vCards, and location labels was inserted directly into prompts without any trust boundary.

According to Imperva researcher Yohann Sillam, this allowed attackers to hide instructions inside fields such as contact names. Since angle brackets are permitted in contact names, the model could not reliably distinguish legitimate information from injected commands.

Only selected fields were transmitted to the model, making them attractive targets. In one example, a shared contact was serialized as <contact: name, number>, allowing attackers to insert malicious instructions within the name field itself. Because messaging apps truncate long contact names, victims often never saw the hidden payload.

The same attack method was also successful through WhatsApp-supported vCards and shared location labels.

During testing against Gemini 3.1 Pro's preview build, hidden instructions successfully convinced the AI agent to download and execute a script hosted on servers controlled by the researchers. Similar attempts using images with embedded instructions failed, likely because AI models have become more resistant to that well-known attack technique.

Imperva warned that OpenClaw's default memory functionality could amplify the threat. A single malicious piece of widely shared content could potentially affect multiple agents if adequate sandboxing protections were absent.

Following responsible disclosure, OpenClaw addressed the issue in version 2026.4.23. The update separates contact names, vCard information, and location labels from the main prompt and places them in an isolated untrusted metadata channel.

Researchers also noted that similar design patterns exist in several other personal AI assistant platforms, suggesting the issue extends beyond OpenClaw alone.

Social Engineering Defeats Technical Safeguards

While Imperva focused on prompt injection, Varonis Threat Labs explored how AI agents respond to social engineering attacks.

Led by researcher Itay Yashar, the Varonis team created an OpenClaw-based agent called Pinchy and connected it to a Gmail inbox filled with realistic business communications and synthetic sensitive information. The researchers then tested the agent using four different phishing scenarios involving Google Gemini 3.1 Pro and OpenAI Codex GPT-5.4.

Varonis distinguishes traditional prompt injection from what it calls "agent phishing." Unlike hidden instructions embedded in content, agent phishing relies on convincing requests delivered through normal communication channels, exploiting the agent's willingness to act before verifying legitimacy.

The tests revealed significant weaknesses.

In one scenario, an email impersonating a team leader named Dan requested urgent staging access during a simulated production emergency. The message originated from an external Gmail account, yet the agent located and forwarded mock AWS IAM access keys, database connection credentials, and SSH details in plain text.

A second phishing attempt used a more routine business request, asking for a weekly customer export supposedly needed for a QBR presentation. The agent responded by sending a synthetic database containing information on 247 enterprise customers, including contact details and contract values.

Notably, these failures occurred despite the agent being configured with instructions to verify sender identities before responding. Researchers observed that urgency successfully bypassed safeguards in one case, while routine business language defeated them in another.

The agent demonstrated stronger performance against technically oriented threats. It interacted with a phishing page designed to steal gift-card credentials but ultimately withheld sensitive information and flagged suspicious behavior. A stricter configuration blocked the page entirely.

Similarly, when presented with a malicious OAuth consent screen disguised as a timesheet application, the agent examined the redirect destination, recognized warning signs, and refused access.

Researchers concluded that AI agents may outperform many users when identifying suspicious URLs and fraudulent login portals. However, they remain vulnerable to social manipulation that exploits helpfulness and trust.

Varonis also observed that OpenAI Codex GPT-5.4 behaved more cautiously than Gemini 3.1 Pro when interacting with external websites or transmitting data. Nevertheless, both models ultimately fell victim to the social-engineering scenarios.

One Core Problem Behind Multiple Attacks

Varonis linked both attack methods to what researcher Simon Willison describes as the "lethal trifecta": an AI system capable of accessing private data, consuming untrusted content, and transmitting information externally.

OpenClaw satisfies all three conditions, making both hidden prompt injections and phishing-based attacks highly effective.

Additional concerns emerged from a separate InfoSec Write-ups analysis. Researchers converted historical OpenClaw security advisories into static-analysis rules and uncovered five additional vulnerabilities affecting integrations with Slack, Discord, Matrix, Zalo, and Microsoft Teams.

Each flaw originated from the same design issue. Channel allowlists were validated using mutable display names rather than permanent identifiers. Attackers could therefore impersonate trusted users simply by changing their display names to match approved accounts.

OpenClaw has since patched these vulnerabilities.

The platform's extensive permissions—including access to files, shell environments, and more than twenty messaging services—have previously prompted warnings regarding prompt injection and data exfiltration risks.

The strongest criticism came from the Dutch data protection authority, the Autoriteit Persoonsgegevens, which advised users and organizations against deploying OpenClaw on systems containing sensitive information due to concerns over data breaches and account compromise.

Recommended Defenses

Organizations using OpenClaw are advised to upgrade immediately to version 2026.4.23 or newer to mitigate the message-object vulnerability identified by Imperva.

However, researchers stress that software updates alone cannot solve the broader trust problem inherent in autonomous AI systems.

Varonis recommends four key safeguards:

  • Treat agent instruction files as strict, version-controlled policies rather than informal guidance.

  • Require approval before agents send messages to unfamiliar recipients, reducing the risk of automated phishing or data leakage.

  • Restrict access to connected systems based on the trustworthiness of the triggering source.

  • Require human review for high-risk actions such as credential sharing, financial transactions, or sensitive data transfers.

Both research teams ultimately advocate the same mindset. Varonis recommends treating AI agents as inexperienced employees with extensive system access but limited judgment, while Imperva describes them as authenticated executors that inherently trust incoming information.

Although vendors continue to introduce patches and protective controls, the fundamental challenge remains unresolved. AI agents derive their usefulness from acting on instructions, processing inputs, and helping users accomplish tasks. Those same characteristics also create opportunities for attackers, and the industry has yet to develop a universal solution.

Researcher Warns of ‘ChatGPhish’ Vulnerability That Could Turn Web Summaries Into Phishing Attacks

 

A cybersecurity researcher has raised concerns over a newly identified vulnerability in ChatGPT that could allow attackers to manipulate the chatbot's responses through hidden instructions embedded within web pages.

The issue, discovered by Permiso threat hunter Andi Ahmeti, reportedly enables malicious actors to influence ChatGPT when users ask the AI assistant to summarize online content. According to Ahmeti, if a webpage contains concealed prompt instructions, ChatGPT may unknowingly follow them and display attacker-controlled content alongside legitimate summaries.

The researcher explained that this weakness could be exploited to insert phishing links, fake security notifications, or other deceptive messages that appear to originate from ChatGPT itself. In some cases, attackers could even leverage QR codes embedded within AI-generated responses to redirect users to malicious websites.

“AI systems increasingly render untrusted content directly inside browsers, which expands risk significantly,” Ahmeti told us. “The bigger issue is that AI products are starting to resemble browser or operating system environments, which creates a much larger security surface.”

Ahmeti disclosed the vulnerability, which he has named “ChatGPhish,” through OpenAI’s Bugcrowd disclosure program. He initially submitted the report on April 29 and later updated it on May 1 with additional information.

“The initial submission was marked as not reproducible,” he said. “We resubmitted with additional detail and it was marked as a duplicate.”

According to Ahmeti, the issue his team reported differed significantly from the previously identified vulnerability it was allegedly linked to.

“The issue Permiso reported and the supposed duplicate ‘had major differences,’” Ahmeti said. “We reached out again to clarify those differences and request additional details, but we did not receive a response.”

At the time of publication, OpenAI had not confirmed whether any remediation measures had been implemented.

“At the time of publication, ‘we have not received confirmation from OpenAI on whether a fix has been applied,’” he told us.

To demonstrate the threat, Ahmeti embedded hidden instructions into a GitHub-hosted CloudLens page. The injected prompt directed ChatGPT to generate a standard summary while also appending a fabricated account-security warning containing a malicious hyperlink.

When users asked ChatGPT to summarize the page, the chatbot correctly described CloudLens and its cloud security functions. However, it also displayed an additional warning message suggesting that a new device had accessed the user's account, along with a clickable link controlled by the attacker.

The researcher noted that the same technique could be used to insert QR codes into ChatGPT’s responses.

“Because the chatgpt.com client auto-fetches and displays Markdown images, an attacker can place a QR code in the assistant’s output,” he wrote. “Scanning it on a phone takes the victim to an attacker-controlled URL that has never been displayed in plaintext.”

To verify that the issue was not specific to GitHub, Ahmeti repeated the experiment on a self-hosted website based in Kosovo. The results were reportedly identical, with ChatGPT generating a legitimate summary before appending a misleading security alert containing an attacker-controlled link.

“The behavior is identical: the assistant produces a normal summary, then appends a spoofed alert with a clickable attacker link,” Ahmeti wrote.

While Ahmeti acknowledged that there may not be a single solution to prompt injection attacks, he recommended stronger isolation mechanisms, stricter content filtering, and rendering safeguards for AI-generated outputs.

“Do not trust model output,” Ahmeti said. “AI-generated content should always be treated as untrusted. Assume prompt injection will happen.”

He also emphasized that prompt injection should be viewed as a broader application-security challenge rather than solely a model-alignment issue.

“Prompt injection has increasingly become an application-security problem, not just a model alignment issue,” he told us. “The real concern is what systems the model can influence: browsers, plugins, tools, memory, or external services.”

AI’s Hidden Weak Spot: How Hackers Are Turning Smart Assistants into Secret Spies

 

As artificial intelligence becomes part of everyday life, cybercriminals are already exploiting its vulnerabilities. One major threat shaking up the tech world is the prompt injection attack — a method where hidden commands override an AI’s normal behavior, turning helpful chatbots like ChatGPT, Gemini, or Claude into silent partners in crime.

A prompt injection occurs when hackers embed secret instructions inside what looks like an ordinary input. The AI can’t tell the difference between developer-given rules and user input, so it processes everything as one continuous prompt. This loophole lets attackers trick the model into following their commands — stealing data, installing malware, or even hijacking smart home devices.

Security experts warn that these malicious instructions can be hidden in everyday digital spaces — web pages, calendar invites, PDFs, or even emails. Attackers disguise their prompts using invisible Unicode characters, white text on white backgrounds, or zero-sized fonts. The AI then reads and executes these hidden commands without realizing they are malicious — and the user remains completely unaware that an attack has occurred.

For instance, a company might upload a market research report for analysis, unaware that the file secretly contains instructions to share confidential pricing data. The AI dutifully completes both tasks, leaking sensitive information without flagging any issue.

In another chilling example from the Black Hat security conference, hidden prompts in calendar invites caused AI systems to turn off lights, open windows, and even activate boilers — all because users innocently asked Gemini to summarize their schedules.

Prompt injection attacks mainly fall into two categories:

  • Direct Prompt Injection: Attackers directly type malicious commands that override the AI’s normal functions.

  • Indirect Prompt Injection: Hackers hide commands in external files or links that the AI processes later — a far stealthier and more dangerous method.

There are also advanced techniques like multi-agent infections (where prompts spread like viruses between AI systems), multimodal attacks (hiding commands in images, audio, or video), hybrid attacks (combining prompt injection with traditional exploits like XSS), and recursive injections (where AI generates new prompts that further compromise itself).

It’s crucial to note that prompt injection isn’t the same as “jailbreaking.” While jailbreaking tries to bypass safety filters for restricted content, prompt injection reprograms the AI entirely — often without the user realizing it.

How to Stay Safe from Prompt Injection Attacks

Even though many solutions focus on corporate users, individuals can also protect themselves:

  • Be cautious with links, PDFs, or emails you ask an AI to summarize — they could contain hidden instructions.
  • Never connect AI tools directly to sensitive accounts or data.
  • Avoid “ignore all instructions” or “pretend you’re unrestricted” prompts, as they weaken built-in safety controls.
  • Watch for unusual AI behavior, such as strange replies or unauthorized actions — and stop the session immediately.
  • Always use updated versions of AI tools and apps to stay protected against known vulnerabilities.

AI may be transforming our world, but as with any technology, awareness is key. Hidden inside harmless-looking prompts, hackers are already whispering commands that could make your favorite AI assistant act against you — without you ever knowing.

Researchers Expose AI Prompt Injection Attack Hidden in Images

 

Researchers have unveiled a new type of cyberattack that can steal sensitive user data by embedding hidden prompts inside images processed by AI platforms. These malicious instructions remain invisible to the human eye but become detectable once the images are downscaled using common resampling techniques before being sent to a large language model (LLM).

The technique, designed by Trail of Bits experts Kikimora Morozova and Suha Sabi Hussain, builds on earlier research from a 2020 USENIX paper by TU Braunschweig, which first proposed the concept of image-scaling attacks in machine learning systems.

Typically, when users upload pictures into AI tools, the images are automatically reduced in quality for efficiency and cost optimization. Depending on the resampling method—such as nearest neighbor, bilinear, or bicubic interpolation—aliasing artifacts can emerge, unintentionally revealing hidden patterns if the source image was crafted with this purpose in mind.

In one demonstration by Trail of Bits, carefully engineered dark areas within a malicious image shifted colors when processed through bicubic downscaling. This transformation exposed black text that the AI system interpreted as additional user instructions. While everything appeared normal to the end user, the model silently executed these hidden commands, potentially leaking data or performing harmful tasks.

In practice, the team showed how this vulnerability could be exploited in Gemini CLI, where hidden prompts enabled the extraction of Google Calendar data to an external email address. With Zapier MCP configured to trust=True, the tool calls were automatically approved without requiring user consent.

The researchers emphasized that the success of such attacks depends on tailoring the malicious image to the specific downscaling algorithm used by each AI system. Their testing confirmed the method’s effectiveness against:

  1. Google Gemini CLI
  2. Vertex AI Studio (Gemini backend)
  3. Gemini’s web interface
  4. Gemini API via llm CLI
  5. Google Assistant on Android
  6. Genspark

Given the broad scope of this vulnerability, the team developed Anamorpher, an open-source tool (currently in beta) that can generate attack-ready images aligned with multiple downscaling methods.

To defend against this threat, Trail of Bits recommends that AI platforms enforce image dimension limits, provide a preview of the downscaled output before submission to an LLM, and require explicit user approval for sensitive tool calls—especially if text is detected in images.

"The strongest defense, however, is to implement secure design patterns and systematic defenses that mitigate impactful prompt injection beyond multi-modal prompt injection," the researchers said, pointing to their earlier paper on robust LLM design strategies.