The API tokens of tech giants Meta, Microsoft, Google, VMware, and more have been found exposed on Hugging Face, opening them up to potential supply chain attacks.
Researchers at Lasso Security found more than 1,500 exposed API tokens on the open source data science and machine learning platform – which allowed them to gain access to 723 organizations' accounts.
In the vast majority of cases (655), the exposed tokens had write permissions granting the ability to modify files in account repositories. A total of 77 organizations were exposed in this way, including Meta, EleutherAI, and BigScience Workshop - which run the Llama, Pythia, and Bloom projects respectively.
The three companies were contacted by The Register for comment but Meta and BigScience Workshop did not not respond at the time of publication, although all of them closed the holes shortly after being notified.
Hugging Face is akin to GitHub for AI enthusiasts and hosts a plethora of major projects. More than 250,000 datasets are stored there and more than 500,000 AI models are too.
The researchers say that if attackers had exploited the exposed API tokens, it could have led to them swiping data, poisoning training data, or stealing models altogether, impacting more than 1 million users.
In just their own work, the researchers say they were able to achieve the necessary access to modify 14 different datasets with tens of thousands of downloads per month.
Data poisoning attacks of this kind are among the most critical threats facing AI and ML as their prominence grows, Forcepoint says. The attack is in OWASP's top 10 risks for LLMs and could lead to a range of consequences.
Google's anti-spam filters for Gmail are effective because of the reliably trained models that power the feature, but these have been compromised on a number of occasions in the past to push seemingly benign malicious emails into users' inboxes.
Another hypothetical scenario in which data poisoning could have a serious organizational impact is if the dataset that designates different types of network traffic were to be sabotaged. If network traffic isn't correctly identified as email, web browsing, etc, then it could lead to misallocated resources and potential network performance issues.
Lasso Security's researchers were also able to gain the required access to steal more than 10,000 private models, a threat that also makes OWASP's top 10 AI security risks.
"The ramifications of this breach are far-reaching, as we successfully attained full access, both read and write permissions to Meta Llama 2, BigScience Workshop, and EleutherAI, all of these organizations own models with millions of downloads – an outcome that leaves the organization susceptible to potential exploitation by malicious actors," says Bar Lanyado, security researcher at Lasso Security.
"The gravity of the situation cannot be overstated. With control over an organization boasting millions of downloads, we now possess the capability to manipulate existing models, potentially turning them into malicious entities. This implies a dire threat, as the injection of corrupted models could affect millions of users who rely on these foundational models for their applications."
The exposed API tokens were discovered by researchers conducting a series of substring searches on the platform and manually collecting them. They then used the whoami Hugging Face API to determine whether the token was valid, who owned it, the owner's email, what organizations the owner belongs to, and the token's permissions.
Exposing API tokens is often done when developers store the token in a variable for use in certain functions, but forget to hide it when pushing the code to a public repository.
GitHub has its Secret Scanning feature to prevent leaks like this and is available to all users free of charge, and Hugging Face runs a similar tool that alerts users to exposed API tokens which are hardcoded into projects.
While investigating the exposed secrets on Hugging Face, researchers also found a weakness with its organization API tokens (org_api), which had already been announced as deprecated, that could be used for read access to repositories, and billing access to a resource. It was also blocked in Hugging Face's Python library by adding a check to the type of token in the login function.
"Therefore we decided to investigate it, and indeed the write functionality didn't work, but apparently, even with small changes made for the login function in the library, the read functionality still worked, and we could use tokens that we found to download private models with exposed org_api token e.g. Microsoft," says Lanyado in thre blog.
Lasso Security says all the affected organizations were contacted and the major companies like Meta, Google, Microsoft, and VMware responded on the same day, revoking the tokens and removing the code from their respective repositories.
Stella Biderman, executive director at EleutherAI, told us: "We are always grateful to ethical hackers for their important work identifying vulnerabilities in the ecosystem and are committed to building community norms and best practices that promote safety in machine learning research."
Biderman pointed to a recent collaboration between EleutherAI, Hugging Face, and Stability AI to develop a new checkpointing format to mitigate attacker modifications, saying “the harm that can be done by such attacks has been massively reduced.”
"We helped develop an alternative checkpointing format (now the norm on the Hub) where such behavior is not possible now, limiting the harm someone could do with an exploit like the key leak," she added. "Of course, there are still very real harms to both users and organizations due to key leaks and we are always on the lookout for such things and how we can further mitigate harm." ®
Updated at 12.49 UTC on December 5, 2023, to add:
Following publication of this article, Hugging Face sent a statement from Clement Delangue, co-founder and CEO at the company:
"The tokens were exposed due to users posting their tokens in platforms such as the Hugging Face Hub, GitHub, and others. In general we recommend users do not publish any tokens to any code hosting platform.
"All Hugging Face tokens detected by the security researcher have been invalidated and the team has taken and is continuing to take measures to prevent this issue from happening more in the future, for example, by giving companies more granularity in terms of permissions for their tokens with enterprise hub and detection of malicious behaviors. We are also working with external platforms like Github to prevent valid tokens from getting published in public repositories."