Google Search Algorithm Documentation Leaked Online
A collection of internal Google documents, totaling over 2,500 pages, was recently published online, revealing detailed information about how Google’s search system works and ranks results. The leak reportedly occurred accidentally in March 2024, when the documentation was mistakenly uploaded to a public GitHub repository owned by Google. The data was released by an automated company tool, which inadvertently included an open-source Apache 2.0 license, a standard for Google’s public documentation. An attempt to remove the leak was made in a subsequent commit on May 7, 2024.
By that time, the leak had already been noticed by Erfan Azimi, CEO of SEO firm EA Digital Eagle. Soon after, Rand Fishkin, CEO of SparkToro, and Michael King, CEO of iPullRank, brought public attention to the incident and analyzed the leaked materials.
What the Leaked Documents Reveal
According to researchers, the leaked documentation describes an older version of the Google Search Content Warehouse API, offering insight into the internal workings of Google Search. The documents do not contain code but explain how to use the API, which appears to be for internal use only. They also include numerous references to Google’s internal systems and projects. While a similar API is already available through Google Cloud, the leaked information contains far more interesting details.
Analysts say the files shed light on which criteria Google considers important when ranking web pages—a topic of great interest to SEO specialists and website owners hoping to attract more traffic. The documents detail over 14,000 attributes available or related to the API, but provide little information on how these signals are used or their relative importance. As a result, it’s difficult to determine the weight Google assigns to different attributes in its ranking algorithm.
Nevertheless, SEO experts believe the documents contain noteworthy details that differ significantly from Google’s public statements. Fishkin notes, “Many of the claims [Azimi made in an email describing the leak] directly contradict public statements made by Google representatives over the years. For example, the company has repeatedly denied using click-related signals, denied that subdomains are ranked separately, denied the existence of a sandbox for new sites, denied that domain age is recorded or considered, and much more.”
King references a statement by Google Search’s John Mueller, who previously claimed there is “nothing like a site authority index” at Google. This refers to whether Google considers a particular site authoritative and therefore worthy of higher search rankings. King writes that the leaked documents indicate a siteAuthority metric can be calculated as part of Compressed Quality Signals.
“‘Lying’ is a harsh word, but it’s the only accurate one here,” King writes. “While I don’t blame Google representatives for protecting proprietary information, I disagree with their efforts to actively discredit people in marketing, tech, and journalism who have presented this data for analysis.”
Key Findings: Clicks, Chrome Data, and Content Quality
Experts uncovered several other interesting facts. One concerns the importance of clicks and different types of clicks (good, bad, long, etc.) for ranking web pages. During the U.S. antitrust case against Google, company representatives admitted to using click metrics as a ranking factor, and the leaked documents provide more details about these systems.
Good, Bad, and Other Clicks
The documents also reveal that the number of site views in Chrome is used to determine resource quality, reflected in the API as the ChromeInTotal parameter. However, Google has repeatedly stated that Chrome data is not used for page ranking. Chrome is also mentioned in sections related to the creation of additional links.
Additional Links Created Using Chrome Data
Furthermore, the documents show that Google considers other factors, such as content freshness, authorship, the relationship of a page to the main site topic, the match between the title and page content, and even the “average weighted font size.” This contradicts previous company statements that the E-E-A-T (Experience, Expertise, Authoritativeness, Trustworthiness) metric is not used in search result ranking, even though Google uses it to assess result quality.
For example, King describes in detail how Google collects author data from pages and has a special field to indicate whether a particular entity is the author. The documents state that this field is “mainly designed and configured for news articles, but is also filled in for other content (e.g., scientific publications).” While this does not confirm that authorship is a separate ranking metric, it shows that Google at least tracks this attribute.
Google’s Response
After the leak gained media and expert attention, Google representatives were forced to issue a statement confirming the leak and noting that important context may be missing from the accidentally published files.
“We caution against making inaccurate assumptions about search based on out-of-context, outdated, or incomplete information. We share extensive information about how search works and what factors our systems consider, and we strive to protect the integrity of search results from manipulation,” the company said.