How to Find All Current and Archived URLs on a Website

There are plenty of explanations you may perhaps require to seek out many of the URLs on a website, but your specific goal will decide Whatever you’re attempting to find. For instance, you might want to:

Identify each individual indexed URL to analyze problems like cannibalization or index bloat
Acquire latest and historic URLs Google has witnessed, especially for site migrations
Obtain all 404 URLs to recover from article-migration glitches
In each state of affairs, an individual Software won’t Offer you almost everything you would like. Regretably, Google Research Console isn’t exhaustive, along with a “internet site:case in point.com” lookup is limited and tricky to extract info from.

With this article, I’ll stroll you through some resources to develop your URL list and just before deduplicating the information utilizing a spreadsheet or Jupyter Notebook, based on your site’s dimension.

Previous sitemaps and crawl exports
In case you’re seeking URLs that disappeared from the Dwell web site lately, there’s an opportunity somebody in your crew could possibly have saved a sitemap file or simply a crawl export ahead of the changes had been created. In case you haven’t previously, check for these files; they are able to frequently provide what you would like. But, in case you’re studying this, you most likely didn't get so Blessed.

Archive.org
Archive.org
Archive.org is an invaluable Resource for Search engine optimization duties, funded by donations. In case you seek for a website and select the “URLs” option, you can obtain as many as 10,000 mentioned URLs.

On the other hand, There are many limitations:

URL limit: You may only retrieve around web designer kuala lumpur 10,000 URLs, which happens to be inadequate for larger sized sites.
Good quality: A lot of URLs could possibly be malformed or reference source data files (e.g., photographs or scripts).
No export selection: There isn’t a created-in method to export the record.
To bypass the lack of an export button, make use of a browser scraping plugin like Dataminer.io. Nevertheless, these constraints imply Archive.org may not supply a complete Answer for much larger web pages. Also, Archive.org doesn’t suggest no matter whether Google indexed a URL—but if Archive.org identified it, there’s a superb possibility Google did, way too.

Moz Professional
When you may perhaps generally utilize a hyperlink index to uncover exterior websites linking to you, these instruments also explore URLs on your web site in the procedure.


Tips on how to use it:
Export your inbound one-way links in Moz Pro to acquire a fast and easy list of focus on URLs from your web-site. In case you’re managing a large Web site, consider using the Moz API to export info past what’s workable in Excel or Google Sheets.

It’s important to note that Moz Professional doesn’t affirm if URLs are indexed or learned by Google. On the other hand, due to the fact most web-sites implement precisely the same robots.txt guidelines to Moz’s bots since they do to Google’s, this technique typically works properly as a proxy for Googlebot’s discoverability.

Google Lookup Console
Google Search Console provides numerous useful resources for creating your listing of URLs.

Back links reports:


Much like Moz Professional, the Links area provides exportable lists of goal URLs. Sadly, these exports are capped at 1,000 URLs Each individual. You'll be able to use filters for precise web pages, but due to the fact filters don’t use towards the export, you could possibly ought to depend on browser scraping tools—limited to five hundred filtered URLs at any given time. Not best.

Efficiency → Search engine results:


This export provides a list of pages acquiring look for impressions. While the export is limited, You may use Google Look for Console API for bigger datasets. You will also find totally free Google Sheets plugins that simplify pulling more considerable facts.

Indexing → Webpages report:


This part offers exports filtered by difficulty sort, although these are definitely also restricted in scope.

Google Analytics
Google Analytics
The Engagement → Webpages and Screens default report in GA4 is a superb source for amassing URLs, that has a generous limit of 100,000 URLs.


A lot better, you may use filters to make distinctive URL lists, proficiently surpassing the 100k Restrict. As an example, in order to export only blog URLs, comply with these ways:

Move one: Insert a section on the report

Phase 2: Simply click “Develop a new segment.”


Move 3: Determine the section with a narrower URL sample, including URLs made up of /website/


Note: URLs found in Google Analytics might not be discoverable by Googlebot or indexed by Google, but they provide important insights.

Server log data files
Server or CDN log files are Most likely the final word tool at your disposal. These logs capture an exhaustive record of each URL path queried by end users, Googlebot, or other bots in the recorded interval.

Considerations:

Details dimensions: Log information is usually large, countless websites only retain the last two weeks of information.
Complexity: Examining log data files may be demanding, but several tools are available to simplify the process.
Mix, and very good luck
Once you’ve collected URLs from these sources, it’s time to mix them. If your web site is sufficiently small, use Excel or, for greater datasets, resources like Google Sheets or Jupyter Notebook. Make sure all URLs are regularly formatted, then deduplicate the checklist.

And voilà—you now have a comprehensive list of present-day, previous, and archived URLs. Very good luck!

Leave a Reply

Your email address will not be published. Required fields are marked *