Google has used links to determine the authority of websites since its early days: the idea that web pages cast "votes" for other pages by linking to them is central to the PageRank algorithm. This has led to the emergence of many link manipulation tactics. It's important to note that crawlers use your sitemap as a hint, not a definitive guide on how to index your site. Bots also consider other factors (such as your internal linking structure) to understand what your site is about. The most important thing with your extensible markup language (xml) sitemap is to make sure that the message you send to search engines is consistent with your robots txt file.

Second, the Internet is changing. The idea of ​​PageRank that the web is a graph of pages connected by hyperlinks. You don't want to accidentally give crawlers thousands of pages of lightweight content to sort through. If you do, they might never reach your most important pages. The second most important thing is to make sure your xml sitemaps only include canonical urls, as google considers your xml sitemaps a canonicalization signal. Canonization if you have duplicate content on your site.

Duane Forrester, former senior product manager at Bing, pointed out at SMX West 2016 that unlinked mentions. Many people don't realize that their site may be hosting multiple copies of the same page at different urls. If a search engine tries to index these pages, there is a risk that they will trigger the duplicate content filter, or at the very least dilute your link value. Note that adding the canonical link element will not prevent crawlers from crawling duplicate pages. Here is an example of such a homepage indexed multiple times by google.

