All you need to know for moving to HTTPS (part 1)
As Google’s initiative to get more websites to use HTTPS and given Snowden’s revelations about the level of surveillance by US authorities, the number of websites moving to HTTPS has steadily been growing. The switch to HTTPS is a major technical challenge, which is not to be underestimated.
From an SEO point of view, it requires resource allocation, longterm planning and preparation, water tight execution, and is never free of risk. This guide will instruct beginner and advanced website owners how to move from HTTP to HTTPS, from a SEO perspective, discussing why it is important to make this move, selecting the right SSL certificate for your purposes, and how much of the website to move at once.
It will also help identify the most common technical on-page signals which need to be updated to HTTPS to avoid the sending of conflicting signals to search engines, how to configure Google Search Console for the move to HTTPS, how to monitor the impact of the move within Google Search Console and the server log files, and how to improve the overall performance signals towards the user with HSTS and HTTP/2.
What is HTTPS and why should you care?
If you have a website, or visit websites online, you have to care about HTTPS.
HTTPS is short for hypertext transfer protocol over TLS. This a protocol that allows secure communication between computer networks, such as the browser on a local computer or the server serving the content being accessed. Every website on the world wide web uses either HTTP or HTTPS. An example of the address bar of a website using the HTTPS protocol is shown in Figure 1.
The benefits of HTTPS are numerous, but the ones that stand out for regular users are:
- Security, for example, HTTPS prevents man-in-the-middle attacks;
- Privacy, so no online eavesdropping on users by third parties;
- Speed (will get into this later).
For website owners, HTTPS also brings a number of advantages to the table, which are as follows:
- Security, allowing processing of sensitive information such as payment processing;
- Keeps referral data in Analytics, as HTTP websites that have visitors coming from a HTTPS website to a HTTP website lose their referral data of those visitors. However, HTTPS websites retain their referral data from visitors coming from either a HTTP or HTTPS website;
- Potentially improves rankings in Google search results;
- HTTP websites show up as insecure in future browser updates, whereas HTTPS shows as secure;
- Speed (will cover this later).
However, HTTPS is not without challenges, which is why, until now, only a mid-percentage of the World Wide Web has been using this secure protocol – but this number is growing steadily. Some of the challenges include:
- AdditionalTP cost, as commercial SSL certificates cost money, and also modern server infrastructure is necessary to not add a RAM/CPU cost;
- Technical complexity, as implementing SSL certificates on a server is far from easy, and until recently, one HTTPS-enabled domain used to require an unique IP address – luckily, there is now the Server Name Indication, which is supported by most major browsers;
- Switching from HTTP to HTTPS is considered to be a content move by search engines, resulting in lower rankings until all new HTTPS and old HTTP URLs have been re-crawled and reprocessed;
- Conflicts with the original design of the World Wide Web, where the additional “S” in the HTTPS protocol breaks hyperlinks (a fundamental pillar of the World Wide Web)8,
Switching to HTTPS is a long-term strategy. Once committed, it will be difficult not to operate a HTTPS version of the website. Even if the HTTPS version redirects only traffic back to the HTTP version, a HTTPS version needs to be kept live to continue redirecting external link juice and visitors to the HTTP version. In other words, once a website is live and has operated for at least a little while on HTTPS, it is unlikely that it will ever be able to shut down the HTTPS or HTTP version as long as the website is up and running.
However, the long-term benefits do outweigh the challenges. This guide primarily focuses on addressing the content move challenge of moving content from an SEO point of view.
Getting ready with SSL Certificates
Before going further into the SEO aspects of moving to HTTPS, let’s make sure the setup of the server is correctly implemented and nothing stands in the way of continuing with the content move.
Buy a commercial SSL certificate
Although it is possible to use self-signed SSL certificates9 or free community-provided SSL certificates10 and most public SSL certificate types do have a certificate authority11 behind it, the one thing that commercial SSL certificates offer are extended validated SSL certificates (EV)12. These EV SSL certificates require additional verification of the requesting entity’s identity and can take some additional legwork to get approved. Users see this reflected as green bars with company names in the browser address bar.
When choosing a SSL certificate type, keep in mind that community-provided SSL certificates are still in their early days and that in the last few years, prices for commercial SSL certificates have dropped significantly to lower than $10 USD per year per SSL certificate. As such, using commercial SSL certificates for now is recommended for commercial websites, as it will still be possible later on to switch to other options13. When choosing a commercial SSL certificate, also consider that there are several validation types of SSL certificates that can be bought/used. Any certificate is fine in principle. In my experience, for Google it makes no difference and any SSL certificate is fine, but it can make a difference for the users of the website (see Figure 2).
Encryption
Several options are available when creating a SSL certificate (commercial or selfsigned). It is better to choose a SHA2 certificate (e.g. SHA256) as this is more secure than a SHA1 certificate. SHA1 certificates have been downgraded for this reason by most major browsers.
By the end of 2016, websites using SHA1 certificates will appear as insecure, thereby defeating the purpose of using HTTPS.
Server implementation
To implement the SSL certificates on the server infrastructure, check with the hosting provider, the IT team or the web developers. If using server name indication, double check in the Analytics data of the current website if certain old browsers, which do not support SNI14, still frequently visit the website. Once the server has been setup with a SSL certificate for the domain name on port 443, the setup and the server environment needs to be checked and validated.
The SSL certificate can be validated using the SSL Shopper tool15, and the server setup can be checked with the SSL Labs tool16. All errors, if any, have to be resolved before continuing. Note: To avoid any issues while updating the website for the move to HTTPS in the next steps, it is recommended to create a separate home directory on the same server or another server instance and forward the HTTPS traffic to this.
Preparing for the move to HTTPS
Before discussing the next steps, this guide is based on a few assumptions:
- No changes to the content (except link annotations) are made;
- No changes to the templates (except link annotations) are made;
- No changes to the URL structure (except for the protocol change) are made;
- The HTTPS version of the domain name is live on port 443, and validated as described in the previous chapter. Most likely, the root of the domain name on HTTPS returns an empty directory listing.
If any of the first three mentioned assumptions are incorrect, then be sure to read: How to move your content to a new location17 on the official Google Webmaster Blog.
Define a content move strategy
The next step is to choose a strategy for moving to HTTPS. Moving a small website (e.g. less than 10,000 URLs) or a large website (e.g. more than one million URLs) to HTTPS can result in different options for moving content, for example:
- To move the entire domain to HTTPS, including all subdomains, in one go;
- To move only one or more subdomains and/or subdirectories to HTTPS, before moving the rest;
- Move the content and operate two duplicate sites on HTTP and HTTPS18, before finalizing the move.
As part of the strategy, the following question also needs to be answered: How long will the HTTP version still be accessible? The factors to consider will be different depending on the size of the website, the availability of the IT support team, and the organizational structure behind the website (e.g. the internal company politics). While this guide cannot address the last two points for every website, the first point is definitely something to consider from an SEO point of view. This translates to the available crawl budget.
Importance of crawl budget
In order for search engines to process the protocol change, its bots have to re-crawl a significant part of the HTTP URLs and all of the new HTTPS URLs of the website. So, a website with one million URLs will require search engine bots to roughly re-crawl at least two million URLs (or a significant amount of this) to pick up the 301 redirects and recalculate the rankings for the new HTTPS URLs, based on the history of the HTTP URLs.
If a search engine bot crawls an average of 30.000 unique and non-repeated URLs per day of the website, it can take roughly 67 days to re-crawl all URLs (see Figure 3). During this time, the website may suffer in search engine rankings, assuming there are no “crawl budget wasted URLs”.
Utilise server log files
To make sure search engine bots do not waste crawl budget, first double check the server log files and find out which URLs have been crawled by each search bot in the last two years (or longer, if possible). It will also be helpful to know how often each URL was crawled (to determine priority), but for this process all the URLs are needed anyway.
Moving forward, this guide will focus primarily on crawl data from Googlebot. For smaller sites, Screaming Frog Log Analyser19 can do this task rather easily. For larger websites, talk to the IT team and/or utilize a big data solution such as Google BigQuery20 to extract all URLs.
It may also be necessary to ask the hosting provider of the website for the log files. If there are no log files available for the last two years (assuming the website is not brand new), start logging as soon as possible. Without log files, the organisation will miss out on vital and crucial SEO data, and ignore important analytical business data. Save the extracted URLs in a separate file (one URL per line), for example, as logs_extracted_urls.csv.
Extract sitemap
URLs Assuming the website has one or more XML Sitemaps, and these sitemaps contain all the unique canonicals of the indexable pages of the website21, Google Search Console22 will report how many URLs of the XML Sitemaps are currently submitted to Google. Download and extract all the unique URLs from the XML sitemaps. Save the extracted URLs in a separate file (one URL per line), for example, as sitemap_extracted_urls.csv.
How much of the website to move?
At this point, there is enough data to determine the next step: How much of the website to move to HTTPS? To calculate the number of URLs to move, gather the following data:
- A list of unique URLs crawled by Googlebot extracted from the server log files;
- An average number of how many URLs Googlebot crawls per day, based on numbers from Crawl Stats23 in Google Search Console, and the server log files;
- A list of unique URLs extracted from the XML sitemaps.
If the average number of URLs crawled by Googlebot per day (based on the server log files) is anywhere between the 5% and 100% of the total volume of unique URLs extracted from the XML sitemaps, it is relatively safe to move the entire domain in one go to HTTPS. Chances are, in this case, that the entire domain will be re-crawled by search engine bots within three to four weeks – depending on the internal linking structure and several other factors.
Let’s call this scenario 1: “move in-one-go.” If the average number of URLs crawled by Googlebot per day is anywhere between 1% and 5% compared to the total size of unique URLs extracted from the XML sitemaps, it is safer to move one or more subdomains and/or subdirectories in phases to HTTPS. Chances are that, in this case, it will take a long time for Googlebot to re-crawl all URLs of the entire domain, and as such it may take longer than the standard few weeks to recover in Google search results.
Let’s call this scenario 2: “partial move.” This phase is repeated as many times as necessary until the entire domain has been moved to HTTPS. If the site is really big, then another option is on the table. This option involves operating two websites next to each other, one on HTTP and one on HTTPS. While waiting for a significant number of URLs to be re-crawled, the canonicals are used to move the content. For this to work, site owners are dependent on the website canonicals signals to be trusted by search engines.
Let’s call this scenario 3: “move through canonicals.” Once adequate number of URLs have been re-crawled, scenario 1 or 2 can be applied to finalize the move to HTTPS.
Crawl the HTTP website
Next, utilise a crawler, such as Screaming Frog SEO Spider24, DeepCrawl25, Botify26 or OnPage.org27 to crawl the entire website or relevant sections of the current website on the HTTP protocol, and extract all unique URLs that search engines can crawl, which are internally linked. This includes all assets internal to the website, such as robots.txt, Javascript, image, font, and CSS files. This data will be necessary towards the end of the prcess, to double check if the move has been successful. Save the extracted URLs in a separate file (one URL per line), for example, as crawl_extracted_urls.csv.
Note: If the website is too big, e.g. more than ten million URLs, either the “partial move” or “move through canonicals” scenario is recommended because it is the safest course of action to pursue. Try to split up the website in manageable sections, based either on the subdomains and/or subdirectories, and crawl these one by one instead to get as many unique URLs as possible.
Blocking search engine bots
The “move in one-go” and “partial move” scenarios, and depending of the size of the website, and how quickly the next steps can be completed, it may be useful to block search engine bots from crawling the HTTPS website while this is being set up to prevent the possibility of sending conflicting signals to search engines. This can be done by utilising robots.txt on the HTTPS version, and the entire HTTPS version can be blocked from being crawled, or just a part of it.
Use the following code snippet in the robots.txt on the HTTPS version to block all bots completely: User-Agent: * Disallow: / Note: This step is optional and heavily dependent on how quickly the website can be moved. If it can be moved in less than a few days, there is no need for this. This method can also be used to safely test most aspects of the move, before letting search engine bots know about the move. In case of the “move through canonicals” scenario, this particular step is not recommended.
Conclusion
Part 2 of this series in the next issue of iGB Affiliate will cover the actual move to HTTPS in great technical detail, together with the necessary next steps within Google Search Console.