Many people around the world use Google Sheets to complete a range of different tasks for personal and official use. Over the years, Sheets has become an all-in-one data analysis tool that can easily handle all sorts of requests. With increased workloads and having to deal with large datasets, there was a need for an import function that could scrap the required data from external sources. This is where IMPORTXML comes in.
In the following article, we will show you how it can be used for collecting data in different scenarios. Once you start applying it frequently, populating a Google Sheets file with concerned particulars becomes much easier. The IMPORTXML function ends up saving a considerable amount of time and effort. It becomes even more apparent if you need to handle large spreadsheets regularly. Not sure about whether to go with Excel or Sheets? Please read our detailed comparison, which goes through everything you need to know.
What is XML and HTML?
The IMPORTXML function is developed to import data present on the web into your spreadsheet. To properly use the function’s arguments, you need to understand the basics of XML and HTML. As some of you would already know, HTML is the programming language that serves as a building block for most content you see through your web browser. The basic format of HTML works by enclosing an online element inside two tags. The first one marks the starting point of that element, whereas the second defines the end. Anything contained within these tags is formatted, displayed, and treated in a specific manner.
XML stands for Extensible Markup Language. It works similarly to HTML and is responsible for designing different datasets within a webpage. XML can also handle arbitrary data, which turns out to be a pretty helpful feature when you are importing the data into Google Sheets. The IMPORTXML function sends a carefully curated request to a page’s XML and tries to extract a particular dataset from that vast assortment. In simple terms, what this means is that IMPORTXML can add links, emails, lists, and other definite types of elements from any webpage.
We can direct this function to only look for a particular type of HTML tag. That way, it will automatically ignore all other tags and target those specified in the function call. Both XML and HTML were developed to make it easy for developers, data scrapers, and text miners to get unrestricted access to online content. Some sites may add on external layers, which limit this functionality. However, in most cases, you will be able to implement IMPORTXML without any issues whatsoever. In the subsequent sections, we listed step-by-step methods for carrying out different tasks with IMPORTXML in Google Sheets.
Extracting a multi-column list with IMPORTXML
One of the applications for IMPORTXML is extracting multi-column lists from web pages. You may need a specific set of ordered data for any of your projects or official tasks. Such datasets are readily available over the internet. Though, many people do not know much about any direct way to extract such lists automatically without disturbing their original sequence. The IMPORTXML function lets you do just that.
For the sake of this tutorial, we will assume that you need to extract a list of cities with their respective postal codes. Such information is present on Wikipedia or a local bureau of labor statistics.
Step 1: Launch your preferred internet browser and search for the required postal codes. Make sure to find a webpage that displays this information in a list format.
Step 2: Scroll down to the list and select one of the postal codes. Right-click on it and then go to “Inspect” from the drop-down menu. It will open the source code of that webpage in a side window.
Step 3: The source code window will lay out the relevant section based on the text you selected in Step 2. Here, you will notice that the postal code is present inside a specific HTML tag “<td>.” It is used whenever a page needs to display lists or tabular data.
Step 4: Open a new tab on your browser and go to “Google Sheets.” Log in with your account.
Step 5: Create a new spreadsheet and type the following command in the formula bar, “=importxml(URL, Tag).” The first argument URL is the web address of that page, whereas the second argument, “Tag,” is the tag you identified from the source code.
Doing so will extract all elements with the “<td>” tags on that page. To refine your results and put them into your sheet in a more presentable form, you can include all list columns separately. Additional identifiers such as span and column number must be included.
Importing email address with IMPORTXML
Another useful application for the IMPORTXML function is obtaining email addresses from a specific website. Many online marketers rely on this information to build up an extensive dataset of potential consumers. Getting all of these addresses in a single command may be a dream come true for most email marketers.
All you need is to open the target website and go to their “About Us” page. These pages usually display multiple email addresses from most of their staff and management. There are tons of Chrome extensions available for extracting this information. Nevertheless, if you want to take a more hands-on approach, importing these addresses into a spreadsheet is much better. It also gives you more control over how to format the imported addresses afterward. Here is a list of 25 high-rated Google Chrome extensions you should be using.
Step 1: Launch your preferred internet browser and head to the target website. Open the “About Us” page or any other section which contains the addresses you are interested in.
Step 2: Scroll down to an address and select it with your mouse. Right-click on it and then go to “Inspect” from the drop-down menu. It will open the source code of that webpage in a side window.
Step 3: The source code window will lay out the relevant section based on the email address you selected in Step 2. Here, you will notice that these addresses are marked with a specific HTML class.
Step 4: Open a new tab on your browser and go to “Google Sheets.” Log in with your account.
Step 5: Create a new spreadsheet and type the following command in the formula bar, “=importxml(URL, Class).” The first argument URL is the web address of that page, whereas the second argument, “Class,” is the name of the class you identified from the source code.
Doing so will extract all email addresses with that HTML class on that page. You can refine these results by checking for different characters present in a specific email address. The same method can also be used to filter multiple characters present inside square brackets.
Scraping web data with IMPORTXML
You can also use the IMPORTXML function to scrape web data from classified advertisements on Craigslist or online marketplaces such as Amazon or eBay. Such imports are performed for product research or price comparisons.
Step 1: Launch your preferred internet browser and head to an online marketplace. Search for a product you are interested in.
Step 2: Scroll down and click on any item from the search results. It will open the product page in a new tab. Right-click on the price tag or any other element you want to scrape. Go to “Inspect” from the drop-down menu. It opens the source code of that webpage in a side window.
Step 3: The source code window will lay out the relevant section based on the element you selected in Step 2. Here, you will notice that these items are marked with a specific HTML class.
Step 4: Open a new tab on your browser and go to “Google Sheets.” Log in with your account.
Step 5: Create a new spreadsheet and type the following command in the formula bar, “=importxml(URL, Class).” The first argument URL is the web address for search results, whereas the second argument, “Class,” is the name of the class you identified from the source code.
Following the five steps we just described, you can scrape web data from almost any site. Just make sure to recognize the correct classes and tags. Otherwise, the cells in your spreadsheet either display “N/A” or start to get filled with unrecognized information.
Similar import functions in Google Sheets
Please note that IMPORTXML can only be used to bring in data based on XML and HTML tags. An entire range of functions is used to extract information from external sources. Each of these functions is suitable in a varying set of use cases. Some of these are as follows,
- IMPORTRANGE: Imports a range of cells from a specified spreadsheet.
- IMPORTHTML: Imports data from a table or list within an HTML page.
- IMPORTFEED: Imports an RSS or ATOM feed.
- IMPORTDATA: Imports data at a given URL in .CSV or .TSV format.
Conclusion
Knowing how to populate Google Sheets effectively is crucial in mastering spreadsheets. The IMPORTXML function enables you to scrape or import data from live webpages and sites worldwide. With a combination of tags and identifiers, you can use it to collect massive datasets in no time. Learn how to find the average on Google Sheets in this detailed article.