This tutorial will lead you through an exploration of OpenRefine and demonstrate how it can be used to interrogate and clean your own dataset.
It freely acknowledges the wonderful introductory tutorial: Seth van Hooland , Ruben Verborgh and Max De Wilde , “Cleaning Data with OpenRefine,” Programming Historian (05 August 2013), http://programminghistorian.org/lessons/cleaning-data-with-openrefine as well as the basic tutorials provide at OpenRefine.org and their curated tutorial list: https://github.com/OpenRefine/OpenRefine/wiki/External-Resources
This tutorial uses a sample dataset available at http://digitalnomad.ie/ACIrishBrews2018.tsv
So, let’s get started…
OpenRefine Tutorial #1 (approximate Duration 30 minutes)
- The first step is to install the OpenRefine application to your own machine (if it has not already been installed) – http://openrefine.org/download
- Once downloaded, execute the installer and it will guide you through the process. The end result will be a new application icon on your system that looks like any other application. You can click on the icon to launch OpenRefine.
- When you do click on it. You will find that it doesn’t seem to present a window or any other user components to interact with it. Hmmmm.
- In fact, it creates a web service (runs as a local application) that you access through your browser, via a specific address – in this case: http://127.0.0.1:3333.
If you open your web browser and direct it to this we will see something a little more promising:
- This is our window on the OpenRefine service, listing existing projects and offering you the opportunity to create new ones (i.e. datasets that you want to explore and manipulate).
- Once OpenRefine is launched in your browser, the left margin has options to
Open Project, or
Import Project. Here we will create a new project:
Create Projectand select
Get data from
Web Addresses (URL).
2. Input the URL
3. Click the
Next>>button at the foot of the screen to load the data into OpenRefine.
4. OpenRefine gives you a preview – a chance to show you it understood the file. If, for example, your file was really tab-delimited, the preview might look strange, you would choose the correct separator in the box shown and click
Update Preview(bottom left). If this is the wrong file, click
<<Start Over(upper left). There are also options to indicate whether the dataset has column headers included and whether OpenRefine should skip a number of rows before reading the data.
- This is a collection of information about craft beers produced in Ireland in the last decade. It is useful for our exercise and it demonstrates a nice open research ethos for being shared. Let’s import it to see what it contains. Do this by choosing Create Project, choosing ‘Web Addresses (URLs) and pointing to the file and choosing the Next… button.
- OpenRefine inspects the datafile and before finishing the import process (in this case there are about 57 records and it is about less than 1Mb in size. From this screen, we can set some parameters. In this case, change the ‘Quotation marks are used to enclose …” and also give the project a more meaningful name. Note: the ‘encoding’ tab lets you see in realtime what happens if you change it – see ‘funny’ characters – try this one. Then click the Create Project… button.
- OpenRefine then imports the file – noting that it is stored locally on your own machine.
- What we then see is our dataset presented in all its glory:
- From this screen, we get some immediate feedback about the size of the file and what it contains. Let’s take a look at the headers and summaries.
- We can start to get to know our data. OpenRefine offers us an ability to facet the data by aggregating similarities as we search for patterns. This alone can help us know and also get an idea about how we might want to manipulate this file. Try faceting by breweryName by clicking the small arrow next to column heading and choosing Facet/Text Facet.
- There are 52 breweries listed in our dataset. The text facet lists these along with the number of beers in our dataset produced by that brewery.
- You can filter data immediately and find all the beers from a particular brewery by choosing that brewey’s name from the facet list. Check out the 8 beers from Barrelhead Brewery.
- If you are observant you might note that there are brewery names that probably refer to the same brewery but are mispelled.
- We can choose a category and OpenRefine will ‘filter’ the dataset to show us just those records with that define category. With a subset selected I can conduct in cell editing if I wished. Note that the little edit button appears when you hover over a cell. This is good for quick edits but would get laborious for extensive ones:
- Let’s explore some of the more powerful bulk editing capabilities that really make OpenRefine useful.
- Remove the Categories Facet by closing the small window ‘Categories’.
- Let’s look at the RecordID column.
- As inputted the column sees the values in this column as Text. Record ID must be transformed from Text to Number. Change the way Refine sees the values by Choosing Edit Cells > Transform > Common Transforms > To Number.
- When we do this and then choose Numeric Facet … there are three rows that have blank IDs.
- We can now uncheck Numeric (we don’t want to operate on all rows) and are left with just the 3 blank rows. We can now select from the All column – ‘Remove all Matching Rows. We have now cleaned up our dataset and removed superfluous data. and are left with 75811 rows.
- Now we want to explore removing duplicate records – a great bedevilment of oh so many datasets. But first let’s just note that OPenRefine is watching what we do and gives us an opportunity to recover from mistakes in our process. You can step backward by choosing a past state – OpenRefine will then restore your environment and data to the way it was.
- Now we want to try and find duplicates. To do this we carry out two process.
Note that we want to choose to reorder rows permanently by using the global sort drop down and choosing this option – otherwise rows are only sorted based on the current operation.
First sort the RecordID by numeric values – so that duplicate Record IDs would appear in adjoining rows. Then choose to ‘Edit Cell-Blank Down’. OpenRefine will then blank out rows with duplicated ID. As in the last example we can now choose to display only Blank records and eliminate these duplicates.
- Now we turn to the Atomisation of our data – what you say is that? Well, this is one of the more common transformations that we would carry out —> Taking the contents of one column and breaking it into discrete chunks that we can operate on independently. In this case, we have a category field that contains multiple values separated by ‘pipes’.
- To separate these categories into categories and subcategories, we can use the Split Multi-Value Cells pre-built transformation. If we choose the drop down menu from the Categories column – we can select Edit Cells – Split Multivalue Cells. When we do this we end up with new duplicated rows (but not records) – one for each of the values contained in the categories field.
- So … what has now happened? We now have the same number of actual records, but there are additional rows – one for each category in a record so that we can manipulate them individually.
- To my mind we have just made a huge jump in usefulness. We can now choose to facet the Categories column by Text and determine that we have 4,934 distinct categories applied to the objects in our collection.
- We now don’t have to search to see if a category exists…we get a picture of all categories that do and we can browse rather than search to discover things of interest. We can choose to sort the categories alphabetically or by frequency.Now that we have atomised our categories we can really start working with the data and using the power of Refine and machine intelligence to manipulate and refine our dataset. If we now choose Cluster and Edit from the Edit Cells drop down, Refine will immediately suggest cells that seem like they might represent the same thing and we can bulk edit the dataset to harmonise our valued. We can eliminate errors in entry of the data, inconsistencies in the use of descriptive terms.
- When we choose to invoke the clustering algorithms, we see the magic suggestions :
The final Thing we want to learn here is how to take our data out of OpenRefine after we have finished with it. OpenRefine supports a wide number of useful formats and also allows you to define your own custom formats if needed. The ultimate in flexibility.
- To do this use the Export Tab at the top of the workspace:
- That’s all we will cover today. There is much more even at the beginner’s tutorial referenced above. When you want to learn more Open Refine has a deep collection of tutorials as well as a cookbook
available at Openrefine.org