This tutorial will lead you through an exploration of OpenRefine and demonstrate how it can be used to interrogate and clean your own dataset.
It freely acknowledges the wonderful introductory tutorial: Seth van Hooland, Ruben Verborgh and Max De Wilde, “Cleaning Data with OpenRefine,” Programming Historian (05 August 2013), http://programminghistorian.org/lessons/cleaning-data-with-openrefine as well as the basic tutorials provide at OpenRefine.org and their curated tutorial list: https://github.com/OpenRefine/OpenRefine/wiki/External-Resources
This tutorial uses a sample dataset available at http://digitalnomad.ie/ACIrishBrews2018.tsv
So, let’s get started…
OpenRefine Tutorial #1 (approximate Duration 30 minutes)
- The first step is to install the OpenRefine application to your own machine (if it has not already been installed) – http://openrefine.org/download – the latest version will always be available here – as of October 2019, this is v3.2.
- Once downloaded, execute the installer and it will guide you through the process. The end result will be a new application icon on your system that looks like any other application. You can click on the icon to launch OpenRefine.
- When you do click on the icon, you may not see any immediate effect. Unlike other applications that you may work with it doesn’t seem to present a window or any other user-facing components to interact with it (at least immediately). Hmmmm.
- In fact, it spawns a web service (runs as a local application) that you access through your browser, via a specific address – in this case: http://127.0.0.1:3333. If all goes according to plan and you are slightly patient, the application will direct you to the browser and open a new window. It’s worth waiting for.
If you open your web browser and direct it to this we will see something a little more promising:
- This is our window on the OpenRefine service, listing existing projects and offering you the opportunity to create new ones (i.e. datasets that you want to explore and manipulate).
Getting to Know the Interface
- Once OpenRefine is launched in your browser, the left margin has options to,
Import Project. To start we will create a new project:
Create Projectand select
Get data from
Web Addresses (URL).
2. Input the URL
3. Click the
Next>>button at the foot of the screen to load the data into OpenRefine.
4. OpenRefine gives you a preview – a chance to show you it understood the file. If, for example, your file was really comma-delimited (as opposed to tab-separated vales (tsv), the preview might look strange. You would choose the correct separator in the box shown and click
Update Preview(bottom left). If this is the wrong file, click
<<Start Over(upper left). There are also options to indicate whether the dataset has column headers included and whether OpenRefine should skip a number of rows before reading the data. As it is, you should not be faced with these immediate challenges and should have a properly formatted file.
- This dataset is a collection of information about craft beers produced in Ireland during the last decade. It is useful for our exercise and it demonstrates a nice open research ethos for being shared. Let’s import it to see what it contains. Do this by choosing Create Project, choosing ‘Web Addresses (URLs) and pointing to the file and choosing the Next… button.
- OpenRefine inspects the datafile and before finishing the import process (in this case there are only about 837 records and it is about less than 1Mb in size). From this screen, we can set some parameters. In this case, change the ‘Quotation marks are used to enclose …” and also give the project a more meaningful name. Note: the ‘encoding’ tab lets you see in realtime what happens if you change it – see ‘funny’ characters – try this one. Then click the Create Project… button.
- OpenRefine then imports the file – noting that it is stored locally on your own machine.
- What we then see is our dataset presented in all its glory:
- From this screen, we get some immediate feedback about the size of the file and what it contains. Let’s take a look at the headers and summaries.
Facetting by Data Columns
- Here we can begin to get to know ‘our’ data. OpenRefine offers us an ability to facet the data by aggregating similarities as we search for patterns. This alone can help us know and also get an idea about how we might want to manipulate this file. Try faceting by breweryName by clicking the small arrow next to that column heading and choosing Facet/Text Facet.
- There are 126 breweries listed in our dataset. The text facet lists these along with the number of beers in our dataset produced by that brewery.
- How many unique beers are produced by Barrelhead? You can filter data immediately and find all the beers from a particular brewery by choosing that brewery name from the facet list. Is that 7 beers from Barrelhead Brewery? That’s quite handy!
- But are there actually 126 distinct breweries? The human can see there are probably data entry errors – you might note that there are brewery names that probably refer to the same brewery but are mispelled – hint: Arthurtown, etc.
Identifying Data Entry Issues
- To narrow down the group we are working with we can select the brewery name we suspect is misspelled and OpenRefine will give us a subset. With a subset selected you can conduct in cell editing by clicking the small edit icon that appears. Note that the little edit button appears when you hover over a cell. This is good for quick edits but would get laborious for extensive ones:
- Let’s explore some of the more powerful bulk editing capabilities that really make OpenRefine useful.
- Remove the breweryName Facet by closing the small window ‘breweryName’ – click the close X.
Removing Incomplete Data
- Let’s look at the ABV (Alcohol by Volume) column.
- As ingested the column sees the values in this column as Text. We know this should be a number, so ABV must be transformed from Text to Number. To change the way OpenRefine sees the values by choosing Edit Cells > Transform > Common Transforms > To Number.
- When we do this and then choose Facet > Numeric Facet we can see there are 38 rows with null ABV values – we have incomplete records.
- To clean up our records, we can uncheck Numeric (we don’t want to operate on all rows) and are left with 38 matching rows with non-numeric or null values. We can now select from the All column – ‘Remove all Matching Rows. We have now cleaned up our dataset and removed superfluous data. and are left with 798 rows.
Removing Duplicate Records
- Now we want to remove duplicate records – a great bedevilment of oh so many datasets. But first, let’s just note that OpenRefine is watching what we do and gives us an opportunity to recover from mistakes in our process. You can step backwards by choosing a past state – OpenRefine will then restore your environment and data to the way it was.
- Now we want to try and find duplicates. To do this we carry out two processes.
First, sort the RecordID by numeric values (this is our unique record indicator) – so that duplicate Record IDs would appear in adjoining rows. Note that we want to choose to reorder rows permanently by using the Sort > Reorder rows permanently drop-down and choosing this option – otherwise rows are only sorted based on the current operation.
- You know, the MfgsDesc filed has a lot of data in it and it is making it difficult to see multiple rows at once. Let’s make things clearer. Since we aren’t working with this column at the moment, choose View > Collapse this column and voila…things should be clearer.
- Then choose to ‘Edit Cell-Blank Down’ for the column RecordID. OpenRefine will then blank out rows with duplicated ID so that we can immediately locate them.
- RecordID is a flag for our duplicates. As in the last example, we can now choose Edit cells > Common transformations > To number. The when we do a numeric facet it allows us to uncheck Numeric values and be left with Blank records.
- Delete the duplicates by choosing All > Edit rows > Remove matching rows to eliminate these 7 duplicates.
Splitting Multiple Value Records
- Now we turn to the Atomisation of our data – what, you say, is that? Well, this is one of the more common transformations that we would carry out —> Taking the contents of one column and breaking it into discrete chunks that we can operate on independently. In this case, we have a ‘type’ field that contains multiple values separated by ‘commas’.
- To separate these types into multiple types we can use the Split Multi-Value Cells pre-built transformation. If we choose the drop-down menu from the type column – we can select Edit Cells – Split Multivalue Cells. When we do this we end up with new duplicated rows (but not records) – one for each of the values contained in the type field.
- So … what has now happened? We now have the same number of actual records, but there are additional rows – one for each category in a record so that we can manipulate them individually – in fact, if we choose show by records there are the original 791 records. But, if we choose show by rows there are now 811 rows as Open Refine has browkn apart (atomised) the type filed where there were multiple types.
- To my mind, we have just made a huge jump in usefulness. We can now choose to facet the types column by Text and determine that we have 123 distinct types applied to the beers in our dataset.
- We now don’t have to search to see if a beer type exists…we get a picture of all types that do and we can browse rather than search to discover things of interest. We can choose to sort the categories alphabetically or by frequency/count. What is the most popular type of beer produced amongst Irish brewers?
Using Machine Learning to Clean and Refine Data
- Now that we have atomised our categories we can really start working with the data and using the power of Refine and machine intelligence to manipulate and refine our dataset. If we now choose Cluster and Edit from the Edit Cells drop-down, Refine will immediately suggest cells that seem like they might represent the same thing and we can bulk edit the dataset to harmonise our valued. We can eliminate errors in data entry or inconsistencies in the use of descriptive terms.
- Choose Edit Cells > Cluster & Edit from the type column. When we choose to invoke the clustering algorithms, we see the magic suggestions.
- We can choose to merge on Imperial Red Ale to combine related types. You may also note spacing, capitalisation and spelling inconsistencies that can be replaced in bulk. Imagine trying to do this by eye.
Exporting Clean Datasets
- The final thing we want to learn here is how to take our data out of OpenRefine after we have finished with it. OpenRefine supports a wide number of useful formats and also allows you to define your own custom formats if needed. The ultimate in flexibility.
- To do this use the Export Tab at the top of the workspace:
- That’s all we will cover today. There is much more even at the beginner’s tutorial referenced above. When you want to learn more Open Refine has a deep collection of tutorials as well as a cookbook
available at Openrefine.org