This is a brief tutorial to explore how OpenRefine can be used to interrogate and clean your own dataset.
It acknowledges the wonderful tutorial : Seth van Hooland , Ruben Verborgh and Max De Wilde , “Cleaning Data with OpenRefine,” Programming Historian (05 August 2013), http://programminghistorian.org/lessons/cleaning-data-with-openrefine as well as the basic tutorials provide at OpenRefine.org and their curated tutorial list: https://github.com/OpenRefine/OpenRefine/wiki/External-Resources
This tutorial uses their freely distributable data file from the Powerhouse Museum: https://github.com/tracykteal/gapminder-spreadsheet/raw/gh-pages/phm-collection.tsv
So, let’s get started…
OpenRefine Tutorial (approximate Duration 45 minutes)
- The first step is to install the OpenRefine application to your own machine:
- Once downloaded, execute the installer and it will guide you through the process. The end result will be a new application icon on your system that looks like any other application. You can click on the icon to launch OpenRefine.
- When you do click on it. You will find that it doesn’t seem to present a window or any other user components to interact with it. Hmmmm.
- In fact it is creating a web service that you access through your browser, via a specific address – in this case: http://127.0.0.1:3333. If you open your web browser and direct it to this we will see something a little more promising:
- This is our window on the OpenRefine service, listing existing projects and offering you the opportunity to create new ones (i.e. datasets that you want to explore and manipulate).
- We will use the existing dataset referenced above: https://github.com/tracykteal/gapminder-spreadsheet/raw/gh-pages/phm-collection.tsv
- This is a collection of metadata from a museum listing all of the objects from the catalogue and providing the associated metadata. It is useful for our exercise and it demonstrates a nice open research ethos for being shared. Let’s import it to see what it contains. Do this by choosing new Project and then pointing to the downloaded file:
- OpenRefine inspects the datafile and before finishing the import process (in this case there are 78K records and it is about 58Mb in size. From this screen we can set some parameters. In this case change the ‘Quotation marks are used to enclose …” and also give the project a more meaningful name. Note: the ‘encoding’ tab lets you see in realtime what happens if you change it – see ‘funny’ characters – try this one. Then click Create Project.
- OpenRefine then imports the file – noting that it is stored locally on your own machine:
- What we then see is our dataset presented in all its glory:
- From this screen we get some immediate feedback about the size of the file and what it contains. Let’s take a look at the headers, and summaries.
- Now, we can start to get to know our data. OpenRefine offers us an ability to facet the data by aggregating similarities as we search for patterns. This alone can help us know and also get an idea about how we might want to manipulate this file. Try faceting by Categories for example – by clicking the small arrow next to column heading.
- There are actually 14,805 distinct categories listed. We can choose a category and OpenRefine will ‘filter’ the dataset to show us just those records with that define category. With a subset selected I can conduct in cell editing if I wished. Note that the little edit button appears when you hover over a cell. This is good for quick edits but would get laborious for extensive ones:
- Let’s explore some of the more powerful bulk editing capabilities that really make OpenRefine useful.
- Remove the Categories Facet by closing the small window ‘Categories’.
- Let’s look at the RecordID column.
- As inputted the column sees the values in this column as Text. Record ID must be transformed from Text to Number. Change the way Refine sees the values by Choosing Edit Cells > Transform > Common Transforms > To Number.
- When we do this and then choose Numeric Facet … there are three rows that have blank IDs.
- We can now uncheck Numeric (we don’t want to operate on all rows) and are left with just the 3 blank rows. We can now select from the All column – ‘Remove all Matching Rows. We have now cleaned up our dataset and removed superfluous data. and are left with 75811 rows.
- Now we want to explore removing duplicate records – a great bedevilment of oh so many datasets. But first let’s just note that OPenRefine is watching what we do and gives us an opportunity to recover from mistakes in our process. You can step backward by choosing a past state – OpenRefine will then restore your environment and data to the way it was.
- Now we want to try and find duplicates. To do this we carry out two process.
Note that we want to choose to reorder rows permanently by using the global sort drop down and choosing this option – otherwise rows are only sorted based on the current operation.
First sort the RecordID by numeric values – so that duplicate Record IDs would appear in adjoining rows. Then choose to ‘Edit Cell-Blank Down’. OpenRefine will then blank out rows with duplicated ID. As in the last example we can now choose to display only Blank records and eliminate these duplicates.
- Now we turn to the Atomisation of our data – what you say is that? Well, this is one of the more common transformations that we would carry out —> Taking the contents of one column and breaking it into discrete chunks that we can operate on independently. In this case, we have a category field that contains multiple values separated by ‘pipes’.
- To separate these categories into categories and subcategories, we can use the Split Multi-Value Cells pre-built transformation. If we choose the drop down menu from the Categories column – we can select Edit Cells – Split Multivalue Cells. When we do this we end up with new duplicated rows (but not records) – one for each of the values contained in the categories field.
- So … what has now happened? We now have the same number of actual records, but there are additional rows – one for each category in a record so that we can manipulate them individually.
- To my mind we have just made a huge jump in usefulness. We can now choose to facet the Categories column by Text and determine that we have 4,934 distinct categories applied to the objects in our collection.
- We now don’t have to search to see if a category exists…we get a picture of all categories that do and we can browse rather than search to discover things of interest. We can choose to sort the categories alphabetically or by frequency.Now that we have atomised our categories we can really start working with the data and using the power of Refine and machine intelligence to manipulate and refine our dataset. If we now choose Cluster and Edit from the Edit Cells drop down, Refine will immediately suggest cells that seem like they might represent the same thing and we can bulk edit the dataset to harmonise our valued. We can eliminate errors in entry of the data, inconsistencies in the use of descriptive terms.
- When we choose to invoke the clustering algorithms, we see the magic suggestions :
The final Thing we want to learn here is how to take our data out of OpenRefine after we have finished with it. OpenRefine supports a wide number of useful formats and also allows you to define your own custom formats if needed. The ultimate in flexibility.
- To do this use the Export Tab at the top of the workspace:
- That’s all we will cover today. There is much more even at the beginner’s tutorial referenced above. When you want to learn more Open Refine has a deep collection of tutorials as well as a cookbook
available at Openrefine.org