Blog

A year of big data

shutterstock_225952078With three major projects ranging from web archives to parliamentary data to using digital tools to solve research challenges, 2014-15 was a year of big data for the Institute of Historical Research (IHR).

Big UK Domain Data for the Arts and Humanities (funded by the Arts and Humanities Research Council), is concerned with the archive of UK web space from 1996 to 2013 – all 65 terabytes of it. The archived web is very different from the live web, and there is not yet the expertise or the tools to work with it effectively. Both the data itself and the process of collection are poorly understood, and it is possible to draw only the broadest of conclusions using current analytical tools. Working with researchers and developers at the British Library, the Oxford Internet Institute and Aarhus University, this group has begun to develop a theoretical and methodological framework for analysing this vitally important primary source. Ten researchers, from a range of arts and humanities disciplines, were awardedbursaries to work with the dataset, under the guidance of the project team. Their proposals ranged from analysing Euro-scepticism on the web to studying the Ministry of Defence’s recruitment strategy, from examining the history of disability campaigning groups and charities online to looking at Beat literature in the contemporary imagination. The case studies produced demonstrate some of the challenges posed by the archived web, but also its value and significance.

The project has resulted in one of the largest full-text indexes of web archive (WARC) files in the world, and also a sophisticated interface which supports complex query building and gives researchers the ability to create and manipulate corpora derived from the larger dataset. The tools and knowledge developed during the project have already influenced provision of and access to web archives at the British Library, and the software and processes have informed similar work in Denmark and Canada. The project is beginning to transform how researchers interact with this essential part of our digital cultural heritage.

The second project, Digging into Linked Parliamentary Data (funded under the Digging into Data Challenge 3), involves the IHR, theuniversities of Toronto and Amsterdam, King’s College London and the History of Parliament Trust. Like so much work in the digital humanities, it is notably interdisciplinary, with historians, political scientists, computational linguists and information scientists working together to analyse parliamentary proceedings from the UK, Canada and The Netherlands – over a period of 200 years. Key subjects for exploration are left/right ideological polarisation in parliamentary discourse, the way in which migration has been discussed since 1800, and the influence of gender on the language and topics of debate and discussion. The School of Advanced Study and its institutes are uniquely placed to host and facilitate large-scale collaborations of this kind, and to ensure that a humanities perspective informs big data research. And in keeping with the School’s remit to promote and facilitate research both nationally and internationally, all of the data produced by the project will be open for re-use and sharing.

Finally, Traces through Time: Prosopography in Practice across Big Data, funded by the AHRC and led by The National Archives (TNA), is addressing the problem of how you securely identify individuals – or instances of individuals – within and across large datasets. Ultimately the aim is to embed some of the tools developed in The National Archives Discovery service. As TNA digitises more and more of its collections and continues to add data to its catalogue, at the very least there is the potential to help a large number of users manage and refine their searching. If this problem of identification can even partially be mitigated, researchers can begin to reveal the lives hidden in the records and continue to explore history from the bottom up as well as the top down.

Jane Winters, Institute of Historical Research