NASTAC - Work Package 4
Our efforts in terms of data generating and processing have focused on collecting and integrating data on state borders and capitals, settlement areas of ethnic groups, conflict patterns, and railroad networks.
- State borders and capitals. As a first step, we completed the work on the backdating of the existing CShapes dataset (Weidmann et al. 2010) on state borders and capitals worldwide from 1945 to 1886. The new dataset, which is called CShapes 2.0, also covers colonial holdings. These data have now been published online and through the article Schvitz et al. (2021) in the Journal of Conflict Resolution. The next step in terms of state border data related to the premodern period, which are used in Work Package 1. In this case, we were generously given access to Scott Abramson’s comprehensive dataset on European state borders covering the period 1100-1790. For compatibility purposes, these data had to be reprojected to fit existing maps. We also added capitals to all state units from 1400. Finally, to fill the temporal gap between CShapes 2.0 and Abramson, we coded a new dataset that covers European state borders from 1816 through 1886 based on the information from the two datasets, the Centennia Atlas and other sources. This dataset also contains information about capitals.
- Ethnic groups. We have nearly completed our work on the Historical Ethnic Geography (HEG) database, which provides spatial data on 120 ethnic groups in Europe from 1855 to 2020 and accounts for changes in ethnic geography due to forced displacement, assimilation and genocide. Data collection started with over 350 historical maps taken from online and archival sources. From these, we selected the 77 most suitable maps, which we converted into GIS data. We used the Ethnologue Dataset’s (Lewis, 2009) language tree to match group categories across different maps and developed new procedures to aggregate information from multiple maps, while incorporating uncertainty due to discrepancies between maps. Finally, we collected information on over 150 ethnic cleansing events to identify major shifts in the ethnic landscape. The resulting data can be used to produce “best estimates” of both individual group settlement areas and the ethnic composition of locations at any point in time since 1855. We have used the data in two papers that examine the impact of ethnic geography on the size and shape of states (currently under review), and are currently working on a third paper that examines the drivers of ethnic cleansing (to be presented at the APSA 2021 conference).
- Ethnic power relations. Update of datasets relevant to prepare research in Work Package 3.
- Conflict patterns. We have linked a series of existing conflict datasets to the state and ethnicity data described above. One effort concerned matching of Brecke’s data war dataset to state units in the Abramson and Centennia datasets. We also matched historical peace agreements to Abramson’s state data.
- Railroads. Using raster images of the history of the railroad network "Histoire chronologique des chemins de fer européens (1834 - 1900)" provided by Bernard Cima (used with permission), we created a vectorized spatiotemporal dataset of the European railroad network.
- Historical ethnic power relations. We are collecting a streamlined version of the Ethnic Power Relations data for the 1816-1945 period covering all multiethnic states in Europe. The dataset is based on the state border and ethnic data, and measures for each ethnic group and state combination the group’s access to power and regional autonomy. The project is now ending the pilot phase and has coded the relevant variables for 19 out of 280 units. It is foreseeable that the data will be ready by mid to end of autumn.
- Nationalist claims. In order to measure nationalism more directly and validate our theory of nationalist state transformation, we are collecting Europe-wide data on the first nationalist claims made in the name of ethnic groups within states. This data-collection runs in parallel to the historical EPR coding and exploits the case-specific knowledge developed by the coders to speed up the process. It similarly covers multiethnic states 1816 to 1945 and measures the year in which nationalist organizations with clear ethnic links made various types of political claims (i.e. claims to autonomy, inclusion in government, independence).
- Newspapers and other text data. We were able to retrieve large amounts of text data from the Europeana Collection of historical newspapers. Unfortunately, we had to realize that the quality of the optical character recognition severly limited the use cases of the newspaper articles for analytical purposes. We dealt with that problem in several ways. First, we researched other sources of historical text data, such as the WorldCat database for metadata of historical books or text data from historical parliamentary debates. Second, Dennis Atzenhofer was involved in a project that looked at the use of quantitative text analysis with more modern newspaper texts. A brief paper summarizing these efforts will be published in the Proceedings of the 4th Workshop on Challenges and Applications of Automated Extraction of Socio-political Events from Text (CASE 2021). Lastly, we are exploring ways in which the subset of historical newspaper with sufficient quality could be used for specific case studies.
To support the above activities, we purchased two servers with enough computational power (24-core each), memory (1.5 TB and 768 GB), and storage (> 40 TB aggregated). We installed them with Linux and macOS to make sure our results can be replicated under different conditions. A data pipeline to process the data has been developed using a combination of R, Java, and SQL procedures, storing intermediate and final datasets in a geospatial relational database (PostgreSQL/PostGIS). Access to these two servers is protected by firewall and secured through strong encryption with public key authentication. All code is put under revision control for strict reproducibility, and archived alongside the data into three redundant backup systems.