Resource Data

Big (Crisis) Data for Predictive Models 2021 – A Literature Review and Snapshot of Opportunities for UNHCR – World


The recent Global Compact for Refugees recognized that the growing number of forcibly displaced people and the difficulty in predicting mass movements of people have created an urgent need for data-driven early warning systems that enable governments and aid organizations to make the best use of their limited resources. effectively [1, 2]. Efforts to install early warning systems have led to advances in forecasting and forecasting global migratory flows; however, forced displacement remains the most elusive and difficult form of migration to predict [3, 2, 4].

People base their decision to migrate on a complex set of different factors. However, in addition to the complexity of the individual decision-making process, the prediction of forced displacement flows is further complicated by the need to detect or predict triggering events early. Triggering events, which are often the last link in a long chain of other events that tip the individual’s decision to flee, often occur randomly and abruptly. [5]. Both processes are difficult to model in and of themselves. However, the lack of up-to-date and accurate data at the micro, meso and macro levels further exacerbates this problem.

The data needed to model these processes are often either unavailable (micro level) or obsolete (meso and macro level).

Furthermore, while existing refugee flows between countries are often self-sustaining and can be predicted based on historical data, refugee flows due to new events pose a challenge because (i) the event may not not yet known to the modeler; (ii) the effect of a new event on future refugee flows is unknown; (iii) no historical data from a recent event exists, and it is uncertain to what extent historical data from other events are applicable in predicting refugee flows from the new event. Predicting forced displacement flows therefore requires first and foremost a thorough understanding of the mechanisms of forced displacement. Specifically:

  1. Factors that lead to an event that can potentially trigger forced displacement.

  2. The characteristics of an event that have the power to create significant forced displacement.

  3. Factors influencing the magnitude, demographics and direction of forced displacement.
    In addition, reliable prediction of forced displacement flows requires timely data, preferably at the micro, meso and macro level. New data sources, such as Big (Crisis) Data, can deliver such data in a timely manner and complement or replace more traditional data sources [1, 2].
    Big (Crisis) Data1 is an umbrella term for data sources characterized by volume, speed, and variety, such as satellite imagery, data from social media sites, or mining data such as records from call details (CDR), search engine data or connections [1]. Data sources from various digital devices create an estimated amount of data of over 2.5 quintillion bytes per day. “According to Statista (2018), the world’s social network users were 2.46 billion in 2017. According to Internet World Stats (2018), at the start of 2018, the Internet penetration rate varied from 95 , 0% in North America and 85.7% in the European Union to 48.1% in Asia and 35.2% in Africa. [6] However, the vastness and high granularity of Big (Crisis) Data is both a boon and a bane. On the one hand, these data provide timely access to information not available through traditional data survey methods. On the other hand, the same vastness means that finding valuable and applicable information can be like finding a needle in a haystack. [7].
    To assess the information in Big (Crisis) Data that is valuable for predictive models of forced displacement, we assess each data source using three criteria: (A) precision, (B) ias, and (S) scalability (ABS).

• Precision: Big Data generally suffers from a low signal-to-noise ratio [8, 3]. In particular, the content of social networking sites is likely to contain false, misleading or irrelevant information: bots with sales ads that use trending hashtags to gain traction and actors with political agendas or trolls who Posting misleading or false information all contribute to social media noise. network sites. Likewise, satellite images require intensive training of advanced deep learning algorithms to extract usable information from pixels, and advanced natural language processing algorithms are required to extract relevant information from textual sources in various languages ​​and dialects which often contain spelling and grammatical errors.
In short, the effort to filter the signal from the noise may in some cases become substantial and thus may impact the scalability of the data source.

• Bias: To produce user-generated content, data depletion on the Internet or CDRs requires some form of access to electronic devices. However, although the global penetration rate of mobile phones and the Internet is increasing every year, it has not yet reached its full saturation. Studies have shown that this lack of saturation is not evenly distributed across all demographics, but leads to a more Western, more urban, more educated, and more male user demographics. [8, 9, 10]. Although cell phone penetration rates at the household level are high in developing countries, male household members have immediate access to the device and often exclude women and minors. Younger and less educated people prefer communication channels with direct communication, like Instagram, Pinterest, and Facebook, while Twitter and LinkedIn are preferred for more professional posts or social and political activism. [6].
These factors create an inherent bias in many big data sources, which is difficult to correct because detailed demographics of user groups are often not available or are inferred by the platform using methods. unreliable imputation. [3, 11].2 • Scalability: The objective of using Big Data in analyzes with a large context (ideally a global context) requires easy scalability of the data source. However, property rights often prevent the scalability of a data source. CDRs, for example, belong to the carrier’s network and require bilateral agreements between the carrier’s network and the analyst to become accessible to the latter. The need for several bilateral contracts due to various carrier networks within and between countries calls into question the use of CDR in internationally oriented studies. In addition, setting up these agreements requires considerable time and resources, and CDRs are therefore more accessible if pre-crisis agreements already exist. Likewise, content from social media providers like Facebook, which do not provide real-time access to their data, can compromise the timeliness and flexibility of the data source.
In the following sections, we will assess different sources of Big (Crisis) data in the context of a “forced displacement system” and using the “ABS” criteria. Based on these three criteria, we will discuss the pros and cons of different sources of big data in various contexts and give suggestions for their use.