thesis/use_cases.tex at master · christopherkeller/thesis · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
%!TEX root=index.tex
\newacronym{acm}{ACM}{Association of Computing Machinery}
\newacronym{ncdc}{NCDC}{National Climatic Data Center}
\newacronym{soi}{SOI}{Southern Oscillation Index}
\newacronym{tvs}{TVS}{National Tornado Vortex Signature}
\newacronym{usgs}{USGS}{United States Geological Survey}
\newacronym{xml}{XML}{Extensible Markup Language}
\section{Use Cases}
Two offerings with sufficiently different data and processing requirements are tornado and flood. While both are similar in the respect that the objective is to produce a probability of occurrence for any given geographic point in the coverage area, the technology required to accomplish this varies greatly. A primary carrier of property and casualty insurance would be very interested in knowing the probability of natural catastrophes in geographic areas in which they have a large customer base. Having accurate probabilities of seasonal event occurrences, months in advance, constitutes a significant business advantage over competitors when issuing or renewing policies.\\

In a \textsc{CSC} meeting with Novarica\index{Novarica}, an insurance industry consulting group, a scenario was laid out and positively received in which insurance underwriters would be given access to an additional data feed that included catastrophic weather event probabilities [personal communication, 2013]. This feed would be ingested into their current toolset, which assists in the decisions related to policy offering and renewal. The probabilistic occurrence of future catastrophic weather events can help insurance companies balance high risk policies with low risk polices in their overall portfolio in order to maximize revenue. Insurance companies need these long term outlooks, because existing books of business once written, cannot be changed in the short term.
\subsection{Analytical Offerings}
Quantitative probabilities for natural catastrophic events such as tornado, hail, or flood can be expressed as a Poisson distribution showing the probability of a given number of events occurring over a fixed interval of time \cite{anderson}. It is possible to represent the resulting Poisson distribution in a number of ways, either textually or graphically. While an interactive \gls{html} geographic map may best present the results visually, the goal is to get the information into a system capable of overlaying it with parcel and policy data, so as to present a complete picture for risk analysis. \textsc{CSC} has two different administration tools for insurance carriers called POINT IN and Exceed which can handle \gls{xml} input feeds containing the event probabilities for each geographic grid location. For the global flood offering, the Integral software package, also from \textsc{CSC}, could serve the same purpose \cite{integral}.\\

Climate phenomena that drive extreme weather events such as El Niño or La Niña only change over a period of weeks to months, making it unnecessary to generate results more frequently. The \gls{soi} data set, updated monthly by the Australian Bureau of Meteorology, is useful in climate analytics in general and its update frequency can serve as a baseline for future ClimatEdge\index{ClimatEdge} offerings. While specific customers may or may not require monthly analytics feeds, the methods will be in place to generate them at that frequency. Although the results generally would not be substantially different from the scheduled runs, it would be technically possible to run the computations on-demand.
\subsection{Big Data}
In a guest \gls{acm}\index{ACM} blog post, Michael Stonebraker gives four detailed definitions for the term big data\index{big data} \cite{stonebraker}. Big data is classified as having one of the following attributes: velocity, volume, or variety. Velocity describes the frequency at which data is expected to arrive, either pushed or pulled. Volume is simply the size of the data sets to be ingested, stored, and processed. Variety describes the various heterogeneous sources and formats which comprise the data set. A big data label can therefore be applied to a data set having  one or more of these attributes.\\

The initial input data sets for tornado comprise the following \cite{walker}:
\begin{itemize}
    \item 1 MB of \gls{soi} historical archives \cite{bom}
    \item 100 MB of \gls{tvs} data \cite{hdss}
    \item 1.5 TB of \gls{ncdc}\footnote{\gls{ncdc} is part of \gls{noaa} and provides climatological services.} historical tornado report data \cite{ncdc}
\end{itemize}
All of this data is in the public domain and freely available to anyone that has the resources to store and analyze it. It is also possible that there exist variables stored within the three dimensional \gls{merra} archives that can be utilized to further improve calculations of probabilities by reducing uncertainty [personal communication, 2013]. This additional data would be:
\begin{itemize}
    \item 1.5 TB of \gls{nasa} \gls{merra} three dimensional historical data \cite{mdisc}
\end{itemize}
Even with data volumes approaching several terabytes for the full effort, tornado probability computation does not conform to the criteria for big data\index{big data} because it lacks all three attributes. The incoming data velocity can be based on several factors: how often is the input data available, how frequently is the output generated, and how long does it take to processes the incoming data. The tornado analytic is expected to be generated monthly which means input must be processed at least that often. The \gls{soi} and \gls{tvs} data sets are small enough to be considered noise with respect to the larger \gls{merra} set. Even at several terabytes, the subset of \gls{merra} useful for tornado probabilities can still fit on one or two hard drives, eliminating the need for more than one system to store all the data.  Although four data sets will require four different parsers to process and ingest the data, the parsers are fairly straightforward since the data sets are structured and ample documentation is available. Preliminary tests have shown that processing a single day of \gls{merra}\footnote{A single day of two dimensional atmospheric variables is approximately 300 megabytes uncompressed.} and writing into an SQLite\index{SQLite} flat file takes approximately 20 seconds per variable on a modern CPU \cite{keller1}. Taken together, the velocity, volume, and variety of data do not constitute a challenge in the development of the tornado analytic. However, the information system that handles the tornado data should also be capable of scaling to handle the flood data, therefore it is important to understand that the tornado analytic represents only a first step in the development of future offerings.\\

The same attributes of big data\index{big data} that were useful in categorizing tornado can also be applied to the flood offering. The flood analytic requires the following initial public domain data sets \cite{walker}:
\begin{itemize}
    \item \gls{nasa} satellite rainfall data
    \item 2.5 TB of \gls{nasa} \gls{merra} soil moisture, atmospheric wind, humidity, and runoff data
    \item streamflow data as available from \gls{noaa} and the \gls{usgs}
    \item damage data (newspaper, social media, etc)
    \item satellite data of built structures and inundation
    \item hydrologic response models
\end{itemize}
When exploring the velocity, volume, and variety of the above collections it becomes clear that flood probabilities have the characteristics of big data\index{big data}. It is expected that ingesting relevant newspaper and social media data will add a daily component with reports of damage, especially during the various flood seasons, in order to possibly assist with the compilation of a database of worldwide flood claims. Even without taking into account the global nature of the offering for flood probabilities, the satellite rainfall and \gls{merra} data sets will be larger than the corresponding sets for tornado\footnote{more input variables means more data volume}. Without even addressing existing structures or the hydrologic response models, the entirety of the input data set is larger. Many of the data sets have fixed formats, leading to structured input, however the media component implies unstructured data. The flood data has greater volume, higher velocity, and wider variety than the tornado analytic and is more representative of a big data problem.\\

As seen in table \ref{qualifiers}, the flood analytic ingests a larger and wider variety of data sets at a faster pace than the corresponding tornado analytic. Although the format and frequency of both analytic offerings are similar, the infrastructure needed to produce those offerings is quite different in scale.
\begin{table}[htbp]
    \centering
    \begin{tabular}{l l l}
        \hline
        Attribute & Tornado & Flood\\ [0.5ex]
        \hline
        data velocity & monthly & daily\\
        data volume &  3 TB+  & 1 PB+\\
        data variety &  4 sets all structured & 6 sets some structured\\
        \hline
    \end{tabular}
    \caption{Big Data Qualifiers}
    \label{qualifiers}
\end{table}
\subsection{Velocity, Volume, and Variety}
The speed at which sets are processed and stored varies with respect to the particular data. Some data sets are generated daily, e.g. \gls{merra}, some are generated monthly, e.g. \gls{soi}, and some require daily parsing, e.g. newspapers, Twitter\index{Twitter}, or other social media. Although the velocity of incoming data varies by type, it is expected that the output analytics for both tornado and flood are generated monthly on a fixed schedule as determined by the specific customer. If a month's worth of data can be pre-processed in a timely manner, there is no reason to move to daily ingestion. Tests performed on one month of \gls{merra}\footnote{representative of a larger data feed} have shown that on a single 3GHz core it takes approximately thirty five minutes to parse and store the data, exclusive of retrieval time. It is not expected that monthly retrieval of the structured data sets will present a problem. The unstructured data sets will need to be gathered on a daily or semi-daily basis in order to stay current. If not retrieved in a timely manner, there is risk of the data disappearing or time spent processing weekly or monthly bulk loads will prove to be prohibitive.\\

It is possible to store all the necessary data required for the tornado analytic on a 1U server with several one terabyte drives. Many home computers also fall into this category. In contrast, if four drives of four terabytes each were configured in a 1U server, it would require sixty-four servers  (or 1.5 standard racks) to store one petabyte worth of flood related data. Replicating the data for a high availability production environment can add a factor of two or three to these estimates. Clearly, flood data requires capacity that is orders of magnitude greater than that of the tornado data. An important point is that the persistent storage of data could be significantly less than what was needed to pre-process it originally. For example, \gls{merra} files are typically 100 to 300 MB in size, depending on whether they are two or three dimensional. These data files contain variables that may or may not be useful in specific offerings. It is possible that out of over thirty variables, only two or three would actually be used which dramatically reduces the amount of persistent data stored and indexed. The data parsers would know which data they need to permanently store and discard the rest. For this reason, it is crucial that the storage used for pre-processing be elastic to save on costs.\\

Between the tornado and flood data, there are nine separate data sets that need to be pre-processed and stored. While the technology and approach to extracting and storing relevant information from each data is similar, they are still distinct processes that must be developed individually. All four data sets for the tornado analytic are structured with well defined approaches for extracting the required data. The flood analytic also contains several data sets that are well structured and do not present any significant challenges to data extraction on a continual basis. Contained within the flood analytic however, is the unstructured newspaper and associated social media data related to flood reports and damage. This will require a significantly different parsing technology than the structured data and will be the most challenging to develop, implement, and refine.\\

The flood analytic requires data sets with different attributes than that of tornado. The real-time nature of newspaper articles and other social media requires daily or semi-daily ingestion. Significant flooding in a key geographic region may require a quicker than usual delivery of the associated analytics. This information will be reported first by the media in near real-time and not seen in the structured feeds until the end of the month. The additional data sets required by the global and domestic flood offering surpass that of tornado by many times. The same storage technology, but not necessarily the same platform, will work for both offerings at different scales. Tornado data is well structured, while flood data is both structured and unstructured. Different algorithms involving semantic and natural language processing are necessary to effectively parse the variety of feeds associated with flood reporting by the media.  The attributes shown by the flood data sets map very well to the accepted criteria of big data, while tornado does not. The technology necessary to produce accurate flood analytics is more complex than tornado and at scale. The expertise developed by building a platform framework capable of generating analytics on both structured and unstructured data will naturally transfer to the next ClimatEdge\index{ClimatEdge} offering, as well as other big data\index{big data} projects.