thesis/future.tex at master · christopherkeller/thesis · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
%!TEX root=index.tex
\newacronym{api}{API}{Application Programming Interface}
\newacronym{it}{IT}{Information Technology}
\newacronym{tcpip}{TCP/IP}{Transmission Control Protocol Internet Protocol}
\section{Future Directions}
One attribute of any big data problem is that of volume. Some offerings have larger volume, some do not, but every big data solution should provide cost effective data access as a core capability. With respect to ClimatEdge\index{ClimatEdge}, petabytes of publicly accessible climate data reside at the various government agencies which collected them. Should a data center be constructed and the data simply downloaded? There are several broad approaches to the problem of remote data access:
\begin{itemize}
    \item provide an \gls{api} that facilitates access at its remote location
    \item clone the data and store it locally within a \textsc{CSC} data center
    \item process the data remotely on site
\end{itemize}
This section takes a look at each of the three alternatives and the advantages and disadvantages of each approach.
\subsection{Application Programming Interface}
Many Federal data sets are already publicly available in compressed formats, usually by time or type. In order to provide data as a service, the provider would need to implement various \gls{api}s allowing the consumer to slice and dice the data sets on-demand. This has the advantage of being very simple to  access on the part of the consumer. Simply issue the request, retrieve the results, and implement the analytics locally. There are two obvious problems with this approach: latency of the requests rules out real-time analytics and the enormous network bandwidth requirements of everyone pulling down data simultaneously. Local storage space is still required, at least as large as the largest possible retrieval, making this a questionable approach from the start. Additionally, the data provider needs to provide an \gls{api} capable of the granularity of individual results as well as the breadth of the entire data set. This type of approach simply does not scale well to the largest data sets and places a burden on already strained Federal agency \gls{it} budgets.
\subsection{Cloning}
One common approach in running analytics on large amounts of remote data is to clone the data to local storage, thus becoming a value added data provider. The benefits are obvious: network adjacent access speeds, self imposed reliability, and freedom to store and index the data in any way necessary. Even the initial population of historical data sets can be managed by project timelines or even by shipping tapes or hard disks between data centers. The downside, of course, is the immense cost in replicating peta scale data sets locally. With storage costs dropping and data center efficiencies increasing, the outlay of capital necessary to stand up a large data center is significant. One often overlooked issue with duplicating public data sets is in quality assurance and quality control. It is trivial to ensure exact duplicates of data have been created, \gls{tcpip} is reliable and checksumming or hashing are both widely adopted solutions. What happens if an error is discovered months later in the original copy of the data? What if that error leads to erroneous results that are passed onto customers?  While the U.S. government does not always prevail in weather related liability proceedings, most of the time they win on the basis of immunity \cite[Chapter~4]{fairweather}. For these reasons, it may be wise to leave the data where it is.
\subsection{Appliances}
A hybrid approach of the first two cases may lie in processing the data locally at each customer site. At least one government agency, \gls{nasa}, is developing an analytics cluster to work with customers and partners on processing large data sets on site \cite{duffy}. While this approach is heavily dependent on cooperation (and possibly funding) from the various agencies, it is certainly worth investigation. The benefits lie in possible reduced storage costs and reduced network costs since only summary results are transferred. It is reasonable to assume that different entities would have different requirements for analyzing the same data sets. These differences lead to multiple methods of storing and indexing the data efficiently, which would be prohibitively expensive for agencies to undertake.  One solution may lie in the development of a \textsc{CSC} big data appliance for remote storage and analytics. This appliance could be dropped into the agency data centers and, through secure \gls{api}s, handle the analytics on site. Only the results necessary for presentation would be transferred back to the main cluster at \textsc{CSC}, thus substantially reducing network costs for offline analytics. The proposed framework in this paper could easily be deployed into an appliance form factor. \textsc{CSC} is currently exploring technologies such as Cetas\index{Cetas} for its own solutions, which include many of the components necessary for an appliance \cite{cetas}.\\

Within these three broad categories, there exist other approaches such as Federal data centers whose mandate is to gather public data and provide access or even basic analytics. As with any problem, there are multiple solutions possible.  Three approaches were proposed with various advantages and disadvantages. Any one approach, or combination of, would be suitable for a thesis centered around methods for processing large data sets. As big data technology evolves, and data sets grow increasingly larger, methods of efficiently analyzing publicly available data will be forefront in discussions. This creates an opportunity for \textsc{CSC} to leverage its experiences in developing commercial products to aid Federal agencies interested in meeting the growing public sector demand for public climate data.