loading page

Data Processing at Scale
  • Raju Singh
Raju Singh
Arizona State University

Corresponding Author:[email protected]

Author Profile


The data generation and collection of data have gone through a series of improvements over the past several years. Now, we observe that both aspects of data (generation and collection) have evolved, it creates another dimension – how to process the data at scale, and how to manage it.
Relational DBMS has been a widely accepted idea behind processing and managing data, but it has its own pros and cons, the constraints on data to prevent integrity violation is seen as a trade-off between performance and management. With the advent in the storage, compute and network technology, we have reliably transited the state of relational database management. It’s not yet done. Handling exceptions have been very poor with a single point of failure with traditional DB architecture. However, with distributed systems, it only multiplies the failure points. Failure is expected, and hence the solution for availability is designed around these expected failures. Distributed computing adds functionalities such as performance, availability, and reliability.
But, that’s not all. We are living in an era, where we communicate very now and then, through different devices. Not only this, we generate, collect, manage data which are of variant types (mostly unstructured, multi-dimensional, carries lots of noise and bias, etc.). NoSQL DBMS, Apache Spark, and Hadoop come to rescue.
One such area that exemplifies the use of big data is the transportation industry, which can encompass shipping, airline data, trucking, and the context we refer to cabs. NYC taxi data is available in an open-dataset that stores, among other things, geospatial data collected from individual taxis as they navigate the streets of New York City. Processing of geospatial data at this scale is very time-consuming and resource-intensive, as anyone who has used ArcGIS on a large dataset can attest. Distributed and parallel data processing presents an opportunity for faster processing of this type of data. The Apache Spark framework is ideal for this task as it is highly efficient with fast performance times. Additionally, it has libraries and APIs built in that allow it to process SQL queries, which many users are likely to be familiar with given its ubiquity.
In the following report, we demonstrate our approaches to perform hot spot analysis on the NYC Taxi data. Hot-zone analysis performs range-join on the rectangle and point, to identify the boundaries from where most pickups happen. Hot-cell analysis uses statistical parameters to identify the zones by also considering time as an additional dimension.