Get Local Clout
Big data and how your server configuration will change
via WordPress getlocalclout.com/big-data-and-servers/
Today’s data volumes are forcing every system administrators to rethink about their server configurations. With over 2.7 Zettabytes of data worldwide, and volumes doubling every 1.2 years (see big-data stats), dealing with this massive data deluge is creating new opportunities for business development.
Big Data not only describes the massive amounts of data stored on disk, but also refers to streaming data that originates from a wide set of sources, including web activity, mobile connected devices, and a myriad of business transactions. Unstructured, large, and emanating from multiple sources/formats, big-data is often characterized by the 5 Vs: Volume, Velocity, Variety, Veracity, and Value.
The tools for data infrastructures
To deal with the quality of data, computing infrastructure use an array of complex technologies, making server configuration particularly challenging, but also more interesting than traditional computing. Whether getting massive amounts of data from disk, or from live streams, the modern data infrastructure must be a distributed platform, selecting and moving data over parallel nodes to downstream processing, and ultimately interpretation. This movement towards distribution achieves the goal of dramatically increasing data throughput by spreading storage and computing over many nodes in a cluster.
To convert your server for dealing with large data, new platform tools for Warehousing include Apache Hadoopwith the HDFS filesystem and Apache Kafka (used with with Samza or Flink) for low-latency distributed streaming. Unstructured data can be accessed using modern NoSQL distributed databases, such as Cassandra or MongoDB, as well as many others. Also part of the modern data infrastructure are the downstream Data Analytics applications for statistical machine learning library; popular libraries include Apache Spark or H2O.ai. Several visualization platforms are available providing remarkable web-based data discovery, such as the D3 D3 javascript library.
Getting your hands dirty
With so many software tools, the complex interplay between, and the need for distributed cluster computing, building an effective production level infrastructure for data has become a daunting task even for the experienced administrator. Fortunately, you don’t have to build a cluster system to become familiar with the large data ecosystem. By using a Linux server and virtualization, you can configure a Hadoop cluster (see this tutorial) on a single machine. For step-by-step guides, see this Hadoop on VirtualBox blog and this how-to or this blog using Linux containers that offer a lower latency solution. To configure a server for analyzing Twitter data with Hadoop,
this how-to shows how cloud computing infrastructures can be configured. For using machine learning with Apache Spark, this step-by-step guide will show you how to install it on your server and this article will cover some basic programming examples.
Why not GPUs?
Apart from cluster-based computing, you can use the inherent parallelism of modern graphics processors (GPUs). This blog article describes the hardware/software configurations for a deep learning workstation server.
Conclusions
Whatever your needs, it is an exciting time to learn how to adapt your server to the new data world.
Big data and how your server configuration will change
via WordPress getlocalclout.com/big-data-and-servers/
Today’s data volumes are forcing every system administrators to rethink about their server configurations. With over 2.7 Zettabytes of data worldwide, and volumes doubling every 1.2 years (see big-data stats), dealing with this massive data deluge is creating new opportunities for business development.
Big Data not only describes the massive amounts of data stored on disk, but also refers to streaming data that originates from a wide set of sources, including web activity, mobile connected devices, and a myriad of business transactions. Unstructured, large, and emanating from multiple sources/formats, big-data is often characterized by the 5 Vs: Volume, Velocity, Variety, Veracity, and Value.
The tools for data infrastructures
To deal with the quality of data, computing infrastructure use an array of complex technologies, making server configuration particularly challenging, but also more interesting than traditional computing. Whether getting massive amounts of data from disk, or from live streams, the modern data infrastructure must be a distributed platform, selecting and moving data over parallel nodes to downstream processing, and ultimately interpretation. This movement towards distribution achieves the goal of dramatically increasing data throughput by spreading storage and computing over many nodes in a cluster.
To convert your server for dealing with large data, new platform tools for Warehousing include Apache Hadoopwith the HDFS filesystem and Apache Kafka (used with with Samza or Flink) for low-latency distributed streaming. Unstructured data can be accessed using modern NoSQL distributed databases, such as Cassandra or MongoDB, as well as many others. Also part of the modern data infrastructure are the downstream Data Analytics applications for statistical machine learning library; popular libraries include Apache Spark or H2O.ai. Several visualization platforms are available providing remarkable web-based data discovery, such as the D3 D3 javascript library.
Getting your hands dirty
With so many software tools, the complex interplay between, and the need for distributed cluster computing, building an effective production level infrastructure for data has become a daunting task even for the experienced administrator. Fortunately, you don’t have to build a cluster system to become familiar with the large data ecosystem. By using a Linux server and virtualization, you can configure a Hadoop cluster (see this tutorial) on a single machine. For step-by-step guides, see this Hadoop on VirtualBox blog and this how-to or this blog using Linux containers that offer a lower latency solution. To configure a server for analyzing Twitter data with Hadoop,
this how-to shows how cloud computing infrastructures can be configured. For using machine learning with Apache Spark, this step-by-step guide will show you how to install it on your server and this article will cover some basic programming examples.
Why not GPUs?
Apart from cluster-based computing, you can use the inherent parallelism of modern graphics processors (GPUs). This blog article describes the hardware/software configurations for a deep learning workstation server.
Conclusions
Whatever your needs, it is an exciting time to learn how to adapt your server to the new data world.