How to choose a big data security solution


Big data security refers to all measures and tools used to detect and protect a system from malicious actors. Various threats that threaten big data platforms online include DDOS attacks, ransomware and theft.

From a security standpoint, the biggest challenge for big data is the protection of user privacy. With the continual growth of existing social media networks and the emergence of new ones, identifiable user information is becoming more prevalent in big data.

Yet the problem is magnified for big data platforms because they store much more information about user data. Should such a system be compromised, the net effect such an attack has will be more severe.

The solution, then, would be to adopt tools that offer security from such malicious actors. And yet, picking a big data security platform isn’t so easy. There are roughly five key factors that should be considered when selecting a big data security solution:

  • Scalability
  • Data type support
  • Specialization
  • Reporting and visualization
  • Data augmentation


One factor that makes big data different from ‘small’ data is the sheer amount of information transferred within and without the network. A platform selected to secure a big data platform should be able to ingest massive amounts of data without downtime.

The servers and endpoints the security platform talks to are constantly changing state. These changes in state often need to be logged and communicated to other devices within and off of the network. If the security platform in question does not have enough bandwidth available, the biggest risk is a complete failure of the whole system.

Another crucial part of security analysis is analyzing the packets of data that are transmitted to the system. While this in itself is a useful tool to have, it’s the ability of a big data security platform to correlate data historically and across systems that makes the largest difference.

Data type support

In the three V’s of big data – velocity, variety and volume – ‘variety’ is the most likely candidate to be ignored, often erroneously. However, it is crucial that a big data security solution should be able to support different types of data.

Events collected by such systems come from different sources, have different kinds of information and may be a lot finer coming from some sources as compared to others. For instance, data packets may provide information about the attacker’s method of accessing a server illegally.

These are very fine-grain pieces of data that require network analysis tools to be able to capture. On the other hand, access logs are high-level and human-readable, unlike raw data packets. Such systems may need to communicate with each other seamlessly.

Other data types that may need to be collected include audit logs, system logs (alongside application logs), flow data and other security events. If a system supports incrementally adding more data types, the better. Such a system should also provide support for querying data that is relevant from an infosec perspective – a feature more prevalent in graph-like tools rather than conventional relational databases.

Platform specialization

The Hadoop vs. Spark debate is still on. Both of these are general-purpose tools with different approaches to the same problem. While either one could just as well be used to solve our security problem, they are not particularly specialized in doing so. This is particularly so in the case of Hadoop, which is infamous for being insecure by default.

While both Spark and Hadoop perform exceptionally well with big data analytics, even at scale, they lack the fundamental ability to account for relations between different types of data – users, networks and even inter-connected systems. And while it is possible to achieve the same with a bit of tinkering, it’s hardly worth the effort, considering the availability of specialized platforms like Fortscale (acquired by RSA security) and LogRhythm.

In addition to data collection, big data security solutions also perform functions such as risk contextualization, time normalization and metadata tagging. Such functionality would have to be built from the ground up on a big data analysis platform like Hadoop or Spark.

Reporting and visualization

As previously mentioned, big data involves processing and interacting with a lot of different kinds of data. In addition to which, big data platforms generate a lot of data that needs to be studied to guide organizations into making more sound decisions. Such information is much easier to study and make conclusions on if represented in some form of visual format. In the case of big data, that visual format is charts and other visualizations.

A big data security platform should provide access to dashboards with pre-configured security indicators to provide a higher-level overview of the performance of an organization’s security.

Data augmentation

Data augmentation is the addition of contextual information into an event as data is collected. For instance, rather than interact directly with low-level data harvested from packets, a big data security system should collect such information and inject it directly into the event.

The system might then make information such as threat indicators, metadata about the network session and other relevant details to make low-level data more accessible to security analysts.

Data augmentation is useful, particularly because it has the potential to eliminate false-positives. These are a lot more common in traditional systems that don’t collect contextual information about events. In turn, data augmentation also reduces the workload that security analysts have to deal with.

Not to mention how much simpler it makes analyzing and visualizing said events. Without it, analysts will be quickly overwhelmed with the amount of information they have to deal with.


User privacy and security has never been as important to organizations as it is today. And with the need for big data platforms increasing with each passing day, so does the need for big data security platforms. But before adopting one, users should put a few factors into consideration – platform specialization, data augmentation, scalability and data type support. Depending on the organization, some features may be more important than others – a startup may not be in a hurry to scale, for example, while a large or