big data

Close-up of database servers Breaking Down Apache’s Hadoop Distributed File System

Apache Hadoop is a framework for big data. One of its main components is HDFS, Hadoop Distributed File System, which stores that data. You might expect that a storage framework that holds large quantities of data requires state-of-the-art infrastructure for a file system that does not fail, but quite the contrary is true.

Deepak Vohra's picture
Deepak Vohra
Big data code When to Use MapReduce with Big Data

MapReduce is a programming model for distributed computation on big data sets in parallel. It's a module in the Apache Hadoop open source ecosystem, and a range of queries may be done based on the algorithms available. Here's when it's suitable (and not suitable) to use MapReduce for generating and processing data.

Deepak Vohra's picture
Deepak Vohra
Lines of data in a spreadsheet Before Data Analysis, You Need Data Preparation

One of the prerequisites for any type of analytics in data science is data preparation. Raw data usually has several shortcomings in structure, format, and consistency, so first it has to be converted to a usable form. These are some types of data preparation you can conduct to make your data useful for analysis.

Deepak Vohra's picture
Deepak Vohra
Apache Hadoop logo Exploring Big Data Options in the Apache Hadoop Ecosystem

With the emergence of the World Wide Web came the need to manage large, web-scale quantities of data, or “big data.” The most notable tool to manage big data has been Apache Hadoop. Let’s explore some of the open source Apache projects in the Hadoop ecosystem, including what they're used for and how they interact.

Deepak Vohra's picture
Deepak Vohra
Data analysis Data-Driven Testing Skills in an Agile and DevOps World

For agile and DevOps, an understanding of the role of data analysis in the test strategy is helping teams accelerate development, testing, and deployments. As we continue to enhance our testing effectiveness, data analytics skills are an important dimension in managing risks in a “continuous everything” world.

Michael Sowers's picture
Michael Sowers
Data Test Your Data Quality to Increase the Return on Your QA Investment

With the high volume of data coming into your organization, it’s important that it be complete, correct, and timely. But considering the velocity at which this data is moving, how do you measure its current quality? You must be able to test it wherever it sits still enough to be viewable, without altering it.

Shauna Ayers's picture
Shauna Ayers
Data What You Should Consider to Make the Best Use of Your Collected Data

We live in a world where data is constantly being recorded. In software, determining the timing of when to use that data is critical to making the most of the information. You should take into account data freshness, the data-gathering processes and any dependencies between them, and when to distribute information.

Catherine Cruz Agosto's picture
Catherine Cruz ...
Here There Be Monsters: The Value of Data Profiling

Monsters appeared on medieval maps to identify the unknown dangers of the sea. Likewise, the data profiles for an organization identify the points within its data. A robust data-profiling strategy can provide a more accurate picture of an organization’s data systems and find risks before they become monsters.

Shauna Ayers's picture
Shauna Ayers