how big data is bigdata

how big data is bigdata  using -'hadoop,mapreduce,bigdata'

How much amount of data does qualify to be categorised as Bigdata?

With what size of data can one decide that this is the time to go for technologies like Hadoop and use the power of distributed computing?

I believe there is a certain premium in going for these technologies, so how to make sure that using Bigdata methods are going to leverage the current system?

asked Oct 11, 2015 by abhimca2006
0 votes

2 Answers

0 votes

To quote from the wiki page for Bigdata:

When it becomes difficult to store, search, analyse, share etc. a given amount of data using our traditional database management tools, that large and complex dataset is called to be Bigdata.

Basically, it’s all relative. What is considered Bigdata varies depending on the capabilities of the organization managing the dataset. For some organizations, facing hundreds of gigabytes of data for the first time may trigger a need to reconsider data management options. For others, it may take tens or hundreds of terabytes before data size becomes a significant consideration.

Amount of data is just one of the key elements in defining Bigdata. Variety in data and the velocity at which data increases are other two major elements in defining a data set to be Bigdata.

Variety in data means having many different data and file types which may required to be analyzed and processed in ways which is out of bounds of traditional relational databases.Some examples of this variety include sound and movie files, images, documents, geo-spatial data, web logs, and text strings.

Velocity is about the speed of change in the data and how quickly it must be processed to generate significant value. Traditional technologies are especially poorly suited to storing and using high-velocity data. So new approaches are needed. If the data in question is created and aggregates very quickly and must be used swiftly to uncover patterns and problems, the greater the velocity and the more likely you are to have a Bigdata problem at hand.

By the way if you are looking for a 'cost-effective' solution, you can explore amazon's EMR.

answered Oct 11, 2015 by mannar kande
0 votes

"Big Data" is a somewhat vague term, used more for marketing purposes than making technical decisions. What one person calls "big data" another may consider just to be day to day operations on a single system.

My rule of thumb is that big data starts where you have a working set of data that does not fit into main memory on a single system. The working set is the data you are actively working on at a given time. So, for instance, if you have a filesystem that stores 10 TB of data, but you are using that to store video for editing, your editors may only need a few hundred gigs of it at any given time; and they are generally streaming that data off of the discs, which doesn't require random-access. But if you are trying to do database queries against a full 10 TB data set that is changing on a regular basis, you don't want to be serving that data off of disk; that starts to become "big data."

For a basic rule of thumb, I can configure an off-the-shelf Dell server for 2 TB of RAM right now. But you pay a substantial premium to stuff that much RAM into a single system. 512 GB of RAM on a single server is much more affordable, so it would generally be more cost effective to use 4 machines with 512 GB of RAM than a single machine with 2 TB. So you can probably say that above 512 GB of working-set data (data that you need to access for any given computation on a day-to-day basis) would qualify as "big data".

Given the additional cost of developing software for "big data" systems as opposed to traditional database, for some people it might be more cost effective to move to that 2 TB system rather than re-designing their system to be distributed among several systems, so depending on your needs, anywhere between 512 GB and 2 TB of data may be the point where you need to move to "big data" systems.

I wouldn't use the term "big data" to make any technical decisions. Instead, formulate your actual needs, and determine what kinds of technologies are needed to address those needs now. Consider growth a bit, but also remember that systems are still growing in capacity; so don't try to over-plan. Many "big data" systems can be hard to use and inflexible, so if you don't actually need them to spread your data and computation to dozens or hundreds of systems, they can be more trouble than they're worth.

answered Oct 11, 2015 by amit_cmps