Organizational Research By

Surprising Reserch Topic

Question:MQ to process, aggregate and publish data asynchronously


Some background, before getting to the real question:

I am working on a back-end application that consists of several different modules. Each module is, currently, a command-line java application, which is run "on demand" (more details later).

Each module is a "step", part of a bigger process that you can think of as a data flow; the first step collects data files from an external source and pushes/loads them into some SQL database tables; then the following steps, based on different conditions and events (timing, presence of data in the DB, messages and elaborations done through a web-service/web-interface), take data from (1 or more) DB tables, process them, and write them down on different tables. Steps run on three different servers, and read data from three different DBs, but write only in a single DB. The purpose is to aggregate data, compute metrics and statistics.

Currently, each module is executed periodically (from a few minutes/hours for the first modules, to few days for the last in the chain, which need to aggregate more data and therefore wait "longer" from them to be available), using a cronjob. A module (currently, a java console application) is run, and it checks the database for new, unprocessed information in a given datetime-window, and do its job.

The problem: it works, but.. I need to expand and maintain it, and this approach is starting to show its limits.

  1. I do not like to rely on "polling"; it is a waste, considering that the information of previous modules could be sufficient to "tell" other modules down the chain when the information they need is available, and that they can proceed.
  2. It is "slow": the several days of delay for modules down the chain is there because we have to be sure data is arrived and processed by the previous modules. So we "stop" these modules until we are sure we have all the data. New additions require real-time (not hard, but "as soon as possible") computation of some metrics. A very good example is what happens here, on SO, with badges! :) I need to obtain something really similar.

To solve the second problem, I am going to introduce "partial", or "incremental" computations: as long as I have a set of relevant information, I process it. Then, when some other linked information arrives, I compute the difference and update the data accordingly, but then I need also to notify other (dependent) modules.

The question(s)

  • 1) Which is the best way to do it?
  • 2) Related: which is the best way to "notify" other modules (java executables, in my case) that a relevant data is available?

I can see three ways:

  • add other, "non-data" tables to the DB, in which each module write "Hey, I have done this and it is available". When the cronjob starts another module, it read the table(s), decide that he can compute subset xxx, and does it. And so on
  • use Message Queues, like ZeroMQ, (or Apache Camel, like @mjn suggested) instead of DB tables
  • use a key-value store, like Redis, instead of DB tables

I am biased towards the second or third solution; in particular,

  • 3) are there any solution that help me in getting rid completely of the cronjobs?

(That, IMO, would mean some sort of queue that:

  • store messages
  • based on queue(s) content, sends a message to a particular application (subscriber?) (for example, ModuleC receive a message only if QueueA and QueueB both have a message)
  • fire messages based also on a time delay (I can post a message and say "notify the subscriber after two hours")
  • if the subscriber is not running, it can start it)

Is there something similar, or if nothing is available, does it make sense to build it on my own? (probably on top of redis?). In any case,

  • 4) should I build this message(event?)-based solution as a centralized service, running it as a daemon/service on one of the servers?
  • 5) should I abandon this idea of starting the subscribers on demand, and have each module running continuous as a daemon/service?
  • 6) which are the pro and cons (reliability, single point of failure vs. resource usage and complexity...)?

asked Sep 13, 2013 in Java Interview Questions by anonymous
edited Sep 12, 2013
0 votes
35 views



Related Hot Questions

4 Answers

0 votes

1> I suggest using a message queue, choose the queue depending on your requirements, but for most cases any one would do, I suggest you choose a queue based on protocol JMS (active mq) or AMQP (rabbit mq) and write a simple wrapper over it or use the ones provided by spring- > spring-jms or spring-amqp

2> You can write queue consumers such that they notify your system that a new message arrives for example in rabbit you can implement the MessageListener interface

 public class AutomationInputQueueListener implements MessageListener {
     @Override
public void onMessage(Message message) {
     /* Handle the message */        

    }
}

3> If you use async conumers like in <2> you can get rid of all polling and cron jobs

4> Depends on your requirements -> If you have millions of events/messages passing through your queue then running the queue middle-ware on a centralized server makes sense.

5> If resource consumption is not an issue then keeping your consumers/subscribers running all the while is the easiest way to go. if these consumers are distributed then you can orchestrate them using a service like zookeeper

6> Scalability -> Most queuing systems provide for easy distribution of messages, so provided that your consumers are stateless, then scaling is possible just by adding new consumers and some configuration.

 

answered Sep 13, 2013 by rajesh
edited Sep 12, 2013
0 votes
The queue task descriptions partially sound like things systems based on "enterprise integration patterns" like Apache Camel do.

A delayed message can be expressed by constants

from("seda:b").delay(1000).to("mock:result");

or variables, for example a message header value

from("seda:a").delay().header("MyDelay").to("mock:result");
answered Sep 13, 2013 by rajesh
edited Sep 12, 2013
0 votes

1> I suggest using a message queue, choose the queue depending on your requirements, but for most cases any one would do, I suggest you choose a queue based on protocol JMS (active mq) or AMQP (rabbit mq) and write a simple wrapper over it or use the ones provided by spring- > spring-jms or spring-amqp

2> You can write queue consumers such that they notify your system that a new message arrives for example in rabbit you can implement the MessageListener interface

 public class AutomationInputQueueListener implements MessageListener {
     @Override
public void onMessage(Message message) {
     /* Handle the message */        

    }
}

3> If you use async conumers like in <2> you can get rid of all polling and cron jobs

4> Depends on your requirements -> If you have millions of events/messages passing through your queue then running the queue middle-ware on a centralized server makes sense.

5> If resource consumption is not an issue then keeping your consumers/subscribers running all the while is the easiest way to go. if these consumers are distributed then you can orchestrate them using a service like zookeeper

6> Scalability -> Most queuing systems provide for easy distribution of messages, so provided that your consumers are stateless, then scaling is possible just by adding new consumers and some configuration.

 

answered Sep 13, 2013 by rajesh
edited Sep 12, 2013
0 votes
The queue task descriptions partially sound like things systems based on "enterprise integration patterns" like Apache Camel do.

A delayed message can be expressed by constants

from("seda:b").delay(1000).to("mock:result");

or variables, for example a message header value

from("seda:a").delay().header("MyDelay").to("mock:result");
answered Sep 13, 2013 by rajesh
edited Sep 12, 2013

...