Java – Batch Upload to Database

Usually I/O operations are costly. When I  tried to upload my huge csv file of 1.8 lakhs records one by one,  into a MySQL table , it took almost more than 30 minutes. And obviously it was not an acceptable result. So I had to resort  to batch uploading. On writing data as batches of 1000 records, everything was over in 30-40 seconds.

Lets see the code in detail.

The important thing to remember is that we need to turn off  “auto commit” mode. This means that if this mode is enable, every time record is pushed DB memory, it would automatically get written into tables, nullifying the effects of batch upload. At the same we have to enable it just before writing the records into table once enough number of records are pushed into DB memory or cache, using the commit().

In the the following example, records are read from a csv file named input.csv and its first three fields are written into tables called “batch” in the DB test. MySQL was DB of my choice.

At first auto commit mode is turned off by calling setAutoCommit(false) on DB connection object. Each record will be read and pushed them into DB cache using the addBatch(); 

When the we have 1000 records in the cache ie count variables becomes multiples of 1000, we need to write them into DB . For that we call following method, executeBatch(); Since we have disabled the auto commit mode, we need to enable that as well by calling commit()  on connection object in order to get this data written into the DB.

Gitub link here

BatcUpload.java

public class BatchUpload {

public static void main(String[] args) throws IOException, SQLException {

String line = “”;
String delimiter = “,”;
int count = 1;
DBConnector.createConnection();
Connection dbConn = DBConnector.getDBConnection();
System.out.println(dbConn);
PreparedStatement ps = DBConnector.getPSInstance();
dbConn.setAutoCommit(false);
String inputFile = “input.csv”;
BufferedReader br = new BufferedReader(new FileReader(inputFile));
while ((line = br.readLine()) != null) {
String[] entities = line.split(delimiter);
try {
ps.setString(1,entities[0]);
ps.setString(2,entities[1]);
ps.setString(3,entities[2]);

ps.addBatch();
if(count%1000==0){
ps.executeBatch();
dbConn.commit();
}
count++;

System.out.println(“Records are inserted into DBUSER table!”);

} catch (SQLException e) {

System.out.println(e.getMessage());

}

}

/*** To write the remaining records into DB*/

ps.executeBatch();
dbConn.commit();
dbConn.close();

}
}

DBConnector.java 

public class DBConnector {
private static final String DB_DRIVER = “com.mysql.jdbc.DRIVER”;
private static final String DB_CONNECTION = “jdbc:mysql://localhost:3306/test”;
private static final String DB_USER = “root”;
private static final String DB_PASSWORD = “root”;

private static Connection conn;
private static PreparedStatement ps;

public static void createConnection() {

// conn = null;
System.out.println(“asdasdfafds”);

try {

Class.forName(DB_DRIVER);

} catch(ClassNotFoundException cnf){
System.out.println(“Driver could not be loaded: ” + cnf);
}

try{
conn = DriverManager.getConnection(DB_CONNECTION, DB_USER,DB_PASSWORD);
String query = “INSERT INTO batch”
+ “(userID, username, address) VALUES”
+ “(?,?,?)”;

ps = conn.prepareStatement(query);
}
catch (SQLException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}

}

public static Connection getDBConnection() {

return conn;

}

public static PreparedStatement getPSInstance() {
return ps;

}
}

Advertisements

HTTP Gateway : Invoking Celery Tasks from Java (Non Python Application) – Part #3

HTTP Gateway is the ideal way for celery tasks invocation and status polling from any non python languages like java. For this, first we have to set up a django application which would handle all the incoming requests for tasks invocation and status polling. As a first step, we need to install python virtual enviroment named env (not mandatory, But I prefer this as this wont affect python packages installed in the machine).

Step1 :
activate the virtual environment and install all required libraries like django,djecelery etc.

Step 2:
copy and paste the django app for http-gateway in this activated virtual environment,env.(I have uploaded the code for this app in github, https://github.com/nishijayaraj/celery-HTTPgateway)

This app contains all tasks in tasks.py file. Now run the app from this activated virtual environment using the following command :
python manage.py runserver

Now with this, our djnago app would start at localhost:8080, to which we can send rest api calls. It is likely to be getting some module not found errors at this point, in case u get them, just try to install the missing packages and libraries in this activated virtual environment. and then run the app again.

The structure of the entire app is described in the image added below :
Screenshot from 2016-07-14 18:21:27

Now we need workers to execute the tasks. So open a terminal and run the following command from the root directory of our application (env/celery-HTTPgateway),

celery worker -A tasks –loglevel=INFO

It would then list all registered tasks there. We would be able to see to tasks,
1.tasks.hello_world
2.UploadTask

Now let’s see how we can invoke a celery task and poll its status using REST apis. These rest apis can be called from any programming languages using appropriate native apis.

For the sake of simplicity, here I am using linux curl command for simulating REST api calls.
( / at end of the url is mandatory for this command to work)

Open a terminal and run the following command,

curl -X GET http://localhost:8000/apply/tasks.hello_world/

Then you would a json data as response,

{“ok”: “true”, “task_id”: “0fc2150e-b321-4cc6-aaef-b1ce9b30e7fe”}

The respose contains the id of the invoked tasks which can be used to track its status.

Status Polling:

curl -X GET http://localhost:8000/0fc2150e-b321-4cc6-aaef-b1ce9b30e7fe/status/

Response:
{“task”: {“status”: “SUCCESS”, “result”: “Hello world………”, “id”: “0fc2150e-b321-4cc6-aaef-b1ce9b30e7fe”}}

These api would display custom states,something at which celery flower apis fails most of the time. A task named UploadTask has been written in tasks.py file with a view to showcase this feature. For this, first we need to invoke the tasks and then track the status.

Task invocation :

curl -X GET http://localhost:8000/apply/tasks.UploadTask/
Response –
{
“ok”: “true”,
“task_id”: “cc51e093-372f-42c1-8344-c1def70c544a”
}

The status checking of above task, can be done :

curl -X GET http://localhost:8000/cc51e093-372f-42c1-8344-c1def70c544a/status/
Response
{
“task”: {
“status”: “PROGRESS”,
“result”: {
“progress”: 0
},
“id”: “cc51e093-372f-42c1-8344-c1def70c544a”
}
}

References :
http://sofc.developer-works.com/article/25718503/How+to+use+Celery+in+Java+web+Application
https://github.com/ask/celery/tree/master/examples/celery_http_gateway

Invoking Celery Tasks from Java Application – Part #2

In the previous post we have seen how to invoke a celery tasks from java application. but it was based on sending messge to  rabbitMQ queue using respective rabbitMQ libraries. But in this post, let’s be be familiar with more convenient way or rather using Rest APIs.

For this, we need to install a celery monitoring tool called flower. Not all version of flower is supposed to serve our purpose. What worked for me is the development version. (the command to install is written below)
pip install https://github.com/mher/flower/zipball/master#egg=flower

So let me assume that we have tasks.py with a task named add

@app.task
def add(x, y):
print x+y

Now run the worker
celery -A tasks worker –loglevel=info

Starting flower
Finally it is time to start flower so that we access/control both tasks and workers using flower REST apis. For that we need to run the following command :

celery flower -A appname (celery flower -A tasks)

Care should be taken to specify the project name in the above command(here tasks) when we start flower because the apis would not work properly otherwise.

Now this can be viewed from the url http://localhost:5555 (or using respective hostname). This has got different tabs to show the status of tasks, workers and so on. So basically what we are going to do is, use the the apis which flower is using for aforementioned feature, directly in our application.

In order to simulate REST api call, throughout this post I am using curl command as I am coming from linux background. This apis can be integrated from any programming languages.

1. Invoking a celery task

curl -X POST -d ‘{“args”:[1,2]}’ http://localhost:5555/api/task/async-apply/tasks.add

this would trigger celery task add with parameters 1 and 2 and would generate an output similar to the following:

{
“task-id”: “81775ebb-7d88-4e91-b580-b3a2d79fe668”,
“state”: “PENDING”
}

So this api would return the task id of the generaed task, which can be used for tracking it whenever we want.

2. Retrieving information regarding a specific task using its id

curl -X GET http://localhost:5555/api/task/info/81775ebb-7d88-4e91-b580-b3a2d79fe668

output :
{
“task-id”: “81775ebb-7d88-4e91-b580-b3a2d79fe668”,
“result”: “‘None'”,
“clock”: 371,
“routing_key”: null,
“retries”: 0,
“failed”: false,
“state”: “SUCCESS”,
“kwargs”: “{}”,
“sent”: false,
“expires”: null,
“exchange”: null,
“started”: 1466248131.745754,
“timestamp”: 1466248131.837694,
“args”: “[1, 2]”,
“worker”: “celery@space-Vostro-3800”,
“revoked”: false,
“received”: 1466248131.744577,
“exception”: null,
“name”: “tasks.add”,
“succeeded”: 1466248131.837694,
“traceback”: null,
“eta”: null,
“retried”: false,
“runtime”: 0.09263942600227892
}

3. Listing all the tasks sent to workers

curl -X GET http://localhost:5555/api/tasks

output :
{
“81775ebb-7d88-4e91-b580-b3a2d79fe668”: {
“received”: 1466248131.744577,
“revoked”: false,
“name”: “tasks.add”,
“succeeded”: 1466248131.837694,
“clock”: 371,
“started”: 1466248131.745754,
“timestamp”: 1466248131.837694,
“args”: “[1, 2]”,
“retries”: 0,
“failed”: false,
“state”: “SUCCESS”,
“result”: “‘None'”,
“retried”: false,
“kwargs”: “{}”,
“runtime”: 0.09263942600227892,
“sent”: false,
“uuid”: “81775ebb-7d88-4e91-b580-b3a2d79fe668”
},
“50c589e1-b613-496f-af1e-c94c04b163dc”: {
“received”: 1466248086.289584,
“revoked”: false,
“name”: “tasks.add”,
“succeeded”: 1466248086.339701,
“clock”: 313,
“started”: 1466248086.291148,
“timestamp”: 1466248086.339701,
“args”: “[4, 3]”,
“retries”: 0,
“failed”: false,
“state”: “SUCCESS”,
“result”: “‘None'”,
“retried”: false,
“kwargs”: “{}”,
“runtime”: 0.049509562999446644,
“sent”: false,
“uuid”: “50c589e1-b613-496f-af1e-c94c04b163dc”
}
}

4. Terminating a task
curl -X POST -d ‘terminate=True’ http://localhost:5555/api/task/revoke/81775ebb-7d88-4e91-b580-b3a2d79fe668

References :
https://pypi.python.org/pypi/flower
http://flower.readthedocs.io/en/latest/api.html

http://nbviewer.jupyter.org/github/mher/flower/blob/master/docs/api.ipynb

 

 

Reading Java property file in Python

Accessing  a java property file in a python code is an easy task. For this we need to install, a python module called pyjavaproperties. (There are many other ways in which we can do this. I prefer this module)

For installing this, please run the following command :

sudo pip install http://pypi.python.org/packages/source/p/pyjavaproperties/pyjavaproperties-0.6.tar.gz

How to use it 

we have a property file named  config.properties and which is as follows,

config.properties
user=Crunchify
company1=Google
company2=eBay
company3=Yahoo
Now open a python ide and add the following lines
from pyjavaproperties import Properties
p = Properties()
p.load(open('test2.properties'))
p.list()     #will all the properties and its valuesprint                                               print p['user']   #prints Cruchify
Ref :                                                                                                   https://pypi.python.org/pypi/pyjavaproperties                                                           https://www.versioneye.com/python/pyjavaproperties/0.6

Invoking Celery Tasks from Java Application – Part #1

Invoking a celery task from java application is not hassle but not an easy one either.  This java celery integration was  implemented  with the help of a message broker/queue  and  what I chose for this was RabbitMQ.  There are many options out there  for message broker but I opted this as I had used it earlier with celery.

This article is based on the assumptions that  readers have a bit prior experience with celery and RabbitMQ. For a start, we can go through a very very brief introduction. (might write a few articles on celery, if time favours ..)

As we know, celery is task queue which absorbs messsags from message queues (iie here RabbitMQ ) and execute them in celery worker threads. Here, in order to trigger a celery task from java application, what we need to is, to create a rabbitMQ queue and push the messages to this queue from java application, in an appropriate format ie format that celery tasks messages adhere to (http://docs.celeryproject.org/en/latest/internals/protocol.html). Then We need to  define tasks and start  worker thread to execute these tasks as messages are available in this message queue, (as the java application pushes them).

Following are the steps I have done to make it working  :

1. Download java – rabbitMQ client library   from here                                                                                                                    Then extract the zip file and copy all the jar files into the root directory of the project.

2.  Now we need to write the java code to integrate RabbitMQ and push tasks messages to  specific RabbitMQ queues.          This needs to be placed in the root directory of the project where we have just copied jar files into, in the previous step.

Send.java

import com.rabbitmq.client.Channel;
import com.rabbitmq.client.Connection;
import com.rabbitmq.client.ConnectionFactory;
import com.rabbitmq.client.AMQP;
public class Send {
public static void main(String[] argv) throws Exception {
String QUEUE_NAME = “celery”;
ConnectionFactory factory = new ConnectionFactory();
factory.setHost(“localhost”);
Connection connection = factory.newConnection();
Channel channel = connection.createChannel();
channel.queueDeclare(QUEUE_NAME, true, false, false, null);
String message = “{\”id\”: \”4cc7438e-afd4-4f8f-a2f3-f46567e7ca77\”, \”task\”: \”tasks.add\”, \”args\”: [1,2], \”kwargs\”: {}, \”retries\”: 0, \”eta\”: \”2009-11-17T12:30:56.527191\”}”;
channel.basicPublish(“”, QUEUE_NAME, new AMQP.BasicProperties.Builder()
.contentType(“application/json”).contentEncoding(“utf-8”)
.build(), message.getBytes(“utf-8″));
System.out.println(” [x] Sent ‘” + message + “‘”);
channel.close();
connection.close();
}
}

In this, we are sending messages to a queue named “celery” . Also the format of the message is specified in the variable message and from which it is clear that name of the task is tasks.add and arguments are 1 and 2. This is equivalent to calling add.delay(1,2). Hence bow the task message for addition has been sent to the queue celery.

3.  Now we need to define the celery task.                                                                                                                                      In the root directory of the project add following python files.

tasks.py

 

from celery import Celery
app = Celery(‘tasks’)
app.config_from_object(‘celeryconfig’)

@app.task
def add(x, y):
# return x + y
print “haiii”
print x+y

celeryconfig.py

CELERY_IMPORTS = (“tasks”, )
CELERY_RESULT_BACKEND = “amqp”
BROKER_URL = “amqp://guest:guest@localhost:5672//”
CELERY_TASK_RESULT_EXPIRES = 300

4.  Now we can start celery worker to execute the tasks messages

From the root directory of the project open a terminal and run the following command :                                                           celery -A tasks worker –loglevel=info    

here -A stands for application name, since the argument passed in while creating celery instance is                                       tasks ( app =Celery(“tasks”) ), here our application name is tasks.

5. compile and run Send.java to push celery asks to rabbitMQ queue :

 javac -cp rabbitmq-client.jar Send.java                                                                                                                                      java -cp .:commons-io-1.2.jar:commons-cli-1.1.jar:rabbitmq-client.jar Send

Here we use the parameter -cp to keep the required jar files in class path. With this we have successfully pushed the          tasks messages to queue.

6. Now open the terminal where celery worker is started, you would get the following result.

Screenshot from 2016-05-26 17:33:31