March 1, 2015

esProc Helps Process Heterogeneous Data Sources in Java - HDFS

It is not difficult for Java to access HDFS through API provided by Hadoop. But to realize computations, like grouping, filtering and sorting, on files in HDFS in Java is troublesome. esProc is a good helper in Java’s dealing with these computations. It can execute the access to HDFS too. With the help of esProc, Java will increase its ability in performing structured and semi-structured data computing, like the above-mentioned computations. Let’s look at how it works through an example.

The text file employee.gz in HDFS contains the employee data. You are required to import the data and select the female employees who were born on and after January 1st, 1981. The text file has been zipped with gzip in HDFS and cannot be loaded to the memory entirely.

The data in employee.gz is as follows:
EID   NAME       SURNAME        GENDER  STATE        BIRTHDAY        HIREDATE         DEPT         SALARY
1       Rebecca   Moore      F       California 1974-11-20       2005-03-11       R&D          7000
2       Ashley      Wilson      F       New York 1980-07-19       2008-03-16       Finance    11000
3       Rachel      Johnson   F       New Mexico     1970-12-17       2010-12-01       Sales         9000
4       Emily         Smith        F       Texas        1985-03-07       2006-08-15       HR    7000
5       Ashley      Smith        F       Texas        1975-05-13       2004-07-30       R&D          16000
6       Matthew Johnson   M     California 1984-07-07       2005-07-07       Sales         11000
7       Alexis        Smith        F       Illinois       1972-08-16       2002-08-16       Sales         9000
8       Megan     Wilson      F       California 1979-04-19       1984-04-19       Marketing        11000
9       Victoria    Davis        F       Texas        1983-12-07       2009-12-07       HR    3000
10     Ryan         Johnson   M     Pennsylvania    1976-03-12       2006-03-12       R&D          13000
11     Jacob        Moore      M     Texas        1974-12-16       2004-12-16       Sales         12000
12     Jessica     Davis        F       New York 1980-09-11       2008-09-11       Sales         7000
13     Daniel       Davis        M     Florida      1982-05-14       2010-05-14       Finance    10000
Implementation approach: Call the esProc script with Java program, import and compute the data, then return the result to Java program in the form of ResultSet.

First, you should develop and debug program in esProc’s Integration Development Environment (IDE). The preparatory work is to copy the core packages and the configuration packages of Hadoop to “esProc’s installation directory\esProc\lib”, such as commons-configuration-1.6.jarcommons-lang-2.4.jarhadoop-core-1.0.4.jarHadoop1.0.4.


Because esProc supports analyzing and evaluating expressions dynamically, it will enable Java to filter the data in HDFS file as flexibly as SQL does. For example, to query the data of female employees who were born on and after January 1st, 1981, esProc will use an input parameter “where” as the condition, as shown in the figure below:

“where” is a string, its value is BIRTHDAY>=date(1981,1,1) && GENDER=="F".
The code in esProc is as follows:

A1: Define a HDFS file object cursor with the first row being the title and tab being the default field separator. The zipping mode is determined by the filename extension. Here gzip is used. esProc also supports other zipping modes. UTF-8 is a charset, which is a JVM charset by default.

A2: Filter the cursor according to the condition. Here macro is used to realize analyzing the expression dynamically, in which “where” is the input parameter. esProc will first compute the expression surrounded by ${…}, take the computed result as the macro string value and replace ${…} with it, then interpret and execute the code. The final code executed in this example is =A1.select(BIRTHDAY>=date(1981,1,1) && GENDER=="F").

A3: Return the cursor. If the filtering condition is changed, you only need to change the parameter “where” without modifying the code. For example, you are required to query the data of the female employees who were born on January 1st, 1981, or of the employees in which NAME+SURNAME is ”RebeccaMoore”. The code for the value of “where” can be written as BIRTHDAY>=date(1981,1,1) && GENDER=="F" || NAME+SURNAME=="RebeccaMoore".

The code for calling this block of code in Java with esProc JDBC is as follows (save the esProc program as test.dfx and put the Hadoop jars needed by HDFS in Java’s classpath):
          // create a connection using esProc jdbc
Class.forName("com.esproc.jdbc.InternalDriver");
con= DriverManager.getConnection("jdbc:esproc:local://");
// call the program in esProc (the stored procedure); test is the file name of dfx
st =(com.esproc.jdbc.InternalCStatement)con.prepareCall("call test(?)");
//set the parameters
st.setObject(1," BIRTHDAY>=date(1981,1,1) && GENDER==\"F\" ||NAME+SURNAME==\"RebeccaMoore\"");// the parameters are the dynamic filtering conditions
// execute esProc stored procedure
st.execute();
// get the result set, which is the eligible set of employees
ResultSet set = st.getResultSet();

February 26, 2015

esProc Helps Process Heterogeneous Data sources in Java - Hive

It is easy for Java to connect to Hive using JDBC. But the computational ability of Hive is less than that of SQL in other databases. So to deal with uncommon computations, data should be retrieved before further operation is performed using Java. Thus the code for will be complicated.

But if esProc is used to help with the Java programming, the complicated operation as a result of using Hive in Java will become simpler. The following example will show how esProc works with Java in detail. orders is a table in Hive containing the detailed data of sales orders. Now it is required to compute the year-on-year comparison and link relative ratio. The data is as follows:

ORDERID CLIENT     SELLERID AMOUNT ORDERDATE
1       UJRNP      17     392  2008/11/2 15:28
2       SJCH         6       4802         2008/11/9 15:28
3       UJRNP      16     13500       2008/11/5 15:28
4       PWQ         9       26100       2008/11/8 15:28
5       PWQ         11     4410         2008/11/12 15:28
6       HANAR     18     6174         2008/11/7 15:28
7       EGU 2       17800       2008/11/6 15:28
8       VILJX         7       2156         2008/11/9 15:28
9       JAYB          14     17400       2008/11/12 15:28
10     JAXE          19     19200       2008/11/12 15:28
11     SJCH         7       13700       2008/11/10 15:28
12     QUICK      11     21200       2008/11/13 15:28
13     HL    12     21400       2008/11/21 15:28
14     JAYB          1       7644         2008/11/16 15:28
15     MIP  16     3234         2008/11/19 15:28


Link relative ratio refers to comparison between the current data and data of the previous period, using month as the time interval. For example, divide the sales figure in April by that in March and we get the link relative ratio of April. Year-on-year comparison is the comparison between the current data and data of the corresponding period of the previous year, which means, for example, dividing the sales figure of April 2014 by that of April 2013. Since Hive provides no window functions, it cannot complete the computation unless using nested SQL. But Hive supports very poor subquery and usually the computation should be performed outside of the database. With esProc, however, the computation can be realized easily. The code is as follows:

A1: Connect to the database through JDBC using the datasource Hive defined in advance.

A2: Query the data in the database by the time period using external parameters begin and end. Such as begin="2011-01-01 00:00:00", end="2014-07-08 00:00:00" (i.e. the current date which can be obtained using now() function).

A3: Group orders by the year and the month and sum up to get the sales of each month.

A4: Add a new field Irr, which is the monthly link relative ratio. The expression is mAmount/mAmount[-1], in which mAmount represents the sales in the current time period and mAmount[-1] represents the sales in the previous one. Note that the link relative ratio of the initial month (January of 2011) is empty.

A5: Sort the data in A4 by the month and the year before we compute the year-on-year comparison. Complete code should be =A4.sort(m,y). But since A4 has been sorted by the year, here we just need to sort it by the month, that is A4.sort(m), which has a better performance.

A6: Add another new field yoy, which is the year-on-year comparison of the 
monthly sales figure. The expression is if(m==m[-1],mAmount/mAmount[-1],null), which means the year-on-year comparison is valid only between the same months of the two time periods. The year-on-year comparison of each month in the initial year (the year of 2011) is empty.

A7Sort the data in A6 by the year in descending order and by the month in ascending order. Note that the data is valid up to July of 2014. The result is as follows:
A8: Close Hive database connection.

A9: Return the result.

This block of code can be called by Java using esProc JDBC to get the final result (the above esProc program will be saved as test.dfx). The code for this is as follows:
          // create a connection using esProc jdbc
Class.forName("com.esproc.jdbc.InternalDriver");
con= DriverManager.getConnection("jdbc:esproc:local://");
// call the esProc program (the stored procedure); test is the file name of dfx
st =(com.esproc.jdbc.InternalCStatement)con.prepareCall("call test(?,?)");
// set the parameters
st.setObject(1,"2011-01-01 00:00:00");//begin
st.setObject(1,"2014-07-08 00:00:00");//end
// execute esProc stored procedure
st.execute();
// get the result set
ResultSet set = st.getResultSet();

It is the same way in which esProc accesses Hive and other ordinary databases. Just configure their JDBC while detailed process is omitted here. 

January 30, 2015

esProc Helps Process Heterogeneous Data Sources in Java - JSON

Java’s JSON open source package can only parse JSON data and hasn’t the computational function. It is troublesome for programmers to develop a general program for performing computations, such as grouping, sorting, filtering and joining, by themselves. For example, during developing program for performing conditional filtering in JSON files using Java, the code has to be modified if the conditional expression is changed. If they want to make it as flexible as SQL in performing conditional filtering, they have to write code for analyzing and evaluating expressions dynamically. This requires quite a lot of programming work.

esProc supports dynamic expressions. It also can be embedded in Java to write the general program for computing JSON data. Let’s give an example to see how it works. There are to-be-processed JSON strings that contain employee information, including fields such as EID, NAME, SURNAME, GENDER, STATE, BIRTHDAY, HIREDATE and DEPT, etc. Parse the strings and select female employees who were born on and after January 1st, 1981. The content of the strings is shown as follows:

[{EID:1,NAME:"Rebecca",SURNAME:"Moore",GENDER:"F",STATE:"California",BIRTHDAY:1974-11-20,HIREDATE:2005-03-11,DEPT:"R&D",SALARY:7000},
{EID:2,NAME:"Ashley",SURNAME:"Wilson",GENDER:"F",STATE:"New York",BIRTHDAY:1980-07-19,HIREDATE:2008-03-16,DEPT:"Finance",SALARY:11000},
{EID:3,NAME:"Rachel",SURNAME:"Johnson",GENDER:"F",STATE:"New Mexico",BIRTHDAY:1970-12-17,HIREDATE:2010-12-01,DEPT:"Sales",SALARY:9000},…]

Implementation approach: Call esProc program using Java and input the JSON strings which will then be parsed by esProc, perform the conditional filtering and return the result in JSON format to Java. Because esProc supports parsing and evaluating expression dynamically, it enables Java to filter JSON data as flexibly as SQL does.


For example, it is required to query female employees who were born on and after January 1, 1981. esProc can input two parameters: “jsonstr” and “where”, as the conditions. This is shown as follows:
“where” is a string, its values is BIRTHDAY>=date(1981,1,1) && GENDER=="F".

The code written in esProc is as follows:
A1: Parse the JSON data into a table sequence. esProc’s IDE can display the imported data visually, as shown in the right part of the above figure.

A2: Perform the conditional filtering, using macro to realize parsing the expression dynamically. The “where” in this process is an input parameter. In executing, esProc will first compute the expression surrounded by ${…}, take the computed result as macro string value, replace ${…} with it and then interpret and execute the code. The final code to be executed in this example is =A1.select(BIRTHDAY>=date(1981,1,1) && GENDER=="F").

A3: Generate JSON strings using the filtered table sequence.

A4: Return the eligible result set to the external program.
When the filtering condition is changed, you just need to modify “where”– the parameter. For example, it is required to query female employees who were born on and after January 1, 1981, or employees whose NAME+SURNAME is equal to “RebeccaMoore”. The value of “where” can be written as BIRTHDAY>=date(1981,1,1) && GENDER=="F" || NAME+SURNAME=="RebeccaMoore". After the code is executed, the result set in A2 is as follows:
Since the esProc script is called in Java through the JDBC interface, the returned result is set - the object of ResultSet. Fetch the first field of string type in set, and this is the filtered JSON string. Detailed code is as follows (save the above program in esProc as test.dfx):
         // create a connection
         Class.forName("com.esproc.jdbc.InternalDriver");
         con= DriverManager.getConnection("jdbc:esproc:local://");
         // call the program in esProc (the stored procedure); test is the name of file dfx
         com.esproc.jdbc.InternalCStatementst;
         st =(com. esproc.jdbc.InternalCStatement)con.prepareCall("call json(?,?)");
         // set the parameters; as the JSON string is long, part of it is omitted.
// In practice, JSON strings may be generated by various ways; see below for the explanation.
          String jsonstr=
"[{EID:1,NAME:\"Rebecca\",SURNAME:\"Moore\",GENDER:\"F\",STATE:\"California\...}]";
         st.setObject(1,jsonstr);
         st.setObject(2,"BIRTHDAY>=date(1981,1,1) && GENDER==\"F\"");
// execute the esProcstored procedure
         ResultSet set=st.executeQuery();
         // get the filtered JSON string
         String jsonstrResult;
         if(set.next()) jsonstrResult = set.getString(1);

JSON format is common used by interactive data in internet application. In practice, JSON strings may be retrieved from local files or remote HTTP server. esProc can directly read JSON strings from the files or the HTTP server. Take the latter as an example. It is assumed that there is a testServlet which returns JSON strings of employee information. The code for performing the operation is as follows:
A1: Define an httpfile object, the URL is
  http://localhost:6080/myweb/servlet/testServlet?table=employee&type=json.
A2: Read the result returned by the httpfile object.
A3: Parse the JSON string and generate a table sequence.

A4: Filter data according to the conditions.
A5: Convert the filtered table sequence to JSON strings.

A6: Return the result in A4 to the Java code that calls this piece of esProc program. 

January 27, 2015

esProc Helps Process Heterogeneous Data Sources in Java - MongoDB

It is convenient to establish and close a connection to MongoDB in esProc , as well as to call the database to query and count the data, perform distinct and aggregate operations. 

1. Preparation for connecting to MongoDB

The MongoDB Java driver (like mongo-java-driver-2.12.2.jar), which esProc does’nt provide, should be put into [esProc installation directory]\common\jdbc beforehand when esProc designer is used to access the database. The MongoDB Java driver’s download link is https://github.com/mongodb/mongo-java-driver/releases.

To access MongoDB by calling the esProc script through esProc JDBC in a Java program, the above MongoDB Java driver needs to be put into the classpath of the Java program. 

2. Create and close a connection to MongoDB

In an esProc script, mongodb(con) function is used to establish a connection to the MongoDB database and mdb.close() function is used to close it, as shown by the following code:

A1: Connect to MongoDB. IP and port are localhost:27017; database name is test, and so are the user name and the password.
A2: Close the MongoDB connection. 

3. Common data query and selection

Let’s take the employee, a collection in the database whose name is test, of MongoDB as an example to illustrate the esProc script for accessing MongoDB. Some of the data of employee are as follows:
{ "_id" : ObjectId("546aea22ee62ea452bdef4fe"), "EID" : 1, "NAME" : "Rebecca", "
SURNAME" : "Moore", "GENDER" : "F", "STATE" : "California", "BIRTHDAY" : "1974-1
1-20", "HIREDATE" : "2005-03-11", "DEPT" : "R&D", "SALARY" : 7000 }
{ "_id" : ObjectId("546aea22ee62ea452bdef4ff"), "EID" : 2, "NAME" : "Ashley", "S
URNAME" : "Wilson", "GENDER" : "F", "STATE" : "New York", "BIRTHDAY" : "1980-07-
19", "HIREDATE" : "2008-03-16", "DEPT" : "Finance", "SALARY" : 11000 }
{ "_id" : ObjectId("546aea22ee62ea452bdef500"), "EID" : 3, "NAME" : "Rachel", "S
URNAME" : "Johnson", "GENDER" : "F", "STATE" : "New Mexico", "BIRTHDAY" : "1970-
12-17", "HIREDATE" : "2010-12-01", "DEPT" : "Sales", "SALARY" : 9000 }

The following is a sample of esProc script for querying data of the collection of MongoDB: 

A2: Filter and query the data of employee. find function’s first parameter is employee, the collection’s name; its second parameter is the filtering criterion: STATE=” California” and SALARY >4,000; the third one specifies which properties should be fetched (here GENDER won’t be fetched). Noticed that esProc uses the same parameter format as that used in MongoDB’s find statement.

The function returns a cursor, which can fetch data in batch for computing when the result set is big, so as to avoid the memory overflow. The usage of cursor has been explained in detail in esProc-related documents. Here suppose the data are not big and use fetch function to fetch all the results for observation: 

4.Query and filter nested arrays and documents

Nested arrays and docements are multilevel data structure specific to MongoDB. The Java application doesn’t support this multilevel data structure when it accesses MongoDB through JDBC. Because esProc supports referencing data by fields, like referencing a record sequence by a field, it can fetch the nested arrays and documents of MongoDB.

For a nested array, let’s look at a collection – fruitshop:
{ "_id" : ObjectId("548a58c19cfbc9dd702eb71e"), "name" : "big fruit", "fruits" :
 [ "apple", "pear", "orange" ] }
{ "_id" : ObjectId("548a58c19cfbc9dd702eb71f"), "name" : "good fruit", "fruits"
: [ "banana", "pear", "orange" ] }
{ "_id" : ObjectId("548a58c29cfbc9dd702eb720"), "name" : "my fruit", "fruits" :
[ "banana", "apple", "tomato" ] }

And for a nested docment, let’s look at this collection – blogs:
{ "_id" : ObjectId("548a59fa9cfbc9dd702eb721"), "content" : "It is too hot", "co
mment" : [ { "author" : "joe", "score" : 3, "comment" : "just so so!" }, { "auth
or" : "jimmy", "score" : 5, "comment" : "cool! good!" } ] }
{ "_id" : ObjectId("548a59fa9cfbc9dd702eb722"), "content" : "It is too cold", "c
omment" : [ { "author" : "james", "score" : 1, "comment" : "yes!" }, { "author"
: "jimmy", "score" : 5, "comment" : "cool!" } ] }
{ "_id" : ObjectId("548a59fb9cfbc9dd702eb723"), "content" : "It is windy day tod
ay", "comment" : [ { "author" : "tom", "score" : 3, "comment" : "I do not think
so!" }, { "author" : "jimmy", "score" : 5, "comment" : "cool!" } ] }

esProc script for querying and filtering data of the two collections: 

A2: The filtering criterion of find function includes the pear insead of being equal to pear, for the fruits is an array. The return result of find funciton is a cursor, from which the resulting table sequence is fetched using fetch function and you can see fruits field is a referencing field, which references a sequence consisiting of various fruits names, as shown below: 

Click the blue part of the first row to see the detailed information of the referenced sequence, as shown below: 

Based on this table sequence with a referencing field, esProc can further perform more complicated set operations.

A3: comment is a collection of documents, and the filtering criterion is the comment’s documents whose author is jimmy. find function returns a cursor, from which the table sequence is fetched using fetch function. comment field of the table sequence is a referencing field that references a table sequence, as shown below: 

Click the blue part of the second row to see the detailed information of the table sequence, as shown below: 

Based on this table sequence with a referencing field, esProc can further perform more complicated set operations. Here further discussion is omitted. 

5. Count the data, perfor distinct and aggregate operations

With more powerful ability in computing and flow control, usually esProc is used to perform the complicated multi-step operation and MongoDB, a highly-efficient storage solution, is responsible for providing data without participating in the computation. But considering the performance loss during the data transmission, you can use MongoDB alone to perform some simple aggregate operations.

esProc provides the way of calling MongoDB interface to count the data, perform distinct and aggregate operations. Different from the find function for querying and filtering data, the return results of the three operations are table sequences instead of cursors. A sample code is as follows: 

A2: Count the number of eligible files in employee, a collection, using count function. The function’s first parameter is employee, the collection’s name; its second parameter is the filtering criterion: STATE=” California”. The result is count value: 

A3: Get the distinct values from employee using distinct function. The function’s first parameter is employee, the collection’s name, and the second parameter is one of the properties, DEPT. The return result is a table sequence consisting of employee’s DEPT without duplicate members: 
A4: Group and summarize employee. The first parameter of aggregate function is employee, the collection’s name, and the second one includes the grouping property and aggregate property, which means the data of the collection will be grouped by DEPT and then count the number of files in each group. aggregate function returns a cursor as well, whose data will be fetched all at once using fetch function. 

A5: Group and summarize employee, which is the same as the above. But another aggregate property is added to the second parameter of aggregate function, which means grouping the data of the set by DEPT and computing the average value of SALARY in each group. 


January 21, 2015

esProc Helps Process Heterogeneous Data Sources in Java –Cross-Database Relating

JoinRowSet and FilteredRowSet provided by RowSet– Java’s class library for data computing – can perform cross-database related computing, but they have a lot of weaknesses. First, JoinRowSet only supports inner join, it doesn’t support outer join. Second, test shows that db2, mysql and hsql can work with JoinRowSet, yet the result set of join oracle11g to other databases is empty though no error reporting will appear. The fact is there were two users who perform cross-database join using oracle11g database even got the correct result. This suggests that JDBC produced by different database providers will probably affect the result obtained by using this method. Last, the code is complicated.

esProc has proved its ability in assisting Java to perform cross-database relating. It can work with various databases, such as oracle, db2, mysql, sqlserver, sybase and postgresql, to perform a variety of cross-database related computing, like inner join and outer join involving heterogeneous data. An example will teach you the way esProc works. Requirement: relate table sales in db2 to table employee in mysql through sale.sellerid and employee.eid, and then filter data in both sales and employee according to the criterion state=”California”. The way the code is written in this task applies to situations where other types of databases are involved.


The structure and data of table sales are as follows: 

The structure and data of table employee are as follows:

Implementation approach: Call esProc script using Java program, join the multiple databases together to realize the cross-database relating, perform filtering and return the result to Java in the form of ResultSet.

The code written in esProc is as follows:

A1: Connect to the data source db2 configured in advance.
A2: Connect to the data source mysql configured in advance. In fact oracle and other types of databases can be used too.
A3, A4: Retrieve table sequences: sales and employee, from db2 and mysql respectively. esProc’s Integration Development Environment (IDE) can display the retrieved data visually, as shown in the right part of the figure in the above.
A5: Relate sales to employee through sellerid=eid using esProc’s object reference mechanism.
A6: Filter the two table sequences according to state="California".
A7: Generate a new table sequence and get the desired fields.
A8,A9: Close the connection
A10: Return the result to the caller of esProc program.

This piece of program is called in Java using esProc JDBC to get the result. The code is as follows (save the above esProc program as test.dfx):
//create a connection using esProcjdbc
Class.forName("com.esproc.jdbc.InternalDriver");
con= DriverManager.getConnection("jdbc:esproc:local://");
// call esProc program (the stored procedure) in which test is the name of file dfx
com.esproc.jdbc.InternalCStatementst;
st =(com.esproc.jdbc.InternalCStatement)con.prepareCall("call test()");
// execute esProc stored procedure
st.execute();
// get the result set
ResultSet set = st.getResultSet();