October 19, 2014

esProc Helps Process Structured Texts in Java – Alignment Join

The join statements of the database can be used conveniently to perform the operation of alignment join. But sometimes the data is stored in the text files, and to compute it in Java alone we need to write a large number of loop statements. This makes the code cumbersome. Using esProc to help with programming in Java can solve the problem easily and quickly. Let’s look at how this works through an example.

The text file emp.txt contains employee information, except that in which EId is 1. Another text file sOrder.txt contains information of sales orders in which field SellerId corresponds to field EId inemp and from which the information whose SellerId is 2 is excluded. Part of the original data is listed below:



It is required to join the three fields: Name, Dept and Gender, in emp to sOrder in alignment and output the computed result to a new file. The expected result is as follows:

Code written in esProc:

In cells A1 and A2 respectively, data is imported from the two text files and stored in two variables: emp and sOrder. Here import function uses tab as the column separator by default. Option @t represents the first row will be imported as the field names. Because only some of the fields in emp.txt are needed, the code in A1 uses the names of these desired fields as parameters. After execution, values of emp and sOrder are as follows:

In the code in A3: =join@1(sOrder:s,SellerId;emp:e,EId), join function performs the operation of alignment join and changes the names of the two tables to s and e respectively. Option @1 represents the left join which is in line with the requirement of the example: join emp to sOrder in alignment. The computed result is as follows:

Click the numbers in blue and we can see the detailed information, as shown below:

esProc can also be used to realize the right join which only requires exchanging positions of data in alignment. For example, to align sOrder according to emp, we just need to exchange their positions in the code, that is, =join@1(emp:e,EId;sOrder:s,SellerId). The computed result is as follows:

It is also easy to realize the full join using option @f. The code is join@f(sOrder:s,SellerId;emp:e,EId). The computed result is as follows:

There are altogether four operations of alignment join: left join, right join, full join and inner join. By default, join function is used to execute the inner join, the code is =join(sOrder:s,SellerId;emp:e,EId). The computed result is as follows:

Let‘s get back to the example. The code in A4: =A3.new(s.OrderID, s.Client, s.SellerId, s.Amount, s.OrderDate,e.Name, e.Dept, e.Gender), is for getting the desired fields from table eand creating a new structured two-dimensional table. The computed result is as follows:
Now the alignment is done and data needs to be exported to a new file. The code for this is =file("E: \\result.txt").export@t(A4). In export function, tab is by default the column separator and option @t represents the field names are exported to the first row. Open result.txt and we can see information as follows:

         //create a connection using esProcjdbc
         con= DriverManager.getConnection("jdbc:esproc:local://");
         //call esProc script; the name of the script file is test
         st =(com.esproc.jdbc.InternalCStatement)con.prepareCall("call test()");
         // execute esProc stored procedure

By executing the above Java code, emp will be joined to sOrder in alignment and the result will be output to file result.txt.

If the requirement is changed to this: query data in sOrder according to dynamic periods of time, execute the same operation of alignment join and return the result directly to Java. To complete the task esProc needs to define two parameters: begin and end, to represent starting time and ending time respectively. The esProc code is as follows:

A2Filter sOrder again using select function according to the starting and ending time passed from Java, that is, @begin and @end.
A5Output the computed result in A4 to JDBC interface.
And Java code should be modified too to pass parameters to esProc code and get thefinal result. The modified code is as follows:
         con= DriverManager.getConnection("jdbc:esproc:local://");
         st =(com.esproc.jdbc.InternalCStatement)con.prepareCall("call test(?,?)");
         ResultSet set = st.getResultSet();

October 15, 2014

esProc Simplifies SQL-style Computation –Records Corresponding to Max Value

In developing database applications, usually it is the records corresponding to the max/min value that we need to retrieve, instead of the value itself. For example, the occasion in which each employee gets his/her biggest pay raise; the three lowest scores ever got in golf; the five days in each month when each product gets its highest sales amount; and so on. As the max function of SQL can only retrieve the max value, instead of the records to which the max value corresponds, it is quite complicated to handle the computation in SQL with the aid of some advanced techniques, such as window functions or the nested sub-queries or keep/top/row number. If multi-layered grouping or relations are involved, the computation will become even more complicated.

With the top function in esProc, the records corresponding to the max value can be retrieved and the computation becomes much easier. The following is such an example.

The database table golf contains the scores of members in a golf club. Please select the best three scores each member has ever got. Part of the data is as follows:

The code written in esProc:

A1Retrieve data from the database. If the data come from a structured text file, the following equivalent code can be used: =file("\\golf").import@t(). Click the cell and we can check the retrieving result:

A2=A1.group(User_ID), i.e., group the result of A1. The result is as follows:

As shown in the above figure, the data have been separated into multiple groups by User_ID and each row is a group. Click the blue hyperlink and members of the group will be shown as follows:

A3=A2.(~.top(-Score;3)).The code is to compute the records of each group of data whose field Score is in the top three. Here "~" represents each group of data. ~.top() represents that top function will work on every group of data in turn. The top function can retrieve the N biggest/smallest records from a data set. For example, top(Score;3) represents sorting by Score in ascending order and fetching the first three records (i.e. min values); top(-Score;3) represents sorting in descending order and fetching the first three records (i.e. max values). The result of this step is as follows:

A4=A3.union(), which means union data of every group. The result is as follows:
In the above, the computation is performed step by step. But the steps can be integrated into one for the convenience of maintenance and debugging:db.query("select * from golf").group(User_ID). (~.top(-Score;3)).union().

The method with which a Java program calls esProc is similar to that with which it calls an ordinary database. The JDBC provided by esProc can be used to return a computed result of the form of ResultSet to Java main program. For more details, please refer to the related documents 

October 14, 2014

esProc Simplifies SQL-style Computation – In-group Computation

During developing the database applications, we often need to perform computations on the grouped data in each group. For example, list the names of the students who have published papers in each of the past three years; make statistics of the employees who have taken part in all previous training; select the top three days when each client gets the highest scores in a golf game; and the like. To perform these computations, SQL needs multi-layered nests, which will make the code difficult to understand and maintain. By contrast, esProc is better at handling this kind of in-group computation, as well as easy to integrate with Java and the reporting tool. We’ll illustrate this through an example.

According to the database table SaleData, select the clients whose sales amount of each month in the year 2013 is always in the top 20. Part of the data of SalesData is as follows:
To complete the task, first select the sales data of the year of 2013, and then group the data by the month and, in each group, select the clients whose monthly sales amount is in the top 20. Finally, compute the intersection of these groups.

With esProc we can split this complicated problem into several steps and then get the final result. First, retrieve the data of 2013 from SaleData and group it by the month:

Note: The code for filtering in A2 can also be written in SQL.

It is the real grouping that esProc separates data into multiple groups. This is different from the case in SQL, whose group by command will compute the summary value of each group directly and won’t keep the intermediate results of the grouping. After grouping, the data in A3 are as follows:

esProc will sorts the data automatically before grouping. Each group is a set of sales data. The data of March, for example, are as follows:

In order to compute every client‘s sales amount of each month, we need to group the data a second time by clients. In esProc, we just need to perform this step by looping the data of each month and group it respectively. A.(x) can be used to execute the loop on members of a certain group, with no necessity for loop code.
In A4, the data of each month constitute a subgroup of each previous group after the second grouping:

At this point, the data of March are as follows:
It can be seen that each group of data in March contains the sales data of a certain client. 

Please note "~" in the above code represents each member of the group, and the code written with “~” is called in-group computation code, like the above-mentioned ~.group(Client).

Next, select the clients whose rankings of each month are in the top 20 through the in-group computation:

A5 computes the top 20 clients of each month in sales amount by looping each month's data. A6 lists the clients and their sales amount every month. The result of A6 is as follows:

Finally, list the field Client of each subgroup and compute the intersection of the subgroups:

A7 computes the top 20 clients of each month in sales amount. A8 computes the intersection of the field Clients of the twelve months. The result is as follows:
As can be seen from this problem, esProc can easily realize the in-group computation, including the second group and sort, on the structured data, make the solving way more visually, and display a clear and smooth data processing in each step. Moreover, the operations, like looping members of a group or computing intersection, become easier in esProc, which will reduce the amount of code significantly.

The method with which a Java program calls esProc is similar to that with which it calls an ordinary database. The JDBC provided by esProc can be used to return a computed result of the form of ResultSet to Java main program. For more details, please refer to the related documents 

October 13, 2014

Programming Languages for Desktop Data Analysis

For a programming language for desktop data analysis, the most important is that it must be user-friendly and have great computing power. We can judge whether a language is suitable for analyzing data on desktop from six aspects: application environment, file processing, text and string processing, structured data processing, predictive modeling algorithms and other less important indicators.
Application environment
Most users who make desktop data analysis are not professional programmers. They are accustomed to jobs under Windows, lack the skill of environment configuration which a professional should have. So the application environment of the programming language for desktop data analysis should be simple, Windows compatible and easy to install and configure.

In this respect, both esProc and R language have done well. They have simple application environment which can be used directly after installation. Python itself can manage without problem, but Pandas – frequently used to improve the computing capacity– has complex installation under Windows and is very strict about versions, though it is easily installed under Linux. Produced by Microsoft, SSAS gets along quite well with Windows except thatthe installation and configuration is a little complicated.
File processing
TXT and XLS are file formats most likely to be generated in routine work. And the degree of support for them shows the degree of ease of use of an interpreted language.

Generally, all analysis languages support TXT directly. In the case of XLS, the situation is complicated, especially the installation of third party modules and version compatibility. Since both Excel and SSAS are Microsoft products, SSAS can support XLS seamlessly. esProc, SPSS and SAS don’t need third party packages too, they can access XLS directly. Pandas is special, because it can directly support XLS, but, in itself, it is the third party library function; and its support for certain versions of XLS is limited. R language needs third party library function and Perl operating environment, and requires that versions of the three be matched; this makes installation complicated.

SSAS is easiest to use in read/write. Pandas and R language provide abundant parameters.

The ability to process big files should be taken into account, such as processing while the files are being imported. esProc is the best for it, with the most concise code.
Text and string processing
Sometimes, the disordered, semi-structured or non-structured raw data needs to be preprocessed to become the easy-to-use structured data. Thus the text and string processing is another focus in evaluating an interpreted language.
In this aspect, Python is the best and R is satisfactory. esProc comes next and SSAS is the worst.
Structured data processing
In practice, most of the data to be analyzed on desktop is the structured data. Therefore, the most common operation in desktop data analysis is the structured data computing and the corresponding computing power becomes the core competence of programming languages for desktop data analysis.

esProc is the most professional one in this field because it is specially designed for structured data computing. R language is unprofessional, especially in ordered data computing, though dataframe - a new data type – has been created for it. Pandas' dataframe is developed and improved on the basis of R’s, which makes it as able as R language but easier to understand. By contrast, SPSS, SAS and SSAS boast little in performing structured data computing.

Ordered data computing includes operations like link relative ratio, year-on-year comparison, fetching data in a relative interval, rank ordering during grouping data, and getting records in the top or bottom. It often involves relative position and cross-row and –group, and is a typical case of desktop data analysis. With inherent serial numbers, esProc performs the best in ordered data computing. Python and R language perform well, but because their basic element is the vector  instead of the record, the code written in them is elusive and more suitable for scientific use.

Predictive modeling algorithms
The predictive modeling is mainly used in scientific field and not common in desktop data analysis. Yet it is still an essential indicator.

R language boasts the latest, the richest and the most mature third party algorithms. SASS is easy to use but inflexible. Python/Pandas has always been trying to catch up with and imitate R language. It achieves an easier to understand syntax but hasn't been fully-formed. SAS and SPSS have established their authority in this regard. By contrast, esProc almost hasn't any ready-made predictive modeling algorithms.

Other less important indicators
Some less important indicators, such as the support for databases and parallel computing and graphing ability, also attract attention in special cases.

SASS works the best with databases, but it is not good at heterogeneous computing between text files, databases and self-defined data. esProc also gets along well with databases and performs satisfactorily in handling situations involving heterogeneous data sources. But Pandas, R Language, SAS and SPSS are not good in dealing with the both.

SASS is an expert at graphing, though it lacks flexibility. R language, esProc and Pandas, however, are flexible and have abundant inherent charts at the same time.

As to parallel computing, esProc has a built-in engine for it, which is easy to configure and develop. R language need third party software to perform parallel computing, resulting in complicated configuration and development. 

October 12, 2014

Comparison of esProc and R Language in Processing Text Files

As languages for data computations, both esProc and R language have rich functions to process text files. They have many similarities in basic usage, as well as obvious differences, such as in the aspect of processing files with fixed column width and big text files, reading and writing designated columns, computational performance, etc. The article aims to compare their similarities and differences.

1.Comparison of basic functions

There are six columns in sales.txt, they are separated from each other by tab \t. Lines are separated from each other by line break\n. The first row contains column names. Read the file into the memory and write it anew. The first rows of the file are as follows:
R language:
data<-read.table("e:\\sales.txt",sep="\t", header=TRUE)
write.table(data, file="e:\\ salesResult.txt",sep="\t",quote=FALSE,row.names=FALSE)
1.Both esProc and R language can do this job conveniently. esProc uses function option "@t"to represent that the first row contains column names, while R language uses "header=TURE" to do the same thing.

2. Line breaks are the most common separators for separating lines from each other. Both esProc and R language support line breaks by default. And tabs are the most common separators for separating columns from each other. esProc supports tabs by default. If other types of separators like comma are designated to be used, the code should be import@t(;",").In R language, default column separators are "blanks and tabs", which can mistakenly separate the Client column containing blanks into two columns, thus sep="\t" is needed to define separators as tabs. In addition, "quote=FALSE,row.names=FALSE" in the code represents that it is not necessary to put elements in quotes and to output row number.

3.Usually, files read into the memory will be stored as structured two-dimensional data objects, which are called table sequence(TSeq) in esProc or data frame (data.frame) in R language. Both TSeq and data.frame have rich computational functions. For example,group by Client and SellerID, then sum upAmount and find maximum. The code for esProc to perform the computations is:
As data.frame doesn’t directly support simultaneous use of multiple aggregation methods, two steps are needed to sum up and find maximum. Finally, cbind will be used to combine the results. See below:
4.  Except storing files as the structured two-dimensional data objects in the memory, esProc can access files by cursor objects. While R language can access files by matrix objects.
Conclusion:For basic file reading and writing, both esProc and TSeq provide rich functions to meet users' needs.

2.Reading files with fixed column width

In some files, fixed width, instead of separators, is used to differentiate one column from another. For example, read file static.txt which contains three columns of data into the memory and modify column names respectively to col1, col2 and col3, among which the width of col1 is 1, that of col2 is 4 and that of col3 is 3. 
data.new(mid(_1,1,1):col1, mid(_1,2,4):col2, mid(_1,6,8):col3)
R language:
data<-read.fwf("e:\\sales.txt ", widths=c(1, 4, 3),col.names=c("col1","col2","col3"))

R language does this job directly while esProc does it indirectly by reading the file into the memory first and split it into multiple columns. Note that in the code mid(_1,1,1), “_1” represents default column names, and if the file read into the memory has more than one column, the default column names will be in due order: _1_2_3 and so on.

Conclusion:R language is more convenient than esProc because it can read files with fixed column width.

3.Reading and writing designated columns

Sometimes only some of the data columns are needed in order to save memory and enhance performance. In this example, read columns ORDERID, CLIENT and AMOUNT into the memory and write ORDERID and AMOUNT to a new file.

R language:
data<-read.table("e:\\sales.txt",sep="\t", header=TRUE)
write.table(col2, file="e:\\ salesResult.txt", sep="\t",quote=FALSE,row.names=FALSE)

esProc does the job directly, while R language does it indirectly by reading all columns into the memory and saving designated columns in a new variable.
R language can only read all columns into the memory, which will occupy a relatively large memory.

4.Processing big text files

Big text files are files whose sizes are bigger than memory size. Usually they are processed by reading and computing in batches. For example, in big text file sales.txt, filter data according to the condition Amount>2000 and sum up Amount of each SellerID.

A1As reading the big text file into the memory at a time will result in memory overflow, it will be read in batches with cursor.

A2Read by loop with 100,000 rows of data each time and store them in TSeq A2.

B3Among each batch of data, filter out records whose order amount is greater than 2,000.

B4Group and summarize the filtered data, and seek each seller's sales amount in this batch.

B5Append the computed results of this batch to a certain variable (B1), and begin the computation of the next batch.

B6After the computations all batches are over, each seller's sales amount of each batch can be found in B1, execute another and the last grouping and summarizing to get the total sales amount of each seller.

R language:

1-4Create an empty data frame data to generate each batch's data frame databatch.

5-9Create an empty data frame agg to append the results of grouping and summarizing of each batch.

11-13Read in the file by rows, with 100,000 lines each time, but skip the column names of the first row.

15-21In each batch of data, filter out records whose order amount is greater than 2,000.

22Group and summarize the filtered data, and seek each seller’s sales amount of this batch.

23Append the computed results of this batch to a certain variable (agg), and begin the computation of next batch.

24After the computations of all batches are over, each seller's sales amount of each batch can be found in B1, execute another and the last grouping and summarizing to get the total sales amount of each seller. 

1. Both of them have the same way of thinking. Differences are that esProc does the job with library function and its code is concise and easy to understand, while R language needs to process a great deal of details manually and its code is lengthy, complicated and error-prone.

2 .With esProc cursor, the above computations can be performed more easily, that is:

In this piece of code, esProc engine can automatically process data in batches, and it is not necessary for programmers to control manually by loop statements.
Conclusion:In processing big text files, esProc code is more concise, more flexible and easier to understand than that of R language.

5.Processing big text files in parallel

Parallel computing can make full use of the resource of multi-core CPU and significantly improve computational performance. 

The example in the above part is still used here, but parallel computing is used. That is, divide sales.txt into four segments to give four CPU cores to perform computations, then filter data according to the condition Amount>2000 and compute the total sales amount of each seller.
Main programpro5.dfx
A1Set the number of parallel tasks as four, meaning the file would be divided into four segments.

A2Call subprogram to perform multi-threaded parallel computing, and there are two task parameters: to(A1) and A1. Value of to(A1) is [1,2,3…24], representing segment number assigned to each task; A1 is the total number of segments. When all the tasks are completed, all computed results will be stored in the current cell.

A3Merge the computed results of every task in A2 according to SellerID.

A4Group and summarize the merge results and seek each seller's sales amount.

A1Read the file with cursor, and decide which segment of the file the current task should process according to the parameter sent by the main program. Take the third task as an example, value of the parameter segment is 3 and that of parameter total is always 4.

A2Select records whose order amount is greater than 2,000.

A3Group and summarize the filtered data.

A4Return the computed results of current task to main program.

R language:It cannot do this job by using parallel computing.

esProc can read big text files segmentally by bytes, and designated part by skipping useless data and supporting multi-threaded parallel computing in the low level.

Though R language can perform parallel computing of in-memory data, it cannot read files in disk segmentally by bytes. It can also read data by skipping multiple rows, but this method has to traversal all useless data, resulting in poor performance and inability to perform parallel computing of big text files in the low level.

In addition, esProc can automatically manage the situation that there is only half line of data when segmenting by bytes, as shown in the above code, thus it is unnecessary for programmers to handle it manually.
esProc can process big text files in parallel and has a high computational performance. R language cannot perform the parallel computing of big text files in the low level and has a much poorer performance.

6.Computational performance

Under the same test circumstance, use esProc and R language to read a file of 1G size, and summarize one of the fields.

R language:
con<- file("d:/T21.txt", "r")
while( length(lines) != 0) {
         for(line in lines){
1. It takes esProc 26 seconds and R language 9 minutes and 47 seconds respectively to finish the task. Their gap exceeds an order of magnitude.
2..In processing big files, R language cannot use data frame objects and library function. It can only write loop statements manually and compute while the file is being read, so the performance is poor. esProc can directly use cursor objects and library function and has a higher performance. But there is no big difference between them when processing small files.

Summary: esProc's performance is far beyond that of R language in processing big text files.