September 18, 2014

Program Eight Queens Problem with esProc

There are quite a lot of chess problems, among which the Eight Queens Problem is the most famous one. Put eight Queens in a 8×8 chessboard, the requirement is no attack between any of them. How many layouts are there to achieve this?

As a Queen's striking range is limited to a row, a column and an oblique line, only one Queen is allowed to appear in a single line, otherwise it is ineligible. 

Since there is only one Queen in any single line, a sequence, 8 in length, could be employed. Set in turn the column number a Queen in each row is located. If there is not a Queen in a certain row, mark with zero; when we place a new Queen in the board, and the column number is not within the sequence, which means there is no Queen in this column.

At the same time, we also need to make sure the new Queen has no counterpart in the diagonal. If a Queen is located in row m of column k, there are two places at most in row m+n of the same oblique line. Between the two places and the Queen, the horizontal distance is equal to the vertical distance, which means there are at most two places (m+n,k+n) and (m+n,k-n) are in the same oblique line with the Queen.

So, we would know the number of eligible conditions after examining status of each Queen in every line. 

esProc can do all the work with loop computation, as shown in the following form:

 i is employed during the computation to record the serial number of the current row where a Queen is placed. The code in the 2nd line of the above form shows that with each loop of the code, the Queen will move to the next column, and in this way, traversal of every place of the current row will be completed. For the code in the 3rd line, when the Queen move to the 9th column, its traversal in all places in the current row has completed; the record of the current row restores to zero and i equals i-1 and return to continue with traversal in the last row. Here note that when the entire loop in the first line is done, the traversal is completed, i will be reset as zero, and the loop is over. For the code of in the 4th line, when moving the queen in the first row, we could locate the second Queen without making any judgments. The code of in the 6th line judges whether there is any located Queen in a same column; and the code in the 7th line judges whether there is any located Queen in a same oblique line. If the answers of both judgment are no, we could locate the Queen in the next row. When all the eight Queens are located, record their current places with the code in the 9th line.

The computed result in A10 is:

Check detailed result in C1:

September 17, 2014

How to Compute Moving Average in R Language

A moving average is used to smooth out a time series. Computing moving average is a typical case of ordered data computing. Its basic computing method is to create a subset composed of N consecutive members of a time series, compute the average of the set and shift the subset forward one by one. The following example teaches you how to compute moving average in R language.

Case description:

Data frame sales has two fields: salesDate and Amount of this date. Requirement: compute the moving average in three days. Computing steps include seeking sales amount average of the previous day, the current day and the next day, and shift forward along the dates. A part of the source data is as follows:
Code:
Computed result:
Code interpretation:
filter function can be used in R language to compute moving average, which produces concise code. This method is quite convenient.

Despite the convenience of the filter function, it is difficult to understand for beginners. For example, sales$Amount/3means dividing the current value of field Amount by three,but when it is used in filter function, it may mean adding the three consecutive values together, then divide the sum by three. [1,1,1] is the value of expression rep(1,3), which is used here to specify the range of data fetching. In addition, because neither the name nor the parameters of filter function contain the words “average” and “moving”, even many developers of R language don’t know its use for computing moving average.

In fact, filter function is a universal linear filter. Its use is more than computing moving average. Its complete function reference is filter(x, filter, method = c("convolution", "recursive"),sides = 2, circular = FALSE, init).
Any modification of the requirement will make the code more difficult to understand. For example, the code for computing moving average of the current day and the previous two days cannot be written as filter(sales$Amount/3, rep(0,2)), it has to befilter(sales$Amount/3, rep(1,3), sides = 1).
Summary:
R language can compute moving average, but its code is rather elusive.

Third-party solutions
We can also use Python, esProc and Perl to handle this case. As R language, all of these languages can perform data statistics and analysis and compute moving average. The following introduces solutions of Python and esProc briefly.

Python(pandas)
Pandas is Python's third-party library function. It is powerful in processing structured data with basic data type imitating R's data frame. At present the latest version is 0.14. Its code for handling this case is as follows:
         pandas.stats.moments.rolling_mean(sales["Amount"], 3)

The name of rolling_mean function is clear, even a developer without experience with pandas can understand it easily. The function’s usage is simple too. Its first parameter is the sequence being computed and the second parameter is N, which is the number of days in seeking moving average.

esProc
esProc is good at expressing business logic freely with agile syntax. Its expressions for relative position can solve computational problems of ordering data easily. The code is as follows:
         sales.(Amount{-1,1}.avg())

{-1,1} in the code represents a relative interval, that is, the three days of the previous day, the current day and the next day. It can be seen that moving average can be worked out clearly and flexibly by using a relative interval. If it is required, for example, to compute the moving average of the current day and the previous two days, we just need to change the interval to {-2,0}in esProc.

A relative interval is a set. esProc can also express an element of relative position. For example, it can compute sales growth rate with (Amount -Amount[-1]) conveniently. In contrast, the code in R language and Python is difficult to understand. 

September 16, 2014

Computing the Online Time for Users with esProc (IV)

In last article we mentioned that IT engineers from the Web Company used esProc to code single-machine multi-threaded program which could handle large data volume and complex requirements. This leverages the full power of one multi-core multi-CPU machine. Now once again these engineers found a new issue: with the user numbers for the online application growing explosively, colleagues from the Operation Department complained that the online time computation program is still running too slow.

IT Engineers leverage esProc's multi-machine parallel computing capability, to split the task for multiple machines to complete. The performance problem is resolved successfully. The single machine parallel processing is shifted to multi-machine parallel processing, with relatively low cost for hardware and software upgrade.

To improve performance, the Web Company increased the number of server from the original number of 1 to 3. Accordingly, the following steps are needed to shift from single-machine parallel to multi-machine parallel:


The first step: Modify the esProc program for weekly log files processing. Divide user ID by3 and separate the weekly log file into 3 files according to the remainder. Every server would be processing one of these. This way the file size were reduced and file transfer time could be shortened. Later the three files were uploaded to three servers, using multiple parallel programs to do the computation. The actual program is as following:


Note in the last screenshot that, A6 used the @g option of export function to retrieve "log files for one week" into three binary files. During subsequent use of parallel processing time, the content of log files can be retrieved by blocks for different user. The use of @g option is to ensure the segmented data retrieval is aligned to group borders, removing the possibility for assigning data of the same user to two blocks.

The second step: the single-machine multi-threaded program is unchanged. Let's go back.

Subroutine parameters are shown below. They are used to pass the log file name, block number and total number of blocks for the week when called by the main program. Here the log file name for the week, week file, was already one of the three segmented files corresponding to this machine.


The subroutine is as following:


The above screenshot illustrates that:
1. As we previously used export @g to output the file in group according to different user ID, the use of @z option by cursor in A2 to handle specific block (value is block number) among total (value is total blocks) from file will retrieve the complete group for the same userID. Data for one user will not be split into two blocks.

2.  The code line in red box returns the resulting file as cursor to the main program. Since multi-machine parallel processing were used here, this cursor is remote cursor ( Read esProc's Documents for detailed introduction on remote cursor).

The third step: writing main program for parallel computing, to call the parallel computing subroutine. As illustrated below, the main program called parallel tasks on tree machines, which effectively improved the performance for computation.

The server list in the program could also be written into the configuration file, this way any subsequent increase or decrease of the server would be easy.

Note: for specific measurements regarding esProc's performance gain with parallel computing, please refer to related test reports for esProc.Notes on the above screen capture:

1. callx@ parameter specifies 3 servers from A1 to A3, to handle three log files B1 to B3.

2. The syntax of callx's input parameter, is to specify three servers through A5, and specify 6 parallel computing tasks for each server in A6.

3. Server list, server number, and the number of tasks for each server can be adjusted according to actual situation, to leverage full performance potential of the server.

The fourth step: implement the esProc server, and upload related program & data files. Refer to instructions on esProc for specific steps and methods.

After the transformation to multi-machine parallel computing, the Operations Department found significant improvement in the computation speed of users online time. The cost of this transformation is much lower than that for application databases upgrade, especially, in the hardware part, only 2 additional PC Servers were needed.

So far, The Web Company finished implementation of esProc based user behavior analysis and computation platform. Its main advantages are:

1. The platform is easy to be adjusted with more complex algorithm for future, shortened the response time and saved labor costs from engineers.

2. It's easy to scale out for even larger data amount in the future, with shortened project time and reduced cost of upgrade.

September 15, 2014

Computing the Online Time for Users with esProc (III)

In last article we mentioned that IT engineers from the Web Company used esProc to code program which could handle large data volume and complex requirements. Not only could it meet the demands for online time computation, but also is relatively easy to be extended to with new conditions.

However, these engineers found that the single-threaded program does not take full advantage of the of the server's computing power. Practice has proved that the use of esProc's multi-threading capability can take advantage of the server's quaddual core, or even more CPUs. The change from single-threaded to multi-threaded requires very little workload. 

The Operation Department provided the following requirements for computation of users online time:

1. Login should be considered as the starting point of online time, and overnight should be take into consideration.

2. If the time interval between any two operations is less than 3 seconds, then this interval should not be added to online time.


3. If after login, the time interval between any two operations is longer than 600 seconds, then the user should be considered as logged out.


4. If there is only login, without logout, then the last operation time should be treated as time for logout.


5. For users who completed a post operation, his/her current time online time will be tripled in computation.


To shift from single-threaded computing to parallel computing, following steps needs to be done:


The first step: Adjust the log file preprocessor with the @g option of export function, to retrieve the log file for one week into a segmented binary file. In subsequent parallel processing, log file could be retrieved by block for different users. The use of @g option is to ensure the segmented data retrieval is aligned to group borders, removing the possibility for assigning data of the same user to two blocks. The actual procedures are as following:


The second step: Rewrite the online time computing program into a parallel subroutine. The part in the following red box is where we need to modify for parallel processing. Because different parallel tasks are used compute for different users, you can see that very little changes are required for parallel computing. The only change required, is to replace the use of files with different blocks from the binary file.

First we need to add parameters to subroutine, to pass the log file name, block number and total number of blocks for the week when called by the main program.


And then modify the program as following:

The above screenshot illustrates that:
1. As we previously used export@g to retrieve the file according to different user ID, the use of @z option by cursor to handle specific block (value is block number) among total (value is total blocks) from file, as shown in the redbox, will retrieve the complete group for the same userID. Data for one userwill not be split into two blocks.

2. A16 returns the resulting file as cursor to the main program.

The third step: writing main program for parallel computing, to call the parallel computing subroutine. Because the total cores of the server CPU is 8,the IT engineers decided to use six threads for parallel computing. This take full advantage of multi-core CPUs to improve performance.

Note: for specific measurements regarding esProc's performance gain with parallel computing, please refer to related test reports for esProc.

Upon the meeting of this requirement, IT engineers from the Web Company are facing a new problem: the user numbers for the online application grew explosively. Colleagues from the Operation Department complained that the online time computation program is still running too slow. The single-machine, multi-threaded approach can no longer enhance the computing speed significantly. Can these IT engineers effectively solve the performance issue using esProc’s parallel multi-machine computing capability? Is it too costly to transform to a multi-machine parallel mode? See "Computing the Online Time for Users with esProc (IV)"

September 14, 2014

Using esProc to Compute the Online Time of Users (II)

In last part we mentioned that the Operation Department of the Web Company brought about a new demand: adding new conditions to the way the online time are computed. As IT department was using esProc as the tool for computation, it's easy to handle such changes in requirements. On the other hand, the increasing amount of data could be accommodated by out-of-memory computation with esProc's file cursor functionality. 

Previously, the Operation Department provided the following requirements for computation of users online time:


1. Login should be considered as the starting point of online time, and overnight should be take into consideration.

2. If the time interval between any two operations is less than 3 seconds, then this interval should not be added to online time.

3. If after login, the time interval between any two operations is longer than 600 seconds, then the user should be considered as logged out.

4. If there is only login, without logout, then the last operation time should be treated as time for logout.

Over time, the operations department found that there are some "key point" in users behavior: between login and logout, user who conducted post actions are more loyal to the online application. Therefore, the Web Company plans to introduce an incentive: Based on the original rules, if a user conducted a post operation, his/her online time will be tripled in computation.

After receiving the task, the IT engineer considered the possibility for future adjustment in the way of computation, plus the need for added conditions. The decision is to use out-memory cursor and for loop to realize the computation.

After analysis, it's found that most user behavior analysis are done for each user independently. Thus, if the logfile are pre-sorted according to userid, the performance for various analysis computation will be raised, with reduced difficulty and shortened process time. The pre-processing programming are as following:

As we could see, pre-processing means that we sort and output the seven days log files to a binary file. This way we can eliminate the need for subsequent consolidation and sort.Meanwhile, the binary files provided by esProc can also help to raise the data/write performance for data.

After pre-processing, the codes for online time computation could be written as following:

Note that:
1. The volume of data for one user n seven days is not big. Thus in cell A5 we can retrieve all log data for a user into memory in one batch.

2. In the one-loop-for-each-user cycle, the codes in red box implemented the computation of the new business logic: for every post operation conducted, the users’ current time online time will be tripled in computation. The removal of unqualified record is done in cell B9, and in B10 we calculate a serial number for every login (lognum). Records are grouped in B10 according to lognum, to compute the sum of online time for each group. If there is at least one "post" action in the current group of operations, then the sum of online time for current group will be tripled.

3. Considering the relatively large data resulted, when the computation is done for 10,000 users, and the result also reach 10,000 lines, we'll do a batch output of the data from memory to a result file. This improves the performance while avoiding the memory overflow at the same time.


After meeting this demand, the IT engineers in Web Company found that the single-threaded program does not take full advantage of the of the server's computing power. Here comes another question: can these engineers leverage esProc's multi-threaded parallel computing capabilities to take full advantages of the server's quaddual core CPUs? Is it troublesome to shift from single-threaded to multiple-threaded? See "Computing the Online Time for users with esProc (III)".