May 22, 2015

esProc Improves Text Processing – Remove Duplicate Rows

During processing the text file, sometimes we need to remove duplicate rows from the grouped data. But the operation becomes complicated when the file under processing is too big to be entirely loaded into the memory. esProc’s group operation supports importing a whole group of data with the file cursor, as well as many options. So it can handle this kind of operation effortlessly. The following example will show you how it works.

The file EPRom.log has four columns separated by tab and its data have been grouped by the second column. Duplicate rows need to be removed (only the first row of each group is wanted). Some of the source data are as follows:

esProc code for doing this

A1=file("E:\\EPRom.log").import()

This line of code imports the file into the memory. By default, the separator is tab and column names are _1_2_3……. If it is a CSV file, the separator needs to be specified using the code import(;”,”). If the file’s first row contains column names, @t option can be used to import them, that is, import@t(). The result of A1 is as follows:

A2=A1.group@1o(_2)

This line of code gets the first row from every group. The group field is _2, the second field. This is the case’s final result, as shown below.

By default, group function will regroup the data. For instance, A1.group(_2) will divide A1 into two groups by the second field, as the following shows:

But the use of @o option won’t make that happen. For instance, result of A1.group@o(_2) is as follows:

With @1 option, the function will get the first row of every group. By using both @1 and @o, we’ve hit the target of this case.
        
In the situation that the file is too big to be wholly loaded into the memory, esProc cursor can be used to deal with it. Corresponding code is as follows:

A1=file("E:\\EPRom.log").cursor()

This line of code opens the log file in the form of cursor. cursor function returns a cursor object according to the corresponding file object, with tab being the separator and _1,_2…_n being column names by default. Notice that the code merely creates the cursor object without importing data. Data importing will be started by for statement or fetch function.

B1= file("e:\\result.txt")

This line of code creates a file object to which the computed results can be exported later on.

A2:for A1;_2

This line of code imports data from cursor A1 by loop. Each time it imports a group of data with the same second column (column name is _2). In this step the data are truly being imported into the memory.

Here for statement deserves special attention. In esProc, for cs,n imports n records from cursor cs each time. for cs;x imports a group of records with the same x field from cursor cs each time, the condition is that records have been grouped by x field.

The x in for cs;x statement can be an expression, which means multiple rows will be imported until the result of computing expression x changes. Take for A1 ; floor(_1/5) as an example. It divides _1 field by 5 and rounds the result off, put the records with the same results into the same group, like the first row to the fifth row.

B2=file("e:\\result.txt").export@a([A2(1)])

As the loop body of for statement in A2, it processes every group of data in the same way. The method is to get the first row of the current group and append it to file result.txt. A2 is the loop variable which represents all records in the current group. A2(1) represents the first record in 2. export function is used to write the structured data into a new file, its @a option means appending. Since A2(1) is a single record, it needs to be converted into array with the operator [].

We can see the final result in result.txt:

In esProc, the working range of for statement can be represented by indentation instead of the parentheses or identifiers like begin/end. In the following code block, for instance, B2-B5 is A2’s working range.

May 19, 2015

esProc Improves Text Processing – Parse Logs with Arbitrary Number of Lines

When parsing logs into structured data, we often find that the records consist of a variable number of lines. This makes the conversion, as well as the corresponding operation, quite complicated. Equipped with various flexible functions for structured data processing, such as regular expressions, string splitting, fetching data located in a different row, and data concatenation, esProc is ideal for processing this kind of text. Following example will show you how it works.


The log file reportXXX.log holds records, each of which consists of multiple lines with 14 data items (fields) and starts with the string “Object Type”. Our goal is to rearrange the log into structured data and write the result to a new text file. Some of the source data are as follows:

esProc code for doing this

A1=file("e:\\reportXXX.log").read()


This line of code reads the logs entirely into the memory. Result is as follows:

A2=A1.array("Object Type: ").to(2,)

This line of code can be divided into two parts. The first part - A1.array("Object Type: ") – splits A1 into strings according to “Object Type”. Result is as follows: 

Except the first item, every item of data is valid. to(2,) means getting items from the second one to the last one. Result of A2 is as follows:


This line of code applies the same regular expression to each member of A2 and gets the 14 fields separated by commas. Following lists the first fields:
 

A4=file("e:\\result.txt").export@t(A3)  

This line of code writes the final result to a new file. Tab is the default separator. The use of @t option will export the field names as the file’s first row. We can see the following data in result.txt:

The regular expression used in the code above is complicated. We’ll use esProc’ built-in functions to make the operation more intuitive. For example, ObjectType field is the first line of each record, so we can separate the records from each other with the line break and then get the first line. left\top\right\bottom actually splits each record’s second line by space and get item 3, 5, 7 and 9.

The task can be handled with esProc built-in functions as follows:

In the above code, pjoin function concatenates many sets together; array function splits a string into many segments by the specified delimiter and creates a set with them, in which (~.array("\r\n") splits each record by carriage return.
In the above example, we assumed that the log file is not big and can be wholly loaded into the memory for computing. But sometimes the file is big and needs to be imported, parsed and exported in batch, which makes the code extremely difficult to write. Besides, because the number of records is variable, there is always a record in a batch of data which cannot be imported completely. This further complicates the coding.

esProc can handle the big log files with arbitrary number of lines more easily using file cursors. Following is a code sample: 

A1=file("\\reportXXX.log").cursor@s()

This line of code opens the log file in the form of a cursor. cursor function returns a cursor object according to the file object, with tab being the default separator and _1,_2…_n being the default column names. @s option means ignoring the separator and importing the file as a one-column string, with _1 being the column name. Note that this code only creates a cursor object and doesn’t import data. Data importing will be started by for statement or fetch function.

A2: for A1,10000

A2 is a loop statement, which imports a batch of data (10,000 rows) each time and sends them to the loop body. This won’t stop until the end of the log file. It can be seen that a loop body in esProc is visually represented by the indentation instead of the parentheses or identifiers like begin/end. The area of B3-B7 is A2’s loop body which processes data like this: by the carriage-return the current batch of data is restored to the text which is split into records again according to “Object Type” , and then the last, incomplete record is saved in B1, a temporary variable, and the first and the last record, both of which are useless, are deleted; and then the regular expression is parsed with each of the rest of the records, getting a two-dimensional table to be written into result.txt. Following will explain this process in detail:

B2=B1+A2.(_1).string@d("\r\n")

This line of code concatenates the temporary variable B1 with the current text. In the first-run loop, B1 is empty. But after that B1 will accept the incomplete record from the previous loop and then concatenate with the current text, thus making the incomplete record complete.

string function concatenates members of a set by the specified separator and @d function forbids surrounding members with quotation marks. Top rows in A2 are as follows:

A2.(_1) represents the set formed by field _1 in A2 :

A2.(_1).string@d("\r\n") means concatenating members of the above set into a big string, which is Object Type: Symbol Location: left: 195 top: 11 right: 123 bottom: 15 Line Color: RGB ( 1 0 0 ) Fill Color:   RGB ( 251 255 0 ) Link:l11…. 

B3=B2.array("Object Type: ")


This line of code splits the big text in B2 into strings by “Object Type”. Result of B3’s first-run loop is as follows:
Since the last string in B3 is not a complete record and cannot be computed, it will be stored in the temporary variable and concatenated with the new string created in the next loop. B4’s code will store this last string in the temporary variable B1.

B4=B1="Object Type: "+B3.m(-1)+"\r\n"

m function gets one or more members of a set in normal or reverse order. For example, m(1) gets the first one, m([1,2,3]) gets the top three and m(-1) gets the bottom one. Or B3(1) can be used to get the first one. And now we should restore the “Object Type” at the beginning of each record which has been 
deleted in the previous string splitting in A2. And the carriage return removed during fetching the text by rows from cursors will be appended.

The first member of B3 is an empty row and the last one is an incomplete row, both of them cannot be computed. We can delete them as follows:

B5=B3.to(2,B3.len()-if(A1.fetch@0(1),1,0)))

This line of code fetches the valid data from B3. If the data under processing is not the last batch, fetch rows from the second one to the second-last one and give up the first empty row and last incomplete row. But if the current batch is the last one, fetch rows from the second one and the last one which is complete and give up the first empty row only.

B3.to(m,n) function fetches rows from the mth one and the nth one in B3. B3.len() represents the number of records in B3, which is the sequence number of the last record in the current batch of data. A1.fetch(n) means fetching n rows from cursor A1 and @0 option means only peeking data but the position of cursor remaining unchanged. if function has three parameters, which are respectively boolean expression, return result when the expression is true and return result when the expression is false. When the current batch of data is not the last one, A1.fetch@0(1) is the valid records and if function will return 1; when it is the last one, value of A1.fetch@0(1) is null and if function will return 0.

B6=B5.regex(regular expression;field names list). This line of code applies the same regular expression to each member of B5 and gets the 14 fields separated by commas. Following lists the first fields:

B7=file("e:\\result.txt").export@a(B6)

This line of code appends the results of B6 to result.txt. It will append a batch of records to the file after each loop until the loop is over. We can view this example’s final result in the big file result.txt:

In the above algorithm, regular expression was used in the loop. But it has a relatively poor compilation performance, so we’d better avoid using it. In this case, we can use two esProc scripts along with pcursor function to realize the stream-style splitting and parsing.

First let’s look at the code for master routine main.dfx:

pcursor function calls a subroutine and returns a cursor consisting of one-column records. A2 parses the regular expression with each record in A1 and returns structured data. Note that the result of A2 is a cursor instead of the in-memory data. Data will be exported to the memory for computing from A2’s cursor segmentally and automatically by executing export function.

Subroutine sub.dfx is used to return the cursor, whose code is similar to the previous one. The difference is that the results need not be written to a new file, the one-column records will be returned instead, as the following code shows:

B6’s result statement can convert the result of B5 to a one-column table sequence and return it to the caller (pcursor function in main.dfx) in the form of a cursor.

With pcursor function, master routine main.dfx can fetch data from the subroutine sub.dfx by regarding it as an ordinary cursor and ignoring the process of data generation. While main.dfx needs data, pcursor function will judge if the loop in sub.dfx should continue, or if it should supply data by returning them from the buffer area. The whole process is automatic. 

March 13, 2015

esProc Improves Text Processing – String Matching with Big Files

There are many occasions during text processing which require performing string matching with big files. Coding with command line grep\cat is simple yet inefficient. Though higher efficiency can be achieved with high-level languages, coding will be rather difficult.

Yet this operation, as well as multithreaded parallel computing, can be handled more easily in esProc, with more concise code and much better performance. The following examples will show esProc method in detail.


file1.txt has a great many strings. Find out the rows ending with “.txt” and export them to result.txt. Some of the original data are as follows:

esProc code for doing this task:

A1: Open the file in the form of cursors. Instead of importing all the data into the memory at a time, cursor function opens the file in the form cursors (stream) without memory footprint. The function uses default parameters to import all the fields with tab being the column separator and to automatically name them _1, _2, _3…_n respectively. There is only one field, _1, in this example.

A2=A1.select(like@c(_1,"*.txt"))

This line of code selects rows ending with “.txt” from cursor A1. select function executes the query and like function performs string matching. _1 represents the first field. The use of @c option in like function means the matching is case insensitive.

One point worth noting is that the result of A2 is still a cursor without memory footprint. Only with the use of functions like export/fetch/groups will esProc allocate suitable memory buffers and convert the cursor computing to memory computing.

A3=file("e:\\result.txt").export(A2). This line of code exports the final result to a file. Some of the data are as follows:

The matching rule in the example above is relatively simple. If the rule is complex, a regular expression will be needed. For example, find out rows starting with “c:\windows” and not ending with “.txt”.

regex function is used to perform string matching with the regular expression. Just modify A2’s code to A1.regex@c("^c:\\\\windows.*(?<!\\\\(.txt)$)") , in which @c option means case insensitive.

Though the regular expression can be used to realize the string matching with complex rule, its performance is not satisfactory. For example, to find out rows ending with “.txt” from a file of 2.13G size in the same test environment, it takes 206 seconds with a regular expression, while it takes only 119 seconds with an ordinary expression (the select statement).

In fact, many tasks of string matching with complex rule can also be realized with the ordinary expression. Moreover, the syntax is more visual and cost of learning is lower. For example, emp.txt holds a large number of user records, each of which has multiple fields, separated by tab and with the first row being the column names. Suppose you are to find out rows with the rule that “Eid field is lesser than 100, the first letter of Name filed is a and Birthday field is greater than 1984-01-01”. You can do it in esProc as follows:

The @t option used with cursor function means that the first row will be imported as column names for the use of accessing data at a later time.

The three query conditions can be represented by EId>100, like@c(Name,"a*") and Birthday>=date("1984-01-01") respectively. The logic relation between the conditions is “AND”, which can be represented by &&.

The above algorithm is sequential computation. The performance can be further improved if parallel computing is used. The method is this: Import the file using multithreads, each of which will access some of the data of the file with a cursor, and perform set operations at the same time; finally, merge the result of each cursor together.

Test the processing of a file of 2.13G size under the same hardware environment. It takes an average of 119 seconds with the sequential computation, whereas it takes only an average of 56 seconds with the parallel computing, which speeds the performance almost doubly. The algorithm used in the example is not so complex, so the bottleneck is the hard driver’s ability to import data. With the increase of the complexity of the computation, the performance will be improved more greatly.

esProc code for parallel computing:

A1=4. A1 is the number of segments, which means the file will be divided into 4 segments. The number is equal to the number of parallel tasks in operation, which generally should not exceed the number of CPU cores. Otherwise the tasks will be queued for processing and the performance won’t be really increased. The maximum number of the parallel tasks can be configured in the environment option.

A2=A1.(file("e:\\file1.txt").cursor@z(;, ~:A1))
This line of code will generate four cursors according to the specified number of segments. A1.(express) means computing the expression with members of A1 respectively. “~” can be used in the parentheses to represent the current member. Generally A1 is a set, like ["file1", " file2"] or [2,3]. If members of the set are consecutive numbers starting with 1, like [1,2,3,4], the code can be written in a simple form as 4.( express), as with the code in this example.

In the expression, file("e:\\file1.txt").cursor@z(;, ~:A1), surrounded in the parentheses, cursor function uses @z option to segment the file and fetch each part with a cursor. ~:A1 means that the file is roughly divided into four segments (A1=4) and the ~th segment is fetched. “~” represents the current member in A1 and each cursor corresponds to the first, the second, the third and the fourth segment respectively.

Besides, though exact division will result in incomplete lines, esProc can import complete lines automatically by skipping the beginning half line of a segment and completing the ending half line of the segment. This is why the file should be divided “roughly”.

A3=A2.(~.select(like@c(_1,"*.txt"))). This line of code queries data of each cursor (i.e. ~) in A2 and selects the eligible rows. The computed results are still four cursors.
A4=A3.conj@xm(). This line of code merges the four cursors in A3 in parallel.
A5=file("e:\\result.txt”).export(A4). This line of code exports the final result to a file.

An esProc script not only can work independently in an Integration Development Environment (IDE), it also can be called by a Java program through JDBC interface. The calling method is the same as the method of calling an ordinary database. A one-step esProc script can be embedded in the Java program directly without script file. Actually the above steps can be combined into one single step:
file("e:\\result.txt").export(4.(file("e:\\file1.txt").cursor@z(;, ~:4)).(~.select(like@c(_1, "*.txt"))).conj@xm())

It is also allowed to run this kind of one-step script in operating system’s command line. Please refer to related documents for further information. 

March 11, 2015

esProc Improves Text Processing – Set Operations on Big Files

It is common to perform set operations on big files in text processing. For example, find different rows between two files. The code for handling the operations with command line grep or cat command is simple but inefficient. While the operational efficiency is high when high-level languages are used to handle the operations, the code is difficult to write.

esProc supports performing set operations on big files and multithreaded parallel computing. Its code is concise and its performance is remarkable. The following examples will show the esProc method in detail.


file.1txt and file2.txt hold a large number of strings respectively. Find their common rows (that is, compute the intersection). Some of the data are shown as follows:

Both files are big
When both files are too big to be loaded into the memory, esProc’s way of cursor merge can be used to realize the intersection operation. The code is as follows:

A1, B1Open the files as cursors. cursor function doesn’t import all the data into the memory, it opens a file in the form of a cursor (stream) without the memory footprint. The function uses default parameters to import all the fields and to name the new columns automatically as _1, _2, _3…_n, with tab being the column separator. There is only one field - _1 - in this example.

A2=[A1.sortx(_1),B1.sortx(_1)].merge@xi(_1)

The above code uses the merge operation to find the common rows of the two files, that is, to compute the intersection. A merge operation requires ordered data, so sortx function is used to sort the cursors first. The corresponding code is A1.sortx(_1) and A2.sortx(_1), in which _1 is the default field name in the files. merge function is used to merge multiple groups of data (two groups as with this example). Without parameter options, it merges the sets in the memory; the use of parameter option @x means to merge the cursors and @i means that the merge result is the intersection.

Notice that the result of A2 is still a cursor without the memory footprint. Only when the computation involves functions like export, fetch, groups and etc. will esProc engine allocate suitable buffers and automatically convert the cursor computing into in-memory computing. 

A3=file("E:\\result.txt").export(A2)

This line of code writes cursor A2 to a file. export function can be used to write both the in-memory data and cursors to files. Here default parameters are used, that is, no column name, using tab as the column separator, overwriting instead of appending files and writing as text files instead of binary files. Open result.txt and you can see some of the data as shown below:

Actually the above three-step esProc code can be combined into a single line of code: 
A1=file("e:\\result.txt").export([file("E:\\file1.txt").cursor().sortx(_1),file("E:\\file2.txt").cursor().sortx(_1)].merge@xi(_1))   

In addition to the intersection operation performed in this example, there are also operations of union, concatenation and difference. Just to modify the option of merge function in A2 to realize them. For example, union file1.txt and file2.txt and remove the duplicate members to get their union. Function option @u can be used to compute the union with the following code: [A1.sortx(_1),B1.sortx(_1)].merge@xu(_1). Their union is as follows:

By not removing the duplicate members, concatenation will be computed. The code is [A1.sortx(_1),B1.sortx(_1)].merge@x(_1), which shows the real meaning of merge operation. The result is as follows:

Use function option @d to compute the difference, the data which are included in file1.txt but not included in file2.txt. The code is [A1.sortx(_1),B1.sortx(_1)].merge@xd(_1). Result is as follows:

Note: The commutative law doesn’t apply to difference operation. So the code for getting the data which are included in file2.txt but not included in file1.txt is [B1.sortx(_1),A1.sortx(_1)].merge@xd(_1). The result is as follows:

One file is big, the other is small
When there is only one big file that runs out of the memory, the smaller one can be loaded into the memory and then use a hash table to perform the set operations. Thus the efficiency can be increased significantly. esProc code is as follows:

A1Open the files as cursors.

A2=file("e:\\file2.txt").import(). This line of code imports the smaller file2.txt into the memory entirely. Similar to cursor function, import function’s default field names are _1 and _2 as well. Click B1 in esProc’s Integrated Development Environment (IDE) to see the computed result of the current cell:

B2>B1.primary(_1).index()

The above code defines a primary key for B1 and creates a hash index. The query speed can be increased greatly in this way. B1 and B2 can be combined into a single line of code : B1=file("E:\\ file2is.txt").import().primary(_1).index()

A3=A1.select(B1.find(~._1))

This line of code selects the common data of cursor A1 and cursor B1, which is computing the intersection. select function is used to execute the query statement and eligible data will be selected. In the function, ~ represents the current record.

The query criterion is B1.find(~._1), meaning to find in B1 the _1 field of the current record of A1.

Notice that the result is still a cursor when the code is executed.

A4=file("E:\\result.txt").export(A3). This line of code writes the final result into a file.

When computing difference, modify the code in A3 to A1.select(!B1.find(~._1)), which selects from A1 the rows which are not included in B1.

To compute the union of file1 and file2, first compute the difference of file1 and file2 and then union it with file2.

conj function in the expression [A3,B1.cursor()].conj@x() in A4 can concatenate multiple sets (which is concatenation operation) – [[1,2,3],[2,3],[]]=[1,2,3,2,3], for example. Function option @x is used to concatenate the data in multiple cursors. Since B1 is not a cursor, cursor function should be used to convert it to a cursor, i.e. B1.cursor(). This is faster than importing data from the file.

Compute a big file and a small one in parallel
The sequential computation is used in the above case, but parallel computation can further enhance the performance. The method is to import files using multithreads. Each thread accesses a part of the file with a cursor, and meanwhile performs a set operation and finally combines the result of each cursor together.

Test a big file of 2.77G size and a small one of 39.93M size under the same hardware environment. It takes an average of 85 seconds to accomplish the sequential computation while it takes an average of 47 seconds to accomplish it with parallel computing. The speed has nearly been doubled. As the set operations are not that complicated in themselves, the computation hits a bottleneck in the hard drive’s ability to import data. The performance will be improved more greatly while the operation is getting more complex.
esProc code for parallel computing is as follows:

B1Import the small file into the memory, define a primary and create an index.
A2=4. A2 represents the number of segments into which a file is going to be divided. Here the number is 4. Generally the number of segments should not exceed the number of CPU cores; otherwise the tasks will be queued for processing and the speed won’t be really increased. The maximum parallel number can be configured in the environment option.

A3=A2.(file("e:\\file1.txt").cursor@z(;, ~:A2))

The above code generates four cursors according to the number of segments. A2.(express) means computing the expression with members of A2 in order. In the parentheses, “~” can be used to represent the current member. Generally A2 is a set, like ["file1", " file2"] or [2,3]. If members of the set are a series of consecutive numbers beginning from 1, like [1,2,3,4], A2 can be abbreviated to 4.( express), as with the code in this example.

In the expression - file("e:\\file1.txt").cursor@z(;, ~:A2) - surrounded by the parentheses, cursor function uses @z option to divide the file into multiple segments and to import one of them using a cursor. ~:A2 means that the file will be roughly divided into 4 segments (A2=4), with “~” representing the current member in A2. Thus the cursors correspond to the first, the second, the third and the fourth segment of file respectively.

The reason for dividing a file “roughly” is that half rows will appear with the exact division. esProc can make sure of importing whole rows automatically by skipping the head incomplete row and making up the tail incomplete row, this is tedious to realize in Java.

A4=A3.(~.select(B1.find(~._1))). This line of code computes intersection on each cursor in A3 and the result is a set of cursors.

A5=A4.conj@xm(). This line of code combines the multiple cursors in A4 in parallel.

A6=file("e:\\result.txt”).export(A5). The code writes the final result into a file. The intersection operation has been accomplished at this point. You can refer to the previous examples to compute difference and union.

esProc can be used alone or be integrated into a Java program. The following is to integrate the esProc script into the Java program through JDBC. The Java code is as follows:
  // create a connection using esProc jdbc
  Class.forName("com.esproc.jdbc.InternalDriver");
  con= DriverManager.getConnection("jdbc:esproc:local://");
  // call esProc script, whose name is test.dfx
  st =(com.esproc.jdbc.InternalCStatement)con.prepareCall("call test()");
  st.execute();//execute esProc stored procedure, or input a parameter and output the computed result

For the simple script as with the first example, it can be embedded in the Java code and executed directly, without writing the script file (text.dfx). The corresponding Java code is as follows:
ResultSet set=st.executeQuery("file(\"e:\\result.txt\").export([file(\"E:\\file1.txt\").cursor().sortx(_1),file(\"E:\\file2.txt\").cursor().sortx(_1)].merge@xi(_1))");

March 9, 2015

esProc Improves Text Processing – Conditional Query on Big Files

During text processing, you often have the tasks of querying data from a big file on one or more conditions. Command line grep\cat command can be used to handle some simple situations with simple command yet low efficiency. Or high-level languages can be used to get a much higher efficiency with complicated code. If the query conditions are complex or dynamic, you need to create an additional SQL-like low-level class library, which increases the complexity of the computation.

esProc supports performing conditional query on big files and multithreaded parallel computing, and its code for handling this kind of problem is both concise and efficient. The following example will teach you the esProc method of doing the job.

A text file - employee.txt – holds the employee data. Import the data, select the female employees born after January 1, 1981 inclusive and export the query result to result.txt.

The format of employee.txt is as follows:
EID   NAME       SURNAME        GENDER  STATE        BIRTHDAY        HIREDATE         DEPT         SALARY
1       Rebecca   Moore      F       California 1974-11-20       2005-03-11       R&D          7000
2       Ashley      Wilson      F       New York 1980-07-19       2008-03-16       Finance    11000
3       Rachel      Johnson   F       New Mexico     1970-12-17       2010-12-01       Sales         9000
4       Emily         Smith        F       Texas        1985-03-07       2006-08-15       HR    7000
5       Ashley      Smith        F       Texas        1975-05-13       2004-07-30       R&D          16000

esProc code for accomplishing the task: 

A1Open the file as a cursor. cursor function won’t import all the data into the memory, it will open the file in the form of a cursor (stream) without the memory footprint. The function uses a default parameter which makes tab as the column separator to import all the fields. @t option means that the file’s first line will be the column names and thus specific column names can be used in the expression later. Without the option, columns will be named _1, _2, _3…_n automatically.

A2=A1.select(${where})

Filter the data according to the condition. Here a macro is used to dynamically parse the expression. "where" is the dynamic input parameter, it needs to be pre-defined. The following is the interface on which a parameter is defined:

The esProc program will first compute the expression surrounded by ${…}, then assign the computed result as the value to the macro string and replace ${…} with it; after that, the program will interpret and execute the code. For example, if where gets assigned as BIRTHDAY>=date(1981,1,1) && GENDER=="F" according to the given condition in the example, the expression in A2 will be =A1.select(BIRTHDAY>=date(1981,1,1) && GENDER=="F"). The parameter can be entered into esProc’s Integration Development Environment (IDE), or can be passed from the Java code or the command line.

A3=file("D:/result.txt").export@t(A2). This line of code exports the computed result to a file. If the size of computing result is always small, use the code =A2.fetch() in A3 to fetch the results into the memory for direct observation, or use result A2.fetch() to return the results to the Java application. 
The final result of this example is as follows:

This example shows the method of realizing a dynamic query, that is, there is no need to change the code when the query condition changes, just modify the value of the parameter “where”. For example, if the condition becomes “query female employees born after January 1, 1981 inclusive, or employees whose FULLNAME is RebeccaMoore”, the value of “where” can be written as BIRTHDAY>=date(1981,1,1) && GENDER=="F" || NAME+SURNAME=="RebeccaMoore". After the code is executed, the result set of A2 will be as follows:

The above algorithm is a sequential computation. But the use of parallel computation can further improve the performance. The method is: Import the file using multithreads, each of which accesses a part of the file with a cursor; meanwhile query the data according to the condition and finally merge the result of each cursor together.

The esProc code for parallel computing is as follows:

A1=4. A1 is the number of segments, which means the file will be divided into 4 segments. The number is equal to the number of parallel tasks in operation, which generally should not exceed the number of CPU cores. Otherwise the tasks will be queued for processing and the efficiency won’t be really increased. The maximum number of parallel tasks can be configured in the environment option.
A2=A1.(file("d:/employee.txt").cursor@z(;, ~:A1))
This line of code will generate four cursors according to the specified number of segments. A1.(express) means computing the expression with each member of A1 in order. “~” can be used in the parentheses to represent the current member. Generally A1 is a set, like ["file1", " file2"] or [2,3]. If members of the set are consecutive numbers starting with 1, like [1,2,3,4], the code can be written in a simple form as 4.( express), as with the code in this example.

file("d:/employee.txt ").cursor@z(;, ~:A1) surrounded in the parentheses is an expression, in which cursor function uses @z option to segment the file and fetch each part with a cursor. ~:A1 means that the file will be roughly divided into four segments (A1=4) and the ~th segment will be fetched. “~” represents the current member in A1 and each cursor corresponds to the first, the second, the third and the fourth segment respectively.

Besides, though exact division will result in incomplete lines, esProc can import complete lines automatically by skipping the beginning half line of a segment and completing the ending half line of the segment. This is why the file should be divided “roughly”.

A3=A2.(~.select(${where})). This line of code will query data of each cursor (i.e. ~) in A2 and select the eligible rows. The computed results are still four cursors.

A4=A3.conj@xm(). This line of code will merge the four cursors in A3 in parallel.

A5=file("d:/result.txt”).export(A4). This line of code will export the final result to a file.