July 17, 2014

Process Text Files with esProc

It is quite convenient to process data from text files with esProc, which provides many functions for processing text files: import various text files; process big text files; visit text files of hdfs; as well as general operations, such as, file moving, deleting and checking whether a file exists. The following will illustrate these functions through examples.

File function and import function can be used if data in text files need to be read in. For example, the following text file, empolyee.txt, stores employee information:

EID   NAME SURNAME GENDER STATE BIRTHDAY   HIREDATE    DEPT SALARY
1   Rebecca Moore F California 1974-11-20 2005-03-11  R&D 7000
2 Ashley Wilson F New York 1980-07-19 2008-03-16  Finance   11000
3   Rachel Johnson F New Mexico 1970-12-17 2010-12-01 Sales 9000
4   Emily Smith F Texas 1985-03-07 2006-08-15 HR 7000
5   Ashley Smith F Texas 1975-05-13 2004-07-30 R&D 16000
6 Matthew Johnson M California 1984-07-07 2005-07-07 Sales   11000
7 Alexis Smith F Illinois   1972-08-16 2002-08-16 Sales 9000
8 Megan Wilson F California 1979-04-19 1984-04-19 Marketing 11000
9 Victoria Davis F Texas 1983-12-07 2009-12-07  HR 3000


It is convenient to import the file code:


In the above figure, cell A1 uses file function to open a designated file and cell A2 uses the file’s import function to import the file’s content. We can see in the figure that esProc's integrated development environment provides debugging functions, like single step, break point and so on. The imported file content can be displayed visually in the window on the right, making  debugging program and the observation of result more convenient. 

As the other functions provided by esProc, import function contains options and parameters. @t in import@t() in the above figure shows option t is used, and by looking up in the function reference, @t means that the data file’s first row is the field name. It is the parameter that the parentheses contains. The blank in parentheses in the above figure shows that all parameters are default.

Let’s look at how to code with those functions in esProc program:

1. Separator 
In the above figure, because tab is used as a separator in text file empolyee.txt, no input parameter is used in the parentheses of import function. The default character tab is used directly instead. If other separators need to be imported, they can be defined with parameters. An example: comma is used to separate fields in csv file supported by excel, so the code for import function is import@t(;”,”).   

2. Import a portion of the fields
When a portion of the fields, instead of all of them, is to be imported, parameters can be used to define them, e.g., import@t(EID,NAME,BIRTHDAY).
3. Data type designated to be imported
esProc automatically judges the type of imported fields. For instance, BIRTHDAY is in consistent with the format of date and time, so esProc will automatically import the datetime type. If the field is required to be imported as a string, the code may be like this: import@t(EID,NAME,BIRTHDAY:string). The data types compatible with esProc include bool, int, long, float, decimal, number, string, date, time and datetime. 

4. Importing data piecewise 
It is required to piecewise import data when the data size of a text file is larger than the memory. import function of esProc provides two input parameters to read data from designated starting bytes to end bytes. For instance, import@t(EID,NAME,BIRTHDAY;,1000:2000) reads data from the 1000th bytes to the 2000th bytes. Usually, text file data gets its meaning when read by rows. import function automatically adjusts the position of starting point and end point to ensure a full data row. In order to ensure that the imported data consists exclusively of full rows, when the starting point happens to be in the middle of a data row, the import function will skip the row and begin with the next row; but if it is the end point that happens to be in the same position, the reading will continue until the end of the row. Reading data piecewise rules that the starting character will not be read while the end character will. So by making each end point to be the starting one of the next reading, all data rows will be imported without omission and repetition. Thus, in the preceding example, import@t(EID,NAME,BIRTHDAY;,2000:3000) is the appropriate code for the next reading. 

It's worth pointing out that because the data should not be segmented according to rows, the current row number can only be determined after the traversal of all preceding rows. Otherwise, the expected excellent performance of importing data piecewise won’t be achieved.  

esProc also segments data automatically. For example, if a file is divided into three parts and the second one is to be read, the code is like this: import@tz(EID,NAME,BIRTHDAY:string;,2:3). With the option @z, meanings of the two parameters become the total number of parts and the serial number of current part waiting to be read. In this method, it is convenient for multiple parallel tasks to piecewise process a big file. 

5. The process and export of big files
After the file is imported to the memory, it can be exported or re-imported into the database. For example, after empolyee.txt is imported, NAME and SURNAME could be merged into FULLNAME to be exported to employee_result.txt. If data size of the file is big enough, text cursor can be used to process it. esProc cursor can take out file data one by one and make computations. In addition, it provides quite a lot of cursor functions, making code more concise.

A complete esProc code is as follows:


In the above figure, cell A2 defines a cursor of file; cell A3 calculates full name; cell A4 exports the result of A3 to the result file. 

6. Access hdfs files.
It’s easy to access hdfs files with esProc. Your just need to change file functions to hdfsfile functions, e.g., hdfsfile("hdfs://192.168.0.204:9000/user/root/student.txt").For a compressed file, change the extension, e.g., hdfsfile("hdfs://192.168.0.204:9000/user/root/student.gzip"). Thus hdfs compresses or decompresses according to the extensions of files when reading and writing hdfs files. Compress methods includes gzip, LZO and LZ4 etc.

7. General file operations
We can make general file operations with esProc. For example, functions, such as move, exists, date, size, property and file name, respectively have the function of move (delete), checking whether a file exists, modifying date, size checking, reading property of configuration files and splitting full path, as well as obtaining file names and extensions.

No comments:

Post a Comment