esProc, A Script Language for Data Analytics with Parallel Mechanism: Implement Basic Functions of SQL through R Language & esProc

As we know, SQL users have to finish composing all codes and then run them all at one time, resulting in inconvenient interactive computing. However, the simple and easy-to-understand query syntax of SQL is always welcomed by programmers. R language and esProc as the computing and analysis tool are surely required to offer the similar query syntax. In this essay, we will discuss how they implement the basic functions of SQL through R language and esProc through some examples.

The example data is from 2 tables of the classical Northwind database:
Orders table with the main fields: OrderID, EmployeeID, OrderDate, Freight, CustomerID

Customer table with the main fields: CustomerID, CompanyName

Retrieve data of the entire table
SQL: select * from Orders

R: A1<-sqlQuery(conn,'select * from Orders')

esProc: $select * from Orders

Comments: The word count of esProc and SQL codes only differs by one word. The R language has a strong flavor of programming.

Where: Search for the order which has a freightage higher than 100 and is placed before the year of 1998
SQL: select * from Orders where Freight > 100 and OrderDate < '1998-01-01'

R : subset(A1,Freight > 100 & OrderDate < as.POSIXlt('1998-01-01'))

esProc: =A1.select(Freight > 100 && OrderDate < date("1998-01-01"))

Comments: Both SQL and R solutions are close to SQL to some extent. R is of the typical function style, and esProc is of the typical object style.

Order: Sort by employees correctly, and then sort by freightage in reverse order
SQL: select * from Orders order by EmployeeID, Freight desc

R: A1[order(A1$ EmployeeID,-A1$Freight),]

esProc: =A1.sort(EmployeeID,Freight:-1)

Comments: R solution is to retrieve 2 vectors from A1 at first, pass them to the function order to group them together for sorting, and then export the serial numbers. At last, data will be rearranged according to the serial number. The computing process of R language is rather winding, not as straightforward as SQL. It is because that R is good at handling vector, the access to structured data will take column as the basic unit, and the parameter usually takes up a whole column. By comparison, SQL takes the record (column) as the basic unit, with parameter as the column name.

esProc solution resembles that of SQL because esProc takes the record (row) as the basic unit .

Group & Sum: Summarize by employee, sum up the freightage, and count the orders:
SQL: select EmployeeID, count(OrderID), sum(Freight) from Orders group by EmployeeID

R :
A4<-aggregate (A1$Freight,list(A1$EmployeeID),sum)
A4$count<-tapply(A1$Freight,A1$EmployeeID,length)

esProc : =A1.group(EmployeeID;EmployeeID,~.count(),~.sum(Freight))

Comments: In this case, it is obvious that R and SQL differ greatly. The algorithms available in R may be clearer for mathematicians, and means more learning efforts for users accustomed to SQL.

Join: Perform left join on Orders table and Customers table by CustomerID.
SQL :select * from Orders leftjoin Customers on Orders.CustomerID=Customers.CustomerID

R :merge(A1,B1,by.x="CustomerID",by.y="CustomerID",all.x=TRUE)

esProc: =join@1(A1:CustomerID:Orders, B1:CustomerID:Customers)

Comments: The join of SQL equals to join of esProc or the merge of R. Similarly, the left join of SQL equals to join@1 of esProc, or merge(...all.x=TRUE) of R. Obviously, esProc is more alike SQL in the respects of both the syntax conventions and the literal meanings.

Distinct: Remove the duplicate CustomerID
SQL: selectdistinct CustomerID fromOrders

R: unique(B2$CustomerID)

esProc: =B2.id(CustomerID)
Comments: The keywords of the two solutions respectively differ to that of SQL. However, their usages are basically the same to that of SQL. In which, R is the typical function style, and esProc is the typical object style.

Like: Search for the record with Island in ShipName
SQL : select * from Orders where ShipName like '%Island%'

R: subset(A1,grepl("Island",ShipName,ignore.case = TRUE))

esProc : =A1.select(like@c(ShipName ,"*Island*"))

Comments: R supports several means to match, including the regular expressions, and is more powerful than esProc in this respect. The usages of esProc are more close to that of SQL, and fit for those who are familiar with SQL.

As can be seen from the above comparison, esProc has a coding style more close to that of SQL since esProc supports the data structure of “Record” by the infrastructure, making it more suitable for users who are familiar with SQL. In addition, compared with R language, esProc provides the representation style of "object + functions" that is much easier for those who are familiar with database to accept.
R is more resourceful in details and third party support, ideal for mathematicians.

menu

July 28, 2014

Implement Basic Functions of SQL through R Language & esProc

No comments:

Post a Comment