Cloudera Impala is a SQL query engine for Hadoop. Impala is supposed to be better suited for real-time SQL queries, compared to MapReduce-based, batch processing software like Hive or Pig. The former is not dependent on MapReduce. In one of the ongoing projects which is running Impala on Hadoop, I had to configure connecting to it via PHP – so that the web developers can start using it.
This blog post explains to query Impala using PHP and ODBC. What’s more, Cloudera (the company behind Impala) provides an RPM for ODBC drivers for Impala. Following is how to install, configure and use Impala ODBC drivers with PHP (version 5.4) on CentOS 7 (64bit) Linux distribution.
Download the Impala ODBC drivers from http://www.cloudera.com/content/cloudera/en/downloads/connectors/impala/odbc/impala-odbc-v2-5-29.html. On this page, click on “Download Bits” button against Linux RHEL6 – 64 Bit. After filling up the popup form, the RPM will start downloading. As of this writing the filename of the RPM is ClouderaImpalaODBC-2.5.29.1009-1.el6.x86_64.rpm.
Install this RPM on your PHP powered web server as:
rpm -ivh ClouderaImpalaODBC-2.5.29.1009-1.el6.x86_64.rpm
On CentOS 7, the above command will fail with the following error:
cyrus-sasl-gssapi >= 2.1.22 is needed by ClouderaImpalaODBC-2.5.29.1009-1.x86_64 cyrus-sasl-plain >= 2.1.22 is needed by ClouderaImpalaODBC-2.5.29.1009-1.x86_64 libsasl2.so.2()(64bit) is needed by ClouderaImpalaODBC-2.5.29.1009-1.x86_64
Now install cyrus-sasl-gssapi and cyrus-sasl-plain as follows:
yum install cyrus-sasl-gssapi cyrus-sasl-plain
Now again issue:
rpm -ivh ClouderaImpalaODBC-2.5.29.1009-1.el6.x86_64.rpm
This should now leave you with only one error:
libsasl2.so.2()(64bit) is needed by ClouderaImpalaODBC-2.5.29.1009-1.x86_64
Next, install the rpm with –nodeps option as:
rpm -ivh ClouderaImpalaODBC-2.5.29.1009-1.el6.x86_64.rpm --nodeps
To fix the libsasl2.so.2, issue the following command:
ln -s /usr/lib64/libsasl2.so.3 /usr/lib64/libsasl2.so.2
Install unixODBC as follows:
yum install unixODBC
Configure the ODBC
Copy the files odbc.ini and odbcinst.ini, found in /opt/cloudera/impalaodbc/Setup, to /etc directory (overwrite the existing files). Next, open the file named cloudera.impalaodbc.ini, found in the directory /opt/cloudera/impalaodbc/lib/64, in a text editor. Comment out (by prefixing a #) the line which says ODBCInstLib=libiodbcinst.so as follows:
#ODBCInstLib=libiodbcinst.so
Next, add the following line towards the end of the file:
ODBCInstLib=libodbcinst.so
Save the file.
The PHP side
Install php-odbc as:
yum install php-odbc
Reload Apache web server as:
service httpd reload
Now the following PHP code should work and query data in Impala via ODBC:
<?php $dsn = "DSN=Sample Cloudera Impala DSN 64;host=192.168.0.5;port=21050;database=bigdata;"; $conn = odbc_connect($dsn, '', ''); $result = odbc_exec($conn, "select * from tbl limit 10"); while($row = odbc_fetch_array($result)) print_r($row); ?>
In the above code, substitute 192.168.0.5 to the name or IP address of the machine running Impala datanode. If you point it to the namenode, you will get the following error:
PHP Warning: odbc_connect(): SQL error: [unixODBC][Cloudera][ImpalaODBC] (100) Error from the Impala Thrift API: connect() failed: Connection refused, SQL state S1000 in SQLConnect
Substitute bigdata in the DSN with the name of the database. And, substitute the name of the table (tbl) in the query select * from tbl limit 10.
This blog post is aimed to get you started with PHP ODBC and Impala with the default configuration, with minimal changes. So feel free to change the ugly DSN name “Sample Cloudera Impala DSN 64″ in /etc/odbcinst.ini, /etc/odbc.ini and in the PHP script.