1.什么是impala?
Impala是Cloudera公司主导开发的新型查询系统,它提供SQL语义,能查询存储在Hadoop的HDFS和HBase中的PB级大数据。已有的Hive系统虽然也提供了SQL语义,但由于Hive底层执行使用的是MapReduce引擎,仍然是一个批处理过程,难以满足查询的交互性。相比之下,Impala的最大特点也是最大卖点就是它的快速。
Impala组成
客户端:包括JDBC、ODBC、Hue、Impala Shell等,用于执行查询或完成管理任务;
Hive Metastore:存储可用于Impala数据的信息,包括可用数据库及其结构。当执行Impala Sql语句进行schema对象的创建、修改及删除,或加载数据到表中等操作时,相关元数据的变化,通过单独的catalog服务自动广播到所有Impala节点;
Cloudera Impala(Impalad进程):运行于数据节点的Impala程序,用于协调和执行查询。每一个Impala的实例可以获取、解析以及协调Impala客户端传来的查询。查询是被分布到各Impala节点间,这些节点作为workers,并行执行查询片段;
HDFS、HBase、kudu:数据的实际存储位置。
impala查询执处理过程
用户程序通过JDBC、ODBC、Impala Shell等Impala 客户端发送Sql语句给Impala;
用户程序连接到集群中任意Impalad进程,这一进程作为整个查询的协调器;
Impala解析、分析查询,确定哪些任务由集群中哪一Impalad实例执行,并生成最优执行计划;
Impalad实例访问对应HDFS、HBase服务,获取数据;
每一个Impalad实例将数据返回给协调器Impalad,由其发送结果给客户端。
2.环境准备
install kudu
详见代码仓库kudu模块里面的docker目录
install impala
docker run -d --name kudu-impala --add-host kudu-master-1:{ip} --add-host kudu-master-2:{ip} --add-host kudu-master-3:{ip} -p 21000:21000 -p 21050:21050 -p 25000:25000 -p 25010:25010 -p 25020:25020 --memory=4096m apache/kudu:impala-latest impala
需要注意增加主机映射关系,不然impala找不带kudu的机器。 访问http://127.0.0.1:25000/
run impala-shell
docker exec -it kudu-impala impala-shellCreate a Kudu Table
Now that you are in an impala-shell that is connected to Impala you can use an Impala DDL statement to create a Kudu table.
CREATE TABLE my_first_table(id BIGINT,name STRING,PRIMARY KEY(id))PARTITION BY HASH PARTITIONS 4STORED AS KUDU;DESCRIBE my_first_table;
Insert and Modify Data
With my_first_table created you can now use Impala DML statements to INSERT, UPDATE, UPSERT, and DELETE data.
-- Insert a row.INSERT INTO my_first_table VALUES (99, "sarah");SELECT * FROM my_first_table;-- Insert multiple rows.INSERT INTO my_first_table VALUES (1, "john"), (2, "jane"), (3, "jim");SELECT * FROM my_first_table;-- Update a row.UPDATE my_first_table SET name="bob" where id = 3;SELECT * FROM my_first_table;-- Use upsert to insert a new row and update another.UPSERT INTO my_first_table VALUES (3, "bobby"), (4, "grant");SELECT * FROM my_first_table;-- Delete a row.DELETE FROM my_first_table WHERE id = 99;SELECT * FROM my_first_table;-- Delete multiple rows.DELETE FROM my_first_table WHERE id < 3;SELECT * FROM my_first_table;
install hue
docker run -it -p 8888:8888 gethue/hue:latest拷贝配置文件出来
docker cpb72dfc588c76:/usr/share/hue/desktop/conf/z-hue-overrides.ini ./z-hue-overrides.ini编辑文件内容
[[database]]engine=mysqlhost=xxx.xxx.xxx.xxxport=3306user=xxxpassword=xxxname=database[impala]server_host=xxx.xxx.xxx.xxxserver_port=21050
将修改好的文件放回去
docker cp./z-hue-overrides.ini b72dfc588c76:/usr/share/hue/desktop/conf/z-hue-overrides.ini
重启
docker stop ${container_id}docker start ${container_id}
访问http://127.0.0.1:8888/

3.代码工程
实验目标
实现JDBC访问impala
pom.xml
因为ImpalaJDBC41是由Cloudera维护的,而且包并没有上传到maven仓库,所以只能去他官网下载 Cloudera’s official Connector site.然后本地引入
xml version="1.0" encoding="UTF-8"?>xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">org.springframework.boot spring-boot-starter-parent 3.2.1 4.0.0 impala 17 17 org.springframework.boot spring-boot-starter-web org.springframework.boot spring-boot-autoconfigure org.springframework.boot spring-boot-starter-test test com.cloudera.impala.jdbc ImpalaJDBC41 2.6.4.1005 system ${project.basedir}/src/lib/ImpalaJDBC41.jar
测试类
使用datasource方法
DataSource ds =new com.cloudera.impala.jdbc41.DataSource();ds.setURL(<CONNECTION_URL>);connection = ds.getConnection();
直连方式
String IMPALA_URL="jdbc:impala://:21050/ " Connection connection = DriverManager.getConnection(IMPALA_URL);
package com.et.impala;import com.cloudera.impala.jdbc41.DataSource;import java.sql.Connection;import java.sql.DriverManager;import java.sql.ResultSet;import java.sql.Statement;import java.util.Properties;public class ImpalaTest {public static void main(String[] args) {Connection connection = null;Statement statement = null;ResultSet resultSet = null;try {Class.forName("com.cloudera.impala.jdbc41.Driver");Properties p = new Properties();p.setProperty( "user", "System" );p.setProperty( "password", "Hyd20240531" );p.setProperty( "latency", "0" );p.setProperty( "communicationtimeout", "0" );String currentschema = "default";String url = "jdbc:impala://172.24.4.35:21050/"+currentschema;//DataSourceDataSource ds = new com.cloudera.impala.jdbc41.DataSource();ds.setURL(url);connection = ds.getConnection();//don't use DataSource//connection = DriverManager.getConnection( url, p );System.out.println("Schema: " + connection.getSchema());String sqlQuery = "SELECT * FROM my_first_table";statement = connection.createStatement();resultSet = statement.executeQuery(sqlQuery);while (resultSet.next()) {int column1 = resultSet.getInt("id");String column2 = resultSet.getString("name");System.out.println("id: " + column1 + ", name: " + column2);}} catch (Exception e) {e.printStackTrace();} finally {try {if (resultSet != null) {resultSet.close();}if (statement != null) {statement.close();}if (connection != null) {connection.close();}} catch (Exception e) {e.printStackTrace();}}}}
以上只是一些关键代码,所有代码请参见下面代码仓库
代码仓库
https://github.com/Harries/springboot-demo(impala)
4.测试
运行测试类main方法
Schema:id: 99, name: sarahid: 1, name: johnid: 2, name: janeid: 3, name: bobbyid: 4, name: grant
5.引用
https://tech-spaghetti.com/2019/01/27/jdbc-connection-to-impala/
https://www.cloudera.com/downloads/connectors/impala/jdbc/2-6-4.html