Hadoop – Hive / Impala

Hive / Impala

Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis.
Cloudera Impala is a query engine that runs on Apache Hadoop.




# Normal
SELECT name, age FROM mytable LIMIT 10;

# Regex Querie 
SELECT '(ds|hr)?+.+' FROM sales;

# Subqueries
  (SELECT c1 + c2 AS total FROM mytable) my_query;

# Join

Create table

CREATE TABLE mytable (name string, age int)

Load Data

LOAD DATA LOCAL INPATH 'path/mydata/data.txt'
INTO TABLE mytable;

Load from an existing table

INSERT INTO birthdays SELECT firstname, lastname, birthday FROM customers WHERE birthday IS NOT NULL;

Apache access log to hive

First copy an apache access log to hdfs.
sudo -u hdfs hadoop fs -mkdir /user/hive/warehouse/original_access_logs
sudo -u hdfs hadoop fs -copyFromLocal /opt/examples/log_files/access.log.2 /user/hive/warehouse/original_access_logs
After the copy ends, create a table using hive with that hdfs file as source:
CREATE EXTERNAL TABLE intermediate_access_logs (
    ip STRING,
    date STRING,
    method STRING,
    url STRING,
    http_version STRING,
    code1 STRING,
    code2 STRING,
    dash STRING,
    user_agent STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
    'input.regex' = '([^ ]*) - - \\[([^\\]]*)\\] "([^\ ]*) ([^\ ]*) ([^\ ]*)" (\\d*) (\\d*) "([^"]*)" "([^"]*)"',
    'output.format.string' = "%1$$s %2$$s %3$$s %4$$s %5$$s %6$$s %7$$s %8$$s %9$$s")
LOCATION '/user/hive/warehouse/original_access_logs';
CREATE EXTERNAL TABLE tokenized_access_logs (
    ip STRING,
    date STRING,
    method STRING,
    url STRING,
    http_version STRING,
    code1 STRING,
    code2 STRING,
    dash STRING,
    user_agent STRING)
LOCATION '/user/hive/warehouse/tokenized_access_logs';
ADD JAR /usr/lib/hive/lib/hive-contrib.jar;
INSERT OVERWRITE TABLE tokenized_access_logs SELECT * FROM intermediate_access_logs;
Now you can work with impala to query this data.
invalidate metadata; -- to reload and read this new table.
select count(*),url from tokenized_access_logs
where url like '%\/product\/%'
group by url order by count(*) desc;

Hadoop – HCatalog

HCatalog presents a relational view of data. Data is stored in tables and these tables can be placed into databases. Tables can also be partitioned on one or more keys. For a given value of a key (or set of keys) there will be one partition that contains all rows with that value (or set of values).





hcat.py -g mygrpoup ...
# Indicates to HCatalog that table that needs to be created must have group as "mygroup"

hcat.py -p rwxr-xr-x ...
# Indicates to HCatalog to set permissions as "rwxr-xr-x"

hcat.py -f myscript.hcatalog ...
# Indicates to HCatalog that myscript.hcatalog is a file which contains DDL commands it needs to execute

hcat.py -e "create table mytrable (a int) ...
# Indicates to HCatalog to treat the following string as DDL command and execute it

Hadoop – Secondary namenode

​Namenodes are where the block’s metadata and directions are stored, and, that is done using 2 files, fsimage and edits. The fsimage is the first complete snapshot of the filesystem metadata, whereas edits contains only incremental modifications.

But, over time, the edits file grows and grows (its a log-based system) and it would take a long time to restore in the event of server failure. The edits file needs to be periodically applied to the fsimage. The problem is that the namenode may not have the avaliable resources (RAM/CPU) to do on-hot. This is where the secondary namenode comes in.
* The secondary namenode asks to the namenode to roll its edits file and begin writing to edits.new.
* The secundary namenode copies the namenode’s fsimage and edits to its local checkpoint directory.
* The secundary namenode “merge” fsimage and edits and write a new compacted fsimage to disk.
* The secundary nm sends the new fsimage file to the namenote, which adopts it (Renaming the edits.new to edits.).



  • HDFS is a highly-available, distributed file system.
    • A NameNode manages the file system.
    • A DataNode provides storage resources to the file system.
  • YARN is the data operating system for Hadoop that supports different types of distributed applications.
    • A ResourceManager manages cluster CPU and memory resources and features a pluggable job scheduler.
    • A NodeManager provides CPU and memory resources to the cluster.
    • An ApplicationMaster starts, monitors, restarts, and stops applications.
  • MapReduce is a framework, and not a program, used to process data in parallel across many machines.
    • It works in cooperation with HDFS to achieve scalability and performance.
  • MapReduce is so named because there is a map phase and a reduce phase.
    • The main purpose of the map phase is to read all of the input data and transform or filter it.
    • The main purpose of the reduce phase is to employ business logic to answer a question or solve a problem.
  • Use cases for MapReduce include data analytics, data mining, and full-text indexing.




Comand-Line Tools (hdfs)

Hadoop come with a command-line tool called hadoop.

The usage of it’s like unix commands, example:

# Create dir
hadoop fs -mkdir /user/hive/warehouse/original_access_logs

# List
hadoop fs -ls /user/hive/warehouse/original_access_logs

# Set the replication factor of a file
hadoop fs -setrep 5 -R /user/hive/warehouse

# Inspect and show file health
hadoop fsck /user/hive/warehouse -files -blocks -locations

* You can also access HDFS via REST Using WebHDFS.


Sqoop is a tool, which can be used to import our relational data from MySQL into HDFS.
Usage example:
sqoop import-all-tables \
-m 1 \
--connect jdbc:mysql://quickstart:3306/retail_db \
--username=retail_dba \
--password=cloudera \
--compression-codec=snappy \
--as-parquetfile \
--warehouse-dir=/user/hive/warehouse \