Distributed data storage in the Data Lake concept: where to start

In the world of the enterprise, there has been a satiety of front-line systems, data buses, and other classic systems that have been introduced by everyone for the last 10-15 years. But there is one segment that until recently was in the status of "everyone wants, but no one knows what it is." And this is Big Data. It sounds beautiful, promoted by top Western companies - how not to become a tasty morsel?



But while the majority is only looking and asking a price, some companies have begun to actively implement solutions based on this technological stack in their IT landscape. An important role in this was played by the emergence of commercial distributions of Apache Hadoop, the developers of which provide technical support to their customers. Sensing the need for such a decision, one of our clients decided to organize a distributed data warehouse in the Data Lake concept based on Apache Hadoop.

Project Goals


First, to optimize the work of the risk management department. Before the start of work, a whole department was involved in calculating credit risk factors (FCR), and all calculations were made manually. The recalculation took about a month each time and the data on the basis of which it was based had time to become obsolete. Therefore, the tasks of the solution included the daily loading of the data delta into the repository, the recalculation of the FKR and the construction of data marts in the BI tool (for this task, SpagoBI functionality was enough) to visualize them.


Secondly, to provide high-performance Data Mining tools for bank employees involved in Data Science. These tools, such as Jupyter and Apache Zeppelin, can be installed locally and can also be used to explore data and build models. But their integration with the Cloudera cluster makes it possible to use the hardware resources of the most productive nodes of the system for calculations, which speeds up the tasks of data analysis tens and even hundreds of times.


The Oracle Big Data Appliance rack was chosen as the target hardware solution, therefore the Apache Hadoop distribution kit from Cloudera was taken as the basis. The rack traveled for quite a long time, and to speed up the process, servers in the customer’s private cloud were allocated for this project. The decision is reasonable, but there were a number of problems, which I will discuss below.


The project has planned the following tasks:

  1. Expand Cloudera CDH (Cloudera's Distribution including Apache Hadoop) and additional services required for work.
  2. Set up the installed software.
  3. Configure continuous integration to speed up the development process (to be covered in a separate article).
  4. Install BI-tools for building reports and Data Discovery tools to ensure the work of data users (will be covered in a separate post).
  5. Develop applications for downloading the necessary data from end systems, as well as their regular updating.
  6. Develop reporting forms for data visualization in the BI tool.

Neoflex has been developing and implementing systems based on Apache Hadoop for many years and even has its own product for visual development of ETL processes - Neoflex Datagram. I have long wanted to take part in one of the projects of this class and gladly took up the administration of this system. The experience turned out to be very valuable and motivating to further study the topic, so I hasten to share it with you. Hope it will be interesting.


Resources


Before starting the installation, it is recommended to prepare everything necessary for it.
The amount and power of iron depends on how much and what environments need to be deployed. For development purposes, you can install all the components for at least one frail virtual machine, but this approach is not welcome.


At the stage of project piloting and active development, when the number of users of the system was minimal, one basic environment was enough - it allowed to accelerate by reducing the time for loading data from end systems (the most frequent and lengthy procedure in developing data warehouses). Now that the system has stabilized, we have arrived at a configuration with three environments — a test, a preprode, and a prod (main).


In a private cloud, servers were allocated for the organization of 2 environments - the main and test ones. Media specifications are shown in the table below:

PurposeamountvCPUvRAM, GbDisks, Gb
Core Environment, Cloudera Services3eight642,200
Mainstream HDFS3222885000
Core Environment, Data Discovery Toolsonesixteen1282200
Test environment, Cloudera servicesoneeight642200
Test environment, HDFS2222564,000
Core Environment, Data Discovery Toolsonesixteen1282200
CI26481000

Later, the main environment migrated to Oracle BDA, and the servers were used to organize preprodes of the environment.


The decision to migrate was justified - there were objectively not enough resources allocated for HDFS servers. Tiny disks (what is 5 Tb for Big Data?) And insufficiently powerful processors, which are stably loaded by 95% with regular work of data loading tasks, have become bottlenecks. With other servers, the situation is reversed - almost all the time they are idle without work and their resources could be more useful for other projects.


The situation was not easy with the software - due to the fact that the development was conducted in a private cloud without access to the Internet, all files had to be transferred through the security service and only by agreement. In this regard, it was necessary to unload all the necessary distributions, packages and dependencies in advance.


In this difficult task, the keepcache = 1 setting in the /etc/yum.conf file helped (RHEL 7.3 was used as the OS) - installing the necessary software on a machine with access to the network is much easier than downloading it from the repositories with the dependencies;)

What you need to deploy:

  1. Oracle JDK (no Java anywhere).
  2. Database for storing information created and used by CDH services (for example, Hive Metastore). In our case, PostgreSQL version 9.2.18 was installed, but any of the supported Cloudera services can be used (the list is different for different versions of the distribution, see the section “Requirements and Supported Versions” of the official site). It should be noted here that the choice of database was not entirely successful - Oracle BDA comes with a MySQL database (one of their products, transferred to them with the Sun purchase) and it was more logical to use a similar base for other environments, which would simplify the migration procedure. It is recommended to choose a distribution based on the target hardware solution.
  3. Chrony daemon for time synchronization on servers.
  4. Cloudera Manager Server.
  5. Demons Cloudera Manager.

Preparing to install


Before starting the installation of CDH it’s worth a series of preparatory work. One part of them is useful during installation, the other will simplify operation.


Install and configure the OS


First of all, it is worth preparing virtual (possible and real) machines on which the system will be located: install an operating system of a supported version on each of them (the list is different for different versions of the distribution kit, see the “Requirements and Supported Versions” section of the official site), assign friendly names for hosts (for example, <system_name> master1,2,3 ..., <system_name> slave1,2,3 ...), as well as mark disks for file storage and temporary files created during system operation.


The markup recommendations are as follows:


Configuring http server and offline installation of yum and CDH packages


Since the installation of software is done without access to the Internet, to simplify the installation of packages, it is recommended to raise the HTTP server and with its help create a local repository that will be accessible over the network. You can install all the software locally using, for example, rpm, but with a large number of servers and the appearance of several environments, it is convenient to have a single repository from which you can install packages without having to transfer them from machine to machine by hand.


The installation was performed on the Red Hat 7.3 OS, so the article will contain commands specific to it and other CentOS-based operating systems. When installed on other operating systems, the sequence will be similar, only package managers will differ.
In order not to write sudo everywhere, we assume that the installation is from under the root.


Here is what you need to do:
1. Select the machine on which the HTTP server and distributions will be located.
2. On a machine with a similar OS, but connected to the Internet, set the keepcache = 1 flag in the /etc/yum.conf file and install httpd with all dependencies:

yum install httpd 

If this command does not work, then you need to add to the list of repositories a yum repository in which there are data packages, for example, this one is centos.excellmedia.net/7/os/x86_64 :

 echo -e "\n[centos.excellmedia.net]\nname=excellmedia\nbaseurl=http://centos.excellmedia.net/7/os/x86_64/\nenabled=1\ngpgcheck=0" > /etc/yum.repos.d/excell.repo 

After that, the yum repolist command checks that the repository is pulled up - an added repository should appear in the repository list (repo id - centos.excellmedia.net; repo name - excellmedia).
Now we check that yum saw the packages we needed:

 yum list | grep httpd 

If the output contains the necessary packages, you can install them with the command above.

3. To create the yum repository, we need the createrepo package. It is also in the above repository and is set up in the same way:

 yum install createrepo 

4. As I said, a database is required for the CDH services to work. Install for this purpose PostgreSQL:

 yum install postgresql-server 

5. One of the prerequisites for the correct operation of CDH is time synchronization on all servers included in the cluster. For these purposes, the chronyd package is used (on those OS where I had to deploy CDH, it was installed by default). We check its availability:

 chronyd -v 

If it is not installed, then install:

 yum install chrony 

If installed, just download:

 yumdownloader --destdir=/var/cache/yum/x86_64/7Server/<repo id>/packages chrony 

6. At the same time immediately download the packages needed to install CDH. They are available at archive.cloudera.com - archive.cloudera.com/cm <major version of CDH> / <name of your OS> / <version of your OS> / x86_64 / cm / <full version of CDH> / RPMS / x86_64 /. You can download packages by hand (cloudera-manager-server and cloudera-manager-daemons) or by analogy add a repository and install them:

 yum install cloudera-manager-daemons cloudera-manager-server 

7. After installation, packages and their dependencies are cached in the / var / cache / yum / x86_64 / 7Server / \ <repo id \> / packages folder. We transfer them to the machine selected for HTTP server and distributions, and install:

 rpm -ivh < > 

8. Start httpd, make it visible from other hosts of our cluster, and also add it to the list of services that are started automatically after the download:

 systemctl start httpd systemctl enable httpd systemctl stop firewalld #       systemctl disable firewalld #       setenforce 0 

9. Now we have a working HTTP server. His working directory is / var / www / html . Create two folders in it - one for the yum repository, another for Cloudera parsels (more on that later):

 cd /var/www/html mkdir yum_repo parcels 

10. For the operation of the Cloudera services we need Java . On all machines, you need to install the same JDK version, Cloudera recommends Oracle's Hot Spot. Download the distribution from the official site (http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html) and transfer it to the yum_repo folder.

11. Create a yum repository in the yum_repo folder using the createrepo utility to make the JDK package available for installation from the cluster machines:

 createrepo -v /var/www/html/yum_repo/ 

12. After creating our local repository on each of the hosts, you need to add its description by analogy with paragraph 2:

 echo -e "\n[yum.local.repo]\nname=yum_repo\nbaseurl=http://<   httpd>/yum_repo/\nenabled=1\ngpgcheck=0" > /etc/yum.repos.d/yum_repo.repo 

You can also make checks similar to paragraph 2.

13. JDK is available, install:

 yum install jdk1.8.0_161.x86_64 

For Java operation, you need to set the JAVA_HOME variable. I recommend to export it immediately after installation, and also write it to the / etc / environment files and / etc / default / bigtop-utils so that it is automatically exported after the servers are restarted and its location is provided to CDH services:

 export JAVA_HOME=/usr/java/jdk1.8.0_161 echo "JAVA_HOME=/usr/java/jdk1.8.0_161" >> /etc/environment export JAVA_HOME=/usr/java/jdk1.8.0_144 >> /etc/default/bigtop-utils 

14. In the same way, install chronyd on all the cluster machines (if, of course, it is absent):

 yum install chrony 

15. Select the host on which PostgreSQL will work, and install it:

 yum install postgresql-server 

16. Similarly, select the host on which the Cloudera Manager will work, and install it:

 yum install cloudera-manager-daemons cloudera-manager-server 

17. Packages are installed, you can begin to configure the software before installation.

Addition:


During the development and operation of the system, you will need to add packages to the yum repository in order to install them on the cluster hosts (for example, the Anaconda distribution). To do this, in addition to the transfer of files to the yum_repo folder, the following actions are required:


Configuring auxiliary software


It is time to configure PostgreSQL and create databases for our future services. These settings are relevant for the CDH version 5.12.1, when installing other versions of the distribution kit, it is recommended to read the section “Cloudera Manager and Managed Service Datastores” of the official site.


First, we will initialize the database:


 postgresql-setup initdb 

Now we set up network interaction with the database. In the /var/lib/pgsql/data/pg_hba.conf file in the “IPv4 local connections” section, we change the method for 127.0.0.1/32 to the “md5” method, add the “trust” method and add the cluster subnet with the “trust” method :

 vi /var/lib/pgsql/data/pg_hba.conf pg_hba.conf: ----------------------------------------------------------------------- # TYPE DATABASE USER ADDRESS METHOD # "local" is for Unix domain socket connections only local all all peer # IPv4 local connections: host all all 127.0.0.1/32 md5 host all all 127.0.0.1/32 trust host all all <cluster_subnet> trust ----------------------------------------------------------------------- 

Then we will make some adjustments to the file /var/lib/pgsql/data/postgres.conf (I will give only those lines that need to be changed or checked for consistency:

 vi /var/lib/pgsql/data/postgres.conf postgres.conf: ----------------------------------------------------------------------- listen_addresses = '*' max_connections = 100 shared_buffers = 256MB checkpoint_segments = 16 checkpoint_completion_target = 0.9 logging_collector = on log_filename = 'postgresql-%a.log' log_truncate_on_rotation = on log_rotation_age = 1d log_rotation_size = 0 log_timezone = 'W-SU' datestyle = 'iso, mdy' timezone = 'W-SU' lc_messages = 'en_US.UTF-8' lc_monetary = 'en_US.UTF-8' lc_numeric = 'en_US.UTF-8' lc_time = 'en_US.UTF-8' default_text_search_config = 'pg_catalog.english' ----------------------------------------------------------------------- 

After the configuration is complete, you need to create databases (for those who are closer to Oracle terminology - schemas) for the services that we will install. In our case, the following services were installed: Cloudera Management Service, HDFS, Hive, Hue, Impala, Oozie, Yarn and ZooKeeper. Of these, Hive, Hue and Oozie need databases, and 2 databases will be needed for the needs of the Cloudera services — one for the Cloudera Manager server and the other for the report manager included in the Cloudera Management Service. Run PostgreSQL and add it to the autoload:

 systemctl start postgresql systemctl enable postgresql 

Now we can connect and create the necessary databases:

 sudo -u postgres psql > CREATE ROLE scm LOGIN PASSWORD '<password>'; > CREATE DATABASE scm OWNER scm ENCODING 'UTF8'; #    Cloudera Manager > CREATE ROLE rman LOGIN PASSWORD '<password>'; > CREATE DATABASE rman OWNER rman ENCODING 'UTF8'; #      > CREATE ROLE hive LOGIN PASSWORD '<password>'; > CREATE DATABASE metastore OWNER hive ENCODING 'UTF8'; #    Hive Metastore > ALTER DATABASE metastore SET standard_conforming_strings = off; #   PostgreSQL   8.2.23 > CREATE ROLE hue_u LOGIN PASSWORD '<password>'; > CREATE DATABASE hue_d OWNER hue_u ENCODING 'UTF8'; #    Hue > CREATE ROLE oozie LOGIN ENCRYPTED PASSWORD '<password>' NOSUPERUSER INHERIT CREATEDB NOCREATEROLE; > CREATE DATABASE "oozie" WITH OWNER = oozie ENCODING = 'UTF8' TABLESPACE = pg_default LC_COLLATE = 'en_US.UTF-8' LC_CTYPE = 'en_US.UTF-8' CONNECTION LIMIT = -1; #    Oozie    : > \q 

For other database services are created similarly.


Do not forget to run the script to prepare the Cloudera Manager server base, passing the input data for connecting to the database created for it:

 . /usr/share/cmf/schema/scm_prepare_database.sh postgresql scm scm <password> 

Creating a repository with CDH files


Cloudera provides 2 ways to install CDH - using packages (packages) and using parcels. The first option involves downloading a set of packages with the services of the required versions and their subsequent installation. This method provides more cluster configuration flexibility, but Cloudera does not guarantee their compatibility. Therefore, the second installation option using parcels — pre-formed sets of compatible version packages — is more popular. The latest versions are available at the following link: archive.cloudera.com/cdh5/parcels/latest . Earlier you can find a higher level. In addition to the parsel with CDH, you need to download manifest.json from the same repository directory.


For the operation of the developed functionality, we also needed Spark 2.2, which is not part of the CDH parsel (the first version of this service is available there). To install it, you need to download a separate parsel with this service and the corresponding manifest.json, also available in the Cloudera archive .


After downloading parsers and manifest.json, you need to transfer them to the appropriate folders of our repository. Create separate folders for CDH and Spark files:

 cd /var/www/html/parcels mkdir cdh spark 

We transfer into the created folders parsels and files manifest.json. To make them available for installation over the network, we issue a folder with parsals with the corresponding access:

 chmod -R ugo+rX /var/www/html/parcels 

You can start installing CDH, which I will discuss in the next post.

Source: https://habr.com/ru/post/413149/


All Articles