First, to optimize the work of the risk management department. Before the start of work, a whole department was involved in calculating credit risk factors (FCR), and all calculations were made manually. The recalculation took about a month each time and the data on the basis of which it was based had time to become obsolete. Therefore, the tasks of the solution included the daily loading of the data delta into the repository, the recalculation of the FKR and the construction of data marts in the BI tool (for this task, SpagoBI functionality was enough) to visualize them.
Secondly, to provide high-performance Data Mining tools for bank employees involved in Data Science. These tools, such as Jupyter and Apache Zeppelin, can be installed locally and can also be used to explore data and build models. But their integration with the Cloudera cluster makes it possible to use the hardware resources of the most productive nodes of the system for calculations, which speeds up the tasks of data analysis tens and even hundreds of times.
The Oracle Big Data Appliance rack was chosen as the target hardware solution, therefore the Apache Hadoop distribution kit from Cloudera was taken as the basis. The rack traveled for quite a long time, and to speed up the process, servers in the customer’s private cloud were allocated for this project. The decision is reasonable, but there were a number of problems, which I will discuss below.
The project has planned the following tasks:
Neoflex has been developing and implementing systems based on Apache Hadoop for many years and even has its own product for visual development of ETL processes - Neoflex Datagram. I have long wanted to take part in one of the projects of this class and gladly took up the administration of this system. The experience turned out to be very valuable and motivating to further study the topic, so I hasten to share it with you. Hope it will be interesting.
Before starting the installation, it is recommended to prepare everything necessary for it.
The amount and power of iron depends on how much and what environments need to be deployed. For development purposes, you can install all the components for at least one frail virtual machine, but this approach is not welcome.
At the stage of project piloting and active development, when the number of users of the system was minimal, one basic environment was enough - it allowed to accelerate by reducing the time for loading data from end systems (the most frequent and lengthy procedure in developing data warehouses). Now that the system has stabilized, we have arrived at a configuration with three environments — a test, a preprode, and a prod (main).
In a private cloud, servers were allocated for the organization of 2 environments - the main and test ones. Media specifications are shown in the table below:
Purpose | amount | vCPU | vRAM, Gb | Disks, Gb |
---|---|---|---|---|
Core Environment, Cloudera Services | 3 | eight | 64 | 2,200 |
Mainstream HDFS | 3 | 22 | 288 | 5000 |
Core Environment, Data Discovery Tools | one | sixteen | 128 | 2200 |
Test environment, Cloudera services | one | eight | 64 | 2200 |
Test environment, HDFS | 2 | 22 | 256 | 4,000 |
Core Environment, Data Discovery Tools | one | sixteen | 128 | 2200 |
CI | 2 | 6 | 48 | 1000 |
Later, the main environment migrated to Oracle BDA, and the servers were used to organize preprodes of the environment.
The decision to migrate was justified - there were objectively not enough resources allocated for HDFS servers. Tiny disks (what is 5 Tb for Big Data?) And insufficiently powerful processors, which are stably loaded by 95% with regular work of data loading tasks, have become bottlenecks. With other servers, the situation is reversed - almost all the time they are idle without work and their resources could be more useful for other projects.
The situation was not easy with the software - due to the fact that the development was conducted in a private cloud without access to the Internet, all files had to be transferred through the security service and only by agreement. In this regard, it was necessary to unload all the necessary distributions, packages and dependencies in advance.
In this difficult task, the keepcache = 1 setting in the /etc/yum.conf file helped (RHEL 7.3 was used as the OS) - installing the necessary software on a machine with access to the network is much easier than downloading it from the repositories with the dependencies;)
What you need to deploy:
Before starting the installation of CDH it’s worth a series of preparatory work. One part of them is useful during installation, the other will simplify operation.
First of all, it is worth preparing virtual (possible and real) machines on which the system will be located: install an operating system of a supported version on each of them (the list is different for different versions of the distribution kit, see the “Requirements and Supported Versions” section of the official site), assign friendly names for hosts (for example, <system_name> master1,2,3 ..., <system_name> slave1,2,3 ...), as well as mark disks for file storage and temporary files created during system operation.
The markup recommendations are as follows:
Since the installation of software is done without access to the Internet, to simplify the installation of packages, it is recommended to raise the HTTP server and with its help create a local repository that will be accessible over the network. You can install all the software locally using, for example, rpm, but with a large number of servers and the appearance of several environments, it is convenient to have a single repository from which you can install packages without having to transfer them from machine to machine by hand.
The installation was performed on the Red Hat 7.3 OS, so the article will contain commands specific to it and other CentOS-based operating systems. When installed on other operating systems, the sequence will be similar, only package managers will differ.
In order not to write sudo everywhere, we assume that the installation is from under the root.
Here is what you need to do:
1. Select the machine on which the HTTP server and distributions will be located.
2. On a machine with a similar OS, but connected to the Internet, set the keepcache = 1 flag in the /etc/yum.conf file and install httpd with all dependencies:
yum install httpd
echo -e "\n[centos.excellmedia.net]\nname=excellmedia\nbaseurl=http://centos.excellmedia.net/7/os/x86_64/\nenabled=1\ngpgcheck=0" > /etc/yum.repos.d/excell.repo
yum list | grep httpd
yum install createrepo
yum install postgresql-server
chronyd -v
yum install chrony
yumdownloader --destdir=/var/cache/yum/x86_64/7Server/<repo id>/packages chrony
yum install cloudera-manager-daemons cloudera-manager-server
rpm -ivh < >
systemctl start httpd systemctl enable httpd systemctl stop firewalld # systemctl disable firewalld # setenforce 0
cd /var/www/html mkdir yum_repo parcels
createrepo -v /var/www/html/yum_repo/
echo -e "\n[yum.local.repo]\nname=yum_repo\nbaseurl=http://< httpd>/yum_repo/\nenabled=1\ngpgcheck=0" > /etc/yum.repos.d/yum_repo.repo
yum install jdk1.8.0_161.x86_64
export JAVA_HOME=/usr/java/jdk1.8.0_161 echo "JAVA_HOME=/usr/java/jdk1.8.0_161" >> /etc/environment export JAVA_HOME=/usr/java/jdk1.8.0_144 >> /etc/default/bigtop-utils
yum install chrony
yum install postgresql-server
yum install cloudera-manager-daemons cloudera-manager-server
During the development and operation of the system, you will need to add packages to the yum repository in order to install them on the cluster hosts (for example, the Anaconda distribution). To do this, in addition to the transfer of files to the yum_repo folder, the following actions are required:
createrepo -v --update /var/www/html/yum_repo/
yum clean all
rm -rf /var/cache/yum
It is time to configure PostgreSQL and create databases for our future services. These settings are relevant for the CDH version 5.12.1, when installing other versions of the distribution kit, it is recommended to read the section “Cloudera Manager and Managed Service Datastores” of the official site.
First, we will initialize the database:
postgresql-setup initdb
vi /var/lib/pgsql/data/pg_hba.conf pg_hba.conf: ----------------------------------------------------------------------- # TYPE DATABASE USER ADDRESS METHOD # "local" is for Unix domain socket connections only local all all peer # IPv4 local connections: host all all 127.0.0.1/32 md5 host all all 127.0.0.1/32 trust host all all <cluster_subnet> trust -----------------------------------------------------------------------
Then we will make some adjustments to the file /var/lib/pgsql/data/postgres.conf (I will give only those lines that need to be changed or checked for consistency:
vi /var/lib/pgsql/data/postgres.conf postgres.conf: ----------------------------------------------------------------------- listen_addresses = '*' max_connections = 100 shared_buffers = 256MB checkpoint_segments = 16 checkpoint_completion_target = 0.9 logging_collector = on log_filename = 'postgresql-%a.log' log_truncate_on_rotation = on log_rotation_age = 1d log_rotation_size = 0 log_timezone = 'W-SU' datestyle = 'iso, mdy' timezone = 'W-SU' lc_messages = 'en_US.UTF-8' lc_monetary = 'en_US.UTF-8' lc_numeric = 'en_US.UTF-8' lc_time = 'en_US.UTF-8' default_text_search_config = 'pg_catalog.english' -----------------------------------------------------------------------
After the configuration is complete, you need to create databases (for those who are closer to Oracle terminology - schemas) for the services that we will install. In our case, the following services were installed: Cloudera Management Service, HDFS, Hive, Hue, Impala, Oozie, Yarn and ZooKeeper. Of these, Hive, Hue and Oozie need databases, and 2 databases will be needed for the needs of the Cloudera services — one for the Cloudera Manager server and the other for the report manager included in the Cloudera Management Service. Run PostgreSQL and add it to the autoload:
systemctl start postgresql systemctl enable postgresql
Now we can connect and create the necessary databases:
sudo -u postgres psql > CREATE ROLE scm LOGIN PASSWORD '<password>'; > CREATE DATABASE scm OWNER scm ENCODING 'UTF8'; # Cloudera Manager > CREATE ROLE rman LOGIN PASSWORD '<password>'; > CREATE DATABASE rman OWNER rman ENCODING 'UTF8'; # > CREATE ROLE hive LOGIN PASSWORD '<password>'; > CREATE DATABASE metastore OWNER hive ENCODING 'UTF8'; # Hive Metastore > ALTER DATABASE metastore SET standard_conforming_strings = off; # PostgreSQL 8.2.23 > CREATE ROLE hue_u LOGIN PASSWORD '<password>'; > CREATE DATABASE hue_d OWNER hue_u ENCODING 'UTF8'; # Hue > CREATE ROLE oozie LOGIN ENCRYPTED PASSWORD '<password>' NOSUPERUSER INHERIT CREATEDB NOCREATEROLE; > CREATE DATABASE "oozie" WITH OWNER = oozie ENCODING = 'UTF8' TABLESPACE = pg_default LC_COLLATE = 'en_US.UTF-8' LC_CTYPE = 'en_US.UTF-8' CONNECTION LIMIT = -1; # Oozie : > \q
For other database services are created similarly.
Do not forget to run the script to prepare the Cloudera Manager server base, passing the input data for connecting to the database created for it:
. /usr/share/cmf/schema/scm_prepare_database.sh postgresql scm scm <password>
Cloudera provides 2 ways to install CDH - using packages (packages) and using parcels. The first option involves downloading a set of packages with the services of the required versions and their subsequent installation. This method provides more cluster configuration flexibility, but Cloudera does not guarantee their compatibility. Therefore, the second installation option using parcels — pre-formed sets of compatible version packages — is more popular. The latest versions are available at the following link: archive.cloudera.com/cdh5/parcels/latest . Earlier you can find a higher level. In addition to the parsel with CDH, you need to download manifest.json from the same repository directory.
For the operation of the developed functionality, we also needed Spark 2.2, which is not part of the CDH parsel (the first version of this service is available there). To install it, you need to download a separate parsel with this service and the corresponding manifest.json, also available in the Cloudera archive .
After downloading parsers and manifest.json, you need to transfer them to the appropriate folders of our repository. Create separate folders for CDH and Spark files:
cd /var/www/html/parcels mkdir cdh spark
We transfer into the created folders parsels and files manifest.json. To make them available for installation over the network, we issue a folder with parsals with the corresponding access:
chmod -R ugo+rX /var/www/html/parcels
You can start installing CDH, which I will discuss in the next post.
Source: https://habr.com/ru/post/413149/