Huazhi Fang
Big Data Engineer at Yahoo- Claim this Profile
Click to upgrade to our gold package
for the full feature experience.
Topline Score
Bio
Credentials
-
Data Analytics & Data Science
Digi-Safari & Tredence Inc.Aug, 2019- Oct, 2024 -
Big Data 101
IBM -
Hadoop 101
IBM -
Simplifying data pipelines with Apache Kafka
IBM -
Spark Fundamentals I
IBM -
Spark Fundamentals II
IBM -
Using HBase for Real-time Access to your Big Data
IBM
Experience
-
Yahoo
-
Australia
-
Online Media
-
100 - 200 Employee
-
Big Data Engineer
-
Sep 2019 - Present
• Implemented solutions for ingesting data from various sources and processing the Data-at-Rest utilizing Big Data technologies such as Hadoop, Map Reduce Frameworks, HBase, Hive. • Worked on AWS Kinesis for processing huge amounts of real-time data. • Expertise in optimizing the storage in Hive using partitioning and bucketing mechanisms on each the managed and external tables. • Used Spark SQL and DataFrames API to load structured and semi structured information into Spark Clusters. • Extensively worked on CI/CD pipeline for code deployment by engaging different tools (Git, Jenkins, CodePipeline) in the process right from developer code check-in to production deployment. • Used Cloudera Manager for installation and management of single-node and multi-node Hadoop cluster. • Creation, configuration, and monitoring Shards sets. Analysis of the data to be shared, choosing a shard Key to distribute data evenly. • Enforced YARN Resource pools to share resources of cluster for YARN jobs submitted by users. • Exploring Spark for improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, DataFrames, Pair RDD's, Spark YARN. • Used Spark to export the transformed streaming datasets into Redshift on AWS cloud. • Created Lambda to process the information from S3 to Spark for organized gushing to get organized information by blueprint. • Installed and configured Kafka cluster and monitoring the cluster; Architected a lightweight Kafka broker; integration of Kafka with Spark for real-time data processing. • Extracted the needed data from the server into Hadoop file system (HDFS) and bulk loaded the cleaned data into HBase using Spark. • Accessed Hadoop file system (HDFS) using Spark and managed data in Hadoop data lakes. • Worked with the Spark-SQL context to create data frames to filter input data for model execution. • Utilized Spark Data Frame and Data Set from Spark SQL API widely for information handling. Show less
-
-
-
Spotify
-
Sweden
-
Musicians
-
700 & Above Employee
-
Date Engineer
-
Dec 2017 - Sep 2019
• Installed and configured Kafka producer to ingest data from Rest API. • Installed and configured Spark consumer to scream data from Kafka Producer. • Proficient experience in writing Queries, Stored procedures, Functions, and Triggers by using SQL. Support development, testing, and operations teams during new system deployments. • Wrote custom user define functions (UDF) for complex Hive queries (HQL). • Configure and deploy production-ready multi-node Hadoop services Hive, Sqoop, Flume, Oozie on the Hadoop cluster with latest patches. • Developed scripts for collecting high-frequency log data from various sources and integrating it into HDFS using Flume; staging data in HDFS for further analysis. • Used Cloudera Manager for installation and management of single-node and multi-node Hadoop cluster. • Configuring a multi-node cluster of 10 Nodes and 30 brokers for consuming high volume, high-velocity data. • Used Spark SQL to perform transformations and actions on data residing in Hive. • Used Zookeeper for numerous styles of centralized configurations, as well for Kafka offset management. • Assigned to making Hive tables, loading the info and writing hive queries. • Import/export knowledge into HDFS and Hive in exploitation of Sqoop and Kafka. • Created Partitions, Buckets supported State to additional method exploitation Bucket primarily based Hive joins. • Configure and deploy production-ready multi-node Hadoop services Hive, Sqoop, Flume, Oozie on the Hadoop cluster with latest patches. • Built a prototype for real-time analysis using Spark Streaming and Kafka. • Flume and HiveQL scripts to extract, transform, and load the data into database. • Loaded into ingested data into Hive Managed and External tables. • Involved in creating Hive tables, loading data, and writing hive queries, which will run internally in the map, reduce way. • Used Spark Streaming to divide streaming data into batches as an input to Spark engine for batch processing. Show less
-
-
-
eBay
-
United States
-
Technology, Information and Internet
-
700 & Above Employee
-
Big Data Engineer
-
Jun 2015 - Dec 2017
• Installed and configured Kafka producer to ingest data from Rest API. • Installed and configured Spark consumer to scream data from Kafka Producer. • Installed and configured Hive for data warehousing and HQL ETL. • Used Spark to migrate the data to Hive. • Worked on AWS to form, manage EC2 instances, and Hadoop Clusters. • Deployed the large knowledge Hadoop application mistreatment Talend on cloud AWS. • Using AWS Redshift for storing the information on cloud. • Performed maintenance, monitoring, deployments, and upgrades across infrastructure that supports all Hadoop clusters. • Used Zookeeper and Oozie for coordinating the cluster and programming workflows. • Involved in reworking knowledge from tables to HDFS, and HBase tables. • Transformed the logs data into knowledge model written UDF functions to format the logs knowledge. • Used HBase to store majority of information that required to be divided on columns region. • Experience with Spark for process ingested data from varied sources. • Created HBase tables to store variable data formats of information returning from completely different portfolios. • Used Spark SQL and Data Frames API to load structured and semi structured Data into Spark Clusters. • Wrote shell scripts for log files to Hadoop cluster through automatic processes. • Successfully loaded files to HDFS from MySQL using Spark. Show less
-
-
-
Ahold Delhaize
-
Netherlands
-
Retail
-
700 & Above Employee
-
Date Engineer
-
Oct 2013 - Jun 2015
• Installed and configured Hadoop cluster including HDFS, Yarn and MapReduce. • Used Spark to migrate data from HDFS to MySQL database. • Installed and configured Hive and also written Hive UDFs. • Worked with totally different file formats and compression techniques to standards. • Involved in loading data from the UNIX file system to HDFS. • Installed and configured MySQL server to allow remote user access on Ubuntu. • Loaded RDBMS of large datasets to big data by using Sqoop. • Accessed Hadoop cluster (CDM) and reviewed log files of all daemons. • Analyzed datasets using Hive, MapReduce, and Sqoop to recommend business improvements • Maintaining and troubleshooting network connectivity. • Collected and aggregated large amounts of log data using Apache Flume and staging data in HDFS for further analysis. • Installed and configured Flume agent to ingest data from Rest API. Show less
-
-
Education
-
University of Science and Technology Beijing
Doctor of Philosophy - PhD, Modeling and Simulation in Materials Science