1.2010/10 ~ 2011/09. TrendMicro. Develop Global Meta Store System. Use Hadoop and HBase to build huge Meta store and compute platform, and use map/reduce to handle the flexible query and build index function.
2. 2011/01 ~ 2011/12. TrendMicro. HBase Service Owner, include Website Reputation System (WRS) and TrendMicro Log System (TLS). Upgrade the HBase cluster successfully to 0.92 with security.
3. 2011/12 ~ Now. Alibaba Group. HBase Service Owner and HBase operation team leader. Manage and sustain 30+ services related to HBase, and 20+ HBase clusters. In 2012 Taobao 'double sticks day' , in term of Taobao Log Service with HBase, the output data peak flow reached to 10Gbps, input data peak flow reached to 5Gbps. and for the Alipay Security System with HBase, the peak QPS reached to 280k+, and the latency was always below 4ms in the whole day.
These works as below help us to survive in the 'double sticks day', we successfully make sure all the HBase service stable and always at the low response latency.
a)Dynamic modify the important configurations online, includes compaction thread number, flush thread number and the major compaction time and throughput without restart the regionservers.
b)Divide the cluster into several groups with different software and hardware configuration, and support various services with different types, include Log System, Read only system, Read-Write System.
c)Discuss with the application owners and prepare the degrade plan to handle with the unexpected huge request situation.
d)Efficient monitor system. By monitor the important HBase metrics, such as Hlog number, Memstore size, compaction queue, flush queue and so on, we can detect the potential problems more than 4 hours before the real accident happened and avoid it by involve and fix them at first time.
e)Monitor the Java JVM, speciously focus on the GC pause time.
Furthermore, we do these works on HBase Operation
a)Move cluster from one Data Center to another without stopping the services.
b)Replication between two different Data Centers, build an Active-Standby cluster for the Alibaba Monitor System.
4. 2012/7 ~ Now. Alibaba Group. Develop Hadoop cold data analysis system, and delete more than 2 PB useless data, save 80+ machines.
Taobao biggest Hadoop cluster (we call YUNTI), which is consists of 3000+ machines, is a platform which is used to store all the Alibaba group business data, meanwhile, thousands of analysis jobs run on this platform every day. However, the utilization of cluster keeps 75% above and the store utilization reaches to 80%. I utilize the hive and Hadoop OVI tool to analysis the audit log and Hadoop directory, then identify the cold data which has not been touched for one month and half year. Then we identify with the application owners and push them to delete the garbage data. Now we has deleted more than 2 PB data and saved 80+ machines.
5. 2012/7 ~ Now. Alibaba Group. Develop cloud operation platform, include automatic upgrade, automatic online repairs, and automatic update configuration.