I am currently a Research Scientist at Facebook.
Prior to my current position, I pursued PhD in Computer Science Program, School of Informatics and Computing, Indiana University. I was a research assistant at Digital Science Center, Pervasive Technology Institute, Indiana University.
Personal email:

Education

Research

I joined Facebook in 2002 as a Research Scientist. Since then, I along with other peers design and implement a foundational service and framework at the bottom of our distributed system stack which provides crucial functionalities including failure detection, automatic failover, automatic load balancing, locality based data placement preference, automatic load testing, and disaster readiness to upper-level sharded services. Our framework has been adopted widely and it backs many critical backend services. It enables our customers to achieve really high rps (requests per second) in a reliable way. Our framework supports different types of services (e.g. stateless services, stateful services with replication, storage systems, read-only cache services). As the scale keeps growing (e.g. number of servers, number of users, geographically dispersed data centers, feature set), there are challenges in multiple fronts that require innovation beyond state of the art technologies.
For outstanding PhD students, FB has a graduate fellowship program that funds their research. If you have done excellent research, I encourage you to apply. You can apply on the website, or reach out to me at zhguo@fb.com.

Prior to joining Facebook, I was a member of Digital Science Center and supervised by Prof. Geoffrey Fox. My research was on distributed systems especially data parallel systems and science gateways/portals. I thoroughly investigated how to improve multiple critical aspects of MapReduce such as data locality, load balancing, resource utilization and speculative execution. Given the huge volume of data modern big data systems process, data locality is crucial .I was among the first to deeply analyze how commonly used factors such as replication factor and number of servers affect data locality. I quantified their relationship with mathematical model, and proposed innovative task scheduling algorithm that significantly improves data locality. The resultant publication is Investigation of data locality in mapreduce and Investigation of data locality and fairness in MapReduce . Perceiving that the artificially imposed partitioning of resources into map/reduce slots causes significant resource underutilization, I proposed multiple innovative improvements that boost efficiency. In Automatic task re-organization in MapReduce , I presented mechanisms to dynamically split and consolidate tasks to cope with load imbalancing and break through the concurrency limit resulting from fixed task granularity. For single-job system, two algorithms were proposed for circumstances where prior knowledge is known and unknown. For multi-job case, I proposed a modified shortest-job-first strategy, which minimizes job turnaround time theoretically when combined with task splitting. In Improving Resource Utilization in MapReduce , I proposed resource stealing to enable running tasks to steal resources reserved for idle slots and give them back proportionally whenever new tasks are assigned. Resource stealing makes the otherwise wasted resources get fully utilized without interfering with normal job scheduling. I also proposed Benefit Aware Speculative Execution (BASE) which evaluates the potential benefit of speculative tasks and eliminates unnecessary runs. In Improving MapReduce Performance in Heterogeneous Network Environments and Resource Utilization , I investigated the performance of MapReduce in heterogeneous network environments and proposed novel network heterogeneity aware scheduling algorithm.

Observing the limited environments and application types MapReduce supports, I proposed new architectures that greatly expand the scenarios MapReduce can be used. I proposed a new paradigm Hierarchical MapReduce that enables MapReduce to be deployed on top of geographically dispersed compute clusters across research institutes and universities. The work was published as A hierarchical framework for cross-domain MapReduce execution . Some projects in our lab ran data processing pipeline consisting of multiple applications which needed to be manually scheduled to run on appropriate platforms including Hadoop and Twister (iterative MapReduce). I designed Hybrid MapReduce that allows users to orchestrate complicated processing workflows across multiple runtime platforms without worrying about implementation detail (e.g. copy/transform data between platforms). The work was published as HyMR: a Hybrid MapReduce Workflow System .

Besides backend distributed systems, I also worked intensively on revolutionizing science portal/gateway development. I applied cutting-edge web technologies such as OpenID, OAuth, AJAX, OpenSocial, gadgets and widgets into science gateway development. This significantly improved reusability, flexibility, and agility. I shared my work in open source project OGCE (Open Gateway Computing Environments) which was the largest initiative to innovate the accessibility of large computer clusters and received millions of dollars fund from National Science Foundation. A series of papers were published: Building the PolarGrid Portal Using Web 2.0 and OpenSocial , Cyberaide JavaScript: A JavaScript Commodity Grid Kit , Investigating the Use of Gadgets, Widgets, and OpenSocial to Build Science Gateways , The QuakeSim Portal and Services: New Approaches to Science Gateway Development Techniques  , Open Community Development for Science Gateways with Apache Rave , and Using Web 2.0 for Scientific Applications and Scientific Communities .

I was a member of FutureGrid Experts Team. The main responsibility was to utilize my expertise to guide FutureGrid users on i) the use of High Performance Computing clusters, ii) the use of cloud computing platforms such as Eucalyptus and Nimbus, iii) the use of distributed computing runtimes such as Hadoop and Twister, iv) other system issues.

I am an Apache committer in RAVE project, which tries to build a generic web 2.0 friendly portal framework using OpenSocial gadgets and W3C widgets. Our main responsibility is to add science gateway specific features (e.g. integrate with existing grid computing systems and High Performance Computing clusters).

Blogs/Wikis

I maintained several blogs, wikis and google projects for different purposes.