Scheduling Shared Scans of Large Data Files
Parag AgrawalStanford University
Daniel KiferYahoo! Research
Christopher OlstonYahoo! Research
ABSTRACTWe study how best to schedule scans of large data files, inthe presence of many simultaneous requests to a commonset of files. The objective is to maximize the overall rate ofprocessing these files, by sharing scans of the same file asaggressively as possible, without imposing undue wait timeon individual jobs. This scheduling problem arises in batchdata processing environments such as Map-Reduce systems,some of which handle tens of thousands of processing re-quests daily, over a shared set of files.
As we demonstrate, conventional scheduling techniquessuch as shortest-job-first do not perform well in the presenceof cross-job sharing opportunities. We derive a new familyof scheduling policies specifically targeted to sharable work-loads. Our scheduling policies revolve around the notionthat, all else being equal, it is good to schedule nonsharablescans ahead of ones that can share IO work with future jobs,if the arrival rate of sharable future jobs is expected to behigh. We evaluate our policies via simulation over variedsynthetic and real workloads, and demonstrate significantperformance gains compared with conventional schedulingapproaches.
1. INTRODUCTIONAs disk seeks become increasingly expensive relative to
sequential access, data processing systems are being archi-tected to favor bulk sequential scans of large files. Database,warehouse and mining systems have incorporated scan-centric access methods for a long time, but at the mo-ment the most prominent example of scan-centric archi-tectures is Map-Reduce . Map-Reduce systems executeUDF-enhanced group-by programs over extremely large, dis-tributed files. Other architectures in this space includeDryad  and River .
Large Map-Reduce installations handle tens of thousandsof jobs daily, where a job consists of a scan of a large file ac-companied by some processing and perhaps communicationwork. In many cases the processing is relatively light (e.g.,count the number of times Britney Spears is mentioned on
Permission to copy without fee all or part of this material is granted providedthat the copies are not made or distributed for direct commercial advantage,the VLDB copyright notice and the title of the publication and its date appear,and notice is given that copying is by permission of the Very Large DataBase Endowment. To copy otherwise, or to republish, to post on serversor to redistribute to lists, requires a fee and/or special permission from thepublisher, ACM.VLDB 08, August 24-30, 2008, Auckland, New ZealandCopyright 2008 VLDB Endowment, ACM 000-0-00000-000-0/00/00.
the web), and the communication is minimal (distributiveand algebraic aggregation functions enable early aggregationon the Map side of the job, and the data transmitted to theReduce side is small). Many jobs even disable the Reducecomponent, because they do not require global processing(e.g., generate a hash-based synopsis of every document ina large collection).
The execution time of these jobs is dominated by scanningthe input file. If the number of unique input files is smallrelative to the number of daily jobs (e.g., in a search enginecompany many jobs process the web crawl, user click log,and search query log), then it is desirable to amortize thework of scanning one of these files across multiple jobs. Un-fortunately, caching is not good enough because often thesedata sets are so large that they do not fit in memory, evenif spread across a large cluster of machines.
Cooperative scans [6, 8, 21] can help here: multiple jobsthat require scanning the same file can be executed simulta-neously, with the scanning performed once and the scanneddata fed into each jobs processing component. The work oncooperative scans has focused on mechanisms to realize IOsavings across multiple co-executing jobs. However there isanother opportunity here: In the Map-Reduce context jobstend to run for a long time, and users do not expect quickturnaround. It is acceptable to reorder pending jobs, withina reasonable limit on delaying individual jobs, if doing socan improve the total amount of useful work performed bythe system.
In this paper we study how to schedule jobs that can ben-efit from shared scans over a common set of files. To ourknowledge this scheduling problem has not been posed be-fore. Existing scheduling techniques such as shortest-job-first do not necessarily work well in the presence of sharablejobs, and it is not obvious how to design ones that do workwell. We illustrate these points via a series of informal ex-amples (rigorous formal analysis follows).
1.1 Motivating Examples
Example 1Suppose the systems work queue contains two pending jobs,J1 and J2, which are unrelated (i.e., they scan different files),and hence there is no benefit in executing them jointly.Therefore we execute them sequentially, and we must de-cide which one to execute first. We might consider execut-ing them in order of arrival (FIFO), or perhaps in order ofexpected running time (a policy known as shortest-job-firstscheduling, which aims for low average response time in non-
Permission to make digital or hard copies of portions of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyright for components of this work owned by others than VLDB Endowment must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers or to redistribute to lists requires prior specific permission and/or a fee. Request permission to republish from: Publications Dept., ACM, Inc. Fax +1 (212) 869-0481 or firstname.lastname@example.org. PVLDB '08, August 23-28, 2008, Auckland, New Zealand Copyright 2008 VLDB Endowment, ACM 978-1-60558-305-1/08/08
sharable workloads). If J1 arrived slightly earlier and has aslightly shorter execution time than J2, then both FIFOand shortest-job-first would schedule J1 first. This decision,which is made without taking sharing into account, seemsreasonable because J1 and J2 are unrelated.
However, one might want to consider the fact that addi-tional jobs may arrive in the queue while J1 and J2 are beingexecuted. Since future jobs may be sharable with J1 or J2,they can influence the optimal execution order of J1 and J2.Even if one does not anticipate the exact arrival schedule offuture jobs, a simple stochastic model of future job arrivalscan influence the decision of which of J1 or J2 to executefirst.
Suppose J1 scans file F1, and J2 scans file F2. Let idenote the frequency with which jobs that scan Fi are sub-mitted. In our example, if 1 > 2, then all else beingequal it might make sense to schedule J2 first. While J2 isexecuting, new jobs that are sharable with J1 may arrive,permitting us to amortize J1s work across multiple jobs.This amortization of work, in turn, can lead to lower av-erage job response times going forward. The schedule weproduced by considering future job arrival rates differs fromthe one produced by FIFO and shortest-job-first.
Example 2In a more subtle scenario, suppose instead that 1 = 2.Suppose F1 is 1 TB in size, and F2 is 10 TB. Assumeeach jobs execution time is dominated by scanning the file.Hence, J2 takes about ten times as long to execute as J1.
Now, which one of J1 and J2 should we execute first?Perhaps J1 should be executed first because J2 can benefitmore from sharing, and postponing J2s execution permitsadditional, sharable F2 jobs to accumulate in the queue. Onthe other hand, perhaps J2 ought to be executed first sinceit takes roughly ten times as long as J1, thereby allowingten times as many F1 jobs to accumulate for future jointexecution with J1.
Which of these opposing factors dominates in this case?How can we reason about these issues in general, in orderto maximize system productivity or minimize average jobresponse time?
1.2 Contributions and OutlineIn this paper we formalize and study the problem of schedul-
ing sharable jobs, using a combination of analytical and em-pirical techniques. We demonstrate that scheduling policiesthat work well in the traditional context of nonsharable jobscan yield poor schedules in the presence of sharing. Weidentify simple policies that do work well in the presence ofsharing, and are robust to fluctuations in the workload suchas bursts of job arrivals.
The remainder of this paper is structured as follows. Wediscuss related work in Section 2, and give our formal modelof scheduling jobs with shared scans in Section 3. Then inSection 4 we derive a family of scheduling policies, whichhave some convenient properties that make them practicalas we discuss in Section 5. We perform some initial empiricalanalysis of our policies in Section 6. Then in Section 7 weextend our family of policies to include hybrid ones thatbalance multiple scheduling objectives. We present our finalempirical evaluation in Section 8.
2. RELATED WORKWe are not aware of any prior work that addresses the
problem studied in this paper. That said, there is a tremen-dous amount of work, in both the database and schedulingtheory communities, that is peripherally related. We surveythis work below.
2.1 Database LiteraturePrior work on cooperative scans [6, 8, 21] focused on mech-
anisms for sharing scans across jobs or queries that get ex-ecuted at the same time. Our work is complementary: weconsider how to schedule a queue of pending jobs to ensurethat sharable jobs get executed together and can benefitfrom cooperative scan techniques.
Gupta et al.  study how to select an execution order forenqueued jobs, to maximize the chance that data cached onbehalf of one job can be reused for a subsequent job. Thatwork only takes into account jobs that are already in thequeue, whereas our work focuses on scheduling in view ofanticipated future jobs.
2.2 Scheduling LiteratureScheduling theory is a vast field with countless variations
on the scheduling problem, including various performancemetrics, machine environments (such as single machine, par-allel machines, and shop), and constraints (such as releasetimes, deadlines, precedence constraints, and preemption). Some of the earliest complexity results for schedulingproblems are given in . In particular, the problem ofminimizing the sum of completion times on a single proces-sor in the presence of release dates (i.e. job arrival times)is NP-hard. On the other hand, minimizing the maximumabsolute or relative wait times can be done in polynomialtime using the algorithm proposed in . Both of theseproblems are special cases of the problem considered in thispaper when all of the shared costs are zero.
In practice, the quality of a schedule depends on severalfactors (such as maximum completion time, average com-pletion time, maximum earliness, maximum lateness). Op-timizing schedules with respect to several performance met-rics is known as multicriteria scheduling .
Online scheduling algorithms [18, 20] make scheduling de-cisions without knowledge of future jobs. In non-clairvoyantscheduling , the characteristics of the jobs (such as run-ning time) are not known until the job finishes. Online al-gorithms are typically evaluated using competitive analysis[18, 20]: if C(I) is the cost of an online schedule on instanceI and Copt(I) is the cost of the optimal schedule, then theonline algorithm is c-competitive if C(I) c Copt(I)+ b forall instances I and for some constant b.
Divikaran and Saks  studied the online scheduling prob-lem with setup times. In this scenario, jobs belong to jobfamilies and a setup cost is incurred whenever the proces-sor switches between jobs of different families. For example,jobs in the same family can perform independent scans ofthe same file, in which case the setup cost is the time ittakes to load a file into memory. The problem consideredin this paper differs in two ways: all jobs executed in onebatch have the same completion time since the scans occurconcurrently instead of serially; also, once a batch has beenprocessed, the next batch still has a shared cost even if it isfrom the same job family (for example, if the entire file doesnot fit into memory).
Figure 1: Model: input queues and job executor.
Stochastic scheduling  considers another variation onthe scheduling problem: the processing time of a job is arandom variable, usually with finite mean and variance, andtypically only the distribution or some of its moments areknown. Online versions of these problems for minimizingexpected weighted completion time have also been consid-ered [3, 14, 19] in cases where there is no sharing of workamong jobs.
3. MODELMap-Reduce and related systems execute jobs on large
clusters, over data files that are spread across many nodes(each node serves a dual storage and computation role).Large files (e.g., a web crawl, or a multi-day search query andresult log) are spread across essentially all nodes, whereassmaller files may only occupy a subset of nodes. Correspond-ingly, jobs that access large files are spread onto the entirecluster, and jobs over small files generally only use a subsetof nodes.
In this paper we focus on the issue of ordering jobs tomaximize shared scans, rather than the issue of how to al-locate data and jobs onto individual cluster nodes. Hencefor the purpose of this paper we abstract away the per-nodedetails and model the cluster as a single unit of storage andexecution. For workloads dominated by large data sets andjobs that get spread across the full cluster, this abstractionis appropriate.
Our model of a data processing engine has two parts: anexecutor module that processes jobs, and an input queuethat holds pending jobs. Each job Ji requires a scan over a(large) input file Fi, and performs some custom processingover the content of the file. Jobs can be categorized basedon their input file into job families, where all jobs that accessfile Fi belong to family Fi. It is useful to think of the inputqueue as being divided into a set of smaller queues, one perjob family, as shown in Figure 1.
The executor is capable of executing a batch of multiplejobs from the same family, in which case the input file isscanned once and each jobs custom processing is appliedover the stream of data generated by scanning the file. Forsimplicity we assume that one batch is executed at a time,although our techniques can easily be extended to the caseof k simultaneous batches.
The time to execute a batch consisting of n jobs fromfamily Fi equals tsi + n tni , where tsi represents the cost ofscanning the input file Fi (i.e., the sharable execution cost),and tni represents the custom processing cost incurred byeach job (i.e., the nonsharable cost). We assume that tsi islarge relative to tni , i.e., the jobs are IO-bound as discussedin Section 1.
Given that tsi is the dominant cost, for simplicity we treatthe nonshared execution cost tni as being the same for alljobs in a batch, even though in reality each job may incur adifferent cost in its custom processing. We verify empiricallyin Section 6 that nonuniform within-batch processing costsdo not throw off our results.
3.1 System WorkloadFor the purpose of our analysis we model job arrival as
a stationary process (in Section 8.2.2 we study the effect ofbursty job arrivals empirically). In our model, for each jobfamily Fi, jobs arrive according to a Poisson process withrate parameter i.
Obviously, a high enough aggregate job arrival rate canoverwhelm a given system, regardless of the scheduling pol-icy. To reason about what job workload a system is capableof handling, it is instructive to consider what happens if jobsare executed in extremely large batches. In the asymptote,as batch sizes approach infinity, the tn values dominate andthe ts values become insignificant, so system load convergestoPi i t
ni . If this quantity exceeds the systems intrin-
sic processing capacity, then it is impossible to keep queuelengths from growing without bound, and the system cannever catch up with pending work under any schedulingregime. Hence we impose a workload feasibility condition:
asymptotic load =Xi
i tni < 1
3.2 Scheduling ObjectivesThe performance metric we use in this paper is average
perceived wait time. The perceived wait time (PWT) of jobJ is the difference between the systems response time inhandling J , and the minimum possible response time t(J).(Response time is the total delay between submission andcompletion of a job.)
As stated in Section 1, the class of systems we consideris geared toward maximizing overall system productivity,rather than committing to response time targets for indi-vidual jobs. This stance would seem to suggest optimizingfor system throughput. However, in our context maximiz-ing throughput means maximizing batch sizes, which leadsto indefinite job wait times. While these systems may findit acceptable to delay some jobs in order to improve overallthroughput, it does not make sense to delay all jobs.
Optimizing for average PWT still gives an incentive tobatch multiple jobs together when the sharing opportunityis large (thereby improving throughput), but not so muchthat the queues grow indefinitely. Furthermore, PWT seemslike an appropriate metric because it corresponds to usersend-to-end view of system performance. Informally, averagePWT can be thought of as an indicator of how unhappyusers are, on average, due to job processing delays. Anotherconsideration is the maximum PWT across all jobs, whichindicates how unhappy the least happy user is.
Our aim is to minimize average PWT, while keeping maxi-mum PWT from being excessively high. We focus on steady-state behavior, rather than a fixed time period such as oneday, to avoid knapsack-style tactics that squeeze shortjobs in at the end of the period. Knapsack-style behav-ior only makes sense in the context of real-time scheduling,which is not a concern in the class of systems we study.
For a given job J , PWT can either be measured on an ab-solute scale as the difference between the systems response
Figure 2: Ways to measure perceived wait time.
time and the minimum possible response time (e.g., 10 min-utes), or on a relative scale as the ratio of the systems re-sponse time to the minimum possible response time (e.g.,1.5 t(J)). (Relative PWT is also known as stretch .)
The space of PWT metric variants is shown in Figure 2.For convenience we adopt the abbreviations AA, MA, ARand MR to refer to the four variants.
3.3 Scheduling PolicyA scheduling policy is an online algorithm that is (re)invoked
each time the executor becomes idle. Upon invocation, thepolicy leaves the executor idle for some period of time (pos-sibly zero time), and then removes a nonempty subset ofjobs from the input queue, packages them into an executionbatch, and submits the batch to the executor.
In this paper, to simplify our analysis we impose two veryreasonable restrictions on our scheduling policies:
No idle. If the input queue is nonempty, do not leavethe executor idle. Given the stochastic nature of jobarrivals, this policy seems appropriate.
Always share. Whenever a job family Fi is scheduledfor execution, all enqueued jobs from family Fi areincluded in the execution batch. While it is true thatif tn > ts, one achieves lower average absolute PWTby scheduling jobs sequentially instead of in a batch,in this paper we assume ts > tn, as stated above. Ifts > tn it is always beneficial to form large batches, interms of average absolute PWT of jobs in the batch.In all cases, large batches reduce the wait time of jobsoutside the batch that are executed afterward.
4. BASIC SCHEDULING POLICIESWe derive scheduling policies aimed at minimizing each of
average absolute PWT (Section 4.1) and maximum absolutePWT (Section 4.2).1
The notation we use in this section is summarized in Ta-ble 1.
4.1 Average Absolute PWTIf there is no sharing, low average absolute PWT is achieved
via shortest-job-first (SJF) scheduling and its variants. (Ina stochastic setting, the generalization of SJF is asymptoti-cally optimal .) We generalize SJF to the case of sharablejobs as follows.
1We tried deriving policies that directly aim to minimize rel-ative PWT, but the resulting policies did not perform well,perhaps due to breakdowns in the approximation schemesused to derive the policies.
symbol meaningFi ith job familytsi sharable execution time for Fi jobstni nonsharable execution time for Fi jobsi arrival rate of Fi jobsBi theoretical batch size for Fiti theoretical time to execute one Fi batchTi theoretical scheduling period for Fifi theoretical processing fraction for Fii perceived wait time for Fi jobsPi scheduling priority of FiBi queue length for FiTi waiting time of oldest enqueued Fi job
Table 1: Notation.
Let Pi denote the scheduling priority of family Fi. If thereis no sharing, SJF sets Pi equal to the time to complete onejob. If there is sharing, then we let Pi equal the averageper-job execution time of a job batch. Suppose Bi is thenumber of enqueued jobs in family Fi, in other words, thecurrent batch size for Fi. Then the total time to execute abatch is tsi + Bi tni . The average per-job execution time is(tsi +Bi tni )/Bi, which gives us the SJF scheduling priority:
SJF Policy : Pi = tsiBi
Unfortunately, as we demonstrate empirically in Section 6,
SJF does not work well in the presence of sharing. To under-stand why, consider a simple example with two job families:
F1 : ts1 = 1, tn1 = 0, 1 = aF2 : ts2 = a, tn2 = 0, 2 = 1
for some constant a > 1.In this scenario, F2 jobs have long execution time (ts2 = a)
so SJF schedules F2 infrequently: once every a2 time units,on expectation. The average perceived wait time under thisschedule is O(a) due to holding back F2 jobs a long timebetween batches. A policy that is aware of the fact that F2jobs are relatively rare (2 = 1) would elect to schedule F2more often, and schedule F1 less often but in much largerbatches. In fact, a policy that schedules F2 every a3/2 timeunits achieves an average PWT of only O(a1/2). For largea, SJF performs very poorly in comparison.
Since SJF does not always produce good schedules in thepresence sharing, we begin from first principles. Unfortu-nately, as discussed in Section 2.2, solving even the non-shared scheduling problem exactly is NP-hard. Hence, tomake our problem tractable we consider a relaxed version ofthe problem, find an optimal solution to the relaxed prob-lem, and apply this solution to the original problem.
4.1.1 Relaxation 1In our initial, simple relaxation, each job family (each
queue in Figure 1) has a dedicated executor. The total workdone by all executors, in steady state, is constrained to beless than or equal to the total work performed by the oneexecutor in the original problem. Furthermore, rather thandiscrete jobs, in our relaxation we treat jobs as continuouslyarriving, infinitely divisible units of work.
In steady state, an optimal schedule will exhibit periodicbehavior: For each job family Fi, wait until Bi jobs havearrived on the queue and execute those Bi jobs as a batch.
Given the arrival rate i, on expectation a new batch isexecuted every Ti = Bi/i time units. A batch takes timeti = t
si + Bi tni to complete. The fraction of time Fis
executor is in use (rather than idle), is fi = ti/Ti.We arrive at the following optimization problem:X
fi 1 minXi
where AAi is the average absolute PWT for jobs in Fi.There are two factors that contribute to the PWT of a
newly-arrived job: (1) the delay until the next batch isformed (2) the fact that a batch of size Bi takes longer tofinish than a singleton batch. The expected value of Factor1 is Ti/2. Factor 2 equals (Bi 1) tni . Overall,
+ (Bi 1) tni
We solve the above optimization problem using the methodof Lagrange Multipliers. In the optimal solution the follow-ing quantity is constant across all job families Fi:
(1 + 2 i tni )
Given the , ts and tn values, one can select batch sizes (Bvalues) accordingly.
4.1.2 Relaxation 2Unfortunately, the optimal solution to Relaxation 1 can
differ substantially from the optimal solution to the origi-nal problem. Consider the simple two-family example wepresented earlier in Section 4.1. The optimal policy underRelaxation 1 schedules job families in a round robin fashion,yielding an average PWT of O(a). Once again this result is
much worse than the achievable O(a1/2) value we discussedearlier.
Whereas SJF errs by scheduling F2 too infrequently, theoptimal Relaxation 1 policy errs in the other direction: itschedules F2 too frequently. Doing so causes F1 jobs to waitbehind F2 batches too often, hurting average wait time.
The problem is that Relaxation 1 reduces the originalscheduling problem to a resource allocation problem. UnderRelaxation 1, the only interaction among job families is factthat they must share the overall processing time (
Pi fi 1).
In reality, resource allocation is not the only important con-sideration. We must also take into account the fact that theexecution batches must be serialized into a single sequen-tial schedule and executed on a single executor. When along-running batch is executed, other batches must wait fora long time.
Consider a job family Fi, for which a batch of size Bi isexecuted once every Ti time units. Whenever an Fi batchis executed, the following contributions to PWT occur:
In-batch jobs. The Bi Fi jobs in the current batchare delayed by (Bi 1) tni time units each, for a totalof D1 = Bi (Bi 1) tni time units.
New jobs. jobs that arrive while the Fi batch is beingexecuted, are delayed. The expected number of suchjobs is ti
Pj j . The delay incurred to each one is
ti/2 on average, making the overall delay incurred toother new jobs equal to
Old jobs. jobs that are already in the queue whenthe Fi batch is executed, are also delayed. UnderRelaxation 1, the expected number of such jobs isPj 6=i(Tj j)/2. The delay incurred to each one is
ti, making the overall delay incurred to other in-queuejobs equal to
D3 =ti2Xj 6=i
The total delay imposed on other jobs per unit time isproportional to 1/Ti (D1 + D2 + D3). If we minimize thesum of this quantity across all families Fi, again subjectto the resource utilization constraint
Pi fi 1 using the
Lagrange method, we obtain the following invariant acrossjob families:
i tsi (i tni )
j + 1
!The scheduling policy resulting from this invariant does
achieve the hoped-for O(a1/2) average PWT in our exampletwo-family scenario.
4.1.3 Implementation and IntuitionRecall the workload feasibility condition
Pi i t
ni < 1
from Section 3.1. If the executors load is spread across alarge number of job families, then for each Fi, i tni is small.Hence, it is reasonable to drop the terms involving i tnifrom our above formulae, yielding the following simplifiedinvariants2:
Relaxation 1 result: For all job families Fi, thefollowing quantity is equal:
Relaxation 2 result: For all job families Fi, thefollowing quantity is equal:
A simple way to translate these statements into imple-mentable policies is as follows: Assign a numeric priorityPi to each job family Fi. Every time the executor becomesidle schedule the family with the highest priority, as a sin-gle batch of Bi jobs, where Bi denotes the queue length forfamily Fi. If we are in steady state, then Bi should roughlyequal Bi. This observation suggests the following priorityvalues for the scheduling policies implied by Relaxations 1and 2, respectively:
AA Policy 1 : Pi =B2ii tsi
AA Policy 2 : Pi =B2ii tsi
2There are also practically-motivated reasons to drop termsinvolving tn, as we discuss in Section 5. In Section 6 we giveempirical justification for dropping the tn terms.
These formulae have a fairly simple intuitive explanation.First, if many new jobs with a high degree of sharing areexpected to arrive in the future (i tsi in the denomina-tor, which we refer to as the sharability of family Fi), weshould postpone execution of Fi and allow additional jobsto accumulate into the same batch, so as to achieve greatersharing with little extra waiting. On the other hand, as thenumber of enqueued jobs becomes large (B2i in the numer-ator), the execution priority increases quadratically, whicheventually forces the execution of a batch from family Fi toavoid imposing excessive delay on the enqueued jobs.
Policy 2 has an extra subtractive term, which penalizeslong batches (i.e., ones with large ts) if the overall rate ofarrival of jobs is high (i.e., high
Pj j). Doing so allows
short batches to execute ahead of long batches, in the spiritof shortest-job-first.
For singleton job families (families with just one job),tsi = 0 and the priority value Pi goes to infinity. Hencenonsharable jobs are to be scheduled ahead of sharable ones.The intuition is that nonsharable jobs cannot be beneficiallycoexecuted with future jobs, so we might as well executethem right away. If there are multiple nonsharable jobs, tiescan be broken according to shortest-job-first.
4.2 Maximum Absolute PWTHere, instead of optimizing for average absolute PWT,
we optimize for the maximum. We again adopt a relaxationof the original problem that assumes parallel executors andinfinitely divisible work. Under the relaxation, the objectivefunction is:
where MAi is the maximum absolute PWT for Fi jobs.As stated in Section 4.1.1 there are two factors that con-
tribute to the PWT of a newly-arrived job: (1) the delayuntil the next batch is formed (2) the fact that a batch ofsize Bi takes longer to finish than a singleton batch. Themaximum values of these factors are Ti and (Bi 1) tni ,respectively. Overall,
MAi = Ti + (Bi 1) tni
or, written differently:
MAi = Ti (1 + i tni ) tni
In the optimal solution MAi is constant across all job fam-ilies Fi. The intuition behind this result is that if one ofthe MAi values is larger than the others, we can decrease itsomewhat by increasing the other MAi values, thereby re-ducing the maximum PWT. Hence in the optimal solutionall MAi values are equal.
4.2.1 Implementation and IntuitionAs justified in Section 4.1.3, we drop terms involving i
tni from our MA formula and obtain MA Ti tni . As
stated in Section 3, we assume the tn values to be a smallcomponent of the overall job execution times, so we also dropthe tni term and arrive at the approximation MA Ti.
Let Ti denote the waiting time of the oldest enqueued Fijob, which should roughly equal Ti in steady state. We useTi as the basis for our priority based scheduling policy:
MA Policy(FIFO) : Pi = Ti
This policy can be thought of as FIFO applied to job familybatches, since it schedules the family of the job that hasbeen waiting the longest.
5. PRACTICAL CONSIDERATIONSThe scheduling policies we derived in Section 4 rely on
several parameters related to job execution cost and job ar-rival rates. In this section we explain how these parameterscan be obtained in practice.
Robust cost estimation: The fact that we were able todrop the nonsharable execution time tn from our schedulingpriority formulae not only keeps them simple, it also meansthat the scheduler does not need to estimate this quantity.In practice, estimating the full execution time of a job accu-rately can be difficult, especially in the Map-Reduce contextin which processing is specified via opaque user-defined func-tions. (In Section 6 we verify empirically that the perfor-mance of our policies is not sensitive to whether the factorsinvolving tn are included.)
Our formulae do require estimates of the sharable exe-cution time ts, i.e., the IO cost of scanning the input file.For large files, this cost is nearly linearly proportional tothe size of the input file, a quantity that is easy to obtainfrom system metadata. (The proportionality constant canbe dropped, as linear scaling of the ts values does not affectour priority-based scheduling policies.)
Dynamic estimation of arrival rates: Some of our pri-ority formulae contain values, which denote job arrivalrates. Under the Poisson model of arrival, one can estimatethe values dynamically, by keeping a time-decayed countof arrivals. In this way the arrival rate estimates ( values)automatically adjust as the workload shifts over time. (SeeSection 6.1 for details.)
6. BASIC EXPERIMENTSIn this section we present experiments that:
Justify ignoring the nonsharable execution time compo-nent tn in our scheduling policies (Section 6.2).
Compare our scheduling policy variants empirically (Sec-tion 6.3).
(We compare our policies against baseline policies in Sec-tion 8.)
6.1 Experimental SetupWe built a simulator and a workload generator. Our work-
load consists of 100 job families. For each job family, thesharable cost ts is generated from the heavy-tailed distri-bution 1 + |X |, where X is a Cauchy random variable. Forgreater realism, the nonsharable cost tn is on a per-job basis,rather than a per-family basis as in our model in Section 3.
In our default workload, each time a job arrives, we selecta nonshared cost randomly as follows: with probability 0.6,tn = 0.1 ts; with probability 0.2, tn = 0.2 ts; with prob-ability 0.2, tn = 0.3 ts. (The scenario we focus on in thispaper is one in which the shared cost dominates, because itrepresents IO and jobs tend to be IO-bound, as discussed inSection 3.) In some of our experiments we deviate from thisdefault workload and study what happens when tn tends tobe larger than ts.
Shared cost divisor
tn-ignorant policy for AA PWTtn-aware policy for AA PWT
Figure 3: tn-awareness versus tn-ignorance for AAPolicy 2.
Shared cost divisor
tn-ignorant policy for MA PWT
tn-aware policy for MA PWT
Figure 4: tn-awareness versus tn-ignorance for MAPolicy.
Job arrival events are generated using the standard ho-mogenous Poisson point process . Each job family Fi hasan arrival parameter i which represents the expected num-ber of jobs that arrive in one unit of time. There are 500, 000units of time in each run of the experiments. The i valuesare initially chosen from a Pareto distribution with parame-ter = 1.9 and then are rescaled so that
ni ] = load.
The total asymptotic system load (Pitni ) is 0.5 by default.
Some of our scheduling policies require estimation of thejob arrival rate i. To do this, we maintain an estimate Iiof the difference in the arrival times of the next two jobsin family Fi. We adjust Ii as new job arrivals occur, bytaking a weighted average of our previous estimate Ii andAi, the difference in arrival times of the two most recent jobsfrom Fi. Formally, the update step is Ii 0.05Ai + 0.95Ii.Given Ii and the time t since the last arrival of a job in Fi,we estimate i as 1/Ii if t < Ii and as 10.05+0.95Ii otherwise.
6.2 Influence of Nonshared Execution TimeIn our first set of experiments, we measure how knowl-
edge of tn affects our scheduling policies. Recall that inSections 4.1.3 and 4.2.1 we dropped tn from the priorityformulae, on the grounds that the factors involving tn aresmall relative to other factors. To validate ignoring tn in ourscheduling policies, we compare tn-aware variants (whichuse the full formulae with tn values) against the tn-ignorantvariants presented in Sections 4.1.3 and 4.2.1. (The tn-awarevariants are given knowledge of the precise tn value of each
0 200 400 600 800
1000 1200 1400 1600
Shared cost skew
AA Policy 2, lambda knownAA Policy 1, est. lambdaAA Policy 2, est. lambda
AA Policy 1, known lambda
Figure 5: AA Policy 1 versus AA Policy 2, varyingshared cost skew.
AA Policy 2, est. lambdaAA Policy 1, est. lambda
Figure 6: AA Policy 1 versus AA Policy 2, varyingshared cost skew, it
si = const.
job instance in the queue.)Figures 3 and 4 plot the performance of the tn-aware and
tn-ignorant variants of our policies (AA Policy 2 and MAPolicy, respectively) as we vary the magnitude of the sharedcost (keeping the tn distribution and values fixed). Inboth graphs, the y-axis plots the metric the policy is tunedto optimize (AA PWT and MA PWT, respectively). Thex-axes plot the shared cost divisor, which is the factor bywhich we divided all shared costs. When the shared costdivisor is large (e.g., 100), the ts values become quite smallrelative to the tn values, on average.
Even when nonshared costs are large relative to sharedcosts (right-hand side of Figures 3 and 4), tn-awareness haslittle impact on performance. Hence from this point forwardwe only consider the simpler, tn-ignorant variants of ourpolicies.
6.3 Comparison of Policy Variants
6.3.1 Relaxation 1 versus Relaxation 2We now turn to a comparison of AA Policy 1 versus AA
Policy 2 (recall that these are based on Relaxation 1 (Sec-tion 4.1.1) and Relaxation 2 (Section 4.1.2) of the originalAA PWT minimization problem, respectively). Figure 5shows that the two variants exhibit nearly identical perfor-mance, even as we vary the skew in the shared cost (ts) dis-tribution among job families (here there are five job familiesFi with shared cost tsi = i, where is the skew parameter).
Figure 7: Relative effectiveness of different priorityformula variants.
However, if we introduce the invariant that i tsi (whichrepresents the sharability of jobs in family Fi; see Sec-tion 4.1.3) remain constant across all job families Fi, a dif-ferent picture emerges. Figure 6 shows the result of varyingthe shared cost skew, as we hold i tsi constant across jobfamilies. (Here there are two job families: ts2 = 1 = 1 andts1 = 2 = skew parameter (x-axis).) In this case, we see aclear difference in performance between the policies basedon the two relaxations, with the one based on Relaxation 2(AA Policy 2) performing much better.
Overall, it appears that AA Policy 2 dominates AA Policy1, as expected. As to whether the case in which AA Policy2 performs significantly better than AA Policy 1 is likelyto occur in practice, we do not know. Clearly, using AAPolicy 2 is the safest option, and besides it is not muchmore complex to implement than AA Policy 1.
6.3.2 Use of Different EstimatorsRecall that our AA Policies 1 and 2 (Section 4.1.3) have a
B2i /i term. In the model assumed by Relaxation 1, usingthe equivalence Bi = Ti i, we can rewrite this term infour different ways: B2i /i (using batch size), T
2i i (using
waiting time), BiTi (the geometric mean of the two previousoptions), and max
B2i /i, T
In Figure 7 we compare these variants, and also compareusing the true values versus using an online estimator for as described in Section 6.1. We used a more skewed non-shared cost (tn) distribution than in our other experiments,to get a clear separation of the variants. In particular weused: with probability 0.6, tn = 0.1 ts; with probability 0.2,tn = 0.2 ts; with probability 0.1, tn = 0.5 ts; with probabil-ity 0.1, tn = 1.0 ts. We generated 20 sample workloads, andfor each workload we computed the best AA PWT amongthe policy variants. For each policy variant, Figure 7 plotsthe fraction of times the policy variant had an AA PWT thatwas more than 3% worse than the best AA PWT for eachworkload. The result is that the variant that uses B2i /i(the form given in Section 4.1.3) clearly outperforms therest. Furthermore, estimating the arrival rates ( values)works fine, compared to knowing them in advance via anoracle.
6.4 Summary of FindingsThe findings from our basic experiments are:
Estimating the arrival rates ( values) online, as op-posed to knowing them from an oracle, does not hurtperformance.
It is not necessary to incorporate tn estimates into thepriority functions.
AA Policy 2 (which is based on Relaxation 2) domi-nates AA Policy 1 (based on Relaxation 1).
From this point forward, we use tn-ignorant AA Policy 2with online estimation.
7. HYBRID SCHEDULING POLICIESThe quality of a scheduling policy is generally evaluated
using several criteria  and so optimizing for either the av-erage or maximum perceived wait time, as in Section 4, maybe too extreme. If we optimize solely for the average, theremay be certain jobs with very high PWT. Conversely if weoptimize solely for the maximum, we end up punishing themajority of jobs in order to help a few outlier jobs. In prac-tice it may make more sense to optimize for a combinationof average and maximum PWT. A simple approach is tooptimize for a linear combination of the two:
AAi + (1 ) MAi
where AA denotes average absolute PWT and MA denotesmaximum absolute PWT. The parameter [0, 1] denotesthe relative importance of having low average PWT versuslow maximum PWT.
We apply the methods used in Section 4 to the hybridoptimization objective, resulting in the following policy:
Hybrid Policy : Pi = 1
2 Pj j
+ xi (1 ) T 2itsi
where xi = 1 if Ti = maxj Tj , and xi = 0 otherwise.
The hybrid policy degenerates to the nonhybrid policiesof Section 4 if we set = 0 or = 1. For intermediatevalues of , job families receive the same relative priorityas they would under the average PWT regime, except thefamily that has been waiting the longest (i.e., the one withxi = 1), which gets an extra boost in priority. This extraboost reduces the maximum wait time, while raising theaverage wait time a bit.
8. FURTHER EXPERIMENTSWe are now ready for further experiments. In particular
The behavior of our hybrid policy (Section 8.1). The performance of our policies compared to baseline
policies (Section 8.2.1).
The ability to cope with large bursts of job arrivals (Sec-tion 8.2.2).
200 220 240 260 280 300 320 340 360 380 400 420
1.000.990.950.800.600.400.200 500 1000 1500 2000 2500 3000 3500 4000 4500
AA PWTMA PWT
Figure 8: Hybrid Policy performance on average andmaximum absolute PWT, as we vary the hybrid pa-rameter .
AR PWTMR PWT
Figure 9: Hybrid Policy performance on average andmaximum relative PWT, as we vary .
8.1 Hybrid PolicyFigure 8 shows the performance of our Hybrid Policy (Sec-
tion 7), in terms of both average and maximum absolutePWT. Figure 9 shows the same thing, but for relative PWT.In both graphs the x-axis plots the hybrid parameter (thisaxis is not on a linear scale, for the purpose of presentation).The decreasing curve plots average PWT, whose scale is onthe left-hand y-axis; the increasing curve plots maximumPWT, whose scale is on the right-hand y-axis.
With = 0, the hybrid policy behaves like the MA Pol-icy (FIFO), which achieves low maximum PWT at the ex-pense of very high average PWT. On the other extreme,with = 1 it behaves like the AA Policy, which achieveslow average PWT but very high maximum PWT. Using in-termediate values of trades off the two objectives. In boththe absolute and relative cases, a good balance is achievedat approximately = 0.99: maximum PWT is only slightlyhigher than with = 0, and average PWT is only slightlyhigher than with = 1.
Basically, when configured with = 0.99, the Hybrid Pol-icy mimics the AA Policy most of the time, but makes anexception if it notices that one job has been waiting for avery long time.
8.2 Comparison Against BaselinesIn the following experiments, we compare the policies AA
Policy 2, MA Policy (FIFO), and the Hybrid Policy with
Load (lambda increasing)
FIFOAA Policy 2
Figure 10: Policy performance on AA PWT met-ric, as job arrival rates increase (both SJF variantsshown).
= 0.99 against two generalizations of shortest-job-first(SJF): The policy Aware SJF is the one given in Sec-tion 4.1, which knows the nonshared cost of jobs in its queue,and chooses the job family for which it can execute the mostnumber of jobs per unit of time (i.e., the family that min-imizes (batch execution cost)/B). By a simple interchangeargument it can be shown that this policy is optimal for thecase when jobs have stopped arriving. The policy Oblivi-ous SJF does not know the nonshared cost of jobs and so itchooses the family for which ts/B is minimized. This policyis optimal for the case when jobs have stopped arriving andthe nonshared costs are small.
In these experiments we tested how these policies are af-fected by the total load placed on the system. (Recall fromSection 3.1 that asymptotic load =
Pi tni .) To vary load,
we started with workloads with asymptotic load = 0.1, andthen caused load to increase by various increments, in oneof two ways: (1) increase the nonshared costs (tn values), or(2) increase the job arrival rates ( values). In both cases,all other workload parameters are held constant.
In Section 8.2.1 we report results for the case where jobarrivals are generated by a homogeneous Poisson point pro-cess. In Section 8.2.2 we report results under bursty arrivals.
8.2.1 Stationary WorkloadsIn Figure 10 we plot AA PWT as the job arrival rate,
and thus total system load, increases. It is clear that AwareSJF has terrible performance. The reason is as follows: Inour workload generator, expected nonshared costs are pro-portional to shared costs (e.g., the cost of a CPU scan ofthe file is roughly proportional to its size on disk). Hence,Aware SJF has a very strong preference for job families withsmall shared cost (essentially ignoring the batch size), whichleads to starvation of ones with large shared cost.
In the rest of our experiments we drop Aware SJF, so wecan focus on the performance differences among the otherpolicies. Figure 11 is the same as Figure 10, with AwareSJF removed and the y-axis re-scaled. Here we see that AAPolicy 2 and the Hybrid Policy outperform both FIFO andSJF, especially at higher loads.
In Figure 12 we show the corresponding graph with MAPWT on the y-axis. Here, as expected, FIFO and the Hy-brid Policy perform very well.
Figures 13 and 14 show the corresponding plots for thecase where load increases due to a rise in nonshared cost.
Load (lambda increasing)
FIFOAA Policy 2
Figure 11: Policy performance on AA PWT metric,as job arrival rates increase.
Load (lambda increasing)
FIFOAA Policy 2
Figure 12: Policy performance on MA PWT metric,as job arrival rates increase.
These graphs are qualitatively similar to Figures 11 and 12,but the differences among the scheduling policies are lesspronounced.
Figures 15, 16, 17 and 18 are the same as Figures 11,12, 13 and 14, respectively, but with the y-axis measuringrelative PWT. If we are interested in minimizing relativePWT, our policies, which aim to minimize absolute PWT,do not necessarily do as well as SJF. Devising policies thatspecifically optimize for relative PWT is an important topicof future work.
8.2.2 Bursty WorkloadsTo model bursty job arrival behavior we use two different
Poisson processes for each job family. One Poisson processcorresponds to a low arrival rate and the other correspondsto an arrival rate that is ten times as fast. We switch be-tween these processes using a Markov process: after a jobarrives, we switch states (from high arrival rate to low ar-rival rate or vice versa) with probability 0.05, and stay inthe same state with probability 0.95. The initial probabilityof either state is the stationary distribution of this process(i.e. with probaility 0.5 we start with a high arrival rate).The expected number of jobs coming from bursts is the sameas the expected number of jobs not coming from bursts. Ifi is the arrival rate for the non-burst process, then the ex-pected i (number of jobs per second) asymptotically equals20i/11. Thus the load is
In Figures 19 and 20 we show the average and maximumabsolute PWT, respectively, for bursty job arrivals as load
0 200 400 600 800
1000 1200 1400 1600 1800
Load (non-shared cost increasing)
FIFOAA Policy 2
Figure 13: Policy performance on AA PWT metric,as nonshared costs increase.
10000 15000 20000 25000 30000 35000 40000
Load (non-shared cost increasing)
FIFOAA Policy 2
Figure 14: Policy performance on MA PWT metric,as nonshared costs increase.
increases via increasing non-shared costs. Here, SJF slightlyoutperforms our policies on AA PWT, but our Hybrid Policyperforms well on both average and maximum PWT.
Figure 21 shows average absolute PWT as the job arrivalrate increases, while keeping the nonshared cost distributionconstant. Here AA Policy 2 and Hybrid slightly outperformSJF.
To visualize the temporal behavior in the presence of bursts,Figure 22 shows a moving average of absolute PWT on they-axis, with time plotted on the x-axis. This time series is asample realization of the experiment that produced Figure19, with load = 0.7.
Since our policies focus on exploiting job arrival rate ()estimates, it is not surprising that under extremely burstyworkloads where there is no semblance of a steady-state ,they do not perform as well relative to the baselines as un-der stationary workloads (Section 8.2.1). However, it is re-assuring that our Hybrid Policy does not perform noticeablyworse than shortest-job-first, even under these extreme con-ditions.
8.3 Summary of FindingsThe findings from our experiments on the absolute PWT
metric which our policies are designed to optimize, are:
Our MA Policy (a generalization of FIFO to sharedworkloads) is the best policy on maximum PWT, butperforms poorly on average PWT, as expected.
Load (lambda increasing)
FIFOAA Policy 2
Figure 15: Policy performance on AR PWT metric,as job arrival rates increase.
0 1000 2000 3000 4000 5000 6000 7000 8000
Load (lambda increasing)
FIFOAA Policy 2
Figure 16: Policy performance on MR PWT metric,as job arrival rates increase.
Our Hybrid Policy, if properly tuned, achieves a sweetspot in balancing average and maximum PWT, andis able to perform quite well on both.
With stationary workloads, our Hybrid Policy substan-tially outperforms the better of two generalizations ofshortest-job-first to shared workloads.
With extremely bursty workloads, our Hybrid Policyperforms on par with shortest-job-first.
9. SUMMARYIn this paper we studied how to schedule jobs that can
share scans over a common set of input files. The goal is toamortize expensive file scans across many jobs, but withoutunduly hurting individual job response times.
Our approach builds a simple stochastic model of jobarrivals for each input file, and takes into account antici-pated future jobs while scheduling jobs that are currentlyenqueued. The main idea is as follows: If an enqueued jobJ requires scanning a large file F , and we anticipate thenear-term arrival of additional jobs that also scan F , then itmay make sense to delay J if it has not already waited toolong and other, less sharable, jobs are available to run.
We formalized the problem and derived a simple and ef-fective scheduling policy, under the objective of minimizingperceived wait time (PWT) for completion of user jobs. Ourpolicy can be tuned for average PWT, maximum PWT, or
100 150 200 250 300 350 400
Load (non-shared cost increasing)
FIFOAA Policy 2
Figure 17: Policy performance on AR PWT metric,as nonshared costs increase.
1000 1500 2000 2500 3000 3500 4000
Load (non-shared cost increasing)
FIFOAA Policy 2
Figure 18: Policy performance on MR PWT metric,as nonshared costs increase.
a combination of the two objectives. Compared with thebaseline shortest-job-first and FIFO policies, which do notaccount for future sharing opportunities, our policies achievesignificantly lower perceived wait time. This means thatusers jobs will generally complete earlier under our schedul-ing policies.
10. REFERENCES R. H. Arpaci-Dusseau. Run-time adaptation in River.
ACM Trans. on Computing Systems, 21(1):3686, Feb.2003.
 P. Billingsley. Probability and Measure. John Wiley &Sons, Inc., New York, 3nd edition, 1995.
 M. C. Chou, H. Liu, M. Queyranne, andD. Simchi-Levi. On the asymptotic optimality of asimple on-line algorithm for the stochasticsingle-machine weighted completion time problem andits extensions. Operations Research, 54(3):464474,2006.
 J. Dean and S. Ghemawat. MapReduce: Simplifieddata processing on large clusters. In Proc. OSDI, 2004.
 S. Divakaran and M. Saks. Online scheduling withrelease times and set-ups. Technical Report 2001-50,DIMACS, 2001.
 P. M. Fernandez. Red brick warehouse: A read-mostlyRDBMS for open SMP platforms. In Proc. ACMSIGMOD, 1994.
0 1000 2000 3000 4000 5000 6000 7000 8000 9000
Load (non-shared cost increasing)
FIFOAA Policy 2
Figure 19: Policy performance on AA PWT metric,as nonshared costs increase, with bursty job arrivals.
Load (non-shared cost increasing)
FIFOAA Policy 2
Figure 20: Policy performance on MA PWT metric,as nonshared costs increase, with bursty job arrivals.
 A. Gupta, S. Sudarshan, and S. Vishwanathan. Queryscheduling in multiquery optimization. InInternational Symposium on Database Engineeringand Applications (IDEAS), 2001.
 S. Harizopoulos, V. Shkapenyuk, and A. Ailamaki.QPipe: A simultaneously pipelined relational queryengine. In Proc. ACM SIGMOD, 2005.
 H. Hoogeveen. Multicriteria scheduling. EuropeanJournal of Operational Research, 167(3):592623,2005.
 M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly.Dryad: Distributed data-parallel programs fromsequential building blocks. In Proc. EuropeanConference on Computer Systems (EuroSys), 2007.
 D. Karger, C. Stein, and J. Wein. Schedulingalgorithms. In M. J. Atallah, editor, Handbook ofAlgorithms and Theory of Computation. CRC Press,1997.
 E. L. Lawler. Optimal sequencing of a single machinesubject to precedence constraints. ManagementScience, 19(5):544546, 1973.
 J. Lenstra, A. R. Kan, and P. Brucker. Complexity ofmachine scheduling problems. Annals of DiscreteMathematics, 1:343362, 1977.
 N. Megow, M. Uetz, and T. Vredeveld. Models andalgorithms for stochastic online scheduling.Mathematics of Operations Research, 31(3), 2006.
Load with bursts (non-shared cost increasing)
FIFOAA Policy 2
Figure 21: Policy performance on AA PWT metric,as arrival rates increase, with bursty job arrivals.
Figure 22: Performance over time, with bursty jobarrivals.
 R. H. Mohring, F. J. Radermacher, and G. Weiss.Stochastic scheduling problems I general strategies.Mathematical Methods of Operations Research,28(7):193260, 1984.
 R. Motwani, S. Phillips, and E. Torng.Non-clairvoyant scheduling. In Proc. SODAConference, pages 422431, 1993.
 S. Muthukrishnan, R. Rajaraman, A. Shaheen, andJ. E. Gehrke. Online scheduling to minimize averagestretch. In Proc. FOCS Conference, 1999.
 K. Pruhs, J. Sgall, and E. Torng. Online scheduling,chapter 15. Handbook of Scheduling: Algorithms,Models, and Performance Analysis. Chapman &Hall/CRC, 2004.
 A. S. Schulz. New old algorithms for stochasticscheduling. In Algorithms for Optimization withIncomplete Information, Dagstuhl SeminarProceedings, 2005.
 J. Sgall. Online scheduling a survey. In On-LineAlgorithms, Lecture Notes in Computer Science.Springer-Verlag, 1997.
 M. Zukowski, S. Heman, N. Nes, and P. Boncz.Cooperative scans: Dynamic bandwidth sharing in aDBMS. In Proc. VLDB Conference, 2007.