mrjob-0.3.3.2/0000775€q(¼€tzÕß0000000000011741151621016606 5ustar sjohnsonAD\Domain Users00000000000000mrjob-0.3.3.2/CHANGES.txt0000664€q(¼€tzÕß0000002516211741151504020425 0ustar sjohnsonAD\Domain Users00000000000000v0.3.3.2, 2012-04-10 -- It's a race [condition]! * Option parsing no longer dies when -- is used as an argument (#435) * Fixed race condition where two jobs can join same job flow thinking it is idle, delaying one of the jobs (#438) * Better error message when a config file contains no data for the current runner (#433) v0.3.3.1, 2012-04-02 -- Hothothothothothothotfix * Fixed S3 locking mechanism parsing of last modified time to work around a bug in boto v0.3.3, 2012-03-29 -- Bug...bug...bug...bug...bug...FEATURE! * EMR: * Error detection code follows symlinks in Hadoop logs (#396) * terminate_idle_job_flows locks job flows before terminating them (#391) * terminate_idle_job_flows -qq silences all output (#380) * Other fixes: * mr_tower_of_powers test no longer requires Testify (#395) * Various runner du() implementations no longer broken (#393, #394) * Hadoop counter parser regex handles long lines better (#388) * Hadoop counter parser regex is more correct (#305) * Better error when trying to parse YAML without PyYAML (#348) v0.3.2, 2012-02-22 -- AMI versions, spot instances, and more * Docs: * 'Testing with mrjob' section in docs (includes #321) * MRJobRunner.counters() included in docs (#321) * terminate_idle_job_flows is spelled correctly in docs (#339) * Running jobs: * local mode: * Allow non-string jobconf values again (this changed in v0.3.0) * Don't split *.gz files (#333) * emr mode: * Spot instance support via ec2_*_instance_bid_price and renamed instance type/number options (#219) * ami_version option to allow switching between EMR AMIs (#306) * 'Error while reading from input file' displays correct file (#358) * python_bin used for bootstrap_python_packages instead of just 'python' (#355) * Pooling works with bootstrap_mrjob=False (#347) * Pooling makes sure a job flow has space for the new job before joining it (#324) * EMR tools: * create_job_flow no longer tries to use an option that does not exist (#349) * report_long_jobs tool alerts on jobs that have run for more than X hours (#345) * mrboss no longer spells stderr 'stsderr' * terminate_idle_job_flows counts jobs with pending (but not running) steps as idle (#365) * terminate_idle_job_flows can terminate job flows near the end of a billable hour (#319) * audit_usage breaks down job flows by pool (#239) * Various tools (e.g. audit_usage) get list of job flows correctly (#346) v0.3.1, 2011-12-20 -- Nooooo there were bugs! * Instance-type command-line arguments always override mrjob.conf (Issue #311) * Fixed crash in mrjob.tools.emr.audit_usage (Issue #315) * Tests now use unittest; python setup.py test now works (Issue #292) v0.3.0, 2011-12-07 -- Worth the wait * Configuration: * Saner mrjob.conf locations (Issue #97): * ~/.mrjob is deprecated in favor of ~/.mrjob.conf * searching in PYTHONPATH is deprecated * MRJOB_CONF environment variable for custom paths * Defining Jobs (MRJob): * Combiner support (Issue #74) * *_init() and *_final() methods for mappers, combiners, and reducers (Issue #124) * mapper/combiner/reducer methods no longer need to contain a yield statement if they emit no data * Protocols: * Protocols can be anything with read() and write() methods, and are instances by default (Issue #229) * Set protocols with the *_PROTOCOL attributes or by re-defining the *_protocol() methods * Built-in protocol classes cache the encoded and decoded value of the last key for faster decoding during reducing (Issue #230) * --*protocol switches and aliases are deprecated (Issue #106) * Set Hadoop formats with HADOOP_*_FORMAT attributes or the hadoop_*_format() methods (Issue #241) * --hadoop-*-format switches are deprecated * Hadoop formats can no longer be set from mrjob.conf * Set jobconf with JOBCONF attribute or the jobconf() method (in addition to --jobconf) * Set Hadoop partitioner class with --partitioner, PARTITIONER, or partitioner() (Issue #6) * Custom option parsing (Issue #172) * Use mrjob.compat.get_jobconf_value() to get jobconf values from environment * Running jobs: * All modes: * All runners are Hadoop-version aware and use the correct jobconf and combiner invocation styles (Issue #111) * All types of URIs can be passed through to Hadoop (Issue #53) * Speed up steps with no mapper by using cat (Issue #5) * Stream compressed files with cat() method (Issue #17) * hadoop_bin, python_bin, and ssh_bin can now all take switches (Issue #96) * job_name_prefix option is gone (was deprecated) * Better cleanup (Issue #10): * Separate cleanup_on_failure option * More granular cleanup options * Cleaner handling of passthrough options (Issue #32) * emr mode: * job flow pooling (Issue #26) * vastly improved log fetching via SSH (Issue #2) * New tool: mrjob.tools.emr.fetch_logs * default Hadoop version on EMR is 0.20 (was 0.18) * ec2_instance_type option now only sets instance type for slave nodes when there are multiple EC2 instances (Issue #66) * New tool: mrjob.tools.emr.mrboss for running commands on all nodes and saving output locally * inline mode: * Supports cmdenv (Issue #136) * Passthrough options can now affect steps list (Issue #301) * local mode: * Runs 2 mappers and 2 reducers in parallel by default (Issue #228) * Preliminary Hadoop simulation for some jobconf variables (Issue #86) * Misc: * boto 2.0+ is now required (Issue #92) * Removed debian packaging (should be handled separately) v0.2.8, 2011-09-07 -- Bugfixes and betas * Fix log parsing crash dealing with timeout errors * Make mr_travelling_salesman.py work with simplejson * Add emr_additional_info option, to support EMR beta features * Remove debian packaging (should be handled separately) * Fix crash when creating tmp bucket for job in us-east-1 v0.2.7, 2011-07-12 -- Hooray for interns! * All runner options can be set from the command line (Issue #121) * Including for mrjob.tools.emr.create_job_flow (Issue #142) * New EMR options: * availability_zone (Issue #72) * bootstrap_actions (Issue #69) * enable_emr_debugging (Issue #133) * Read counters from EMR log files (Issue #134) * Clean old files out of S3 with mrjob.tools.emr.s3_tmpwatch (Issue #9) * EMR parses and reports job failure due to steps timing out (Issue #15) * EMR boostrap files are no longer made public on S3 (Issue #70) * mrjob.tools.emr.terminate_idle_job_flows handles custom hadoop streaming jars correctly (Issue #116) * LocalMRJobRunner separates out counters by step (Issue #28) * bootstrap_python_packages works regardless of tarball name (Issue #49) * mrjob always creates temp buckets in the correct AWS region (Issue #64) * Catch abuse of __main__ in jobs (Issue #78) * Added mr_travelling_salesman example v0.2.6, 2011-05-24 -- Hadoop 0.20 in EMR, inline runner, and more * Set Hadoop to run on EMR with --hadoop-version (Issue #71). * Default is still 0.18, but will change to 0.20 in mrjob v0.3.0. * New inline runner, for testing locally with a debugger * New --strict-protocols option, to catch unencodable data (Issue #76) * Added steps_python_bin option (for use with virtualenv) * mrjob no longer chokes when asked to run on an EMR job flow running Hadoop 0.20 (Issue #110) * mrjob no longer chokes on job flows with no LogUri (Issue #112) v0.2.5, 2011-04-29 -- Hadoop input and output formats * Added hadoop_input/output_format options * You can now specify a custom Hadoop streaming jar (hadoop_streaming_jar) * extra args to hadoop now come before -mapper/-reducer on EMR, so that e.g. -libjar will work (worked in hadoop mode since v0.2.2) * hadoop mode now supports s3n:// URIs (Issue #53) v0.2.4, 2011-03-09 -- fix bootstrapping mrjob * Fix bootstrapping of mrjob in hadoop and local mode (Issue #89) * SSH tunnels try to use the same port for the same job flow (Issue #67) * Added mr_postfix_bounce and mr_pegasos_svm to examples. * Retry on spurious 505s from EMR API v0.2.3, 2011-02-24 -- boto compatibility * Fix incompatibility with boto 2.0b4 (Issue #91) v0.2.2, 2011-02-15 -- GET/POST EMR issue * Use POST requests for most EMR queries (EMR was choking on large GETs) * find_probable_cause_of_failure() ignores transient errors (Issue #31) * --hadoop-arg now actually works (Issue #79) * on Hadoop, extra args are added first, so you can set e.g. -libjar * S3 buckets may now have . in their names * MRJob scripts now respect --quiet (Issue #84) * added --no-output option for MRJob scripts (Issue #81) * added --python-bin option (Issue #54) v0.2.1, 2010-11-17 -- laststatechangereason bugfix * Don't assume EMR sets laststatechangereason v0.2.0, 2010-11-15 -- Many bugfixes, Windows support * New Features/Changes: * EMRJobRunner now prints % of mappers and reducers completed when you enable the SSH tunnel. * Added mr_page_rank example * Added mrjob.tools.emr.audit_usage script (Issue #21) * You can specify alternate job owners with the "owner" option. Useful for auditing usage. (Issue #59) * The job_name_prefix option has been renamed to label (the old name still works but is deprecated) * bootstrap_cmds and bootstrap_scripts no longer automatically invoke sudo * Bugs Fixed/Cleanup: * bootstrap files no longer get uploaded to S3 twice (Issue #8) * When using add_file_option(), show_steps() can now see the local version of the file (Issue #45) * Now works on Windows (Issue #46) * No longer requires external jar, tar, or zip binaries (Issue #47) * mrjob-* scratch bucket is only created as needed (Issue #50) * Can now specify us-east-1 region explicitly (Issue #58) * mrjob.tools.emr.terminate_idle_job_flows leaves Hive jobs alone (Issue #60) v0.1.0, 2010-10-28 -- Same code, better version. It's official! v0.1.0-pre3, 2010-10-27 -- Pre-release to run Yelp code against * Added debian packaging * mrjob bootstrapping can now deal with symlinks in site-packages/mrjob * MRJobRunner.stream_output() can now be called multiple times v0.1.0-pre2, 2010-10-25 -- Second pre-release after testing * Fixed small bugs that broke Python 2.5.1 and Python 2.7 * Fixed reading mrjob.conf without yaml installed * Fix tests to work with modern simplejson and pipes.quote() * Auto-create temp bucket on S3 if we don't have one (Issue #16) * Auto-infer AWS region from bucket (Issue #7) * --steps now passes in all extra args (e.g. --protocol) (Issue #4) * Better docs v0.1.0-pre1, 2010-10-21 -- Initial pre-release. YMMV! mrjob-0.3.3.2/LICENSE.txt0000664€q(¼€tzÕß0000000106611622561764020447 0ustar sjohnsonAD\Domain Users00000000000000Copyright 2009-2011 Yelp Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. mrjob-0.3.3.2/MANIFEST.in0000664€q(¼€tzÕß0000000004711622561764020360 0ustar sjohnsonAD\Domain Users00000000000000include *.rst include *.txt prune docs mrjob-0.3.3.2/mrjob/0000775€q(¼€tzÕß0000000000011741151621017717 5ustar sjohnsonAD\Domain Users00000000000000mrjob-0.3.3.2/mrjob/__init__.py0000664€q(¼€tzÕß0000000327611741151504022040 0ustar sjohnsonAD\Domain Users00000000000000# Copyright 2009-2012 Yelp # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. """Write and run Hadoop Streaming jobs on Amazon Elastic MapReduce or your own Hadoop cluster. """ __author__ = 'David Marin ' __credits__ = [ 'Jordan Andersen ', 'Hunter Blanks ', 'Jim Blomo ', 'James Brown ', 'Kevin Burke ', 'David Dehghan ', 'Adam Derewecki ', 'Benjamin Goldenberg ', 'Brandon Haynes ', 'Brett Hoerner ', 'Stephen Johnson ', 'Matt Jones ', 'Nikolaos Koutsopoulos ', 'Julian Krause ', 'Robert Leftwich ', 'Wahbeh Qardaji ', 'Jimmy Retzlaff ', 'Ned Rockson ', 'Steve Spencer ', 'Jyry Suvilehto ', 'Matthew Tai ', 'Paul Wais ', ] __version__ = '0.3.3.2' mrjob-0.3.3.2/mrjob/boto_2_1_1_83aae37b.py0000664€q(¼€tzÕß0000003226611717277734023444 0ustar sjohnsonAD\Domain Users00000000000000# Copyright (c) 2010 Spotify AB # Copyright (c) 2010-2011 Yelp # # Permission is hereby granted, free of charge, to any person obtaining a # copy of this software and associated documentation files (the # "Software"), to deal in the Software without restriction, including # without limitation the rights to use, copy, modify, merge, publish, dis- # tribute, sublicense, and/or sell copies of the Software, and to permit # persons to whom the Software is furnished to do so, subject to the fol- # lowing conditions: # # The above copyright notice and this permission notice shall be included # in all copies or substantial portions of the Software. # # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS # OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABIL- # ITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT # SHALL THE AUTHOR BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, # WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, # OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS # IN THE SOFTWARE. """Code from a bleeding-edge version of boto on github, copied here so that mrjob can formally depend on a stable release of boto (in this case, 2.0). This module will hopefully go away in mrjob v0.4. Please don't make multiple boto_* modules; just bump the module name to whatever version you need to work from, and re-copy the relevant code. This is intentionally somewhat ugly and tedious; our goal is to check the patches we need into boto as fast as we can, so that we don't need to copy code from future versions of boto into mrjob. """ import types import boto.emr.connection import boto.emr.emrobject import boto.emr.instance_group from boto.emr.emrobject import RunJobFlowResponse from boto.emr.step import JarStep # add the AmiVersion field to JobFlow class JobFlow(boto.emr.emrobject.JobFlow): Fields = boto.emr.emrobject.JobFlow.Fields | set(['AmiVersion']) # this is used into describe_jobflows(), below. We don't actually patch # the code for describe_jobflows(); just by virtue of being in this module, # it refers to the JobFlow class above rather than the one in boto. # copied in run_jobflow() and supporting functions. This supports the # additional_info, ami_version, and instance_groups keywords, which don't # exist in boto 2.0, as well as disabling the HadoopVersion API parameter. class EmrConnection(boto.emr.connection.EmrConnection): def describe_jobflows(self, states=None, jobflow_ids=None, created_after=None, created_before=None): """ Retrieve all the Elastic MapReduce job flows on your account :type states: list :param states: A list of strings with job flow states wanted :type jobflow_ids: list :param jobflow_ids: A list of job flow IDs :type created_after: datetime :param created_after: Bound on job flow creation time :type created_before: datetime :param created_before: Bound on job flow creation time """ params = {} if states: self.build_list_params(params, states, 'JobFlowStates.member') if jobflow_ids: self.build_list_params(params, jobflow_ids, 'JobFlowIds.member') if created_after: params['CreatedAfter'] = created_after.strftime( boto.utils.ISO8601) if created_before: params['CreatedBefore'] = created_before.strftime( boto.utils.ISO8601) return self.get_list('DescribeJobFlows', params, [('member', JobFlow)]) def run_jobflow(self, name, log_uri, ec2_keyname=None, availability_zone=None, master_instance_type='m1.small', slave_instance_type='m1.small', num_instances=1, action_on_failure='TERMINATE_JOB_FLOW', keep_alive=False, enable_debugging=False, hadoop_version=None, steps=[], bootstrap_actions=[], instance_groups=None, additional_info=None, ami_version=None): """ Runs a job flow :type name: str :param name: Name of the job flow :type log_uri: str :param log_uri: URI of the S3 bucket to place logs :type ec2_keyname: str :param ec2_keyname: EC2 key used for the instances :type availability_zone: str :param availability_zone: EC2 availability zone of the cluster :type master_instance_type: str :param master_instance_type: EC2 instance type of the master :type slave_instance_type: str :param slave_instance_type: EC2 instance type of the slave nodes :type num_instances: int :param num_instances: Number of instances in the Hadoop cluster :type action_on_failure: str :param action_on_failure: Action to take if a step terminates :type keep_alive: bool :param keep_alive: Denotes whether the cluster should stay alive upon completion :type enable_debugging: bool :param enable_debugging: Denotes whether AWS console debugging should be enabled. :type hadoop_version: str :param hadoop_version: Version of Hadoop to use. If ami_version is not set, defaults to '0.20' for backwards compatibility with older versions of boto. :type steps: list(boto.emr.Step) :param steps: List of steps to add with the job :type bootstrap_actions: list(boto.emr.BootstrapAction) :param bootstrap_actions: List of bootstrap actions that run before Hadoop starts. :type instance_groups: list(boto.emr.InstanceGroup) :param instance_groups: Optional list of instance groups to use when creating this job. NB: When provided, this argument supersedes num_instances and master/slave_instance_type. :type ami_version: str :param ami_version: Amazon Machine Image (AMI) version to use for instances. Values accepted by EMR are '1.0', '2.0', and 'latest'; EMR currently defaults to '1.0' if you don't set 'ami_version'. :type additional_info: JSON str :param additional_info: A JSON string for selecting additional features :rtype: str :return: The jobflow id """ # hadoop_version used to default to '0.20', but this won't work # on later AMI versions, so only default if it ami_version isn't set. if not (hadoop_version or ami_version): hadoop_version = '0.20' params = {} if action_on_failure: params['ActionOnFailure'] = action_on_failure params['Name'] = name params['LogUri'] = log_uri # Common instance args common_params = self._build_instance_common_args(ec2_keyname, availability_zone, keep_alive, hadoop_version) params.update(common_params) # NB: according to the AWS API's error message, we must # "configure instances either using instance count, master and # slave instance type or instance groups but not both." # # Thus we switch here on the truthiness of instance_groups. if not instance_groups: # Instance args (the common case) instance_params = self._build_instance_count_and_type_args( master_instance_type, slave_instance_type, num_instances) params.update(instance_params) else: # Instance group args (for spot instances or a heterogenous cluster) list_args = self._build_instance_group_list_args(instance_groups) instance_params = dict( ('Instances.%s' % k, v) for k, v in list_args.iteritems() ) params.update(instance_params) # Debugging step from EMR API docs if enable_debugging: debugging_step = JarStep(name='Setup Hadoop Debugging', action_on_failure='TERMINATE_JOB_FLOW', main_class=None, jar=self.DebuggingJar, step_args=self.DebuggingArgs) steps.insert(0, debugging_step) # Step args if steps: step_args = [self._build_step_args(step) for step in steps] params.update(self._build_step_list(step_args)) if bootstrap_actions: bootstrap_action_args = [self._build_bootstrap_action_args(bootstrap_action) for bootstrap_action in bootstrap_actions] params.update(self._build_bootstrap_action_list(bootstrap_action_args)) if ami_version: params['AmiVersion'] = ami_version if additional_info is not None: params['AdditionalInfo'] = additional_info response = self.get_object( 'RunJobFlow', params, RunJobFlowResponse, verb='POST') return response.jobflowid def _build_instance_common_args(self, ec2_keyname, availability_zone, keep_alive, hadoop_version): """ Takes a number of parameters used when starting a jobflow (as specified in run_jobflow() above). Returns a comparable dict for use in making a RunJobFlow request. """ params = { 'Instances.KeepJobFlowAliveWhenNoSteps' : str(keep_alive).lower(), } if hadoop_version: params['Instances.HadoopVersion'] = hadoop_version if ec2_keyname: params['Instances.Ec2KeyName'] = ec2_keyname if availability_zone: params['Instances.Placement.AvailabilityZone'] = availability_zone return params def _build_instance_count_and_type_args(self, master_instance_type, slave_instance_type, num_instances): """ Takes a master instance type (string), a slave instance type (string), and a number of instances. Returns a comparable dict for use in making a RunJobFlow request. """ params = { 'Instances.MasterInstanceType' : master_instance_type, 'Instances.SlaveInstanceType' : slave_instance_type, 'Instances.InstanceCount' : num_instances, } return params def _build_instance_group_args(self, instance_group): """ Takes an InstanceGroup; returns a dict that, when its keys are properly prefixed, can be used for describing InstanceGroups in RunJobFlow or AddInstanceGroups requests. """ params = { 'InstanceCount' : instance_group.num_instances, 'InstanceRole' : instance_group.role, 'InstanceType' : instance_group.type, 'Name' : instance_group.name, 'Market' : instance_group.market } if instance_group.market == 'SPOT': params['BidPrice'] = instance_group.bidprice return params def _build_instance_group_list_args(self, instance_groups): """ Takes a list of InstanceGroups, or a single InstanceGroup. Returns a comparable dict for use in making a RunJobFlow or AddInstanceGroups request. """ if type(instance_groups) != types.ListType: instance_groups = [instance_groups] params = {} for i, instance_group in enumerate(instance_groups): ig_dict = self._build_instance_group_args(instance_group) for key, value in ig_dict.iteritems(): params['InstanceGroups.member.%d.%s' % (i+1, key)] = value return params # This version of InstanceGroup has spot support. class InstanceGroup(boto.emr.instance_group.InstanceGroup): def __init__(self, num_instances, role, type, market, name, bidprice=None): self.num_instances = num_instances self.role = role self.type = type self.market = market self.name = name if market == 'SPOT': if not isinstance(bidprice, basestring): raise ValueError('bidprice must be specified if market == SPOT') self.bidprice = bidprice def __repr__(self): if self.market == 'SPOT': return '%s.%s(name=%r, num_instances=%r, role=%r, type=%r, market = %r, bidprice = %r)' % ( self.__class__.__module__, self.__class__.__name__, self.name, self.num_instances, self.role, self.type, self.market, self.bidprice) else: return '%s.%s(name=%r, num_instances=%r, role=%r, type=%r, market = %r)' % ( self.__class__.__module__, self.__class__.__name__, self.name, self.num_instances, self.role, self.type, self.market) mrjob-0.3.3.2/mrjob/compat.py0000664€q(¼€tzÕß0000005701511740642733021574 0ustar sjohnsonAD\Domain Users00000000000000# Copyright 2009-2012 Yelp # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. """Utility functions for compatibility with different version of hadoop.""" from distutils.version import LooseVersion import os # keep a mapping for all the names of old/new jobconf env variables # http://hadoop.apache.org/mapreduce/docs/current/mapred-default.html # lists alternative names for jobconf variables # full listing thanks to translation table in # http://pydoop.sourceforge.net/docs/examples/intro.html#hadoop-0-21-0-notes JOBCONF_DICT_LIST = [ {'0.18': 'create.empty.dir.if.nonexist', '0.21': 'mapreduce.jobcontrol.createdir.ifnotexist'}, {'0.18': 'hadoop.job.history.location', '0.21': 'mapreduce.jobtracker.jobhistory.location'}, {'0.18': 'hadoop.job.history.user.location', '0.21': 'mapreduce.job.userhistorylocation'}, {'0.18': 'hadoop.net.static.resolutions', '0.21': 'mapreduce.tasktracker.net.static.resolutions'}, {'0.18': 'hadoop.pipes.command-file.keep', '0.21': 'mapreduce.pipes.commandfile.preserve'}, {'0.18': 'hadoop.pipes.executable', '0.21': 'mapreduce.pipes.executable'}, {'0.18': 'hadoop.pipes.executable.interpretor', '0.21': 'mapreduce.pipes.executable.interpretor'}, {'0.18': 'hadoop.pipes.java.mapper', '0.21': 'mapreduce.pipes.isjavamapper'}, {'0.18': 'hadoop.pipes.java.recordreader', '0.21': 'mapreduce.pipes.isjavarecordreader'}, {'0.18': 'hadoop.pipes.java.recordwriter', '0.21': 'mapreduce.pipes.isjavarecordwriter'}, {'0.18': 'hadoop.pipes.java.reducer', '0.21': 'mapreduce.pipes.isjavareducer'}, {'0.18': 'hadoop.pipes.partitioner', '0.21': 'mapreduce.pipes.partitioner'}, {'0.18': 'io.sort.factor', '0.21': 'mapreduce.task.io.sort.factor'}, {'0.18': 'io.sort.mb', '0.21': 'mapreduce.task.io.sort.mb'}, {'0.18': 'io.sort.spill.percent', '0.21': 'mapreduce.map.sort.spill.percent'}, {'0.18': 'job.end.notification.url', '0.21': 'mapreduce.job.end-notification.url'}, {'0.18': 'job.end.retry.attempts', '0.21': 'mapreduce.job.end-notification.retry.attempts'}, {'0.18': 'job.end.retry.interval', '0.21': 'mapreduce.job.end-notification.retry.interval'}, {'0.18': 'job.local.dir', '0.21': 'mapreduce.job.local.dir'}, {'0.18': 'jobclient.completion.poll.interval', '0.21': 'mapreduce.client.completion.pollinterval'}, {'0.18': 'jobclient.output.filter', '0.21': 'mapreduce.client.output.filter'}, {'0.18': 'jobclient.progress.monitor.poll.interval', '0.21': 'mapreduce.client.progressmonitor.pollinterval'}, {'0.18': 'keep.failed.task.files', '0.21': 'mapreduce.task.files.preserve.failedtasks'}, {'0.18': 'keep.task.files.pattern', '0.21': 'mapreduce.task.files.preserve.filepattern'}, {'0.18': 'key.value.separator.in.input.line', '0.21': 'mapreduce.input.keyvaluelinerecordreader.key.value.separator'}, {'0.18': 'local.cache.size', '0.21': 'mapreduce.tasktracker.cache.local.size'}, {'0.18': 'map.input.file', '0.21': 'mapreduce.map.input.file'}, {'0.18': 'map.input.length', '0.21': 'mapreduce.map.input.length'}, {'0.18': 'map.input.start', '0.21': 'mapreduce.map.input.start'}, {'0.18': 'map.output.key.field.separator', '0.21': 'mapreduce.map.output.key.field.separator'}, {'0.18': 'map.output.key.value.fields.spec', '0.21': 'mapreduce.fieldsel.map.output.key.value.fields.spec'}, {'0.18': 'mapred.binary.partitioner.left.offset', '0.21': 'mapreduce.partition.binarypartitioner.left.offset'}, {'0.18': 'mapred.binary.partitioner.right.offset', '0.21': 'mapreduce.partition.binarypartitioner.right.offset'}, {'0.18': 'mapred.cache.archives', '0.21': 'mapreduce.job.cache.archives'}, {'0.18': 'mapred.cache.archives.timestamps', '0.21': 'mapreduce.job.cache.archives.timestamps'}, {'0.18': 'mapred.cache.files', '0.21': 'mapreduce.job.cache.files'}, {'0.18': 'mapred.cache.files.timestamps', '0.21': 'mapreduce.job.cache.files.timestamps'}, {'0.18': 'mapred.cache.localArchives', '0.21': 'mapreduce.job.cache.local.archives'}, {'0.18': 'mapred.cache.localFiles', '0.21': 'mapreduce.job.cache.local.files'}, {'0.18': 'mapred.child.tmp', '0.21': 'mapreduce.task.tmp.dir'}, {'0.18': 'mapred.cluster.average.blacklist.threshold', '0.21': 'mapreduce.jobtracker.blacklist.average.threshold'}, {'0.18': 'mapred.cluster.map.memory.mb', '0.21': 'mapreduce.cluster.mapmemory.mb'}, {'0.18': 'mapred.cluster.max.map.memory.mb', '0.21': 'mapreduce.jobtracker.maxmapmemory.mb'}, {'0.18': 'mapred.cluster.max.reduce.memory.mb', '0.21': 'mapreduce.jobtracker.maxreducememory.mb'}, {'0.18': 'mapred.cluster.reduce.memory.mb', '0.21': 'mapreduce.cluster.reducememory.mb'}, {'0.18': 'mapred.committer.job.setup.cleanup.needed', '0.21': 'mapreduce.job.committer.setup.cleanup.needed'}, {'0.18': 'mapred.compress.map.output', '0.21': 'mapreduce.map.output.compress'}, {'0.18': 'mapred.create.symlink', '0.21': 'mapreduce.job.cache.symlink.create'}, {'0.18': 'mapred.data.field.separator', '0.21': 'mapreduce.fieldsel.data.field.separator'}, {'0.18': 'mapred.debug.out.lines', '0.21': 'mapreduce.task.debugout.lines'}, {'0.18': 'mapred.healthChecker.interval', '0.21': 'mapreduce.tasktracker.healthchecker.interval'}, {'0.18': 'mapred.healthChecker.script.args', '0.21': 'mapreduce.tasktracker.healthchecker.script.args'}, {'0.18': 'mapred.healthChecker.script.path', '0.21': 'mapreduce.tasktracker.healthchecker.script.path'}, {'0.18': 'mapred.healthChecker.script.timeout', '0.21': 'mapreduce.tasktracker.healthchecker.script.timeout'}, {'0.18': 'mapred.heartbeats.in.second', '0.21': 'mapreduce.jobtracker.heartbeats.in.second'}, {'0.18': 'mapred.hosts', '0.21': 'mapreduce.jobtracker.hosts.filename'}, {'0.18': 'mapred.hosts.exclude', '0.21': 'mapreduce.jobtracker.hosts.exclude.filename'}, {'0.18': 'mapred.inmem.merge.threshold', '0.21': 'mapreduce.reduce.merge.inmem.threshold'}, {'0.18': 'mapred.input.dir', '0.21': 'mapreduce.input.fileinputformat.inputdir'}, {'0.18': 'mapred.input.dir.formats', '0.21': 'mapreduce.input.multipleinputs.dir.formats'}, {'0.18': 'mapred.input.dir.mappers', '0.21': 'mapreduce.input.multipleinputs.dir.mappers'}, {'0.18': 'mapred.input.pathFilter.class', '0.21': 'mapreduce.input.pathFilter.class'}, {'0.18': 'mapred.jar', '0.21': 'mapreduce.job.jar'}, {'0.18': 'mapred.job.classpath.archives', '0.21': 'mapreduce.job.classpath.archives'}, {'0.18': 'mapred.job.classpath.files', '0.21': 'mapreduce.job.classpath.files'}, {'0.18': 'mapred.job.id', '0.21': 'mapreduce.job.id'}, {'0.18': 'mapred.job.map.memory.mb', '0.21': 'mapreduce.map.memory.mb'}, {'0.18': 'mapred.job.name', '0.21': 'mapreduce.job.name'}, {'0.18': 'mapred.job.priority', '0.21': 'mapreduce.job.priority'}, {'0.18': 'mapred.job.queue.name', '0.21': 'mapreduce.job.queuename'}, {'0.18': 'mapred.job.reduce.input.buffer.percent', '0.21': 'mapreduce.reduce.input.buffer.percent'}, {'0.18': 'mapred.job.reduce.markreset.buffer.percent', '0.21': 'mapreduce.reduce.markreset.buffer.percent'}, {'0.18': 'mapred.job.reduce.memory.mb', '0.21': 'mapreduce.reduce.memory.mb'}, {'0.18': 'mapred.job.reduce.total.mem.bytes', '0.21': 'mapreduce.reduce.memory.totalbytes'}, {'0.18': 'mapred.job.reuse.jvm.num.tasks', '0.21': 'mapreduce.job.jvm.numtasks'}, {'0.18': 'mapred.job.shuffle.input.buffer.percent', '0.21': 'mapreduce.reduce.shuffle.input.buffer.percent'}, {'0.18': 'mapred.job.shuffle.merge.percent', '0.21': 'mapreduce.reduce.shuffle.merge.percent'}, {'0.18': 'mapred.job.tracker', '0.21': 'mapreduce.jobtracker.address'}, {'0.18': 'mapred.job.tracker.handler.count', '0.21': 'mapreduce.jobtracker.handler.count'}, {'0.18': 'mapred.job.tracker.http.address', '0.21': 'mapreduce.jobtracker.http.address'}, {'0.18': 'mapred.job.tracker.jobhistory.lru.cache.size', '0.21': 'mapreduce.jobtracker.jobhistory.lru.cache.size'}, {'0.18': 'mapred.job.tracker.persist.jobstatus.active', '0.21': 'mapreduce.jobtracker.persist.jobstatus.active'}, {'0.18': 'mapred.job.tracker.persist.jobstatus.dir', '0.21': 'mapreduce.jobtracker.persist.jobstatus.dir'}, {'0.18': 'mapred.job.tracker.persist.jobstatus.hours', '0.21': 'mapreduce.jobtracker.persist.jobstatus.hours'}, {'0.18': 'mapred.job.tracker.retire.jobs', '0.21': 'mapreduce.jobtracker.retirejobs'}, {'0.18': 'mapred.job.tracker.retiredjobs.cache.size', '0.21': 'mapreduce.jobtracker.retiredjobs.cache.size'}, {'0.18': 'mapred.jobinit.threads', '0.21': 'mapreduce.jobtracker.jobinit.threads'}, {'0.18': 'mapred.jobtracker.instrumentation', '0.21': 'mapreduce.jobtracker.instrumentation'}, {'0.18': 'mapred.jobtracker.job.history.block.size', '0.21': 'mapreduce.jobtracker.jobhistory.block.size'}, {'0.18': 'mapred.jobtracker.maxtasks.per.job', '0.21': 'mapreduce.jobtracker.maxtasks.perjob'}, {'0.18': 'mapred.jobtracker.restart.recover', '0.21': 'mapreduce.jobtracker.restart.recover'}, {'0.18': 'mapred.jobtracker.taskScheduler', '0.21': 'mapreduce.jobtracker.taskscheduler'}, {'0.18': 'mapred.jobtracker.taskalloc.capacitypad', '0.21': 'mapreduce.jobtracker.taskscheduler.taskalloc.capacitypad'}, {'0.18': 'mapred.join.expr', '0.21': 'mapreduce.join.expr'}, {'0.18': 'mapred.join.keycomparator', '0.21': 'mapreduce.join.keycomparator'}, {'0.18': 'mapred.lazy.output.format', '0.21': 'mapreduce.output.lazyoutputformat.outputformat'}, {'0.18': 'mapred.line.input.format.linespermap', '0.21': 'mapreduce.input.lineinputformat.linespermap'}, {'0.18': 'mapred.linerecordreader.maxlength', '0.21': 'mapreduce.input.linerecordreader.line.maxlength'}, {'0.18': 'mapred.local.dir', '0.21': 'mapreduce.cluster.local.dir'}, {'0.18': 'mapred.local.dir.minspacekill', '0.21': 'mapreduce.tasktracker.local.dir.minspacekill'}, {'0.18': 'mapred.local.dir.minspacestart', '0.21': 'mapreduce.tasktracker.local.dir.minspacestart'}, {'0.18': 'mapred.map.child.env', '0.21': 'mapreduce.map.env'}, {'0.18': 'mapred.map.child.java.opts', '0.21': 'mapreduce.map.java.opts'}, {'0.18': 'mapred.map.child.log.level', '0.21': 'mapreduce.map.log.level'}, {'0.18': 'mapred.map.child.ulimit', '0.21': 'mapreduce.map.ulimit'}, {'0.18': 'mapred.map.max.attempts', '0.21': 'mapreduce.map.maxattempts'}, {'0.18': 'mapred.map.output.compression.codec', '0.21': 'mapreduce.map.output.compress.codec'}, {'0.18': 'mapred.map.task.debug.script', '0.21': 'mapreduce.map.debug.script'}, {'0.18': 'mapred.map.tasks', '0.21': 'mapreduce.job.maps'}, {'0.18': 'mapred.map.tasks.speculative.execution', '0.21': 'mapreduce.map.speculative'}, {'0.18': 'mapred.mapoutput.key.class', '0.21': 'mapreduce.map.output.key.class'}, {'0.18': 'mapred.mapoutput.value.class', '0.21': 'mapreduce.map.output.value.class'}, {'0.18': 'mapred.mapper.regex', '0.21': 'mapreduce.mapper.regex'}, {'0.18': 'mapred.mapper.regex.group', '0.21': 'mapreduce.mapper.regexmapper..group'}, {'0.18': 'mapred.max.map.failures.percent', '0.21': 'mapreduce.map.failures.maxpercent'}, {'0.18': 'mapred.max.reduce.failures.percent', '0.21': 'mapreduce.reduce.failures.maxpercent'}, {'0.18': 'mapred.max.split.size', '0.21': 'mapreduce.input.fileinputformat.split.maxsize'}, {'0.18': 'mapred.max.tracker.blacklists', '0.21': 'mapreduce.jobtracker.tasktracker.maxblacklists'}, {'0.18': 'mapred.max.tracker.failures', '0.21': 'mapreduce.job.maxtaskfailures.per.tracker'}, {'0.18': 'mapred.merge.recordsBeforeProgress', '0.21': 'mapreduce.task.merge.progress.records'}, {'0.18': 'mapred.min.split.size', '0.21': 'mapreduce.input.fileinputformat.split.minsize'}, {'0.18': 'mapred.min.split.size.per.node', '0.21': 'mapreduce.input.fileinputformat.split.minsize.per.node'}, {'0.18': 'mapred.min.split.size.per.rack', '0.21': 'mapreduce.input.fileinputformat.split.minsize.per.rack'}, {'0.18': 'mapred.output.compress', '0.21': 'mapreduce.output.fileoutputformat.compress'}, {'0.18': 'mapred.output.compression.codec', '0.21': 'mapreduce.output.fileoutputformat.compress.codec'}, {'0.18': 'mapred.output.compression.type', '0.21': 'mapreduce.output.fileoutputformat.compress.type'}, {'0.18': 'mapred.output.dir', '0.21': 'mapreduce.output.fileoutputformat.outputdir'}, {'0.18': 'mapred.output.key.class', '0.21': 'mapreduce.job.output.key.class'}, {'0.18': 'mapred.output.key.comparator.class', '0.21': 'mapreduce.job.output.key.comparator.class'}, {'0.18': 'mapred.output.value.class', '0.21': 'mapreduce.job.output.value.class'}, {'0.18': 'mapred.output.value.groupfn.class', '0.21': 'mapreduce.job.output.group.comparator.class'}, {'0.18': 'mapred.permissions.supergroup', '0.21': 'mapreduce.cluster.permissions.supergroup'}, {'0.18': 'mapred.pipes.user.inputformat', '0.21': 'mapreduce.pipes.inputformat'}, {'0.18': 'mapred.reduce.child.env', '0.21': 'mapreduce.reduce.env'}, {'0.18': 'mapred.reduce.child.java.opts', '0.21': 'mapreduce.reduce.java.opts'}, {'0.18': 'mapred.reduce.child.log.level', '0.21': 'mapreduce.reduce.log.level'}, {'0.18': 'mapred.reduce.child.ulimit', '0.21': 'mapreduce.reduce.ulimit'}, {'0.18': 'mapred.reduce.max.attempts', '0.21': 'mapreduce.reduce.maxattempts'}, {'0.18': 'mapred.reduce.parallel.copies', '0.21': 'mapreduce.reduce.shuffle.parallelcopies'}, {'0.18': 'mapred.reduce.slowstart.completed.maps', '0.21': 'mapreduce.job.reduce.slowstart.completedmaps'}, {'0.18': 'mapred.reduce.task.debug.script', '0.21': 'mapreduce.reduce.debug.script'}, {'0.18': 'mapred.reduce.tasks', '0.21': 'mapreduce.job.reduces'}, {'0.18': 'mapred.reduce.tasks.speculative.execution', '0.21': 'mapreduce.reduce.speculative'}, {'0.18': 'mapred.seqbinary.output.key.class', '0.21': 'mapreduce.output.seqbinaryoutputformat.key.class'}, {'0.18': 'mapred.seqbinary.output.value.class', '0.21': 'mapreduce.output.seqbinaryoutputformat.value.class'}, {'0.18': 'mapred.shuffle.connect.timeout', '0.21': 'mapreduce.reduce.shuffle.connect.timeout'}, {'0.18': 'mapred.shuffle.read.timeout', '0.21': 'mapreduce.reduce.shuffle.read.timeout'}, {'0.18': 'mapred.skip.attempts.to.start.skipping', '0.21': 'mapreduce.task.skip.start.attempts'}, {'0.18': 'mapred.skip.map.auto.incr.proc.count', '0.21': 'mapreduce.map.skip.proc-count.auto-incr'}, {'0.18': 'mapred.skip.map.max.skip.records', '0.21': 'mapreduce.map.skip.maxrecords'}, {'0.18': 'mapred.skip.on', '0.21': 'mapreduce.job.skiprecords'}, {'0.18': 'mapred.skip.out.dir', '0.21': 'mapreduce.job.skip.outdir'}, {'0.18': 'mapred.skip.reduce.auto.incr.proc.count', '0.21': 'mapreduce.reduce.skip.proc-count.auto-incr'}, {'0.18': 'mapred.skip.reduce.max.skip.groups', '0.21': 'mapreduce.reduce.skip.maxgroups'}, {'0.18': 'mapred.speculative.execution.speculativeCap', '0.21': 'mapreduce.job.speculative.speculativecap'}, {'0.18': 'mapred.submit.replication', '0.21': 'mapreduce.client.submit.file.replication'}, {'0.18': 'mapred.system.dir', '0.21': 'mapreduce.jobtracker.system.dir'}, {'0.18': 'mapred.task.cache.levels', '0.21': 'mapreduce.jobtracker.taskcache.levels'}, {'0.18': 'mapred.task.id', '0.21': 'mapreduce.task.attempt.id'}, {'0.18': 'mapred.task.is.map', '0.21': 'mapreduce.task.ismap'}, {'0.18': 'mapred.task.partition', '0.21': 'mapreduce.task.partition'}, {'0.18': 'mapred.task.profile', '0.21': 'mapreduce.task.profile'}, {'0.18': 'mapred.task.profile.maps', '0.21': 'mapreduce.task.profile.maps'}, {'0.18': 'mapred.task.profile.params', '0.21': 'mapreduce.task.profile.params'}, {'0.18': 'mapred.task.profile.reduces', '0.21': 'mapreduce.task.profile.reduces'}, {'0.18': 'mapred.task.timeout', '0.21': 'mapreduce.task.timeout'}, {'0.18': 'mapred.task.tracker.http.address', '0.21': 'mapreduce.tasktracker.http.address'}, {'0.18': 'mapred.task.tracker.report.address', '0.21': 'mapreduce.tasktracker.report.address'}, {'0.18': 'mapred.task.tracker.task-controller', '0.21': 'mapreduce.tasktracker.taskcontroller'}, {'0.18': 'mapred.tasktracker.dns.interface', '0.21': 'mapreduce.tasktracker.dns.interface'}, {'0.18': 'mapred.tasktracker.dns.nameserver', '0.21': 'mapreduce.tasktracker.dns.nameserver'}, {'0.18': 'mapred.tasktracker.events.batchsize', '0.21': 'mapreduce.tasktracker.events.batchsize'}, {'0.18': 'mapred.tasktracker.expiry.interval', '0.21': 'mapreduce.jobtracker.expire.trackers.interval'}, {'0.18': 'mapred.tasktracker.indexcache.mb', '0.21': 'mapreduce.tasktracker.indexcache.mb'}, {'0.18': 'mapred.tasktracker.instrumentation', '0.21': 'mapreduce.tasktracker.instrumentation'}, {'0.18': 'mapred.tasktracker.map.tasks.maximum', '0.21': 'mapreduce.tasktracker.map.tasks.maximum'}, {'0.18': 'mapred.tasktracker.memory_calculator_plugin', '0.21': 'mapreduce.tasktracker.resourcecalculatorplugin'}, {'0.18': 'mapred.tasktracker.memorycalculatorplugin', '0.21': 'mapreduce.tasktracker.resourcecalculatorplugin'}, {'0.18': 'mapred.tasktracker.reduce.tasks.maximum', '0.21': 'mapreduce.tasktracker.reduce.tasks.maximum'}, {'0.18': 'mapred.temp.dir', '0.21': 'mapreduce.cluster.temp.dir'}, {'0.18': 'mapred.text.key.comparator.options', '0.21': 'mapreduce.partition.keycomparator.options'}, {'0.18': 'mapred.text.key.partitioner.options', '0.21': 'mapreduce.partition.keypartitioner.options'}, {'0.18': 'mapred.textoutputformat.separator', '0.21': 'mapreduce.output.textoutputformat.separator'}, {'0.18': 'mapred.tip.id', '0.21': 'mapreduce.task.id'}, {'0.18': 'mapred.used.genericoptionsparser', '0.21': 'mapreduce.client.genericoptionsparser.used'}, {'0.18': 'mapred.userlog.limit.kb', '0.21': 'mapreduce.task.userlog.limit.kb'}, {'0.18': 'mapred.userlog.retain.hours', '0.21': 'mapreduce.job.userlog.retain.hours'}, {'0.18': 'mapred.work.output.dir', '0.21': 'mapreduce.task.output.dir'}, {'0.18': 'mapred.working.dir', '0.21': 'mapreduce.job.working.dir'}, {'0.18': 'mapreduce.combine.class', '0.21': 'mapreduce.job.combine.class'}, {'0.18': 'mapreduce.inputformat.class', '0.21': 'mapreduce.job.inputformat.class'}, {'0.18': 'mapreduce.jobtracker.permissions.supergroup', '0.21': 'mapreduce.cluster.permissions.supergroup'}, {'0.18': 'mapreduce.map.class', '0.21': 'mapreduce.job.map.class'}, {'0.18': 'mapreduce.outputformat.class', '0.21': 'mapreduce.job.outputformat.class'}, {'0.18': 'mapreduce.partitioner.class', '0.21': 'mapreduce.job.partitioner.class'}, {'0.18': 'mapreduce.reduce.class', '0.21': 'mapreduce.job.reduce.class'}, {'0.18': 'min.num.spills.for.combine', '0.21': 'mapreduce.map.combine.minspills'}, {'0.18': 'reduce.output.key.value.fields.spec', '0.21': 'mapreduce.fieldsel.reduce.output.key.value.fields.spec'}, {'0.18': 'sequencefile.filter.class', '0.21': 'mapreduce.input.sequencefileinputfilter.class'}, {'0.18': 'sequencefile.filter.frequency', '0.21': 'mapreduce.input.sequencefileinputfilter.frequency'}, {'0.18': 'sequencefile.filter.regex', '0.21': 'mapreduce.input.sequencefileinputfilter.regex'}, {'0.18': 'slave.host.name', '0.21': 'mapreduce.tasktracker.host.name'}, {'0.18': 'tasktracker.contention.tracking', '0.21': 'mapreduce.tasktracker.contention.tracking'}, {'0.18': 'tasktracker.http.threads', '0.21': 'mapreduce.tasktracker.http.threads'}, {'0.18': 'user.name', '0.21': 'mapreduce.job.user.name'}, ] def _dict_list_to_compat_map(dict_list): # compat_map = { # ... # a: {'0.18': a, '0.21': b} # b: {'0.18': a, '0.21': b} # .. # } compat_map = {} for version_dict in dict_list: for value in version_dict.itervalues(): compat_map[value] = version_dict return compat_map _jobconf_map = _dict_list_to_compat_map(JOBCONF_DICT_LIST) def get_jobconf_value(variable, default=None): """Get the value of a jobconf variable from the runtime environment. For example, a :py:class:`~mrjob.job.MRJob` could use ``get_jobconf_value('map.input.file')`` to get the name of the file a mapper is reading input from. If the name of the jobconf variable is different in different versions of Hadoop (e.g. in Hadoop 0.21, `map.input.file` is `mapreduce.map.input.file`), we'll automatically try all variants before giving up. Return *default* if that jobconf variable isn't set. """ # try variable verbatim first name = variable.replace('.', '_') if name in os.environ: return os.environ[name] # try alternatives (arbitrary order) for var in _jobconf_map[variable].itervalues(): name = var.replace('.', '_') if name in os.environ: return os.environ[name] return default def translate_jobconf(variable, version): """Translate *variable* to Hadoop version *version*. If it's not a variable we recognize, leave as-is. """ if not variable in _jobconf_map: return variable req_version = LooseVersion(version) possible_versions = sorted(_jobconf_map[variable].keys(), reverse=True, key=lambda(v): LooseVersion(v)) for possible_version in possible_versions: if req_version >= LooseVersion(possible_version): return _jobconf_map[variable][possible_version] # return oldest version if we don't find required version return _jobconf_map[variable][possible_versions[-1]] def supports_combiners_in_hadoop_streaming(version): """Return True if this version of Hadoop Streaming supports combiners (i.e. >= 0.20.203), otherwise False. """ return version_gte(version, '0.20') def supports_new_distributed_cache_options(version): """Use ``-files`` and ``-archives`` instead of ``-cacheFile`` and ``-cacheArchive`` """ return version_gte(version, '0.20') def uses_020_counters(version): return version_gte(version, '0.20') def uses_generic_jobconf(version): """Use ``-D`` instead of ``-jobconf``""" return version_gte(version, '0.20') def version_gte(version, cmp_version_str): """Return True if version >= *cmp_version_str*.""" if not isinstance(version, basestring): raise TypeError('%r is not a string' % version) if not isinstance(cmp_version_str, basestring): raise TypeError('%r is not a string' % cmp_version_str) return LooseVersion(version) >= LooseVersion(cmp_version_str) mrjob-0.3.3.2/mrjob/conf.py0000664€q(¼€tzÕß0000003166211741151504021226 0ustar sjohnsonAD\Domain Users00000000000000# Copyright 2009-2012 Yelp # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. """"mrjob.conf" is the name of both this module, and the global config file for :py:mod:`mrjob`. """ from __future__ import with_statement import glob import logging import os import shlex from mrjob.util import expand_path try: import simplejson as json # preferred because of C speedups json # quiet "redefinition of unused ..." warning from pyflakes except ImportError: import json # built in to Python 2.6 and later # yaml is nice to have, but we can fall back on JSON if need be try: import yaml yaml # quiet "redefinition of unused ..." warning from pyflakes except ImportError: yaml = None log = logging.getLogger('mrjob.conf') ### READING AND WRITING mrjob.conf ### def find_mrjob_conf(): """Look for :file:`mrjob.conf`, and return its path. Places we look: - The location specified by :envvar:`MRJOB_CONF` - :file:`~/.mrjob.conf` - :file:`~/.mrjob` (deprecated) - :file:`mrjob.conf` in any directory in :envvar:`PYTHONPATH` (deprecated) - :file:`/etc/mrjob.conf` Return ``None`` if we can't find it. Print a warning if its location is deprecated. """ def candidates(): """Return (path, deprecation_warning)""" if 'MRJOB_CONF' in os.environ: yield (expand_path(os.environ['MRJOB_CONF']), None) # $HOME isn't necessarily set on Windows, but ~ works # use os.path.join() so we don't end up mixing \ and / yield (expand_path(os.path.join('~', '.mrjob.conf')), None) # DEPRECATED: yield (expand_path(os.path.join('~', '.mrjob')), 'use ~/.mrjob.conf instead.') if os.environ.get('PYTHONPATH'): for dirname in os.environ['PYTHONPATH'].split(os.pathsep): yield (os.path.join(dirname, 'mrjob.conf'), 'Use $MRJOB_CONF to explicitly specify the path' ' instead.') # this only really makes sense on Unix, so no os.path.join() yield ('/etc/mrjob.conf', None) for path, deprecation_message in candidates(): log.debug('looking for configs in %s' % path) if os.path.exists(path): log.info('using configs in %s' % path) if deprecation_message: log.warning('This config path is deprecated and will stop' ' working in mrjob 0.4. %s' % deprecation_message) return path else: log.info("no configs found; falling back on auto-configuration") return None def real_mrjob_conf_path(conf_path=None): if conf_path is False: return None elif conf_path is None: return find_mrjob_conf() else: return expand_path(conf_path) def conf_object_at_path(conf_path): if conf_path is None: return None with open(conf_path) as f: if yaml: return yaml.safe_load(f) else: try: return json.load(f) except json.JSONDecodeError, e: msg = ('If your mrjob.conf is in YAML, you need to install' ' yaml; see http://pypi.python.org/pypi/PyYAML/') # JSONDecodeError currently has a msg attr, but it may not in # the future if hasattr(e, 'msg'): e.msg = '%s (%s)' % (e.msg, msg) else: e.msg = msg raise e # TODO 0.4: move to tests.test_conf def load_mrjob_conf(conf_path=None): """.. deprecated:: 0.3.3 Load the entire data structure in :file:`mrjob.conf`, which should look something like this:: {'runners': 'emr': {'OPTION': VALUE, ...}, 'hadoop: {'OPTION': VALUE, ...}, 'inline': {'OPTION': VALUE, ...}, 'local': {'OPTION': VALUE, ...}, } Returns ``None`` if we can't find :file:`mrjob.conf`. :type conf_path: str :param conf_path: an alternate place to look for mrjob.conf. If this is ``False``, we'll always return ``None``. """ # Only used by mrjob tests and possibly third parties. log.warn('mrjob.conf.load_mrjob_conf is deprecated.') conf_path = real_mrjob_conf_path(conf_path) return conf_object_at_path(conf_path) def load_opts_from_mrjob_conf(runner_alias, conf_path=None, already_loaded=None): """Load a list of dictionaries representing the options in a given mrjob.conf for a specific runner. Returns ``[(path, values)]``. If conf_path is not found, return [(None, {})]. :type runner_alias: str :param runner_alias: String identifier of the runner type, e.g. ``emr``, ``local``, etc. :type conf_path: str :param conf_path: an alternate place to look for mrjob.conf. If this is ``False``, we'll always return ``{}``. :type already_loaded: list :param already_loaded: list of :file:`mrjob.conf` paths that have already been loaded """ # Used to use load_mrjob_conf() here, but we need both the 'real' path and # the conf object, which we can't get cleanly from load_mrjob_conf. This # means load_mrjob_conf() is basically useless now except for in tests, # but it's exposed in the API, so we shouldn't kill it until 0.4 at least. conf_path = real_mrjob_conf_path(conf_path) conf = conf_object_at_path(conf_path) if conf is None: return [(None, {})] if already_loaded is None: already_loaded = [] already_loaded.append(conf_path) try: values = conf['runners'][runner_alias] or {} except (KeyError, TypeError, ValueError): log.warning('no configs for runner type %r in %s; returning {}' % (runner_alias, conf_path)) values = {} inherited = [] if conf.get('include', None): includes = conf['include'] if isinstance(includes, basestring): includes = [includes] for include in includes: if include in already_loaded: log.warn('%s tries to recursively include %s! (Already included:' ' %s)' % (conf_path, conf['include'], ', '.join(already_loaded))) else: inherited.extend(load_opts_from_mrjob_conf( runner_alias, include, already_loaded)) return inherited + [(conf_path, values)] def dump_mrjob_conf(conf, f): """Write out configuration options to a file. Useful if you don't want to bother to figure out YAML. *conf* should look something like this: {'runners': 'local': {'OPTION': VALUE, ...} 'emr': {'OPTION': VALUE, ...} 'hadoop: {'OPTION': VALUE, ...} } :param f: a file object to write to (e.g. ``open('mrjob.conf', 'w')``) """ if yaml: yaml.safe_dump(conf, f, default_flow_style=False) else: json.dump(conf, f, indent=2) f.flush() ### COMBINING OPTIONS ### # combiners generally consider earlier values to be defaults, and later # options to override or add on to them. def combine_values(*values): """Return the last value in *values* that is not ``None``. The default combiner; good for simple values (booleans, strings, numbers). """ for v in reversed(values): if v is not None: return v else: return None def combine_lists(*seqs): """Concatenate the given sequences into a list. Ignore ``None`` values. Generally this is used for a list of commands we want to run; the "default" commands get run before any commands specific to your job. """ result = [] for seq in seqs: if seq: result.extend(seq) return result def combine_cmds(*cmds): """Take zero or more commands to run on the command line, and return the last one that is not ``None``. Each command should either be a list containing the command plus switches, or a string, which will be parsed with :py:func:`shlex.split` Returns either ``None`` or a list containing the command plus arguments. """ cmd = combine_values(*cmds) if cmd is None: return None elif isinstance(cmd, basestring): return shlex.split(cmd) else: return list(cmd) def combine_cmd_lists(*seqs_of_cmds): """Concatenate the given commands into a list. Ignore ``None`` values, and parse strings with :py:func:`shlex.split`. Returns a list of lists (each sublist contains the command plus arguments). """ seq_of_cmds = combine_lists(*seqs_of_cmds) return [combine_cmds(cmd) for cmd in seq_of_cmds] def combine_dicts(*dicts): """Combine zero or more dictionaries. Values from dicts later in the list take precedence over values earlier in the list. If you pass in ``None`` in place of a dictionary, it will be ignored. """ result = {} for d in dicts: if d: result.update(d) return result def combine_envs(*envs): """Combine zero or more dictionaries containing environment variables. Environment variables later from dictionaries later in the list take priority over those earlier in the list. For variables ending with ``PATH``, we prepend (and add a colon) rather than overwriting. If you pass in ``None`` in place of a dictionary, it will be ignored. """ return _combine_envs_helper(envs, local=False) def combine_local_envs(*envs): """Same as :py:func:`combine_envs`, except that paths are combined using the local path separator (e.g ``;`` on Windows rather than ``:``). """ return _combine_envs_helper(envs, local=True) def _combine_envs_helper(envs, local): if local: pathsep = os.pathsep else: pathsep = ':' result = {} for env in envs: if env: for key, value in env.iteritems(): if key.endswith('PATH') and result.get(key): result[key] = value + pathsep + result[key] else: result[key] = value return result def combine_paths(*paths): """Returns the last value in *paths* that is not ``None``. Resolve ``~`` (home dir) and environment variables.""" return expand_path(combine_values(*paths)) def combine_path_lists(*path_seqs): """Concatenate the given sequences into a list. Ignore None values. Resolve ``~`` (home dir) and environment variables, and expand globs that refer to the local filesystem.""" results = [] for path in combine_lists(*path_seqs): expanded = expand_path(path) # if we can't expand a glob, leave as-is (maybe it refers to # S3 or HDFS) paths = sorted(glob.glob(expanded)) or [expanded] results.extend(paths) return results def combine_opts(combiners, *opts_list): """The master combiner, used to combine dictionaries of options with appropriate sub-combiners. :param combiners: a map from option name to a combine_*() function to combine options by that name. By default, we combine options using :py:func:`combine_values`. :param opts_list: one or more dictionaries to combine """ final_opts = {} keys = set() for opts in opts_list: if opts: keys.update(opts) for key in keys: values = [] for opts in opts_list: if opts and key in opts: values.append(opts[key]) combine_func = combiners.get(key) or combine_values final_opts[key] = combine_func(*values) return final_opts ### PRIORITY ### def calculate_opt_priority(opts, opt_dicts): """Keep track of where in the order opts were specified, to handle opts that affect the same thing (e.g. ec2_*instance_type). Here is a rough guide to the values set by this function. They are Where specified Priority unset everywhere -1 blank 0 non-blank default 1 base conf file 2 inheriting conf [3-n] command line n+1 :type opts: iterable :type opt_dicts: list of dicts with keys also appearing in **opts** """ opt_priority = dict((opt, -1) for opt in opts) for priority, opt_dict in enumerate(opt_dicts): if opt_dict: for opt, value in opt_dict.iteritems(): if value is not None: opt_priority[opt] = priority return opt_priority mrjob-0.3.3.2/mrjob/emr.py0000664€q(¼€tzÕß0000035170111741151504021063 0ustar sjohnsonAD\Domain Users00000000000000# Copyright 2009-2012 Yelp and Contributors # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. from __future__ import with_statement from collections import defaultdict from datetime import datetime from datetime import timedelta import fnmatch import logging import os import posixpath import random import re import shlex import signal import socket from subprocess import Popen from subprocess import PIPE import time import urllib2 try: from cStringIO import StringIO StringIO # quiet "redefinition of unused ..." warning from pyflakes except ImportError: from StringIO import StringIO try: import simplejson as json # preferred because of C speedups json # quiet "redefinition of unused ..." warning from pyflakes except ImportError: import json # built in to Python 2.6 and later try: import boto import boto.ec2 import boto.emr import boto.exception import boto.utils from mrjob import boto_2_1_1_83aae37b boto # quiet "redefinition of unused ..." warning from pyflakes except ImportError: # don't require boto; MRJobs don't actually need it when running # inside hadoop streaming boto = None import mrjob from mrjob import compat from mrjob.conf import combine_cmds from mrjob.conf import combine_dicts from mrjob.conf import combine_lists from mrjob.conf import combine_paths from mrjob.conf import combine_path_lists from mrjob.logparsers import TASK_ATTEMPTS_LOG_URI_RE from mrjob.logparsers import STEP_LOG_URI_RE from mrjob.logparsers import EMR_JOB_LOG_URI_RE from mrjob.logparsers import NODE_LOG_URI_RE from mrjob.logparsers import scan_for_counters_in_files from mrjob.logparsers import scan_logs_in_order from mrjob.parse import is_s3_uri from mrjob.parse import parse_s3_uri from mrjob.pool import est_time_to_hour from mrjob.pool import pool_hash_and_name from mrjob.retry import RetryWrapper from mrjob.runner import MRJobRunner from mrjob.runner import GLOB_RE from mrjob.ssh import ssh_cat from mrjob.ssh import ssh_ls from mrjob.ssh import ssh_copy_key from mrjob.ssh import ssh_slave_addresses from mrjob.ssh import SSHException from mrjob.ssh import SSH_PREFIX from mrjob.ssh import SSH_LOG_ROOT from mrjob.ssh import SSH_URI_RE from mrjob.util import buffer_iterator_to_line_iterator from mrjob.util import cmd_line from mrjob.util import extract_dir_for_tar from mrjob.util import hash_object from mrjob.util import read_file log = logging.getLogger('mrjob.emr') JOB_TRACKER_RE = re.compile('(\d{1,3}\.\d{2})%') # if EMR throttles us, how long to wait (in seconds) before trying again? EMR_BACKOFF = 20 EMR_BACKOFF_MULTIPLIER = 1.5 EMR_MAX_TRIES = 20 # this takes about a day before we run out of tries # the port to tunnel to EMR_JOB_TRACKER_PORT = 9100 EMR_JOB_TRACKER_PATH = '/jobtracker.jsp' MAX_SSH_RETRIES = 20 # ssh should fail right away if it can't bind a port WAIT_FOR_SSH_TO_FAIL = 1.0 # sometimes AWS gives us seconds as a decimal, which we can't parse # with boto.utils.ISO8601 SUBSECOND_RE = re.compile('\.[0-9]+') # map from AWS region to EMR endpoint. See # http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/index.html?ConceptsRequestEndpoints.html REGION_TO_EMR_ENDPOINT = { 'EU': 'eu-west-1.elasticmapreduce.amazonaws.com', 'us-east-1': 'us-east-1.elasticmapreduce.amazonaws.com', 'us-west-1': 'us-west-1.elasticmapreduce.amazonaws.com', '': 'elasticmapreduce.amazonaws.com', # when no region specified } # map from AWS region to S3 endpoint. See # http://docs.amazonwebservices.com/AmazonS3/latest/dev/MakingRequests.html#RequestEndpoints REGION_TO_S3_ENDPOINT = { 'EU': 's3-eu-west-1.amazonaws.com', 'us-east-1': 's3.amazonaws.com', # no region-specific endpoint 'us-west-1': 's3-us-west-1.amazonaws.com', 'ap-southeast-1': 's3-ap-southeast-1.amazonaws.com', # no EMR endpoint yet '': 's3.amazonaws.com', } # map from AWS region to S3 LocationConstraint parameter for regions whose # location constraints differ from their AWS regions. See # http://docs.amazonwebservices.com/AmazonS3/latest/API/index.html?RESTBucketPUT.html REGION_TO_S3_LOCATION_CONSTRAINT = { 'us-east-1': '', } # map from instance type to number of compute units # from http://aws.amazon.com/ec2/instance-types/ EC2_INSTANCE_TYPE_TO_COMPUTE_UNITS = { 't1.micro': 2, 'm1.small': 1, 'm1.large': 4, 'm1.xlarge': 8, 'm2.xlarge': 6.5, 'm2.2xlarge': 13, 'm2.4xlarge': 26, 'c1.medium': 5, 'c1.xlarge': 20, 'cc1.4xlarge': 33.5, 'cg1.4xlarge': 33.5, } # map from instance type to GB of memory # from http://aws.amazon.com/ec2/instance-types/ EC2_INSTANCE_TYPE_TO_MEMORY = { 't1.micro': 0.6, 'm1.small': 1.7, 'm1.large': 7.5, 'm1.xlarge': 15, 'm2.xlarge': 17.5, 'm2.2xlarge': 34.2, 'm2.4xlarge': 68.4, 'c1.medium': 1.7, 'c1.xlarge': 7, 'cc1.4xlarge': 23, 'cg1.4xlarge': 22, } # Use this to figure out which hadoop version we're using if it's not # explicitly specified, so we can keep from passing deprecated command-line # options to Hadoop. If we encounter an AMI version we don't recognize, # we use whatever version matches 'latest'. # # The reason we don't just create a job flow and then query its Hadoop version # is that for most jobs, we create the steps and the job flow at the same time. AMI_VERSION_TO_HADOOP_VERSION = { None: '0.18', # ami_version not specified means version 1.0 '1.0': '0.18', '2.0': '0.20.205', 'latest': '0.20.205', } # EMR's hard limit on number of steps in a job flow MAX_STEPS_PER_JOB_FLOW = 256 def s3_key_to_uri(s3_key): """Convert a boto Key object into an ``s3://`` URI""" return 's3://%s/%s' % (s3_key.bucket.name, s3_key.name) # AWS actually gives dates in two formats, and we only recently started using # API calls that return the second. So the date parsing function is called # iso8601_to_*, but it also parses RFC1123. # Until boto starts seamlessly parsing these, we check for them ourselves. # Thu, 29 Mar 2012 04:55:44 GMT RFC1123 = '%a, %d %b %Y %H:%M:%S %Z' def iso8601_to_timestamp(iso8601_time): iso8601_time = SUBSECOND_RE.sub('', iso8601_time) try: return time.mktime(time.strptime(iso8601_time, boto.utils.ISO8601)) except ValueError: return time.mktime(time.strptime(iso8601_time, RFC1123)) def iso8601_to_datetime(iso8601_time): iso8601_time = SUBSECOND_RE.sub('', iso8601_time) try: return datetime.strptime(iso8601_time, boto.utils.ISO8601) except ValueError: return datetime.strptime(iso8601_time, RFC1123) def describe_all_job_flows(emr_conn, states=None, jobflow_ids=None, created_after=None, created_before=None): """Iteratively call ``EmrConnection.describe_job_flows()`` until we really get all the available job flow information. Currently, 2 months of data is available through the EMR API. This is a way of getting around the limits of the API, both on number of job flows returned, and how far back in time we can go. :type states: list :param states: A list of strings with job flow states wanted :type jobflow_ids: list :param jobflow_ids: A list of job flow IDs :type created_after: datetime :param created_after: Bound on job flow creation time :type created_before: datetime :param created_before: Bound on job flow creation time """ all_job_flows = [] ids_seen = set() # weird things can happen if we send no args the DescribeJobFlows API # (see Issue #346), so if nothing else is set, set created_before # to a day in the future. if not (states or jobflow_ids or created_after or created_before): created_before = datetime.utcnow() + timedelta(days=1) while True: if created_before and created_after and created_before < created_after: break log.debug('Calling describe_jobflows(states=%r, jobflow_ids=%r,' ' created_after=%r, created_before=%r)' % (states, jobflow_ids, created_after, created_before)) try: results = emr_conn.describe_jobflows( states=states, jobflow_ids=jobflow_ids, created_after=created_after, created_before=created_before) except boto.exception.BotoServerError, ex: if 'ValidationError' in ex.body: log.debug( ' reached earliest allowed created_before time, done!') break else: raise # don't count the same job flow twice job_flows = [jf for jf in results if jf.jobflowid not in ids_seen] log.debug(' got %d results (%d new)' % (len(results), len(job_flows))) all_job_flows.extend(job_flows) ids_seen.update(jf.jobflowid for jf in job_flows) if job_flows: # set created_before to be just after the start time of # the first job returned, to deal with job flows started # in the same second min_create_time = min(iso8601_to_datetime(jf.creationdatetime) for jf in job_flows) created_before = min_create_time + timedelta(seconds=1) # if someone managed to start 501 job flows in the same second, # they are still screwed (the EMR API only returns up to 500), # but this seems unlikely. :) else: if not created_before: created_before = datetime.utcnow() created_before -= timedelta(weeks=2) return all_job_flows def make_lock_uri(s3_tmp_uri, emr_job_flow_id, step_num): """Generate the URI to lock the job flow ``emr_job_flow_id``""" return s3_tmp_uri + 'locks/' + emr_job_flow_id + '/' + str(step_num) def _lock_acquire_step_1(s3_conn, lock_uri, job_name, mins_to_expiration=None): bucket_name, key_prefix = parse_s3_uri(lock_uri) bucket = s3_conn.get_bucket(bucket_name) key = bucket.get_key(key_prefix) # EMRJobRunner should start using a job flow within about a second of # locking it, so if it's been a while, then it probably crashed and we # can just use this job flow. key_expired = False if key and mins_to_expiration is not None: last_modified = iso8601_to_datetime(key.last_modified) age = datetime.utcnow() - last_modified if age > timedelta(minutes=mins_to_expiration): key_expired = True if key is None or key_expired: key = bucket.new_key(key_prefix) key.set_contents_from_string(job_name) return key else: return None def _lock_acquire_step_2(key, job_name): key_value = key.get_contents_as_string() return (key_value == job_name) def attempt_to_acquire_lock(s3_conn, lock_uri, sync_wait_time, job_name, mins_to_expiration=None): """Returns True if this session successfully took ownership of the lock specified by ``lock_uri``. """ key = _lock_acquire_step_1(s3_conn, lock_uri, job_name, mins_to_expiration) if key is not None: time.sleep(sync_wait_time) success = _lock_acquire_step_2(key, job_name) if success: return True return False class LogFetchError(Exception): pass class EMRJobRunner(MRJobRunner): """Runs an :py:class:`~mrjob.job.MRJob` on Amazon Elastic MapReduce. :py:class:`EMRJobRunner` runs your job in an EMR job flow, which is basically a temporary Hadoop cluster. Normally, it creates a job flow just for your job; it's also possible to run your job in a specific job flow by setting *emr_job_flow_id* or to automatically choose a waiting job flow, creating one if none exists, by setting *pool_emr_job_flows*. Input, support, and jar files can be either local or on S3; use ``s3://...`` URLs to refer to files on S3. This class has some useful utilities for talking directly to S3 and EMR, so you may find it useful to instantiate it without a script:: from mrjob.emr import EMRJobRunner emr_conn = EMRJobRunner().make_emr_conn() job_flows = emr_conn.describe_jobflows() ... See also: :py:meth:`~EMRJobRunner.__init__`. """ alias = 'emr' def __init__(self, **kwargs): """:py:class:`~mrjob.emr.EMRJobRunner` takes the same arguments as :py:class:`~mrjob.runner.MRJobRunner`, plus some additional options which can be defaulted in :ref:`mrjob.conf `. *aws_access_key_id* and *aws_secret_access_key* are required if you haven't set them up already for boto (e.g. by setting the environment variables :envvar:`AWS_ACCESS_KEY_ID` and :envvar:`AWS_SECRET_ACCESS_KEY`) Additional options: :type additional_emr_info: JSON str, None, or JSON-encodable object :param additional_emr_info: Special parameters to select additional features, mostly to support beta EMR features. Pass a JSON string on the command line or use data structures in the config file (which is itself basically JSON). :type ami_version: str :param ami_version: EMR AMI version to use. This controls which Hadoop version(s) are available and which version of Python is installed, among other things; see \ http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuideindex.html?EnvironmentConfig_AMIVersion.html for details. Implicitly defaults to AMI version 1.0 (this will change to 2.0 in mrjob v0.4). :type aws_access_key_id: str :param aws_access_key_id: "username" for Amazon web services. :type aws_availability_zone: str :param aws_availability_zone: availability zone to run the job in :type aws_secret_access_key: str :param aws_secret_access_key: your "password" on AWS :type aws_region: str :param aws_region: region to connect to S3 and EMR on (e.g. ``us-west-1``). If you want to use separate regions for S3 and EMR, set *emr_endpoint* and *s3_endpoint*. :type bootstrap_actions: list of str :param bootstrap_actions: a list of raw bootstrap actions (essentially scripts) to run prior to any of the other bootstrap steps. Any arguments should be separated from the command by spaces (we use :py:func:`shlex.split`). If the action is on the local filesystem, we'll automatically upload it to S3. :type bootstrap_cmds: list :param bootstrap_cmds: a list of commands to run on the master node to set up libraries, etc. Like *setup_cmds*, these can be strings, which will be run in the shell, or lists of args, which will be run directly. Prepend ``sudo`` to commands to do things that require root privileges. :type bootstrap_files: list of str :param bootstrap_files: files to download to the bootstrap working directory on the master node before running *bootstrap_cmds* (for example, Debian packages). May be local files for mrjob to upload to S3, or any URI that ``hadoop fs`` can handle. :type bootstrap_mrjob: boolean :param bootstrap_mrjob: This is actually an option in the base :py:class:`~mrjob.job.MRJobRunner` class. If this is ``True`` (the default), we'll tar up :py:mod:`mrjob` from the local filesystem, and install it on the master node. :type bootstrap_python_packages: list of str :param bootstrap_python_packages: paths of python modules to install on EMR. These should be standard Python module tarballs. If a module is named ``foo.tar.gz``, we expect to be able to run ``tar xfz foo.tar.gz; cd foo; sudo python setup.py install``. :type bootstrap_scripts: list of str :param bootstrap_scripts: scripts to upload and then run on the master node (a combination of *bootstrap_cmds* and *bootstrap_files*). These are run after the command from bootstrap_cmds. :type check_emr_status_every: float :param check_emr_status_every: How often to check on the status of EMR jobs. Default is 30 seconds (too often and AWS will throttle you anyway). :type ec2_instance_type: str :param ec2_instance_type: What sort of EC2 instance(s) to use on the nodes that actually run tasks (see http://aws.amazon.com/ec2/instance-types/). When you run multiple instances (see *num_ec2_instances*), the master node is just coordinating the other nodes, so usually the default instance type (``m1.small``) is fine, and using larger instances is wasteful. :type ec2_key_pair: str :param ec2_key_pair: name of the SSH key you set up for EMR. :type ec2_key_pair_file: str :param ec2_key_pair_file: path to file containing the SSH key for EMR :type ec2_core_instance_type: str :param ec2_core_instance_type: like *ec2_instance_type*, but only for the core (also know as "slave") Hadoop nodes; these nodes run tasks and host HDFS. Usually you just want to use *ec2_instance_type*. Defaults to ``'m1.small'``. :type ec2_core_instance_bid_price: str :param ec2_core_instance_bid_price: when specified and not "0", this creates the master Hadoop node as a spot instance at this bid price. You usually only want to set bid price for task instances. :type ec2_master_instance_type: str :param ec2_master_instance_type: like *ec2_instance_type*, but only for the master Hadoop node. This node hosts the task tracker and HDFS, and runs tasks if there are no other nodes. Usually you just want to use *ec2_instance_type*. Defaults to ``'m1.small'``. :type ec2_master_instance_bid_price: str :param ec2_master_instance_bid_price: when specified and not "0", this creates the master Hadoop node as a spot instance at this bid price. You usually only want to set bid price for task instances unless the master instance is your only instance. :type ec2_slave_instance_type: str :param ec2_slave_instance_type: An alias for *ec2_core_instance_type*, for consistency with the EMR API. :type ec2_task_instance_type: str :param ec2_task_instance_type: like *ec2_instance_type*, but only for the task Hadoop nodes; these nodes run tasks but do not host HDFS. Usually you just want to use *ec2_instance_type*. Defaults to the same instance type as *ec2_core_instance_type*. :param ec2_task_instance_bid_price: when specified and not "0", this creates the master Hadoop node as a spot instance at this bid price. (You usually only want to set bid price for task instances.) :type emr_endpoint: str :param emr_endpoint: optional host to connect to when communicating with S3 (e.g. ``us-west-1.elasticmapreduce.amazonaws.com``). Default is to infer this from *aws_region*. :type emr_job_flow_id: str :param emr_job_flow_id: the ID of a persistent EMR job flow to run jobs in (normally we launch our own job flow). It's fine for other jobs to be using the job flow; we give our job's steps a unique ID. :type emr_job_flow_pool_name: str :param emr_job_flow_pool_name: Specify a pool name to join. Is set to ``'default'`` if not specified. Does not imply ``pool_emr_job_flows``. :type enable_emr_debugging: str :param enable_emr_debugging: store Hadoop logs in SimpleDB :type hadoop_streaming_jar: str :param hadoop_streaming_jar: This is actually an option in the base :py:class:`~mrjob.runner.MRJobRunner` class. Points to a custom hadoop streaming jar on the local filesystem or S3. If you want to point to a streaming jar already installed on the EMR instances (perhaps through a bootstrap action?), use *hadoop_streaming_jar_on_emr*. :type hadoop_streaming_jar_on_emr: str :param hadoop_streaming_jar_on_emr: Like *hadoop_streaming_jar*, except that it points to a path on the EMR instance, rather than to a local file or one on S3. Rarely necessary to set this by hand. :type hadoop_version: str :param hadoop_version: Set the version of Hadoop to use on EMR. Consider setting *ami_version* instead; only AMI version 1.0 supports multiple versions of Hadoop anyway. If *ami_version* is not set, we'll default to Hadoop 0.20 for backwards compatibility with :py:mod:`mrjob` v0.3.0. :type num_ec2_core_instances: int :param num_ec2_core_instances: Number of core (or "slave") instances to start up. These run your job and host HDFS. Incompatible with *num_ec2_instances*. This is in addition to the single master instance. :type num_ec2_instances: int :param num_ec2_instances: Total number of instances to start up; basically the number of core instance you want, plus 1 (there is always one master instance). Default is ``1``. Incompatible with *num_ec2_core_instances* and *num_ec2_task_instances*. :type num_ec2_task_instances: int :param num_ec2_task_instances: number of task instances to start up. These run your job but do not host HDFS. Incompatible with *num_ec2_instances*. If you use this, you must set *num_ec2_core_instances*; EMR does not allow you to run task instances without core instances (because there's nowhere to host HDFS). :type pool_emr_job_flows: bool :param pool_emr_job_flows: Try to run the job on a ``WAITING`` pooled job flow with the same bootstrap configuration. Prefer the one with the most compute units. Use S3 to "lock" the job flow and ensure that the job is not scheduled behind another job. If no suitable job flow is `WAITING`, create a new pooled job flow. **WARNING**: do not run this without having\ :py:mod:`mrjob.tools.emr.terminate.idle_job_flows` in your crontab; job flows left idle can quickly become expensive! :type s3_endpoint: str :param s3_endpoint: Host to connect to when communicating with S3 (e.g. ``s3-us-west-1.amazonaws.com``). Default is to infer this from *aws_region*. :type s3_log_uri: str :param s3_log_uri: where on S3 to put logs, for example ``s3://yourbucket/logs/``. Logs for your job flow will go into a subdirectory, e.g. ``s3://yourbucket/logs/j-JOBFLOWID/``. in this example s3://yourbucket/logs/j-YOURJOBID/). Default is to append ``logs/`` to *s3_scratch_uri*. :type s3_scratch_uri: str :param s3_scratch_uri: S3 directory (URI ending in ``/``) to use as scratch space, e.g. ``s3://yourbucket/tmp/``. Default is ``tmp/mrjob/`` in the first bucket belonging to you. :type s3_sync_wait_time: float :param s3_sync_wait_time: How long to wait for S3 to reach eventual consistency. This is typically less than a second (zero in U.S. West) but the default is 5.0 to be safe. :type ssh_bin: str or list :param ssh_bin: path to the ssh binary; may include switches (e.g. ``'ssh -v'`` or ``['ssh', '-v']``). Defaults to :command:`ssh` :type ssh_bind_ports: list of int :param ssh_bind_ports: a list of ports that are safe to listen on. Defaults to ports ``40001`` thru ``40840``. :type ssh_tunnel_to_job_tracker: bool :param ssh_tunnel_to_job_tracker: If True, create an ssh tunnel to the job tracker and listen on a randomly chosen port. This requires you to set *ec2_key_pair* and *ec2_key_pair_file*. See :ref:`ssh-tunneling` for detailed instructions. :type ssh_tunnel_is_open: bool :param ssh_tunnel_is_open: if True, any host can connect to the job tracker through the SSH tunnel you open. Mostly useful if your browser is running on a different machine from your job runner. """ super(EMRJobRunner, self).__init__(**kwargs) # make aws_region an instance variable; we might want to set it # based on the scratch bucket self._aws_region = self._opts['aws_region'] or '' # if we're going to create a bucket to use as temp space, we don't # want to actually create it until we run the job (Issue #50). # This variable helps us create the bucket as needed self._s3_temp_bucket_to_create = None self._fix_s3_scratch_and_log_uri_opts() self._fix_ec2_instance_opts() # pick a tmp dir based on the job name self._s3_tmp_uri = self._opts['s3_scratch_uri'] + self._job_name + '/' # pick/validate output dir if self._output_dir: self._output_dir = self._check_and_fix_s3_dir(self._output_dir) else: self._output_dir = self._s3_tmp_uri + 'output/' # add the bootstrap files to a list of files to upload self._bootstrap_actions = [] for action in self._opts['bootstrap_actions']: args = shlex.split(action) if not args: raise ValueError('bad bootstrap action: %r' % (action,)) # don't use _add_bootstrap_file() because this is a raw bootstrap # action, not part of mrjob's bootstrap utilities file_dict = self._add_file(args[0]) file_dict['args'] = args[1:] self._bootstrap_actions.append(file_dict) for path in self._opts['bootstrap_files']: self._add_bootstrap_file(path) self._bootstrap_scripts = [] for path in self._opts['bootstrap_scripts']: file_dict = self._add_bootstrap_file(path) self._bootstrap_scripts.append(file_dict) self._bootstrap_python_packages = [] for path in self._opts['bootstrap_python_packages']: name, path = self._split_path(path) if not path.endswith('.tar.gz'): raise ValueError( 'bootstrap_python_packages only accepts .tar.gz files!') file_dict = self._add_bootstrap_file(path) self._bootstrap_python_packages.append(file_dict) self._streaming_jar = None if self._opts.get('hadoop_streaming_jar'): self._streaming_jar = self._add_file_for_upload( self._opts['hadoop_streaming_jar']) if not (isinstance(self._opts['additional_emr_info'], basestring) or self._opts['additional_emr_info'] is None): self._opts['additional_emr_info'] = json.dumps( self._opts['additional_emr_info']) # if we're bootstrapping mrjob, keep track of the file_dict # for mrjob.tar.gz self._mrjob_tar_gz_file = None # where our own logs ended up (we'll find this out once we run the job) self._s3_job_log_uri = None # where to get input from. We'll fill this later. Once filled, # this must be a list (not some other sort of container) self._s3_input_uris = None # we'll create the script later self._master_bootstrap_script = None # the ID assigned by EMR to this job (might be None) self._emr_job_flow_id = self._opts['emr_job_flow_id'] # when did our particular task start? self._emr_job_start = None # ssh state self._ssh_proc = None self._gave_cant_ssh_warning = False self._ssh_key_name = None # cache for SSH address self._address = None self._ssh_slave_addrs = None # store the tracker URL for completion status self._tracker_url = None # turn off tracker progress until tunnel is up self._show_tracker_progress = False # default requested hadoop version if AMI version is not set if not (self._opts['ami_version'] or self._opts['hadoop_version']): self._opts['hadoop_version'] = '0.20' # init hadoop version cache self._inferred_hadoop_version = None @classmethod def _allowed_opts(cls): """A list of which keyword args we can pass to __init__()""" return super(EMRJobRunner, cls)._allowed_opts() + [ 'additional_emr_info', 'ami_version', 'aws_access_key_id', 'aws_availability_zone', 'aws_region', 'aws_secret_access_key', 'bootstrap_actions', 'bootstrap_cmds', 'bootstrap_files', 'bootstrap_python_packages', 'bootstrap_scripts', 'check_emr_status_every', 'ec2_core_instance_bid_price', 'ec2_core_instance_type', 'ec2_instance_type', 'ec2_key_pair', 'ec2_key_pair_file', 'ec2_master_instance_bid_price', 'ec2_master_instance_type', 'ec2_slave_instance_type', 'ec2_task_instance_bid_price', 'ec2_task_instance_type', 'emr_endpoint', 'emr_job_flow_id', 'emr_job_flow_pool_name', 'enable_emr_debugging', 'enable_emr_debugging', 'hadoop_streaming_jar_on_emr', 'hadoop_version', 'num_ec2_core_instances', 'num_ec2_instances', 'num_ec2_task_instances', 'pool_emr_job_flows', 's3_endpoint', 's3_log_uri', 's3_scratch_uri', 's3_sync_wait_time', 'ssh_bin', 'ssh_bind_ports', 'ssh_tunnel_is_open', 'ssh_tunnel_to_job_tracker', ] @classmethod def _default_opts(cls): """A dictionary giving the default value of options.""" return combine_dicts(super(EMRJobRunner, cls)._default_opts(), { 'check_emr_status_every': 30, 'ec2_core_instance_type': 'm1.small', 'ec2_master_instance_type': 'm1.small', 'emr_job_flow_pool_name': 'default', 'hadoop_version': None, # defaulted in __init__() 'hadoop_streaming_jar_on_emr': '/home/hadoop/contrib/streaming/hadoop-streaming.jar', 'num_ec2_core_instances': 0, 'num_ec2_instances': 1, 'num_ec2_task_instances': 0, 's3_sync_wait_time': 5.0, 'ssh_bin': ['ssh'], 'ssh_bind_ports': range(40001, 40841), 'ssh_tunnel_to_job_tracker': False, 'ssh_tunnel_is_open': False, }) @classmethod def _opts_combiners(cls): """Map from option name to a combine_*() function used to combine values for that option. This allows us to specify that some options are lists, or contain environment variables, or whatever.""" return combine_dicts(super(EMRJobRunner, cls)._opts_combiners(), { 'bootstrap_actions': combine_lists, 'bootstrap_cmds': combine_lists, 'bootstrap_files': combine_path_lists, 'bootstrap_python_packages': combine_path_lists, 'bootstrap_scripts': combine_path_lists, 'ec2_key_pair_file': combine_paths, 's3_log_uri': combine_paths, 's3_scratch_uri': combine_paths, 'ssh_bin': combine_cmds, }) def _fix_ec2_instance_opts(self): """If the *ec2_instance_type* option is set, override instance type for the nodes that actually run tasks (see Issue #66). Allow command-line arguments to override defaults and arguments in mrjob.conf (see Issue #311). Also, make sure that core and slave instance type are the same, total number of instances matches number of master, core, and task instances, and that bid prices of zero are converted to None. Helper for __init__. """ # Make sure slave and core instance type have the same value # Within EMRJobRunner we only ever use ec2_core_instance_type, # but we want ec2_slave_instance_type to be correct in the # options dictionary. if (self._opts['ec2_slave_instance_type'] and (self._opt_priority['ec2_slave_instance_type'] > self._opt_priority['ec2_core_instance_type'])): self._opts['ec2_core_instance_type'] = ( self._opts['ec2_slave_instance_type']) else: self._opts['ec2_slave_instance_type'] = ( self._opts['ec2_core_instance_type']) # If task instance type is not set, use core instance type # (This is mostly so that we don't inadvertently join a pool # with task instance types with too little memory.) if not self._opts['ec2_task_instance_type']: self._opts['ec2_task_instance_type'] = ( self._opts['ec2_core_instance_type']) # Within EMRJobRunner, we use num_ec2_core_instances and # num_ec2_task_instances, not num_ec2_instances. (Number # of master instances is always 1.) if (self._opt_priority['num_ec2_instances'] > max(self._opt_priority['num_ec2_core_instances'], self._opt_priority['num_ec2_task_instances'])): # assume 1 master, n - 1 core, 0 task self._opts['num_ec2_core_instances'] = ( self._opts['num_ec2_instances'] - 1) self._opts['num_ec2_task_instances'] = 0 else: # issue a warning if we used both kinds of instance number # options on the command line or in mrjob.conf if (self._opt_priority['num_ec2_instances'] >= 2 and self._opt_priority['num_ec2_instances'] <= max(self._opt_priority['num_ec2_core_instances'], self._opt_priority['num_ec2_task_instances'])): log.warn('Mixing num_ec2_instances and' ' num_ec2_{core,task}_instances does not make sense;' ' ignoring num_ec2_instances') # recalculate number of EC2 instances self._opts['num_ec2_instances'] = ( 1 + self._opts['num_ec2_core_instances'] + self._opts['num_ec2_task_instances']) # Allow ec2 instance type to override other instance types ec2_instance_type = self._opts['ec2_instance_type'] if ec2_instance_type: # core (slave) instances if (self._opt_priority['ec2_instance_type'] > max(self._opt_priority['ec2_core_instance_type'], self._opt_priority['ec2_slave_instance_type'])): self._opts['ec2_core_instance_type'] = ec2_instance_type self._opts['ec2_slave_instance_type'] = ec2_instance_type # master instance only does work when it's the only instance if (self._opts['num_ec2_core_instances'] <= 0 and self._opts['num_ec2_task_instances'] <= 0 and (self._opt_priority['ec2_instance_type'] > self._opt_priority['ec2_master_instance_type'])): self._opts['ec2_master_instance_type'] = ec2_instance_type # task instances if (self._opt_priority['ec2_instance_type'] > self._opt_priority['ec2_task_instance_type']): self._opts['ec2_task_instance_type'] = ec2_instance_type # convert a bid price of '0' to None for role in ('core', 'master', 'task'): opt_name = 'ec2_%s_instance_bid_price' % role if not self._opts[opt_name]: self._opts[opt_name] = None else: # convert "0", "0.00" etc. to None try: value = float(self._opts[opt_name]) if value == 0: self._opts[opt_name] = None except ValueError: pass # maybe EMR will accept non-floats? def _fix_s3_scratch_and_log_uri_opts(self): """Fill in s3_scratch_uri and s3_log_uri (in self._opts) if they aren't already set. Helper for __init__. """ s3_conn = self.make_s3_conn() # check s3_scratch_uri against aws_region if specified if self._opts['s3_scratch_uri']: bucket_name, _ = parse_s3_uri(self._opts['s3_scratch_uri']) bucket_loc = s3_conn.get_bucket(bucket_name).get_location() # make sure they can communicate if both specified if (self._aws_region and bucket_loc and self._aws_region != bucket_loc): log.warning('warning: aws_region (%s) does not match bucket' ' region (%s). Your EC2 instances may not be able' ' to reach your S3 buckets.' % (self._aws_region, bucket_loc)) # otherwise derive aws_region from bucket_loc elif bucket_loc and not self._aws_region: log.info( "inferring aws_region from scratch bucket's region (%s)" % bucket_loc) self._aws_region = bucket_loc # set s3_scratch_uri by checking for existing buckets else: self._set_s3_scratch_uri(s3_conn) log.info('using %s as our scratch dir on S3' % self._opts['s3_scratch_uri']) self._opts['s3_scratch_uri'] = self._check_and_fix_s3_dir( self._opts['s3_scratch_uri']) # set s3_log_uri if self._opts['s3_log_uri']: self._opts['s3_log_uri'] = self._check_and_fix_s3_dir( self._opts['s3_log_uri']) else: self._opts['s3_log_uri'] = self._opts['s3_scratch_uri'] + 'logs/' def _set_s3_scratch_uri(self, s3_conn): """Helper for _fix_s3_scratch_and_log_uri_opts""" buckets = s3_conn.get_all_buckets() mrjob_buckets = [b for b in buckets if b.name.startswith('mrjob-')] # Loop over buckets until we find one that is not region- # restricted, matches aws_region, or can be used to # infer aws_region if no aws_region is specified for scratch_bucket in mrjob_buckets: scratch_bucket_name = scratch_bucket.name scratch_bucket_location = scratch_bucket.get_location() if scratch_bucket_location: if scratch_bucket_location == self._aws_region: # Regions are both specified and match log.info("using existing scratch bucket %s" % scratch_bucket_name) self._opts['s3_scratch_uri'] = ( 's3://%s/tmp/' % scratch_bucket_name) return elif not self._aws_region: # aws_region not specified, so set it based on this # bucket's location and use this bucket self._aws_region = scratch_bucket_location log.info("inferring aws_region from scratch bucket's" " region (%s)" % self._aws_region) self._opts['s3_scratch_uri'] = ( 's3://%s/tmp/' % scratch_bucket_name) return elif scratch_bucket_location != self._aws_region: continue elif not self._aws_region: # Only use regionless buckets if the job flow is regionless log.info("using existing scratch bucket %s" % scratch_bucket_name) self._opts['s3_scratch_uri'] = ( 's3://%s/tmp/' % scratch_bucket_name) return # That may have all failed. If so, pick a name. scratch_bucket_name = 'mrjob-%016x' % random.randint(0, 2 ** 64 - 1) self._s3_temp_bucket_to_create = scratch_bucket_name log.info("creating new scratch bucket %s" % scratch_bucket_name) self._opts['s3_scratch_uri'] = 's3://%s/tmp/' % scratch_bucket_name def _set_s3_job_log_uri(self, job_flow): """Given a job flow description, set self._s3_job_log_uri. This allows us to call self.ls(), etc. without running the job. """ log_uri = getattr(job_flow, 'loguri', '') if log_uri: self._s3_job_log_uri = '%s%s/' % ( log_uri.replace('s3n://', 's3://'), self._emr_job_flow_id) def _create_s3_temp_bucket_if_needed(self): """Make sure temp bucket exists""" if self._s3_temp_bucket_to_create: s3_conn = self.make_s3_conn() log.info('creating S3 bucket %r to use as scratch space' % self._s3_temp_bucket_to_create) location = REGION_TO_S3_LOCATION_CONSTRAINT.get( self._aws_region, self._aws_region) s3_conn.create_bucket(self._s3_temp_bucket_to_create, location=(location or '')) self._s3_temp_bucket_to_create = None def _check_and_fix_s3_dir(self, s3_uri): """Helper for __init__""" if not is_s3_uri(s3_uri): raise ValueError('Invalid S3 URI: %r' % s3_uri) if not s3_uri.endswith('/'): s3_uri = s3_uri + '/' return s3_uri def _run(self): self._prepare_for_launch() self._launch_emr_job() self._wait_for_job_to_complete() def _prepare_for_launch(self): self._setup_input() self._create_wrapper_script() self._create_master_bootstrap_script() self._upload_non_input_files() def _setup_input(self): """Copy local input files (if any) to a special directory on S3. Set self._s3_input_uris Helper for _run """ self._create_s3_temp_bucket_if_needed() # winnow out s3 files from local ones self._s3_input_uris = [] local_input_paths = [] for path in self._input_paths: if is_s3_uri(path): # Don't even bother running the job if the input isn't there, # since it's costly to spin up instances. if not self.path_exists(path): raise AssertionError( 'Input path %s does not exist!' % (path,)) self._s3_input_uris.append(path) else: local_input_paths.append(path) # copy local files into an input directory, with names like # 00000-actual_name.ext if local_input_paths: s3_input_dir = self._s3_tmp_uri + 'input/' log.info('Uploading input to %s' % s3_input_dir) s3_conn = self.make_s3_conn() for file_num, path in enumerate(local_input_paths): if path == '-': path = self._dump_stdin_to_local_file() target = '%s%05d-%s' % ( s3_input_dir, file_num, os.path.basename(path)) log.debug('uploading %s -> %s' % (path, target)) s3_key = self.make_s3_key(target, s3_conn) s3_key.set_contents_from_filename(path) self._s3_input_uris.append(s3_input_dir) def _add_bootstrap_file(self, path): name, path = self._split_path(path) file_dict = {'path': path, 'name': name, 'bootstrap': 'file'} self._files.append(file_dict) return file_dict def _pick_s3_uris_for_files(self): """Decide where each file will be uploaded on S3. Okay to call this multiple times. """ self._assign_unique_names_to_files( 's3_uri', prefix=self._s3_tmp_uri + 'files/', match=is_s3_uri) def _upload_non_input_files(self): """Copy files to S3 Pick S3 URIs for them if we haven't already.""" self._create_s3_temp_bucket_if_needed() self._pick_s3_uris_for_files() s3_files_dir = self._s3_tmp_uri + 'files/' log.info('Copying non-input files into %s' % s3_files_dir) s3_conn = self.make_s3_conn() for file_dict in self._files: path = file_dict['path'] # don't bother with files that are already on s3 if is_s3_uri(path): continue s3_uri = file_dict['s3_uri'] log.debug('uploading %s -> %s' % (path, s3_uri)) s3_key = self.make_s3_key(s3_uri, s3_conn) s3_key.set_contents_from_filename(file_dict['path']) def setup_ssh_tunnel_to_job_tracker(self, host): """setup the ssh tunnel to the job tracker, if it's not currently running. Args: host -- hostname of the EMR master node. """ REQUIRED_OPTS = ['ec2_key_pair', 'ec2_key_pair_file', 'ssh_bind_ports'] for opt_name in REQUIRED_OPTS: if not self._opts[opt_name]: if not self._gave_cant_ssh_warning: log.warning( "You must set %s in order to ssh to the job tracker!" % opt_name) self._gave_cant_ssh_warning = True return # if there was already a tunnel, make sure it's still up if self._ssh_proc: self._ssh_proc.poll() if self._ssh_proc.returncode is None: return else: log.warning('Oops, ssh subprocess exited with return code %d,' ' restarting...' % self._ssh_proc.returncode) self._ssh_proc = None log.info('Opening ssh tunnel to Hadoop job tracker') # if ssh detects that a host key has changed, it will silently not # open the tunnel, so make a fake empty known_hosts file and use that. # (you can actually use /dev/null as your known hosts file, but # that's UNIX-specific) fake_known_hosts_file = os.path.join( self._get_local_tmp_dir(), 'fake_ssh_known_hosts') # blank out the file, if it exists f = open(fake_known_hosts_file, 'w') f.close() log.debug('Created empty ssh known-hosts file: %s' % ( fake_known_hosts_file,)) bind_port = None for bind_port in self._pick_ssh_bind_ports(): args = self._opts['ssh_bin'] + [ '-o', 'VerifyHostKeyDNS=no', '-o', 'StrictHostKeyChecking=no', '-o', 'ExitOnForwardFailure=yes', '-o', 'UserKnownHostsFile=%s' % fake_known_hosts_file, '-L', '%d:localhost:%d' % (bind_port, EMR_JOB_TRACKER_PORT), '-N', '-q', # no shell, no output '-i', self._opts['ec2_key_pair_file'], ] if self._opts['ssh_tunnel_is_open']: args.extend(['-g', '-4']) # -4: listen on IPv4 only args.append('hadoop@' + host) log.debug('> %s' % cmd_line(args)) ssh_proc = Popen(args, stdin=PIPE, stdout=PIPE, stderr=PIPE) time.sleep(WAIT_FOR_SSH_TO_FAIL) ssh_proc.poll() # still running. We are golden if ssh_proc.returncode is None: self._ssh_proc = ssh_proc break if not self._ssh_proc: log.warning('Failed to open ssh tunnel to job tracker') else: if self._opts['ssh_tunnel_is_open']: bind_host = socket.getfqdn() else: bind_host = 'localhost' self._tracker_url = 'http://%s:%d%s' % ( bind_host, bind_port, EMR_JOB_TRACKER_PATH) self._show_tracker_progress = True log.info('Connect to job tracker at: %s' % self._tracker_url) def _pick_ssh_bind_ports(self): """Pick a list of ports to try binding our SSH tunnel to. We will try to bind the same port for any given job flow (Issue #67) """ # don't perturb the random number generator random_state = random.getstate() try: # seed random port selection on job flow ID random.seed(self._emr_job_flow_id) num_picks = min(MAX_SSH_RETRIES, len(self._opts['ssh_bind_ports'])) return random.sample(self._opts['ssh_bind_ports'], num_picks) finally: random.setstate(random_state) def _enable_slave_ssh_access(self): if not self._ssh_key_name: self._ssh_key_name = self._job_name + '.pem' ssh_copy_key( self._opts['ssh_bin'], self._address_of_master(), self._opts['ec2_key_pair_file'], self._ssh_key_name) ### Running the job ### def cleanup(self, mode=None): super(EMRJobRunner, self).cleanup(mode=mode) # always stop our SSH tunnel if it's still running if self._ssh_proc: self._ssh_proc.poll() if self._ssh_proc.returncode is None: log.info('Killing our SSH tunnel (pid %d)' % self._ssh_proc.pid) try: os.kill(self._ssh_proc.pid, signal.SIGKILL) self._ssh_proc = None except Exception, e: log.exception(e) # stop the job flow if it belongs to us (it may have stopped on its # own already, but that's fine) # don't stop it if it was created due to --pool because the user # probably wants to use it again if self._emr_job_flow_id and not self._opts['emr_job_flow_id'] \ and not self._opts['pool_emr_job_flows']: log.info('Terminating job flow: %s' % self._emr_job_flow_id) try: self.make_emr_conn().terminate_jobflow(self._emr_job_flow_id) except Exception, e: log.exception(e) def _cleanup_remote_scratch(self): # delete all the files we created if self._s3_tmp_uri: try: log.info('Removing all files in %s' % self._s3_tmp_uri) self.rm(self._s3_tmp_uri) self._s3_tmp_uri = None except Exception, e: log.exception(e) def _cleanup_logs(self): super(EMRJobRunner, self)._cleanup_logs() # delete the log files, if it's a job flow we created (the logs # belong to the job flow) if self._s3_job_log_uri and not self._opts['emr_job_flow_id'] \ and not self._opts['pool_emr_job_flows']: try: log.info('Removing all files in %s' % self._s3_job_log_uri) self.rm(self._s3_job_log_uri) self._s3_job_log_uri = None except Exception, e: log.exception(e) def _wait_for_s3_eventual_consistency(self): """Sleep for a little while, to give S3 a chance to sync up. """ log.info('Waiting %.1fs for S3 eventual consistency' % self._opts['s3_sync_wait_time']) time.sleep(self._opts['s3_sync_wait_time']) def _wait_for_job_flow_termination(self): try: jobflow = self._describe_jobflow() except boto.exception.S3ResponseError: # mockboto throws this for some reason return if (jobflow.keepjobflowalivewhennosteps == 'true' and jobflow.state == 'WAITING'): raise Exception('Operation requires job flow to terminate, but' ' it may never do so.') while jobflow.state not in ('TERMINATED', 'COMPLETED', 'FAILED', 'SHUTTING_DOWN'): msg = 'Waiting for job flow to terminate (currently %s)' % \ jobflow.state log.info(msg) time.sleep(self._opts['check_emr_status_every']) jobflow = self._describe_jobflow() def _create_instance_group(self, role, instance_type, count, bid_price): """Helper method for creating instance groups. For use when creating a jobflow using a list of InstanceGroups, instead of the typical triumverate of num_instances/master_instance_type/slave_instance_type. - Role is either 'master', 'core', or 'task'. - instance_type is an EC2 instance type - count is an int - bid_price is a number, a string, or None. If None, this instance group will be use the ON-DEMAND market instead of the SPOT market. """ if not instance_type: if self._opts['ec2_instance_type']: instance_type = self._opts['ec2_instance_type'] else: raise ValueError('Missing instance type for %s node(s)' % role) if bid_price: market = 'SPOT' bid_price = str(bid_price) # must be a string else: market = 'ON_DEMAND' bid_price = None # Just name the groups "master", "task", and "core" name = role.lower() return boto_2_1_1_83aae37b.InstanceGroup( count, role, instance_type, market, name, bidprice=bid_price ) def _create_job_flow(self, persistent=False, steps=None): """Create an empty job flow on EMR, and return the ID of that job. persistent -- if this is true, create the job flow with the --alive option, indicating the job will have to be manually terminated. """ # make sure we can see the files we copied to S3 self._wait_for_s3_eventual_consistency() # figure out local names and S3 URIs for our bootstrap actions, if any self._name_files() self._pick_s3_uris_for_files() log.info('Creating Elastic MapReduce job flow') args = self._job_flow_args(persistent, steps) emr_conn = self.make_emr_conn() log.debug('Calling run_jobflow(%r, %r, %s)' % ( self._job_name, self._opts['s3_log_uri'], ', '.join('%s=%r' % (k, v) for k, v in args.iteritems()))) emr_job_flow_id = emr_conn.run_jobflow( self._job_name, self._opts['s3_log_uri'], **args) # keep track of when we started our job self._emr_job_start = time.time() log.info('Job flow created with ID: %s' % emr_job_flow_id) return emr_job_flow_id def _job_flow_args(self, persistent=False, steps=None): """Build kwargs for emr_conn.run_jobflow()""" args = {} args['ami_version'] = self._opts['ami_version'] args['hadoop_version'] = self._opts['hadoop_version'] if self._opts['aws_availability_zone']: args['availability_zone'] = self._opts['aws_availability_zone'] # The old, simple API, available if we're not using task instances # or bid prices if not (self._opts['num_ec2_task_instances'] or self._opts['ec2_core_instance_bid_price'] or self._opts['ec2_master_instance_bid_price'] or self._opts['ec2_task_instance_bid_price']): args['num_instances'] = self._opts['num_ec2_core_instances'] + 1 args['master_instance_type'] = ( self._opts['ec2_master_instance_type']) args['slave_instance_type'] = self._opts['ec2_core_instance_type'] else: # Create a list of InstanceGroups args['instance_groups'] = [ self._create_instance_group( 'MASTER', self._opts['ec2_master_instance_type'], 1, self._opts['ec2_master_instance_bid_price'] ), ] if self._opts['num_ec2_core_instances']: args['instance_groups'].append( self._create_instance_group( 'CORE', self._opts['ec2_core_instance_type'], self._opts['num_ec2_core_instances'], self._opts['ec2_core_instance_bid_price'] ) ) if self._opts['num_ec2_task_instances']: args['instance_groups'].append( self._create_instance_group( 'TASK', self._opts['ec2_task_instance_type'], self._opts['num_ec2_task_instances'], self._opts['ec2_task_instance_bid_price'] ) ) # bootstrap actions bootstrap_action_args = [] for file_dict in self._bootstrap_actions: # file_dict is not populated the same way by tools and real job # runs, so use s3_uri or path as appropriate s3_uri = file_dict.get('s3_uri', None) or file_dict['path'] bootstrap_action_args.append( boto.emr.BootstrapAction( file_dict['name'], s3_uri, file_dict['args'])) if self._master_bootstrap_script: master_bootstrap_script_args = [] if self._opts['pool_emr_job_flows']: master_bootstrap_script_args = [ 'pool-' + self._pool_hash(), self._opts['emr_job_flow_pool_name'], ] bootstrap_action_args.append( boto.emr.BootstrapAction( 'master', self._master_bootstrap_script['s3_uri'], master_bootstrap_script_args)) if bootstrap_action_args: args['bootstrap_actions'] = bootstrap_action_args if self._opts['ec2_key_pair']: args['ec2_keyname'] = self._opts['ec2_key_pair'] if self._opts['enable_emr_debugging']: args['enable_debugging'] = True if self._opts['additional_emr_info']: args['additional_info'] = self._opts['additional_emr_info'] if persistent or self._opts['pool_emr_job_flows']: args['keep_alive'] = True if steps: args['steps'] = steps return args def _build_steps(self): """Return a list of boto Step objects corresponding to the steps we want to run.""" assert self._script # can't build steps if no script! # figure out local names for our files self._name_files() self._pick_s3_uris_for_files() # we're going to instruct EMR to upload the MR script and the # wrapper script (if any) to the job's local directory self._script['upload'] = 'file' if self._wrapper_script: self._wrapper_script['upload'] = 'file' # quick, add the other steps before the job spins up and # then shuts itself down (in practice this takes several minutes) steps = self._get_steps() step_list = [] version = self.get_hadoop_version() for step_num, step in enumerate(steps): # EMR-specific stuff name = '%s: Step %d of %d' % ( self._job_name, step_num + 1, len(steps)) # don't terminate other people's job flows if (self._opts['emr_job_flow_id'] or self._opts['pool_emr_job_flows']): action_on_failure = 'CANCEL_AND_WAIT' else: action_on_failure = 'TERMINATE_JOB_FLOW' # Hadoop streaming stuff if 'M' not in step: # if we have an identity mapper mapper = 'cat' else: mapper = cmd_line(self._mapper_args(step_num)) if 'C' in step: combiner = cmd_line(self._combiner_args(step_num)) else: combiner = None if 'R' in step: # i.e. if there is a reducer: reducer = cmd_line(self._reducer_args(step_num)) else: reducer = None input = self._s3_step_input_uris(step_num) output = self._s3_step_output_uri(step_num)\ step_args, cache_files, cache_archives = self._cache_args() step_args.extend(self._hadoop_conf_args(step_num, len(steps))) jar = self._get_jar() if combiner is not None: if compat.supports_combiners_in_hadoop_streaming(version): step_args.extend(['-combiner', combiner]) else: mapper = "bash -c '%s | sort | %s'" % (mapper, combiner) streaming_step = boto.emr.StreamingStep( name=name, mapper=mapper, reducer=reducer, action_on_failure=action_on_failure, cache_files=cache_files, cache_archives=cache_archives, step_args=step_args, input=input, output=output, jar=jar) step_list.append(streaming_step) return step_list def _cache_args(self): """Returns ``(step_args, cache_files, cache_archives)``, populating each according to the correct behavior for the current Hadoop version. For < 0.20, populate cache_files and cache_archives. For >= 0.20, populate step_args. step_args should be inserted into the step arguments before anything else. cache_files and cache_archives should be passed as arguments to StreamingStep. """ version = self.get_hadoop_version() step_args = [] cache_files = [] cache_archives = [] if compat.supports_new_distributed_cache_options(version): # boto doesn't support non-deprecated 0.20 options, so insert # them ourselves def escaped_paths(file_dicts): # return list of strings to join with commas and pass to the # hadoop binary return ["%s#%s" % (fd['s3_uri'], fd['name']) for fd in file_dicts] # index by type all_files = {} for fd in self._files: all_files.setdefault(fd.get('upload'), []).append(fd) if 'file' in all_files: step_args.append('-files') step_args.append(','.join(escaped_paths(all_files['file']))) if 'archive' in all_files: step_args.append('-archives') step_args.append(','.join(escaped_paths(all_files['archive']))) else: for file_dict in self._files: if file_dict.get('upload') == 'file': cache_files.append( '%s#%s' % (file_dict['s3_uri'], file_dict['name'])) elif file_dict.get('upload') == 'archive': cache_archives.append( '%s#%s' % (file_dict['s3_uri'], file_dict['name'])) return step_args, cache_files, cache_archives def _get_jar(self): self._name_files() self._pick_s3_uris_for_files() if self._streaming_jar: return self._streaming_jar['s3_uri'] else: return self._opts['hadoop_streaming_jar_on_emr'] def _launch_emr_job(self): """Create an empty jobflow on EMR, and set self._emr_job_flow_id to the ID for that job.""" self._create_s3_temp_bucket_if_needed() # define out steps steps = self._build_steps() # try to find a job flow from the pool. basically auto-fill # 'emr_job_flow_id' if possible and then follow normal behavior. if self._opts['pool_emr_job_flows']: job_flow = self.find_job_flow(num_steps=len(steps)) if job_flow: self._emr_job_flow_id = job_flow.jobflowid # create a job flow if we're not already using an existing one if not self._emr_job_flow_id: self._emr_job_flow_id = self._create_job_flow( persistent=False, steps=steps) else: emr_conn = self.make_emr_conn() log.info('Adding our job to job flow %s' % self._emr_job_flow_id) log.debug('Calling add_jobflow_steps(%r, %r)' % ( self._emr_job_flow_id, steps)) emr_conn.add_jobflow_steps(self._emr_job_flow_id, steps) # keep track of when we launched our job self._emr_job_start = time.time() def _wait_for_job_to_complete(self): """Wait for the job to complete, and raise an exception if the job failed. Also grab log URI from the job status (since we may not know it) """ success = False while True: # don't antagonize EMR's throttling log.debug('Waiting %.1f seconds...' % self._opts['check_emr_status_every']) time.sleep(self._opts['check_emr_status_every']) job_flow = self._describe_jobflow() self._set_s3_job_log_uri(job_flow) job_state = job_flow.state reason = getattr(job_flow, 'laststatechangereason', '') # find all steps belonging to us, and get their state step_states = [] running_step_name = '' total_step_time = 0.0 step_nums = [] # step numbers belonging to us. 1-indexed steps = job_flow.steps or [] for i, step in enumerate(steps): # ignore steps belonging to other jobs if not step.name.startswith(self._job_name): continue step_nums.append(i + 1) step.state = step.state step_states.append(step.state) if step.state == 'RUNNING': running_step_name = step.name if (hasattr(step, 'startdatetime') and hasattr(step, 'enddatetime')): start_time = iso8601_to_timestamp(step.startdatetime) end_time = iso8601_to_timestamp(step.enddatetime) total_step_time += end_time - start_time if not step_states: raise AssertionError("Can't find our steps in the job flow!") # if all our steps have completed, we're done! if all(state == 'COMPLETED' for state in step_states): success = True break # if any step fails, give up if any(state in ('FAILED', 'CANCELLED') for state in step_states): break # (the other step states are PENDING and RUNNING) # keep track of how long we've been waiting running_time = time.time() - self._emr_job_start # otherwise, we can print a status message if running_step_name: log.info('Job launched %.1fs ago, status %s: %s (%s)' % (running_time, job_state, reason, running_step_name)) if self._show_tracker_progress: try: tracker_handle = urllib2.urlopen(self._tracker_url) tracker_page = ''.join(tracker_handle.readlines()) tracker_handle.close() # first two formatted percentages, map then reduce map_complete, reduce_complete = [float(complete) for complete in JOB_TRACKER_RE.findall( tracker_page)[:2]] log.info(' map %3d%% reduce %3d%%' % ( map_complete, reduce_complete)) except: log.error('Unable to load progress from job tracker') # turn off progress for rest of job self._show_tracker_progress = False # once a step is running, it's safe to set up the ssh tunnel to # the job tracker job_host = getattr(job_flow, 'masterpublicdnsname', None) if job_host and self._opts['ssh_tunnel_to_job_tracker']: self.setup_ssh_tunnel_to_job_tracker(job_host) # other states include STARTING and SHUTTING_DOWN elif reason: log.info('Job launched %.1fs ago, status %s: %s' % (running_time, job_state, reason)) else: log.info('Job launched %.1fs ago, status %s' % (running_time, job_state,)) if success: log.info('Job completed.') log.info('Running time was %.1fs (not counting time spent waiting' ' for the EC2 instances)' % total_step_time) self._fetch_counters(step_nums) self.print_counters(range(1, len(step_nums) + 1)) else: msg = 'Job failed with status %s: %s' % (job_state, reason) log.error(msg) if self._s3_job_log_uri: log.info('Logs are in %s' % self._s3_job_log_uri) # look for a Python traceback cause = self._find_probable_cause_of_failure(step_nums) if cause: # log cause, and put it in exception cause_msg = [] # lines to log and put in exception cause_msg.append('Probable cause of failure (from %s):' % cause['log_file_uri']) cause_msg.extend(line.strip('\n') for line in cause['lines']) if cause['input_uri']: cause_msg.append('(while reading from %s)' % cause['input_uri']) for line in cause_msg: log.error(line) # add cause_msg to exception message msg += '\n' + '\n'.join(cause_msg) + '\n' raise Exception(msg) def _script_args(self): """How to invoke the script inside EMR""" # We can invoke the script by its S3 URL, but we don't really # gain anything from that, and EMR is touchy about distinguishing # python scripts from shell scripts assert self._script # shouldn't call _script_args() if no script args = self._opts['python_bin'] + [self._script['name']] if self._wrapper_script: args = (self._opts['python_bin'] + [self._wrapper_script['name']] + args) return args def _mapper_args(self, step_num): return (self._script_args() + ['--step-num=%d' % step_num, '--mapper'] + self._mr_job_extra_args()) def _reducer_args(self, step_num): return (self._script_args() + ['--step-num=%d' % step_num, '--reducer'] + self._mr_job_extra_args()) def _combiner_args(self, step_num): return (self._script_args() + ['--step-num=%d' % step_num, '--combiner'] + self._mr_job_extra_args()) def _upload_args(self): """Args to upload files from S3 to the local nodes that EMR runs on.""" args = [] for file_dict in self._files: if file_dict.get('upload') == 'file': args.append('--cache') args.append('%s#%s' % (file_dict['s3_uri'], file_dict['name'])) elif file_dict.get('upload') == 'archive': args.append('--cache-archive') args.append('%s#%s' % (file_dict['s3_uri'], file_dict['name'])) return args def _s3_step_input_uris(self, step_num): """Get the s3:// URIs for input for the given step.""" if step_num == 0: return self._s3_input_uris else: # put intermediate data in HDFS return ['hdfs:///tmp/mrjob/%s/step-output/%s/' % ( self._job_name, step_num)] def _s3_step_output_uri(self, step_num): if step_num == len(self._get_steps()) - 1: return self._output_dir else: # put intermediate data in HDFS return 'hdfs:///tmp/mrjob/%s/step-output/%s/' % ( self._job_name, step_num + 1) ### LOG FETCHING/PARSING ### def _enforce_path_regexp(self, paths, regexp, step_nums=None): """Helper for log fetching functions to filter out unwanted logs. Only pass ``step_nums`` if ``regexp`` has a ``step_nums`` group. """ for path in paths: m = regexp.match(path) if (m and (step_nums is None or int(m.group('step_num')) in step_nums)): yield path else: log.debug('Ignore %s' % path) ## SSH LOG FETCHING def _ls_ssh_logs(self, relative_path): """List logs over SSH by path relative to log root directory""" full_path = SSH_PREFIX + SSH_LOG_ROOT + '/' + relative_path log.debug('Search %s for logs' % full_path) return self.ls(full_path) def _ls_slave_ssh_logs(self, addr, relative_path): """List logs over multi-hop SSH by path relative to log root directory """ root_path = '%s%s!%s%s' % (SSH_PREFIX, self._address_of_master(), addr, SSH_LOG_ROOT + '/' + relative_path) log.debug('Search %s for logs' % root_path) return self.ls(root_path) def ls_task_attempt_logs_ssh(self, step_nums): all_paths = [] try: all_paths.extend(self._ls_ssh_logs('userlogs/')) except IOError: # sometimes the master doesn't have these pass if not all_paths: # get them from the slaves instead (takes a little longer) try: for addr in self._addresses_of_slaves(): logs = self._ls_slave_ssh_logs(addr, 'userlogs/') all_paths.extend(logs) except IOError: # sometimes the slaves don't have them either pass return self._enforce_path_regexp(all_paths, TASK_ATTEMPTS_LOG_URI_RE, step_nums) def ls_step_logs_ssh(self, step_nums): return self._enforce_path_regexp(self._ls_ssh_logs('steps/'), STEP_LOG_URI_RE, step_nums) def ls_job_logs_ssh(self, step_nums): return self._enforce_path_regexp(self._ls_ssh_logs('history/'), EMR_JOB_LOG_URI_RE, step_nums) def ls_node_logs_ssh(self): all_paths = [] for addr in self._addresses_of_slaves(): logs = self._ls_slave_ssh_logs(addr, '') all_paths.extend(logs) return self._enforce_path_regexp(all_paths, NODE_LOG_URI_RE) def ls_all_logs_ssh(self): """List all log files in the log root directory""" return self.ls(SSH_PREFIX + SSH_LOG_ROOT) ## S3 LOG FETCHING ## def _ls_s3_logs(self, relative_path): """List logs over S3 by path relative to log root directory""" if not self._s3_job_log_uri: self._set_s3_job_log_uri(self._describe_jobflow()) if not self._s3_job_log_uri: raise LogFetchError('Could not determine S3 job log URI') full_path = self._s3_job_log_uri + relative_path log.debug('Search %s for logs' % full_path) return self.ls(full_path) def ls_task_attempt_logs_s3(self, step_nums): return self._enforce_path_regexp(self._ls_s3_logs('task-attempts/'), TASK_ATTEMPTS_LOG_URI_RE, step_nums) def ls_step_logs_s3(self, step_nums): return self._enforce_path_regexp(self._ls_s3_logs('steps/'), STEP_LOG_URI_RE, step_nums) def ls_job_logs_s3(self, step_nums): return self._enforce_path_regexp(self._ls_s3_logs('jobs/'), EMR_JOB_LOG_URI_RE, step_nums) def ls_node_logs_s3(self): return self._enforce_path_regexp(self._ls_s3_logs('node/'), NODE_LOG_URI_RE) def ls_all_logs_s3(self): """List all log files in the S3 log root directory""" if not self._s3_job_log_uri: self._set_s3_job_log_uri(self._describe_jobflow()) return self.ls(self._s3_job_log_uri) ## LOG PARSING ## def _fetch_counters(self, step_nums, skip_s3_wait=False): """Read Hadoop counters from S3. Args: step_nums -- the steps belonging to us, so that we can ignore counters from other jobs when sharing a job flow """ self._counters = [] new_counters = {} if self._opts['ec2_key_pair_file']: try: new_counters = self._fetch_counters_ssh(step_nums) except LogFetchError: new_counters = self._fetch_counters_s3(step_nums, skip_s3_wait) except IOError: # Can get 'file not found' if test suite was lazy or Hadoop # logs moved. We shouldn't crash in either case. new_counters = self._fetch_counters_s3(step_nums, skip_s3_wait) else: log.info('ec2_key_pair_file not specified, going to S3') new_counters = self._fetch_counters_s3(step_nums, skip_s3_wait) # step_nums is relative to the start of the job flow # we only want them relative to the job for step_num in step_nums: self._counters.append(new_counters.get(step_num, {})) def _fetch_counters_ssh(self, step_nums): uris = list(self.ls_job_logs_ssh(step_nums)) log.info('Fetching counters from SSH...') return scan_for_counters_in_files(uris, self, self.get_hadoop_version()) def _fetch_counters_s3(self, step_nums, skip_s3_wait=False): job_flow = self._describe_jobflow() if job_flow.keepjobflowalivewhennosteps == 'true': log.info("Can't fetch counters from S3 for five more minutes. Try" " 'python -m mrjob.tools.emr.fetch_logs --counters %s'" " in five minutes." % job_flow.jobflowid) return {} log.info('Fetching counters from S3...') if not skip_s3_wait: self._wait_for_s3_eventual_consistency() self._wait_for_job_flow_termination() try: uris = self.ls_job_logs_s3(step_nums) return scan_for_counters_in_files(uris, self, self.get_hadoop_version()) except LogFetchError, e: log.info("Unable to fetch counters: %s" % e) return {} def counters(self): return self._counters def _find_probable_cause_of_failure(self, step_nums): """Scan logs for Python exception tracebacks. Args: step_nums -- the numbers of steps belonging to us, so that we can ignore errors from other jobs when sharing a job flow Returns: None (nothing found) or a dictionary containing: lines -- lines in the log file containing the error message log_file_uri -- the log file containing the error message input_uri -- if the error happened in a mapper in the first step, the URI of the input file that caused the error (otherwise None) """ if self._opts['ec2_key_pair_file']: try: return self._find_probable_cause_of_failure_ssh(step_nums) except LogFetchError: return self._find_probable_cause_of_failure_s3(step_nums) else: log.info('ec2_key_pair_file not specified, going to S3') return self._find_probable_cause_of_failure_s3(step_nums) def _find_probable_cause_of_failure_ssh(self, step_nums): task_attempt_logs = self.ls_task_attempt_logs_ssh(step_nums) step_logs = self.ls_step_logs_ssh(step_nums) job_logs = self.ls_job_logs_ssh(step_nums) log.info('Scanning SSH logs for probable cause of failure') return scan_logs_in_order(task_attempt_logs=task_attempt_logs, step_logs=step_logs, job_logs=job_logs, runner=self) def _find_probable_cause_of_failure_s3(self, step_nums): log.info('Scanning S3 logs for probable cause of failure') self._wait_for_s3_eventual_consistency() self._wait_for_job_flow_termination() task_attempt_logs = self.ls_task_attempt_logs_s3(step_nums) step_logs = self.ls_step_logs_s3(step_nums) job_logs = self.ls_job_logs_s3(step_nums) return scan_logs_in_order(task_attempt_logs=task_attempt_logs, step_logs=step_logs, job_logs=job_logs, runner=self) ### Bootstrapping ### def _create_master_bootstrap_script(self, dest='b.py'): """Create the master bootstrap script and write it into our local temp directory. This will do nothing if there are no bootstrap scripts or commands, or if _create_master_bootstrap_script() has already been called.""" # we call the script b.py because there's a character limit on # bootstrap script names (or there was at one time, anyway) if not any(key.startswith('bootstrap_') and value for (key, value) in self._opts.iteritems()): return # don't bother if we're not starting a job flow if self._opts['emr_job_flow_id']: return # Also don't bother if we're not pooling (and therefore don't need # to have a bootstrap script to attach to) and we're not bootstrapping # anything else if not (self._opts['pool_emr_job_flows'] or any(key.startswith('bootstrap_') and key != 'bootstrap_actions' and # these are separate scripts value for (key, value) in self._opts.iteritems())): return if self._opts['bootstrap_mrjob']: if self._mrjob_tar_gz_file is None: self._mrjob_tar_gz_file = self._add_bootstrap_file( self._create_mrjob_tar_gz() + '#') # need to know what files are called self._name_files() self._pick_s3_uris_for_files() path = os.path.join(self._get_local_tmp_dir(), dest) log.info('writing master bootstrap script to %s' % path) contents = self._master_bootstrap_script_content() for line in StringIO(contents): log.debug('BOOTSTRAP: ' + line.rstrip('\r\n')) f = open(path, 'w') f.write(contents) f.close() name, _ = self._split_path(path) self._master_bootstrap_script = {'path': path, 'name': name} self._files.append(self._master_bootstrap_script) def _master_bootstrap_script_content(self): """Create the contents of the master bootstrap script. This will give names and S3 URIs to files that don't already have them. This function does NOT pick S3 URIs for files or anything like that; _create_master_bootstrap_script() is responsible for that. """ out = StringIO() python_bin_in_list = ', '.join(repr(opt) for opt in self._opts['python_bin']) def writeln(line=''): out.write(line + '\n') # shebang writeln('#!/usr/bin/python') writeln() # imports writeln('from __future__ import with_statement') writeln() writeln('import distutils.sysconfig') writeln('import os') writeln('import stat') writeln('from subprocess import call, check_call') writeln('from tempfile import mkstemp') writeln('from xml.etree.ElementTree import ElementTree') writeln() # download files using hadoop fs writeln('# download files using hadoop fs -copyToLocal') for file_dict in self._files: if file_dict.get('bootstrap'): writeln( "check_call(['hadoop', 'fs', '-copyToLocal', %r, %r])" % (file_dict['s3_uri'], file_dict['name'])) writeln() # make scripts executable if self._bootstrap_scripts: writeln('# make bootstrap scripts executable') for file_dict in self._bootstrap_scripts: writeln("check_call(['chmod', 'a+rx', %r])" % file_dict['name']) writeln() # bootstrap mrjob if self._opts['bootstrap_mrjob']: writeln('# bootstrap mrjob') writeln("site_packages = distutils.sysconfig.get_python_lib()") writeln( "check_call(['sudo', 'tar', 'xfz', %r, '-C', site_packages])" % self._mrjob_tar_gz_file['name']) # re-compile pyc files now, since mappers/reducers can't # write to this directory. Don't fail if there is extra # un-compileable crud in the tarball. writeln("mrjob_dir = os.path.join(site_packages, 'mrjob')") writeln("call([" "'sudo', %s, '-m', 'compileall', '-f', mrjob_dir])" % python_bin_in_list) writeln() # install our python modules if self._bootstrap_python_packages: writeln('# install python modules:') for file_dict in self._bootstrap_python_packages: writeln("check_call(['tar', 'xfz', %r])" % file_dict['name']) # figure out name of dir to CD into assert file_dict['path'].endswith('.tar.gz') cd_into = extract_dir_for_tar(file_dict['path']) # install the module writeln("check_call([" "'sudo', %s, 'setup.py', 'install'], cwd=%r)" % (python_bin_in_list, cd_into)) # run our commands if self._opts['bootstrap_cmds']: writeln('# run bootstrap cmds:') for cmd in self._opts['bootstrap_cmds']: if isinstance(cmd, basestring): writeln('check_call(%r, shell=True)' % cmd) else: writeln('check_call(%r)' % cmd) writeln() # run our scripts if self._bootstrap_scripts: writeln('# run bootstrap scripts:') for file_dict in self._bootstrap_scripts: writeln('check_call(%r)' % ( ['./' + file_dict['name']],)) writeln() return out.getvalue() ### EMR JOB MANAGEMENT UTILS ### def make_persistent_job_flow(self): """Create a new EMR job flow that requires manual termination, and return its ID. You can also fetch the job ID by calling self.get_emr_job_flow_id() """ if (self._emr_job_flow_id): raise AssertionError( 'This runner is already associated with job flow ID %s' % (self._emr_job_flow_id)) log.info('Creating persistent job flow to run several jobs in...') self._create_master_bootstrap_script() self._upload_non_input_files() # don't allow user to call run() self._ran_job = True self._emr_job_flow_id = self._create_job_flow(persistent=True) return self._emr_job_flow_id def get_emr_job_flow_id(self): return self._emr_job_flow_id def usable_job_flows(self, emr_conn=None, exclude=None, num_steps=1): """Get job flows that this runner can use. We basically expect to only join available job flows with the exact same setup as our own, that is: - same bootstrap setup (including mrjob version) - have the same Hadoop and AMI version - same number and type of instances However, we allow joining job flows where for each role, every instance has at least as much memory as we require, and the total number of compute units is at least what we require. There also must be room for our job in the job flow (job flows top out at 256 steps). We then sort by: - total compute units for core + task nodes - total compute units for master node - time left to an even instance hour The most desirable job flows come *last* in the list. :return: list of (job_minutes_float, :py:class:`botoemr.emrobject.JobFlow`) """ emr_conn = emr_conn or self.make_emr_conn() exclude = exclude or set() req_hash = self._pool_hash() # decide memory and total compute units requested for each # role type role_to_req_instance_type = {} role_to_req_num_instances = {} role_to_req_mem = {} role_to_req_cu = {} role_to_req_bid_price = {} for role in ('core', 'master', 'task'): instance_type = self._opts['ec2_%s_instance_type' % role] if role == 'master': num_instances = 1 else: num_instances = self._opts['num_ec2_%s_instances' % role] role_to_req_instance_type[role] = instance_type role_to_req_num_instances[role] = num_instances role_to_req_bid_price[role] = ( self._opts['ec2_%s_instance_bid_price' % role]) # unknown instance types can only match themselves role_to_req_mem[role] = ( EC2_INSTANCE_TYPE_TO_MEMORY.get(instance_type, float('Inf'))) role_to_req_cu[role] = ( num_instances * EC2_INSTANCE_TYPE_TO_COMPUTE_UNITS.get(instance_type, float('Inf'))) sort_keys_and_job_flows = [] def add_if_match(job_flow): # this may be a retry due to locked job flows if job_flow.jobflowid in exclude: return # only take persistent job flows if job_flow.keepjobflowalivewhennosteps != 'true': return # match pool name, and (bootstrap) hash hash, name = pool_hash_and_name(job_flow) if req_hash != hash: return if self._opts['emr_job_flow_pool_name'] != name: return # match hadoop version if job_flow.hadoopversion != self.get_hadoop_version(): return # match AMI version job_flow_ami_version = getattr(job_flow, 'amiversion', None) if job_flow_ami_version != self._opts['ami_version']: return # there is a hard limit of 256 steps per job flow if len(job_flow.steps) + num_steps > MAX_STEPS_PER_JOB_FLOW: return # in rare cases, job flow can be WAITING *and* have incomplete # steps if any(getattr(step, 'enddatetime', None) is None for step in job_flow.steps): return # total compute units per group role_to_cu = defaultdict(float) # total number of instances of the same type in each group. # This allows us to match unknown instance types. role_to_matched_instances = defaultdict(int) # check memory and compute units, bailing out if we hit # an instance with too little memory for ig in job_flow.instancegroups: role = ig.instancerole.lower() # unknown, new kind of role; bail out! if role not in ('core', 'master', 'task'): return req_instance_type = role_to_req_instance_type[role] if ig.instancetype != req_instance_type: # if too little memory, bail out mem = EC2_INSTANCE_TYPE_TO_MEMORY.get(ig.instancetype, 0.0) req_mem = role_to_req_mem.get(role, 0.0) if mem < req_mem: return # if bid price is too low, don't count compute units req_bid_price = role_to_req_bid_price[role] bid_price = getattr(ig, 'bidprice', None) # if the instance is on-demand (no bid price) or bid prices # are the same, we're okay if bid_price and bid_price != req_bid_price: # whoops, we didn't want spot instances at all if not req_bid_price: continue try: if float(req_bid_price) > float(bid_price): continue except ValueError: # we don't know what to do with non-float bid prices, # and we know it's not equal to what we requested continue # don't require instances to be running; we'd be worse off if # we started our own job flow from scratch. (This can happen if # the previous job finished while some task instances were # still being provisioned.) cu = (int(ig.instancerequestcount) * EC2_INSTANCE_TYPE_TO_COMPUTE_UNITS.get( ig.instancetype, 0.0)) role_to_cu.setdefault(role, 0.0) role_to_cu[role] += cu # track number of instances of the same type if ig.instancetype == req_instance_type: role_to_matched_instances[role] += ( int(ig.instancerequestcount)) # check if there are enough compute units for role, req_cu in role_to_req_cu.iteritems(): req_num_instances = role_to_req_num_instances[role] # if we have at least as many units of the right type, # don't bother counting compute units if req_num_instances > role_to_matched_instances[role]: cu = role_to_cu.get(role, 0.0) if cu < req_cu: return # make a sort key sort_key = (role_to_cu['core'] + role_to_cu['task'], role_to_cu['master'], est_time_to_hour(job_flow)) sort_keys_and_job_flows.append((sort_key, job_flow)) for job_flow in emr_conn.describe_jobflows(states=['WAITING']): add_if_match(job_flow) return [job_flow for (sort_key, job_flow) in sorted(sort_keys_and_job_flows)] def find_job_flow(self, num_steps=1): """Find a job flow that can host this runner. Prefer flows with more compute units. Break ties by choosing flow with longest idle time. Return ``None`` if no suitable flows exist. """ chosen_job_flow = None exclude = set() emr_conn = self.make_emr_conn() s3_conn = self.make_s3_conn() while chosen_job_flow is None: sorted_tagged_job_flows = self.usable_job_flows( emr_conn=emr_conn, exclude=exclude, num_steps=num_steps) if sorted_tagged_job_flows: job_flow = sorted_tagged_job_flows[-1] status = attempt_to_acquire_lock( s3_conn, self._lock_uri(job_flow), self._opts['s3_sync_wait_time'], self._job_name) if status: return sorted_tagged_job_flows[-1] else: exclude.add(job_flow.jobflowid) else: return None def _lock_uri(self, job_flow): return make_lock_uri(self._opts['s3_scratch_uri'], job_flow.jobflowid, len(job_flow.steps) + 1) def _pool_hash(self): """Generate a hash of the bootstrap configuration so it can be used to match jobs and job flows. This first argument passed to the bootstrap script will be ``'pool-'`` plus this hash. """ def should_include_file(info): # Bootstrap scripts will always have a different checksum if 'name' in info and info['name'] in ('b.py', 'wrapper.py'): return False # Also do not include script used to spin up job if self._script and info['path'] == self._script['path']: return False # Only include bootstrap files if 'bootstrap' not in info: return False # mrjob.tar.gz is covered by the bootstrap_mrjob variable. # also, it seems to be different every time, causing an # undesirable hash mismatch. if (self._opts['bootstrap_mrjob'] and info is self._mrjob_tar_gz_file): return False # Ignore job-specific files if info['path'] in self._input_paths: return False return True # strip unique s3 URI if there is one cleaned_bootstrap_actions = [dict(path=fd['path'], args=fd['args']) for fd in self._bootstrap_actions] things_to_hash = [ [self.md5sum(fd['path']) for fd in self._files if should_include_file(fd)], self._opts['additional_emr_info'], self._opts['bootstrap_mrjob'], self._opts['bootstrap_cmds'], cleaned_bootstrap_actions, ] if self._opts['bootstrap_mrjob']: things_to_hash.append(mrjob.__version__) return hash_object(things_to_hash) ### GENERAL FILESYSTEM STUFF ### def du(self, path_glob): """Get the size of all files matching path_glob.""" if not is_s3_uri(path_glob): return super(EMRJobRunner, self).du(path_glob) return sum(self.get_s3_key(uri).size for uri in self.ls(path_glob)) def ls(self, path_glob): """Recursively list files locally or on S3. This doesn't list "directories" unless there's actually a corresponding key ending with a '/' (which is weird and confusing; don't make S3 keys ending in '/') To list a directory, path_glob must end with a trailing slash (foo and foo/ are different on S3) """ if SSH_URI_RE.match(path_glob): for item in self._ssh_ls(path_glob): yield item return if not is_s3_uri(path_glob): for path in super(EMRJobRunner, self).ls(path_glob): yield path return # support globs glob_match = GLOB_RE.match(path_glob) # if it's a "file" (doesn't end with /), just check if it exists if not glob_match and not path_glob.endswith('/'): uri = path_glob if self.get_s3_key(uri): yield uri return # we're going to search for all keys starting with base_uri if glob_match: # cut it off at first wildcard base_uri = glob_match.group(1) else: base_uri = path_glob for uri in self._s3_ls(base_uri): # enforce globbing if glob_match and not fnmatch.fnmatchcase(uri, path_glob): continue yield uri def _ssh_ls(self, uri): """Helper for ls(); obeys globbing""" m = SSH_URI_RE.match(uri) try: addr = m.group('hostname') or self._address_of_master() if '!' in addr: self._enable_slave_ssh_access() output = ssh_ls( self._opts['ssh_bin'], addr, self._opts['ec2_key_pair_file'], m.group('filesystem_path'), self._ssh_key_name, ) for line in output: # skip directories, we only want to return downloadable files if line and not line.endswith('/'): yield SSH_PREFIX + addr + line except SSHException, e: raise LogFetchError(e) def _s3_ls(self, uri): """Helper for ls(); doesn't bother with globbing or directories""" s3_conn = self.make_s3_conn() bucket_name, key_name = parse_s3_uri(uri) bucket = s3_conn.get_bucket(bucket_name) for key in bucket.list(key_name): yield s3_key_to_uri(key) def md5sum(self, path, s3_conn=None): if is_s3_uri(path): k = self.get_s3_key(path, s3_conn=s3_conn) return k.etag.strip('"') else: return super(EMRJobRunner, self).md5sum(path) def _cat_file(self, filename): ssh_match = SSH_URI_RE.match(filename) if is_s3_uri(filename): # stream lines from the s3 key s3_key = self.get_s3_key(filename) buffer_iterator = read_file(s3_key_to_uri(s3_key), fileobj=s3_key) return buffer_iterator_to_line_iterator(buffer_iterator) elif ssh_match: try: addr = ssh_match.group('hostname') or self._address_of_master() if '!' in addr: self._enable_slave_ssh_access() output = ssh_cat( self._opts['ssh_bin'], addr, self._opts['ec2_key_pair_file'], ssh_match.group('filesystem_path'), self._ssh_key_name, ) return read_file(filename, fileobj=StringIO(output)) except SSHException, e: raise LogFetchError(e) else: # read from local filesystem return super(EMRJobRunner, self)._cat_file(filename) def mkdir(self, dest): """Make a directory. This does nothing on S3 because there are no directories. """ if not is_s3_uri(dest): super(EMRJobRunner, self).mkdir(dest) def path_exists(self, path_glob): """Does the given path exist? If dest is a directory (ends with a "/"), we check if there are any files starting with that path. """ if not is_s3_uri(path_glob): return super(EMRJobRunner, self).path_exists(path_glob) # just fall back on ls(); it's smart return any(self.ls(path_glob)) def path_join(self, dirname, filename): if is_s3_uri(dirname): return posixpath.join(dirname, filename) else: return os.path.join(dirname, filename) def rm(self, path_glob): """Remove all files matching the given glob.""" if not is_s3_uri(path_glob): return super(EMRJobRunner, self).rm(path_glob) s3_conn = self.make_s3_conn() for uri in self.ls(path_glob): key = self.get_s3_key(uri, s3_conn) if key: log.debug('deleting ' + uri) key.delete() # special case: when deleting a directory, also clean up # the _$folder$ files that EMR creates. if uri.endswith('/'): folder_uri = uri[:-1] + '_$folder$' folder_key = self.get_s3_key(folder_uri, s3_conn) if folder_key: log.debug('deleting ' + folder_uri) folder_key.delete() def touchz(self, dest): """Make an empty file in the given location. Raises an error if a non-empty file already exists in that location.""" if not is_s3_uri(dest): super(EMRJobRunner, self).touchz(dest) key = self.get_s3_key(dest) if key and key.size != 0: raise OSError('Non-empty file %r already exists!' % (dest,)) self.make_s3_key(dest).set_contents_from_string('') ### EMR-specific STUFF ### def _wrap_aws_conn(self, raw_conn): """Wrap a given boto Connection object so that it can retry when throttled.""" def retry_if(ex): """Retry if we get a server error indicating throttling. Also handle spurious 505s that are thought to be part of a load balancer issue inside AWS.""" return ((isinstance(ex, boto.exception.BotoServerError) and ('Throttling' in ex.body or 'RequestExpired' in ex.body or ex.status == 505)) or (isinstance(ex, socket.error) and ex.args in ((104, 'Connection reset by peer'), (110, 'Connection timed out')))) return RetryWrapper(raw_conn, retry_if=retry_if, backoff=EMR_BACKOFF, multiplier=EMR_BACKOFF_MULTIPLIER, max_tries=EMR_MAX_TRIES) def make_emr_conn(self): """Create a connection to EMR. :return: a :py:class:`mrjob.boto_2_1_1_83aae37b.EmrConnection`, a subclass of :py:class:`boto.emr.connection.EmrConnection`, wrapped in a :py:class:`mrjob.retry.RetryWrapper` """ # ...which is then wrapped in bacon! Mmmmm! # give a non-cryptic error message if boto isn't installed if boto is None: raise ImportError('You must install boto to connect to EMR') region = self._get_region_info_for_emr_conn() log.debug('creating EMR connection (to %s)' % region.endpoint) raw_emr_conn = boto_2_1_1_83aae37b.EmrConnection( aws_access_key_id=self._opts['aws_access_key_id'], aws_secret_access_key=self._opts['aws_secret_access_key'], region=region) return self._wrap_aws_conn(raw_emr_conn) def _get_region_info_for_emr_conn(self): """Get a :py:class:`boto.ec2.regioninfo.RegionInfo` object to initialize EMR connections with. This is kind of silly because all :py:class:`boto.emr.connection.EmrConnection` ever does with this object is extract the hostname, but that's how boto rolls. """ if self._opts['emr_endpoint']: endpoint = self._opts['emr_endpoint'] else: # look up endpoint in our table try: endpoint = REGION_TO_EMR_ENDPOINT[self._aws_region] except KeyError: raise Exception( "Don't know the EMR endpoint for %s;" " try setting emr_endpoint explicitly" % self._aws_region) return boto.ec2.regioninfo.RegionInfo(None, self._aws_region, endpoint) def _describe_jobflow(self, emr_conn=None): emr_conn = emr_conn or self.make_emr_conn() return emr_conn.describe_jobflow(self._emr_job_flow_id) def get_hadoop_version(self): if not self._inferred_hadoop_version: if self._emr_job_flow_id: # if joining a job flow, infer the version self._inferred_hadoop_version = ( self._describe_jobflow().hadoopversion) else: # otherwise, read it from hadoop_version/ami_version hadoop_version = self._opts['hadoop_version'] if hadoop_version: self._inferred_hadoop_version = hadoop_version else: ami_version = self._opts['ami_version'] # don't explode if we see an AMI version that's # newer than what we know about. self._inferred_hadoop_version = ( AMI_VERSION_TO_HADOOP_VERSION.get(ami_version) or AMI_VERSION_TO_HADOOP_VERSION['latest']) return self._inferred_hadoop_version def _address_of_master(self, emr_conn=None): """Get the address of the master node so we can SSH to it""" # cache address of master to avoid redundant calls to describe_jobflow # also convenient for testing (pretend we can SSH when we really can't # by setting this to something not False) if self._address: return self._address try: jobflow = self._describe_jobflow(emr_conn) if jobflow.state not in ('WAITING', 'RUNNING'): raise LogFetchError( 'Cannot ssh to master; job flow is not waiting or running') except boto.exception.S3ResponseError: # This error is raised by mockboto when the jobflow doesn't exist raise LogFetchError('Could not get job flow information') self._address = jobflow.masterpublicdnsname return self._address def _addresses_of_slaves(self): if not self._ssh_slave_addrs: self._ssh_slave_addrs = ssh_slave_addresses( self._opts['ssh_bin'], self._address_of_master(), self._opts['ec2_key_pair_file']) return self._ssh_slave_addrs ### S3-specific FILESYSTEM STUFF ### # Utilities for interacting with S3 using S3 URIs. # Try to use the more general filesystem interface unless you really # need to do something S3-specific (e.g. setting file permissions) def make_s3_conn(self): """Create a connection to S3. :return: a :py:class:`boto.s3.connection.S3Connection`, wrapped in a :py:class:`mrjob.retry.RetryWrapper` """ # give a non-cryptic error message if boto isn't installed if boto is None: raise ImportError('You must install boto to connect to S3') s3_endpoint = self._get_s3_endpoint() log.debug('creating S3 connection (to %s)' % s3_endpoint) raw_s3_conn = boto.connect_s3( aws_access_key_id=self._opts['aws_access_key_id'], aws_secret_access_key=self._opts['aws_secret_access_key'], host=s3_endpoint) return self._wrap_aws_conn(raw_s3_conn) def _get_s3_endpoint(self): if self._opts['s3_endpoint']: return self._opts['s3_endpoint'] else: # look it up in our table try: return REGION_TO_S3_ENDPOINT[self._aws_region] except KeyError: raise Exception( "Don't know the S3 endpoint for %s;" " try setting s3_endpoint explicitly" % self._aws_region) def get_s3_key(self, uri, s3_conn=None): """Get the boto Key object matching the given S3 uri, or return None if that key doesn't exist. uri is an S3 URI: ``s3://foo/bar`` You may optionally pass in an existing s3 connection through ``s3_conn``. """ if not s3_conn: s3_conn = self.make_s3_conn() bucket_name, key_name = parse_s3_uri(uri) return s3_conn.get_bucket(bucket_name).get_key(key_name) def make_s3_key(self, uri, s3_conn=None): """Create the given S3 key, and return the corresponding boto Key object. uri is an S3 URI: ``s3://foo/bar`` You may optionally pass in an existing S3 connection through ``s3_conn``. """ if not s3_conn: s3_conn = self.make_s3_conn() bucket_name, key_name = parse_s3_uri(uri) return s3_conn.get_bucket(bucket_name).new_key(key_name) def get_s3_keys(self, uri, s3_conn=None): """Get a stream of boto Key objects for each key inside the given dir on S3. uri is an S3 URI: ``s3://foo/bar`` You may optionally pass in an existing S3 connection through s3_conn """ if not s3_conn: s3_conn = self.make_s3_conn() bucket_name, key_prefix = parse_s3_uri(uri) bucket = s3_conn.get_bucket(bucket_name) for key in bucket.list(key_prefix): yield key def get_s3_folder_keys(self, uri, s3_conn=None): """Background: S3 is even less of a filesystem than HDFS in that it doesn't have directories. EMR fakes directories by creating special ``*_$folder$`` keys in S3. For example if your job outputs ``s3://walrus/tmp/output/part-00000``, EMR will also create these keys: - ``s3://walrus/tmp_$folder$`` - ``s3://walrus/tmp/output_$folder$`` If you want to grant another Amazon user access to your files so they can use them in S3, you must grant read access on the actual keys, plus any ``*_$folder$`` keys that "contain" your keys; otherwise EMR will error out with a permissions error. This gets all the ``*_$folder$`` keys associated with the given URI, as boto Key objects. This does not support globbing. You may optionally pass in an existing S3 connection through ``s3_conn``. """ if not s3_conn: s3_conn = self.make_s3_conn() bucket_name, key_name = parse_s3_uri(uri) bucket = s3_conn.get_bucket(bucket_name) dirs = key_name.split('/') for i in range(len(dirs)): folder_name = '/'.join(dirs[:i]) + '_$folder$' key = bucket.get_key(folder_name) if key: yield key mrjob-0.3.3.2/mrjob/examples/0000775€q(¼€tzÕß0000000000011741151621021535 5ustar sjohnsonAD\Domain Users00000000000000mrjob-0.3.3.2/mrjob/examples/__init__.py0000664€q(¼€tzÕß0000000000011706610131023631 0ustar sjohnsonAD\Domain Users00000000000000mrjob-0.3.3.2/mrjob/examples/mr_log_sampler.py0000664€q(¼€tzÕß0000001001411706610131025102 0ustar sjohnsonAD\Domain Users00000000000000# Copyright 2011 Yelp # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. """ MapReduce job to sample n lines from a file. The mapper iterates over each line and yields them to the reducer, combined with a random seed, so that Hadoop will resort the lines. Then, the reducer yields the first n lines. """ __author__ = 'Benjamin Goldenberg ' import random import sys from mrjob.job import MRJob from mrjob.protocol import RawValueProtocol, ReprProtocol SAMPLING_FUDGE_FACTOR = 1.2 class MRLogSampler(MRJob): # We use RawValueProtocol for input to be format agnostic # and avoid any type of parsing errors INPUT_PROTOCOL = RawValueProtocol # We use RawValueProtocol for output so we can output raw lines # instead of (k, v) pairs OUTPUT_PROTOCOL = RawValueProtocol # Encode the intermediate records using repr() instead of JSON, so the # record doesn't get Unicode-encoded INTERNAL_PROTOCOL = ReprProtocol def configure_options(self): super(MRLogSampler, self).configure_options() self.add_passthrough_option( '--sample-size', type=int, help='Number of entries to sample.' ) self.add_passthrough_option( '--expected-length', type=int, help=("Number of entries you expect in the log. If not specified," " we'll pass every line to the reducer.") ) def load_options(self, args): super(MRLogSampler, self).load_options(args) if self.options.sample_size is None: self.option_parser.error('You must specify the --sample-size') else: self.sample_size = self.options.sample_size # If we have an expected length, we can estimate the sampling # probability for the mapper, so that the reducer doesn't have to # process all records. Otherwise, pass everything thru to the reducer. if self.options.expected_length is None: self.sampling_probability = 1. else: # We should be able to bound this probability by using the binomial # distribution, but I haven't figured it out yet. So, let's just # fudge it. self.sampling_probability = (float(self.sample_size) * SAMPLING_FUDGE_FACTOR / self.options.expected_length) def mapper(self, _, line): """ For each log line, with probability self.sampling_probability, yield a None key, and (random seed, line) as the value, so that the values get sorted randomly and fed into a single reducer. Args: line - raw log line Yields: key - None value - (random seed, line) """ if random.random() < self.sampling_probability: seed = '%20i' % random.randint(0, sys.maxint) yield None, (seed, line) def reducer(self, _, values): """ Now that the values have a random number attached, they'll come in in random order, so we yield the first n lines, and return early. Args: values - generator of (random_seed, line) pairs Yields: key - None value - random sample of log lines """ for line_num, (seed, line) in enumerate(values): yield None, line # enumerate() is 0-indexed, so add 1 if line_num + 1 >= self.sample_size: break if __name__ == '__main__': MRLogSampler.run() mrjob-0.3.3.2/mrjob/examples/mr_page_rank.py0000664€q(¼€tzÕß0000000656311706610131024543 0ustar sjohnsonAD\Domain Users00000000000000# Copyright 2009-2010 Yelp # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. """Iterative implementation of the PageRank algorithm: http://en.wikipedia.org/wiki/PageRank """ from mrjob.job import MRJob from mrjob.protocol import JSONProtocol def encode_node(node_id, links=None, score=1): """Print out a node, in JSON format. :param node_id: unique ID for this node (any type is okay) :param links: a list of tuples of ``(node_id, weight)``; *node_id* is the ID of a node to send score to, and *weight* is a number between 0 and 1. Your weights should sum to 1 for each node, but if they sum to less than 1, the algorithm will still converge. :type score: float :param score: initial score for the node. Defaults to 1. Ideally, the average weight of your nodes should be 1 (but it if isn't, the algorithm will still converge). """ node = {} if links is not None: node['links'] = sorted(links.items()) node['score'] = score return JSONProtocol.write(node_id, node) + '\n' class MRPageRank(MRJob): INPUT_PROTOCOL = JSONProtocol # read the same format we write def configure_options(self): super(MRPageRank, self).configure_options() self.add_passthrough_option( '--iterations', dest='iterations', default=10, type='int', help='number of iterations to run') self.add_passthrough_option( '--damping-factor', dest='damping_factor', default=0.85, type='float', help='probability a web surfer will continue clicking on links') def send_score(self, node_id, node): """Mapper: send score from a single node to other nodes. Input: ``node_id, node`` Output: ``node_id, ('node', node)`` OR ``node_id, ('score', score)`` """ yield node_id, ('node', node) for dest_id, weight in node.get('links') or []: yield dest_id, ('score', node['score'] * weight) def receive_score(self, node_id, typed_values): """Reducer: Combine scores sent from other nodes, and update this node (creating it if necessary). Store information about the node's previous score in *prev_score*. """ node = {} total_score = 0 for value_type, value in typed_values: if value_type == 'node': node = value else: assert value_type == 'score' total_score += value node['prev_score'] = node['score'] d = self.options.damping_factor node['score'] = 1 - d + d * total_score yield node_id, node def steps(self): return ([self.mr(mapper=self.send_score, reducer=self.receive_score)] * self.options.iterations) if __name__ == '__main__': MRPageRank.run() mrjob-0.3.3.2/mrjob/examples/mr_text_classifier.py0000664€q(¼€tzÕß0000005254511706610131026005 0ustar sjohnsonAD\Domain Users00000000000000# Copyright 2009-2010 Yelp # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. """A text classifier that uses a modified version of Naive Bayes that is not sensitive to document length. This is a somewhat contrived example in that it does everything in one job; generally you'd run one job to generate n-gram scores, put them in a sqlite database, and run a second job to score documents. But this is simple, and it works! This takes as its input documents encoded by encode_document() below. For each document, you specify its text, and whether it belongs or does not belong to one or more categories. You can also specify a unique ID for each document. This job outputs the documents, with the field 'cat_to_score' filled in. Generally, positive scores indicate the document is in the category, and negative scores indicate it is not, but it's up to you to determine a threshold for each category. This job also outputs scores for each ngram, so that you can classify other documents. About half of the documents are placed in a test set (based on SHA1 hash of their text), which means they will not be used to train the classifier. The 'in_test_set' of each document will be filled accordingly. You can turn this off with the --no-test-set flag. (You can also effectively put docs in the training set by specifying no category information.) Some terminology: An "ngram" is a word or phrase. "foo" is a 1-gram; "foo bar baz" is a 3-gram. "tf" refers to term frequency, that is, the number of times an ngram appears. "df" referse to document frequency, that is, the number of documents an ngram appears in at least once. """ from collections import defaultdict import hashlib import math import re from mrjob.job import MRJob from mrjob.protocol import JSONValueProtocol def encode_document(text, cats=None, id=None): """Encode a document as a JSON so that MRTextClassifier can read it. Args: text -- the text of the document (as a unicode) cats -- a dictionary mapping a category name (e.g. 'sports') to True if the document is in the category, and False if it's not. None indicates that we have no information about this documents' categories id -- a unique ID for the document (any kind of JSON-able value should work). If not specified, we'll auto-generate one. """ text = unicode(text) cats = dict((unicode(cat), bool(is_in_cat)) for cat, is_in_cat in (cats or {}).iteritems()) return JSONValueProtocol.write( None, {'text': text, 'cats': cats, 'id': id}) + '\n' def count_ngrams(text, max_ngram_size, stop_words): """Break text down into ngrams, and return a dictionary mapping (n, ngram) to number of times that ngram occurs. n: ngram size ("foo" is a 1-gram, "foo bar baz" is a 3-gram) ngram: the ngram, as a space-separated string or None to indicate the ANY ngram (basically the number of words in the document). Args: text -- text, as a unicode max_ngram_size -- maximum size of ngrams to consider stop_words -- a collection of words (in lowercase) to remove before parsing out ngrams (e.g. "the", "and") """ if not isinstance(stop_words, set): stop_words = set(stop_words) words = [word.lower() for word in WORD_RE.findall(text) if word.lower() not in stop_words] ngram_counts = defaultdict(int) for i in range(len(words)): for n in range(1, max_ngram_size + 1): if i + n <= len(words): ngram = ' '.join(words[i:i + n]) ngram_counts[(n, ngram)] += 1 # add counts for ANY ngram for n in range(1, max_ngram_size + 1): ngram_counts[(n, None)] = len(words) - n + 1 return ngram_counts WORD_RE = re.compile(r"[\w']+", re.UNICODE) DEFAULT_MAX_NGRAM_SIZE = 4 DEFAULT_STOP_WORDS = [ 'a', 'about', 'also', 'am', 'an', 'and', 'any', 'are', 'as', 'at', 'be', 'but', 'by', 'can', 'com', 'did', 'do', 'does', 'for', 'from', 'had', 'has', 'have', 'he', "he'd", "he'll", "he's", 'her', 'here', 'hers', 'him', 'his', 'i', "i'd", "i'll", "i'm", "i've", 'if', 'in', 'into', 'is', 'it', "it's", 'its', 'just', 'me', 'mine', 'my', 'of', 'on', 'or', 'org', 'our', 'ours', 'she', "she'd", "she'll", "she's", 'some', 'than', 'that', 'the', 'their', 'them', 'then', 'there', 'these', 'they', "they'd", "they'll", "they're", 'this', 'those', 'to', 'us', 'was', 'we', "we'd", "we'll", "we're", 'were', 'what', 'where', 'which', 'who', 'will', 'with', 'would', 'you', 'your', 'yours', ] class MRTextClassifier(MRJob): INPUT_PROTOCOL = JSONValueProtocol def steps(self): """Conceptually, the steps are: 1. Parse documents into ngrams 2. Group by ngram to get a frequency count for each ngram, and to exclude very rare ngrams 3. Send all ngram information to one "global" reducer so we can assign scores for each category and ngram 4. Group scores and documents by ngram and compute score for that ngram for that document. Exclude very common ngrams to save memory. 5. Average together scores for each document to get its score for each category. The documents themselves are passed through from step 1 to step 5. Ngram scoring information is passed through from step 4 to step 5. """ return [self.mr(self.parse_doc, self.count_ngram_freq), self.mr(reducer=self.score_ngrams), self.mr(reducer=self.score_documents_by_ngram), self.mr(reducer=self.score_documents)] def configure_options(self): """Add command-line options specific to this script.""" super(MRTextClassifier, self).configure_options() self.add_passthrough_option( '--min-df', dest='min_df', default=2, type='int', help=('min number of documents an n-gram must appear in for us to' ' count it. Default: %default')) self.add_passthrough_option( '--max-df', dest='max_df', default=10000000, type='int', help=('max number of documents an n-gram may appear in for us to' ' count it (this keeps reducers from running out of memory).' ' Default: %default')) self.add_passthrough_option( '--max-ngram-size', dest='max_ngram_size', default=DEFAULT_MAX_NGRAM_SIZE, type='int', help='maximum phrase length to consider') self.add_passthrough_option( '--stop-words', dest='stop_words', default=', '.join(DEFAULT_STOP_WORDS), help=("comma-separated list of words to ignore. For example, " "--stop-words 'in, the' would cause 'hole in the wall' to be" " parsed as ['hole', 'wall']. Default: %default")) self.add_passthrough_option( '--short-doc-threshold', dest='short_doc_threshold', type='int', default=None, help=('Normally, for each n-gram size, we take the average score' ' over all n-grams that appear. This allows us to penalize' ' short documents by using this threshold as the denominator' ' rather than the actual number of n-grams.')) self.add_passthrough_option( '--no-test-set', dest='no_test_set', action='store_true', default=False, help=("Choose about half of the documents to be the testing set" " (don't use them to train the classifier) based on a SHA1" " hash of their text")) def load_options(self, args): """Parse stop_words option.""" super(MRTextClassifier, self).load_options(args) self.stop_words = set() if self.options.stop_words: self.stop_words.update( s.strip().lower() for s in self.options.stop_words.split(',')) def parse_doc(self, _, doc): """Mapper: parse documents and emit ngram information. Input: JSON-encoded documents (see :py:func:`encode_document`) Output: ``('ngram', (n, ngram)), (count, cats)`` OR ``('doc', doc_id), doc`` n: ngram length ngram: ngram encoded encoded as a string (e.g. "pad thai") or None to indicate ANY ngram. count: # of times an ngram appears in the document cats: a map from category name to a boolean indicating whether it's this document is in the category doc_id: (hopefully) unique document ID doc: the encoded document. We'll fill these fields: ngram_counts: map from (n, ngram) to # of times ngram appears in the document, using (n, None) to represent the total number of times ANY ngram of that size appears (essentially number of words) in_test_set: boolean indicating if this doc is in the test set id: SHA1 hash of doc text (if not already filled) """ # only compute doc hash if we need it if doc.get('id') is not None and self.options.no_test_set: doc_hash = '0' # don't need doc hash else: doc_hash = hashlib.sha1(doc['text'].encode('utf-8')).hexdigest() # fill in ID if missing if doc.get('id') is None: doc['id'] = doc_hash # pick test/training docs if self.options.no_test_set: doc['in_test_set'] = False else: doc['in_test_set'] = bool(int(doc_hash[-1], 16) % 2) # map from (n, ngram) to number of times it appears ngram_counts = count_ngrams( doc['text'], self.options.max_ngram_size, self.stop_words) # yield the number of times the ngram appears in this doc # and the categories for this document, so we can train the classifier if not doc['in_test_set']: for (n, ngram), count in ngram_counts.iteritems(): yield ('ngram', (n, ngram)), (count, doc['cats']) # yield the document itself, for safekeeping doc['ngram_counts'] = ngram_counts.items() yield ('doc', doc['id']), doc def count_ngram_freq(self, type_and_key, values): """Reducer: Combine information about how many times each ngram appears for docs in/not in each category. Dump ngrams that appear in very few documents (according to --min-df switch). If two documents have the same ID, increment a counter and only keep one; otherwise pass docs through unchanged. Input (see parse_doc() for details): ('ngram', (n, ngram)), (count, cats) OR ('doc', doc_id), doc Output: ('global', None), ((n, ngram), (cat_to_df, cat_to_tf)) OR ('doc', doc_id), doc n: ngram length ngram: ngram encoded encoded as a string (e.g. "pad thai") or None to indicate ANY ngram. cat_to_df: list of tuples of ((cat_name, is_in_category), df); df is # of documents of this type that the ngram appears in cat_to_tf: list of tuples of ((cat_name, is_in_category), df); tf is # of time the ngram appears in docs of this type doc_id: unique document ID doc: the encoded document """ key_type, key = type_and_key # pass documents through if key_type == 'doc': doc_id = key docs = list(values) # if two documents end up with the same key, only keep one if len(docs) > 1: self.increment_counter( 'Document key collision', str(doc_id)) yield ('doc', doc_id), docs[0] return assert key_type == 'ngram' n, ngram = key # total # of docs this ngram appears in total_df = 0 # map from (cat, is_in_cat) to # number of documents in this cat it appears in (df), or # number of times it appears in documents of this type (tf) cat_to_df = defaultdict(int) cat_to_tf = defaultdict(int) for count, cats in values: total_df += 1 for cat in cats.iteritems(): cat_to_df[cat] += 1 cat_to_tf[cat] += count # don't bother with very rare ngrams if total_df < self.options.min_df: return yield (('global', None), ((n, ngram), (cat_to_df.items(), cat_to_tf.items()))) def score_ngrams(self, type_and_key, values): """Reducer: Look at all ngrams together, and assign scores by ngram and category. Also farm out documents to the reducer for any ngram they contain, and pass documents through to the next step. To score an ngram for a category, we compare the probability of any given ngram being our ngram for documents in the category against documents not in the category. The score is just the log of the ratio of probabilities (the "log difference") Input (see count_ngram_freq() for details): ('global', None), ((n, ngram), (cat_to_df, cat_to_tf)) OR ('doc', doc_id), doc Output: ('doc', doc_id), document OR ('ngram', (n, ngram)), ('doc_id', doc_id) OR ('ngram', (n, ngram)), ('cat_to_score', cat_to_score) n: ngram length ngram: ngram encoded encoded as a string (e.g. "pad thai") or None to indicate ANY ngram. cat_to_score: map from (cat_name, is_in_category) to score for this ngram doc_id: unique document ID doc: the encoded document """ key_type, key = type_and_key if key_type == 'doc': doc_id = key doc = list(values)[0] # pass document through yield ('doc', doc_id), doc # send document to reducer for every ngram it contains for (n, ngram), count in doc['ngram_counts']: # don't bother even creating a reducer for the ANY ngram # because we'd have to send all documents to it. if ngram is None: continue yield (('ngram', (n, ngram)), ('doc_id', doc_id)) return assert key_type == 'global' ngram_to_info = dict( ((n, ngram), (dict((tuple(cat), df) for cat, df in cat_to_df), dict((tuple(cat), tf) for cat, tf in cat_to_tf))) for (n, ngram), (cat_to_df, cat_to_tf) in values) # m = # of possible ngrams of any given type. This is not a very # rigorous estimate, but it's good enough m = len(ngram_to_info) for (n, ngram), info in ngram_to_info.iteritems(): # do this even for the special ANY ngram; it's useful # as a normalization factor. cat_to_df, cat_to_tf = info # get the total # of documents and terms for ngrams of this size cat_to_d, cat_to_t = ngram_to_info[(n, None)] # calculate the probability of any given term being # this term for documents of each type cat_to_p = {} for cat, t in cat_to_t.iteritems(): tf = cat_to_tf.get(cat) or 0 # use Laplace's rule of succession to estimate p. See: # http://en.wikipedia.org/wiki/Rule_of_succession#Generalization_to_any_number_of_possibilities cat_to_p[cat] = (tf + (2.0 / m)) / (t + 2) cats = set(cat for cat, in_cat in cat_to_t) cat_to_score = {} for cat in cats: p_if_in = cat_to_p.get((cat, True), 1.0 / m) p_if_out = cat_to_p.get((cat, False), 1.0 / m) # take the log difference of probabilities score = math.log(p_if_in) - math.log(p_if_out) cat_to_score[cat] = score yield (('ngram', (n, ngram)), ('cat_to_score', cat_to_score)) def score_documents_by_ngram(self, type_and_key, types_and_values): """Reducer: For all documents that contain a given ngram, send scoring info to that document. Also pass documents and scoring info through as-is Input (see score_ngrams() for details): ('doc', doc_id), doc OR ('ngram', (n, ngram)), ('doc_id', doc_id) OR ('ngram', (n, ngram)), ('cat_to_score', cat_to_score) Output: ('doc', doc_id), ('doc', doc) ('doc', doc_id), ('scores', ((n, ngram), cat_to_score)) ('cat_to_score', (n, ngram)), cat_to_score n: ngram length ngram: ngram encoded encoded as a string (e.g. "pad thai") or None to indicate ANY ngram. cat_to_score: map from (cat_name, is_in_category) to score for this ngram doc_id: unique document ID doc: the encoded document """ key_type, key = type_and_key # pass documents through if key_type == 'doc': doc_id = key doc = list(types_and_values)[0] yield ('doc', doc_id), ('doc', doc) return assert key_type == 'ngram' n, ngram = key doc_ids = [] cat_to_score = None for value_type, value in types_and_values: if value_type == 'cat_to_score': cat_to_score = value continue assert value_type == 'doc_id' doc_ids.append(value) if len(doc_ids) > self.options.max_df: self.increment_counter('Exceeded max df', repr((n, ngram))) return # skip ngrams that are too rare to score if cat_to_score is None: return # send score info for this ngram to this document for doc_id in doc_ids: yield ('doc', doc_id), ('scores', ((n, ngram), cat_to_score)) # keep scoring info yield ('cat_to_score', (n, ngram)), cat_to_score def score_documents(self, type_and_key, types_and_values): """Reducer: combine all scoring information for each document, and add it to the document. Also pass ngram scores through as-is. To score a document, we essentially take a weighted average of all the scores for ngrams of each size, and then sum together those averages. ngrams that aren't scored (because they're very rare or very common) are considered to have a score of zero. Using averages allows us to be insensitive to document size. There is a penalty for very small documents. Input (see score_ngrams() for details): ('doc', doc_id), ('doc', doc) ('doc', doc_id), ('scores', ((n, ngram), cat_to_score)) ('cat_to_score', (n, ngram)), cat_to_score Output: ('doc', doc_id), doc ('cat_to_score', (n, ngram)), cat_to_score n: ngram length ngram: ngram encoded encoded as a string (e.g. "pad thai") or None to indicate ANY ngram. cat_to_score: map from (cat_name, is_in_category) to score for this ngram doc_id: unique document ID doc: the encoded document. this will contain an extra field 'cat_to_score', and will no longer have the 'ngram_counts' field. """ key_type, key = type_and_key # pass through cat_to_score if key_type == 'cat_to_score': cat_to_score = list(types_and_values)[0] yield ('cat_to_score', key), cat_to_score return assert key_type == 'doc' doc_id = key # store the document and scoring info doc = None ngrams_and_scores = [] for value_type, value in types_and_values: if value_type == 'doc': doc = value continue assert value_type == 'scores' ((n, ngram), cat_to_score) = value ngrams_and_scores.append(((n, ngram), cat_to_score)) # total scores for each ngram size ngram_counts = dict(((n, ngram), count) for (n, ngram), count in doc['ngram_counts']) cat_to_n_to_total_score = defaultdict(lambda: defaultdict(float)) for (n, ngram), cat_to_score in ngrams_and_scores: tf = ngram_counts[(n, ngram)] for cat, score in cat_to_score.iteritems(): cat_to_n_to_total_score[cat][n] += score * tf # average scores for each ngram size cat_to_score = {} for cat, n_to_total_score in cat_to_n_to_total_score.iteritems(): total_score_for_cat = 0 for n, total_score in n_to_total_score.iteritems(): total_t = ngram_counts[(n, None)] total_score_for_cat += ( total_score / max(total_t, self.options.short_doc_threshold, 1)) cat_to_score[cat] = total_score_for_cat # add scores to the document, and get rid of ngram_counts doc['cat_to_score'] = cat_to_score del doc['ngram_counts'] yield ('doc', doc_id), doc if __name__ == '__main__': MRTextClassifier.run() mrjob-0.3.3.2/mrjob/examples/mr_wc.py0000664€q(¼€tzÕß0000000302011734727416023226 0ustar sjohnsonAD\Domain Users00000000000000# Copyright 2009-2010 Yelp # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. """An implementation of wc as an MRJob. This is meant as an example of why mapper_final is useful.""" from mrjob.job import MRJob class MRWordCountUtility(MRJob): def __init__(self, *args, **kwargs): super(MRWordCountUtility, self).__init__(*args, **kwargs) self.chars = 0 self.words = 0 self.lines = 0 def mapper(self, _, line): # Don't actually yield anything for each line. Instead, collect them # and yield the sums when all lines have been processed. The results # will be collected by the reducer. self.chars += len(line) + 1 # +1 for newline self.words += sum(1 for word in line.split() if word.strip()) self.lines += 1 def mapper_final(self): yield('chars', self.chars) yield('words', self.words) yield('lines', self.lines) def reducer(self, key, values): yield(key, sum(values)) if __name__ == '__main__': MRWordCountUtility.run() mrjob-0.3.3.2/mrjob/examples/mr_word_freq_count.py0000664€q(¼€tzÕß0000000202711706610131026003 0ustar sjohnsonAD\Domain Users00000000000000# Copyright 2009-2010 Yelp # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. """The classic MapReduce job: count the frequency of words. """ from mrjob.job import MRJob import re WORD_RE = re.compile(r"[\w']+") class MRWordFreqCount(MRJob): def mapper(self, _, line): for word in WORD_RE.findall(line): yield (word.lower(), 1) def combiner(self, word, counts): yield (word, sum(counts)) def reducer(self, word, counts): yield (word, sum(counts)) if __name__ == '__main__': MRWordFreqCount.run() mrjob-0.3.3.2/mrjob/hadoop.py0000664€q(¼€tzÕß0000007224511740642733021565 0ustar sjohnsonAD\Domain Users00000000000000# Copyright 2009-2012 Yelp and Contributors # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. import getpass import logging import os import posixpath import re from subprocess import Popen from subprocess import PIPE from subprocess import CalledProcessError try: from cStringIO import StringIO StringIO # quiet "redefinition of unused ..." warning from pyflakes except ImportError: from StringIO import StringIO from mrjob import compat from mrjob.conf import combine_cmds from mrjob.conf import combine_dicts from mrjob.conf import combine_paths from mrjob.logparsers import TASK_ATTEMPTS_LOG_URI_RE from mrjob.logparsers import STEP_LOG_URI_RE from mrjob.logparsers import HADOOP_JOB_LOG_URI_RE from mrjob.logparsers import scan_for_counters_in_files from mrjob.logparsers import scan_logs_in_order from mrjob.parse import HADOOP_STREAMING_JAR_RE from mrjob.parse import is_uri from mrjob.parse import urlparse from mrjob.runner import MRJobRunner from mrjob.util import cmd_line from mrjob.util import read_file log = logging.getLogger('mrjob.hadoop') # to filter out the log4j stuff that hadoop streaming prints out HADOOP_STREAMING_OUTPUT_RE = re.compile(r'^(\S+ \S+ \S+ \S+: )?(.*)$') # used by mkdir() HADOOP_FILE_EXISTS_RE = re.compile(r'.*File exists.*') # used by ls() HADOOP_LSR_NO_SUCH_FILE = re.compile( r'^lsr: Cannot access .*: No such file or directory.') # used by rm() (see below) HADOOP_RMR_NO_SUCH_FILE = re.compile(r'^rmr: hdfs://.*$') # used to extract the job timestamp from stderr HADOOP_JOB_TIMESTAMP_RE = re.compile( r'(INFO: )?Running job: job_(?P\d+)_(?P\d+)') # find version string in "Hadoop 0.20.203" etc. HADOOP_VERSION_RE = re.compile(r'^.*?(?P(\d|\.)+).*?$') def find_hadoop_streaming_jar(path): """Return the path of the hadoop streaming jar inside the given directory tree, or None if we can't find it.""" for (dirpath, _, filenames) in os.walk(path): for filename in filenames: if HADOOP_STREAMING_JAR_RE.match(filename): return os.path.join(dirpath, filename) else: return None def fully_qualify_hdfs_path(path): """If path isn't an ``hdfs://`` URL, turn it into one.""" if path.startswith('hdfs://') or path.startswith('s3n:/'): return path elif path.startswith('/'): return 'hdfs://' + path else: return 'hdfs:///user/%s/%s' % (getpass.getuser(), path) def hadoop_log_dir(): """Return the path where Hadoop stores logs""" try: return os.environ['HADOOP_LOG_DIR'] except KeyError: # Defaults to $HADOOP_HOME/logs # http://wiki.apache.org/hadoop/HowToConfigure return os.path.join(os.environ['HADOOP_HOME'], 'logs') class HadoopJobRunner(MRJobRunner): """Runs an :py:class:`~mrjob.job.MRJob` on your Hadoop cluster. Input and support files can be either local or on HDFS; use ``hdfs://...`` URLs to refer to files on HDFS. It's rare to need to instantiate this class directly (see :py:meth:`~HadoopJobRunner.__init__` for details). """ alias = 'hadoop' def __init__(self, **kwargs): """:py:class:`~mrjob.hadoop.HadoopJobRunner` takes the same arguments as :py:class:`~mrjob.runner.MRJobRunner`, plus some additional options which can be defaulted in :ref:`mrjob.conf `. *output_dir* and *hdfs_scratch_dir* need not be fully qualified ``hdfs://`` URIs because it's understood that they have to be on HDFS (e.g. ``tmp/mrjob/`` would be okay) Additional options: :type hadoop_bin: str or list :param hadoop_bin: name/path of your hadoop program (may include arguments). Defaults to *hadoop_home* plus ``bin/hadoop``. :type hadoop_home: str :param hadoop_home: alternative to setting the :envvar:`HADOOP_HOME` environment variable :type hdfs_scratch_dir: str :param hdfs_scratch_dir: temp directory on HDFS. Default is ``tmp/mrjob``. *hadoop_streaming_jar* is optional; by default, we'll search for it inside :envvar:`HADOOP_HOME` """ super(HadoopJobRunner, self).__init__(**kwargs) # fix hadoop_home if not self._opts['hadoop_home']: raise Exception( 'you must set $HADOOP_HOME, or pass in hadoop_home explicitly') self._opts['hadoop_home'] = os.path.abspath(self._opts['hadoop_home']) # fix hadoop_bin if not self._opts['hadoop_bin']: self._opts['hadoop_bin'] = [ os.path.join(self._opts['hadoop_home'], 'bin/hadoop')] # fix hadoop_streaming_jar if not self._opts['hadoop_streaming_jar']: log.debug('Looking for hadoop streaming jar in %s' % self._opts['hadoop_home']) self._opts['hadoop_streaming_jar'] = find_hadoop_streaming_jar( self._opts['hadoop_home']) if not self._opts['hadoop_streaming_jar']: raise Exception( "Couldn't find streaming jar in %s, bailing out" % self._opts['hadoop_home']) log.debug('Hadoop streaming jar is %s' % self._opts['hadoop_streaming_jar']) self._hdfs_tmp_dir = fully_qualify_hdfs_path( posixpath.join( self._opts['hdfs_scratch_dir'], self._job_name)) # Set output dir if it wasn't set explicitly self._output_dir = fully_qualify_hdfs_path( self._output_dir or posixpath.join(self._hdfs_tmp_dir, 'output')) # we'll set this up later self._hdfs_input_files = None # temp dir for input self._hdfs_input_dir = None self._hadoop_log_dir = hadoop_log_dir() # Running jobs via hadoop assigns a new timestamp to each job. # Running jobs via mrjob only adds steps. # Store both of these values to enable log parsing. self._job_timestamp = None self._start_step_num = None # init hadoop version cache self._hadoop_version = None @classmethod def _allowed_opts(cls): """A list of which keyword args we can pass to __init__()""" return super(HadoopJobRunner, cls)._allowed_opts() + [ 'hadoop_bin', 'hadoop_home', 'hdfs_scratch_dir', ] @classmethod def _default_opts(cls): """A dictionary giving the default value of options.""" return combine_dicts(super(HadoopJobRunner, cls)._default_opts(), { 'hadoop_home': os.environ.get('HADOOP_HOME'), 'hdfs_scratch_dir': 'tmp/mrjob', }) @classmethod def _opts_combiners(cls): """Map from option name to a combine_*() function used to combine values for that option. This allows us to specify that some options are lists, or contain environment variables, or whatever.""" return combine_dicts(super(HadoopJobRunner, cls)._opts_combiners(), { 'hadoop_bin': combine_cmds, 'hadoop_home': combine_paths, 'hdfs_scratch_dir': combine_paths, }) def get_hadoop_version(self): """Invoke the hadoop executable to determine its version""" if not self._hadoop_version: stdout = self._invoke_hadoop(['version'], return_stdout=True) if stdout: first_line = stdout.split('\n')[0] m = HADOOP_VERSION_RE.match(first_line) if m: self._hadoop_version = m.group('version') log.info("Using Hadoop version %s" % self._hadoop_version) return self._hadoop_version self._hadoop_version = '0.20.203' log.info("Unable to determine Hadoop version. Assuming 0.20.203.") return self._hadoop_version def _run(self): if self._opts['bootstrap_mrjob']: self._add_python_archive(self._create_mrjob_tar_gz() + '#') self._setup_input() self._upload_non_input_files() self._run_job_in_hadoop() def _setup_input(self): """Copy local input files (if any) to a special directory on HDFS. Set self._hdfs_input_files """ # winnow out HDFS files from local ones self._hdfs_input_files = [] local_input_files = [] for path in self._input_paths: if is_uri(path): # Don't even bother running the job if the input isn't there. if not self.ls(path): raise AssertionError( 'Input path %s does not exist!' % (path,)) self._hdfs_input_files.append(path) else: local_input_files.append(path) # copy local files into an input directory, with names like # 00000-actual_name.ext if local_input_files: hdfs_input_dir = posixpath.join(self._hdfs_tmp_dir, 'input') log.info('Uploading input to %s' % hdfs_input_dir) self._mkdir_on_hdfs(hdfs_input_dir) for i, path in enumerate(local_input_files): if path == '-': path = self._dump_stdin_to_local_file() target = '%s/%05i-%s' % ( hdfs_input_dir, i, os.path.basename(path)) self._upload_to_hdfs(path, target) self._hdfs_input_files.append(hdfs_input_dir) def _pick_hdfs_uris_for_files(self): """Decide where each file will be uploaded on S3. Okay to call this multiple times. """ hdfs_files_dir = posixpath.join(self._hdfs_tmp_dir, 'files', '') self._assign_unique_names_to_files( 'hdfs_uri', prefix=hdfs_files_dir, match=is_uri) def _upload_non_input_files(self): """Copy files to HDFS, and set the 'hdfs_uri' field for each file. """ self._pick_hdfs_uris_for_files() hdfs_files_dir = posixpath.join(self._hdfs_tmp_dir, 'files', '') self._mkdir_on_hdfs(hdfs_files_dir) log.info('Copying non-input files into %s' % hdfs_files_dir) for file_dict in self._files: path = file_dict['path'] # don't bother with files already in HDFS if is_uri(path): continue self._upload_to_hdfs(path, file_dict['hdfs_uri']) def _mkdir_on_hdfs(self, path): log.debug('Making directory %s on HDFS' % path) self._invoke_hadoop(['fs', '-mkdir', path]) def _upload_to_hdfs(self, path, target): log.debug('Uploading %s -> %s on HDFS' % (path, target)) self._invoke_hadoop(['fs', '-put', path, target]) def _dump_stdin_to_local_file(self): """Dump sys.stdin to a local file, and return the path to it.""" stdin_path = os.path.join(self._get_local_tmp_dir(), 'STDIN') # prompt user, so they don't think the process has stalled log.info('reading from STDIN') log.debug('dumping stdin to local file %s' % stdin_path) stdin_file = open(stdin_path, 'w') for line in self._stdin: stdin_file.write(line) return stdin_path def _run_job_in_hadoop(self): # figure out local names for our files self._name_files() # send script and wrapper script (if any) to working dir assert self._script # shouldn't be able to run if no script self._script['upload'] = 'file' if self._wrapper_script: self._wrapper_script['upload'] = 'file' self._counters = [] steps = self._get_steps() version = self.get_hadoop_version() for step_num, step in enumerate(steps): log.debug('running step %d of %d' % (step_num + 1, len(steps))) streaming_args = (self._opts['hadoop_bin'] + ['jar', self._opts['hadoop_streaming_jar']]) # -files/-archives (generic options, new-style) if compat.supports_new_distributed_cache_options(version): # set up uploading from HDFS to the working dir streaming_args.extend(self._upload_args()) # Add extra hadoop args first as hadoop args could be a hadoop # specific argument (e.g. -libjar) which must come before job # specific args. streaming_args.extend( self._hadoop_conf_args(step_num, len(steps))) # set up input for input_uri in self._hdfs_step_input_files(step_num): streaming_args.extend(['-input', input_uri]) # set up output streaming_args.append('-output') streaming_args.append(self._hdfs_step_output_dir(step_num)) # -cacheFile/-cacheArchive (streaming options, old-style) if not compat.supports_new_distributed_cache_options(version): # set up uploading from HDFS to the working dir streaming_args.extend(self._upload_args()) # set up mapper and reducer if 'M' not in step: mapper = 'cat' else: mapper = cmd_line(self._mapper_args(step_num)) if 'C' in step: combiner_cmd = cmd_line(self._combiner_args(step_num)) version = self.get_hadoop_version() if compat.supports_combiners_in_hadoop_streaming(version): combiner = combiner_cmd else: mapper = ("bash -c '%s | sort | %s'" % (mapper, combiner_cmd)) combiner = None else: combiner = None streaming_args.append('-mapper') streaming_args.append(mapper) if combiner: streaming_args.append('-combiner') streaming_args.append(combiner) if 'R' in step: streaming_args.append('-reducer') streaming_args.append(cmd_line(self._reducer_args(step_num))) else: streaming_args.extend(['-jobconf', 'mapred.reduce.tasks=0']) log.debug('> %s' % cmd_line(streaming_args)) step_proc = Popen(streaming_args, stdout=PIPE, stderr=PIPE) # TODO: use a pty or something so that the hadoop binary # won't buffer the status messages self._process_stderr_from_streaming(step_proc.stderr) # there shouldn't be much output to STDOUT for line in step_proc.stdout: log.error('STDOUT: ' + line.strip('\n')) returncode = step_proc.wait() if returncode == 0: # parsing needs step number for whole job self._fetch_counters([step_num + self._start_step_num]) # printing needs step number relevant to this run of mrjob self.print_counters([step_num + 1]) else: msg = ('Job failed with return code %d: %s' % (step_proc.returncode, streaming_args)) log.error(msg) # look for a Python traceback cause = self._find_probable_cause_of_failure( [step_num + self._start_step_num]) if cause: # log cause, and put it in exception cause_msg = [] # lines to log and put in exception cause_msg.append('Probable cause of failure (from %s):' % cause['log_file_uri']) cause_msg.extend(line.strip('\n') for line in cause['lines']) if cause['input_uri']: cause_msg.append('(while reading from %s)' % cause['input_uri']) for line in cause_msg: log.error(line) # add cause_msg to exception message msg += '\n' + '\n'.join(cause_msg) + '\n' raise Exception(msg) raise CalledProcessError(step_proc.returncode, streaming_args) def _process_stderr_from_streaming(self, stderr): for line in stderr: line = HADOOP_STREAMING_OUTPUT_RE.match(line).group(2) log.info('HADOOP: ' + line) if 'Streaming Job Failed!' in line: raise Exception(line) # The job identifier is printed to stderr. We only want to parse it # once because we know how many steps we have and just want to know # what Hadoop thinks the first step's number is. m = HADOOP_JOB_TIMESTAMP_RE.match(line) if m and self._job_timestamp is None: self._job_timestamp = m.group('timestamp') self._start_step_num = int(m.group('step_num')) def _hdfs_step_input_files(self, step_num): """Get the hdfs:// URI for input for the given step.""" if step_num == 0: return self._hdfs_input_files else: return [posixpath.join( self._hdfs_tmp_dir, 'step-output', str(step_num))] def _hdfs_step_output_dir(self, step_num): if step_num == len(self._get_steps()) - 1: return self._output_dir else: return posixpath.join( self._hdfs_tmp_dir, 'step-output', str(step_num + 1)) def _script_args(self): """How to invoke the script inside Hadoop""" assert self._script # shouldn't be able to run if no script args = self._opts['python_bin'] + [self._script['name']] if self._wrapper_script: args = (self._opts['python_bin'] + [self._wrapper_script['name']] + args) return args def _mapper_args(self, step_num): return (self._script_args() + ['--step-num=%d' % step_num, '--mapper'] + self._mr_job_extra_args()) def _combiner_args(self, step_num): return (self._script_args() + ['--step-num=%d' % step_num, '--combiner'] + self._mr_job_extra_args()) def _reducer_args(self, step_num): return (self._script_args() + ['--step-num=%d' % step_num, '--reducer'] + self._mr_job_extra_args()) def _upload_args(self): """Args to upload files from HDFS to the hadoop nodes.""" args = [] version = self.get_hadoop_version() if compat.supports_new_distributed_cache_options(version): # return list of strings ready for comma-joining for passing to the # hadoop binary def escaped_paths(file_dicts): return ["%s#%s" % (fd['hdfs_uri'], fd['name']) for fd in file_dicts] # index by type all_files = {} for fd in self._files: all_files.setdefault(fd.get('upload'), []).append(fd) if 'file' in all_files: args.append('-files') args.append(','.join(escaped_paths(all_files['file']))) if 'archive' in all_files: args.append('-archives') args.append(','.join(escaped_paths(all_files['archive']))) else: for file_dict in self._files: if file_dict.get('upload') == 'file': args.append('-cacheFile') args.append( '%s#%s' % (file_dict['hdfs_uri'], file_dict['name'])) elif file_dict.get('upload') == 'archive': args.append('-cacheArchive') args.append( '%s#%s' % (file_dict['hdfs_uri'], file_dict['name'])) return args def _invoke_hadoop(self, args, ok_returncodes=None, ok_stderr=None, return_stdout=False): """Run the given hadoop command, raising an exception on non-zero return code. This only works for commands whose output we don't care about. Args: ok_returncodes -- a list/tuple/set of return codes we expect to get back from hadoop (e.g. [0,1]). By default, we only expect 0. If we get an unexpected return code, we raise a CalledProcessError. ok_stderr -- don't log STDERR or raise CalledProcessError if stderr matches a regex in this list (even if the returncode is bad) return_stdout -- return the stdout from the hadoop command rather than logging it. If this is False, we return the returncode instead. """ args = self._opts['hadoop_bin'] + args log.debug('> %s' % cmd_line(args)) proc = Popen(args, stdout=PIPE, stderr=PIPE) stdout, stderr = proc.communicate() log_func = log.debug if proc.returncode == 0 else log.error if not return_stdout: for line in StringIO(stdout): log_func('STDOUT: ' + line.rstrip('\r\n')) # check if STDERR is okay stderr_is_ok = False if ok_stderr: for stderr_re in ok_stderr: if stderr_re.match(stderr): stderr_is_ok = True break if not stderr_is_ok: for line in StringIO(stderr): log_func('STDERR: ' + line.rstrip('\r\n')) ok_returncodes = ok_returncodes or [0] if not stderr_is_ok and proc.returncode not in ok_returncodes: raise CalledProcessError(proc.returncode, args) if return_stdout: return stdout else: return proc.returncode def _cleanup_local_scratch(self): super(HadoopJobRunner, self)._cleanup_local_scratch() if self._hdfs_tmp_dir: log.info('deleting %s from HDFS' % self._hdfs_tmp_dir) try: self._invoke_hadoop(['fs', '-rmr', self._hdfs_tmp_dir]) except Exception, e: log.exception(e) ### LOG FETCHING/PARSING ### def _enforce_path_regexp(self, paths, regexp, step_nums): """Helper for log fetching functions to filter out unwanted logs. Keyword arguments are checked against their corresponding regex groups. """ for path in paths: m = regexp.match(path) if (m and (step_nums is None or int(m.group('step_num')) in step_nums) and (self._job_timestamp is None or m.group('timestamp') == self._job_timestamp)): yield path def _ls_logs(self, relative_path): """List logs on the local filesystem by path relative to log root directory """ return self.ls(os.path.join(self._hadoop_log_dir, relative_path)) def _fetch_counters(self, step_nums, skip_s3_wait=False): """Read Hadoop counters from local logs. Args: step_nums -- the steps belonging to us, so that we can ignore errors from other jobs run with the same timestamp """ job_logs = self._enforce_path_regexp(self._ls_logs('history/'), HADOOP_JOB_LOG_URI_RE, step_nums) uris = list(job_logs) new_counters = scan_for_counters_in_files(uris, self, self.get_hadoop_version()) # only include steps relevant to the current job for step_num in step_nums: self._counters.append(new_counters.get(step_num, {})) def counters(self): return self._counters def _find_probable_cause_of_failure(self, step_nums): all_task_attempt_logs = [] try: all_task_attempt_logs.extend(self._ls_logs('userlogs/')) except IOError: # sometimes the master doesn't have these pass # TODO: get these logs from slaves if possible task_attempt_logs = self._enforce_path_regexp(all_task_attempt_logs, TASK_ATTEMPTS_LOG_URI_RE, step_nums) step_logs = self._enforce_path_regexp(self._ls_logs('steps/'), STEP_LOG_URI_RE, step_nums) job_logs = self._enforce_path_regexp(self._ls_logs('history/'), HADOOP_JOB_LOG_URI_RE, step_nums) log.info('Scanning logs for probable cause of failure') return scan_logs_in_order(task_attempt_logs=task_attempt_logs, step_logs=step_logs, job_logs=job_logs, runner=self) ### FILESYSTEM STUFF ### def du(self, path_glob): """Get the size of a file, or None if it's not a file or doesn't exist.""" if not is_uri(path_glob): return super(HadoopJobRunner, self).du(path_glob) stdout = self._invoke_hadoop(['fs', '-dus', path_glob], return_stdout=True) try: return sum(int(line.split()[1]) for line in stdout.split('\n') if line.strip()) except (ValueError, TypeError, IndexError): raise Exception( 'Unexpected output from hadoop fs -du: %r' % stdout) def ls(self, path_glob): if not is_uri(path_glob): for path in super(HadoopJobRunner, self).ls(path_glob): yield path return components = urlparse(path_glob) hdfs_prefix = '%s://%s' % (components.scheme, components.netloc) stdout = self._invoke_hadoop( ['fs', '-lsr', path_glob], return_stdout=True, ok_stderr=[HADOOP_LSR_NO_SUCH_FILE]) for line in StringIO(stdout): fields = line.rstrip('\r\n').split() # expect lines like: # -rw-r--r-- 3 dave users 3276 2010-01-13 14:00 /foo/bar if len(fields) < 8: raise Exception('unexpected ls line from hadoop: %r' % line) # ignore directories if fields[0].startswith('d'): continue # not sure if you can have spaces in filenames; just to be safe path = ' '.join(fields[7:]) yield hdfs_prefix + path def _cat_file(self, filename): if is_uri(filename): # stream from HDFS cat_args = self._opts['hadoop_bin'] + ['fs', '-cat', filename] log.debug('> %s' % cmd_line(cat_args)) cat_proc = Popen(cat_args, stdout=PIPE, stderr=PIPE) def stream(): for line in cat_proc.stdout: yield line # there shouldn't be any stderr for line in cat_proc.stderr: log.error('STDERR: ' + line) returncode = cat_proc.wait() if returncode != 0: raise CalledProcessError(returncode, cat_args) return read_file(filename, stream()) else: # read from local filesystem return super(HadoopJobRunner, self)._cat_file(filename) def mkdir(self, path): self._invoke_hadoop( ['fs', '-mkdir', path], ok_stderr=[HADOOP_FILE_EXISTS_RE]) def path_exists(self, path_glob): """Does the given path exist? If dest is a directory (ends with a "/"), we check if there are any files starting with that path. """ if not is_uri(path_glob): return super(HadoopJobRunner, self).path_exists(path_glob) return bool(self._invoke_hadoop(['fs', '-test', '-e', path_glob], ok_returncodes=(0, 1))) def path_join(self, dirname, filename): if is_uri(dirname): return posixpath.join(dirname, filename) else: return os.path.join(dirname, filename) def rm(self, path_glob): if not is_uri(path_glob): super(HadoopJobRunner, self).rm(path_glob) if self.path_exists(path_glob): # hadoop fs -rmr will print something like: # Moved to trash: hdfs://hdnamenode:54310/user/dave/asdf # to STDOUT, which we don't care about. # # if we ask to delete a path that doesn't exist, it prints # to STDERR something like: # rmr: # which we can safely ignore self._invoke_hadoop( ['fs', '-rmr', path_glob], return_stdout=True, ok_stderr=[HADOOP_RMR_NO_SUCH_FILE]) def touchz(self, dest): if not is_uri(dest): super(HadoopJobRunner, self).touchz(dest) self._invoke_hadoop(['fs', '-touchz', dest]) mrjob-0.3.3.2/mrjob/inline.py0000664€q(¼€tzÕß0000002227711740642733021571 0ustar sjohnsonAD\Domain Users00000000000000# Copyright 2011 Matthew Tai # Copyright 2011 Yelp # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. """Run an MRJob inline by running all mappers and reducers through the same process. Useful for debugging.""" from __future__ import with_statement __author__ = 'Matthew Tai ' import logging import os import shutil import subprocess import sys try: from cStringIO import StringIO StringIO # quiet "redefinition of unused ..." warning from pyflakes except ImportError: from StringIO import StringIO from mrjob.conf import combine_dicts from mrjob.conf import combine_local_envs from mrjob.runner import MRJobRunner from mrjob.job import MRJob from mrjob.util import save_current_environment log = logging.getLogger('mrjob.inline') class InlineMRJobRunner(MRJobRunner): """Runs an :py:class:`~mrjob.job.MRJob` without invoking the job as a subprocess, so it's easy to attach a debugger. This is NOT the default way of testing jobs; to more accurately simulate your environment prior to running on Hadoop/EMR, use ``-r local``. It's rare to need to instantiate this class directly (see :py:meth:`~InlineMRJobRunner.__init__` for details). """ alias = 'inline' def __init__(self, mrjob_cls=None, **kwargs): """:py:class:`~mrjob.inline.InlineMRJobRunner` takes the same keyword args as :py:class:`~mrjob.runner.MRJobRunner`. However, please note: * *hadoop_extra_args*, *hadoop_input_format*, *hadoop_output_format*, and *hadoop_streaming_jar*, *jobconf*, and *partitioner* are ignored because they require Java. If you need to test these, consider starting up a standalone Hadoop instance and running your job with ``-r hadoop``. * *cmdenv*, *python_bin*, *setup_cmds*, *setup_scripts*, *steps_python_bin*, *upload_archives*, and *upload_files* are ignored because we don't invoke the job as a subprocess or run it in its own directory. """ super(InlineMRJobRunner, self).__init__(**kwargs) assert ((mrjob_cls) is None or issubclass(mrjob_cls, MRJob)) self._mrjob_cls = mrjob_cls self._prev_outfile = None self._final_outfile = None self._counters = [] @classmethod def _opts_combiners(cls): # on windows, PYTHONPATH should use ;, not : return combine_dicts( super(InlineMRJobRunner, cls)._opts_combiners(), {'cmdenv': combine_local_envs}) # options that we ignore because they require real Hadoop IGNORED_HADOOP_OPTS = [ 'hadoop_extra_args', 'hadoop_streaming_jar', 'jobconf' ] # keyword arguments that we ignore that are stored directly in # self._ because they aren't configurable from mrjob.conf # use the version with the underscore to better support grepping our code IGNORED_HADOOP_ATTRS = [ '_hadoop_input_format', '_hadoop_output_format', '_partitioner', ] # options that we ignore because they involve running subprocesses IGNORED_LOCAL_OPTS = [ 'python_bin', 'setup_cmds', 'setup_scripts', 'steps_python_bin', 'upload_archives', 'upload_files', ] def _run(self): self._setup_output_dir() assert self._script # shouldn't be able to run if no script default_opts = self.get_default_opts() for ignored_opt in self.IGNORED_HADOOP_OPTS: if self._opts[ignored_opt] != default_opts[ignored_opt]: log.warning('ignoring %s option (requires real Hadoop): %r' % (ignored_opt, self._opts[ignored_opt])) for ignored_attr in self.IGNORED_HADOOP_ATTRS: value = getattr(self, ignored_attr) if value is not None: log.warning( 'ignoring %s keyword arg (requires real Hadoop): %r' % (ignored_attr[1:], value)) for ignored_opt in self.IGNORED_LOCAL_OPTS: if self._opts[ignored_opt] != default_opts[ignored_opt]: log.warning('ignoring %s option (use -r local instead): %r' % (ignored_opt, self._opts[ignored_opt])) with save_current_environment(): # set cmdenv variables os.environ.update(self._get_cmdenv()) # run mapper, sort, reducer for each step for step_number, step_name in enumerate(self._get_steps()): self._invoke_inline_mrjob(step_number, 'step-%d-mapper' % step_number, is_mapper=True, has_combiner=('C' in step_name)) if 'R' in step_name: mapper_output_path = self._prev_outfile sorted_mapper_output_path = self._decide_output_path( 'step-%d-mapper-sorted' % step_number) with open(sorted_mapper_output_path, 'w') as sort_out: proc = subprocess.Popen( ['sort', mapper_output_path], stdout=sort_out, env={'LC_ALL': 'C'}) proc.wait() # This'll read from sorted_mapper_output_path self._invoke_inline_mrjob(step_number, 'step-%d-reducer' % step_number, is_reducer=True) # move final output to output directory self._final_outfile = os.path.join(self._output_dir, 'part-00000') log.info('Moving %s -> %s' % (self._prev_outfile, self._final_outfile)) shutil.move(self._prev_outfile, self._final_outfile) def _get_steps(self): """Redefine this so that we can get step descriptions without calling a subprocess.""" job_args = ['--steps'] + self._mr_job_extra_args(local=True) return self._mrjob_cls(args=job_args)._steps_desc() def _invoke_inline_mrjob(self, step_number, outfile_name, is_mapper=False, is_reducer=False, is_combiner=False, has_combiner=False, child_stdin=None): child_stdin = child_stdin or sys.stdin common_args = (['--step-num=%d' % step_number] + self._mr_job_extra_args(local=True)) if is_mapper: child_args = ( ['--mapper'] + self._decide_input_paths() + common_args) elif is_reducer: child_args = ( ['--reducer'] + self._decide_input_paths() + common_args) elif is_combiner: child_args = ['--combiner'] + common_args + ['-'] child_instance = self._mrjob_cls(args=child_args) # Use custom stdin if has_combiner: child_stdout = StringIO() else: outfile = self._decide_output_path(outfile_name) child_stdout = open(outfile, 'w') child_instance.sandbox(stdin=child_stdin, stdout=child_stdout) child_instance.execute() if has_combiner: sorted_lines = sorted(child_stdout.getvalue().splitlines()) combiner_stdin = StringIO('\n'.join(sorted_lines)) else: child_stdout.flush() child_stdout.close() while len(self._counters) <= step_number: self._counters.append({}) child_instance.parse_counters(self._counters[step_number - 1]) self.print_counters([step_number + 1]) if has_combiner: self._invoke_inline_mrjob(step_number, outfile_name, is_combiner=True, child_stdin=combiner_stdin) combiner_stdin.close() def counters(self): return self._counters def _decide_input_paths(self): # decide where to get input if self._prev_outfile is not None: input_paths = [self._prev_outfile] else: input_paths = [] for path in self._input_paths: if path == '-': input_paths.append(self._dump_stdin_to_local_file()) else: input_paths.append(path) return input_paths def _decide_output_path(self, outfile_name): # run the mapper outfile = os.path.join(self._get_local_tmp_dir(), outfile_name) log.info('writing to %s' % outfile) log.debug('') self._prev_outfile = outfile return outfile def _setup_output_dir(self): if not self._output_dir: self._output_dir = os.path.join( self._get_local_tmp_dir(), 'output') if not os.path.isdir(self._output_dir): log.debug('Creating output directory %s' % self._output_dir) self.mkdir(self._output_dir) mrjob-0.3.3.2/mrjob/job.py0000664€q(¼€tzÕß0000024362511740642733021067 0ustar sjohnsonAD\Domain Users00000000000000# Copyright 2009-2012 Yelp and Contributors # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. """To create your own map reduce job, subclass :py:class:`MRJob`, create a series of mappers and reducers, and override :py:meth:`~mrjob.job.MRJob.steps`. For example, a word counter:: from mrjob.job import MRJob class MRWordCounter(MRJob): def get_words(self, key, line): for word in line.split(): yield word, 1 def sum_words(self, word, occurrences): yield word, sum(occurrences) def steps(self): return [self.mr(self.get_words, self.sum_words),] if __name__ == '__main__': MRWordCounter.run() The two lines at the bottom are mandatory; this is what allows your class to be run by Hadoop streaming. This will take in a file with lines of whitespace separated words, and output a file with tab-separated lines like: ``"stars"\t5``. For one-step jobs, you can also just redefine :py:meth:`~mrjob.job.MRJob.mapper` and :py:meth:`~mrjob.job.MRJob.reducer`:: from mrjob.job import MRJob class MRWordCounter(MRJob): def mapper(self, key, line): for word in line.split(): yield word, 1 def reducer(self, word, occurrences): yield word, sum(occurrences) if __name__ == '__main__': MRWordCounter.run() To test the job locally, just run: ``python your_mr_job_sub_class.py < log_file_or_whatever > output`` The script will automatically invoke itself to run the various steps, using :py:class:`~mrjob.local.LocalMRJobRunner`. You can also run individual steps:: # test 1st step mapper: python your_mr_job_sub_class.py --mapper # test 2nd step reducer (--step-num=1 because step numbers are 0-indexed): python your_mr_job_sub_class.py --reducer --step-num=1 By default, we read from stdin, but you can also specify one or more input files. It automatically decompresses .gz and .bz2 files:: python your_mr_job_sub_class.py log_01.gz log_02.bz2 log_03 You can run on Amazon Elastic MapReduce by specifying ``-r emr`` or on your own Hadoop cluster by specifying ``-r hadoop``: ``python your_mr_job_sub_class.py -r emr`` Use :py:meth:`~mrjob.job.MRJob.make_runner` to run an :py:class:`~mrjob.job.MRJob` from another script:: from __future__ import with_statement # only needed on Python 2.5 mr_job = MRWordCounter(args=['-r', 'emr']) with mr_job.make_runner() as runner: runner.run() for line in runner.stream_output(): key, value = mr_job.parse_output_line(line) ... # do something with the parsed output See :py:mod:`mrjob.examples` for more examples. """ # don't add imports here that aren't part of the standard Python library, # since MRJobs need to run in Amazon's generic EMR environment from __future__ import with_statement import inspect import itertools import logging from optparse import Option from optparse import OptionParser from optparse import OptionGroup from optparse import OptionError import sys import time try: from cStringIO import StringIO StringIO # quiet "redefinition of unused ..." warning from pyflakes except ImportError: from StringIO import StringIO # don't use relative imports, to allow this script to be invoked as __main__ from mrjob.conf import combine_dicts from mrjob.parse import parse_port_range_list from mrjob.parse import parse_mr_job_stderr from mrjob.parse import parse_key_value_list from mrjob.protocol import DEFAULT_PROTOCOL from mrjob.protocol import JSONProtocol from mrjob.protocol import PROTOCOL_DICT from mrjob.protocol import RawValueProtocol from mrjob.runner import CLEANUP_CHOICES from mrjob.util import log_to_null from mrjob.util import log_to_stream from mrjob.util import parse_and_save_options from mrjob.util import read_input log = logging.getLogger('mrjob.job') # all the parameters you can specify when definining a job step _JOB_STEP_PARAMS = ( 'combiner', 'combiner_init', 'combiner_final', 'mapper', 'mapper_init', 'mapper_final', 'reducer', 'reducer_init', 'reducer_final', ) # used by mr() below, to fake no mapper def _IDENTITY_MAPPER(key, value): yield key, value # sentinel value; used when running MRJob as a script _READ_ARGS_FROM_SYS_ARGV = '_READ_ARGS_FROM_SYS_ARGV' # The former custom option class has been removed and this stub will disappear # permanently in mrjob 0.4. MRJobOptions = Option class UsageError(Exception): pass class MRJob(object): """The base class for all MapReduce jobs. See :py:meth:`__init__` for details.""" #: :py:class:`optparse.Option` subclass to use with the #: :py:class:`optparse.OptionParser` instance. OPTION_CLASS = Option def __init__(self, args=None): """Entry point for running your job from other Python code. You can pass in command-line arguments, and the job will act the same way it would if it were run from the command line. For example, to run your job on EMR:: mr_job = MRYourJob(args=['-r', 'emr']) with mr_job.make_runner() as runner: ... Passing in ``None`` is the same as passing in ``[]`` (if you want to parse args from ``sys.argv``, call :py:meth:`MRJob.run`). For a full list of command-line arguments, run: ``python -m mrjob.job --help`` """ # make sure we respect the $TZ (time zone) environment variable if hasattr(time, 'tzset'): time.tzset() self._passthrough_options = [] self._file_options = [] usage = "usage: %prog [options] [input files]" self.option_parser = OptionParser(usage=usage, option_class=self.OPTION_CLASS) self.configure_options() # don't pass None to parse_args unless we're actually running # the MRJob script if args is _READ_ARGS_FROM_SYS_ARGV: self._cl_args = sys.argv[1:] else: # don't pass sys.argv to self.option_parser, and have it # raise an exception on error rather than printing to stderr # and exiting. self._cl_args = args or [] def error(msg): raise ValueError(msg) self.option_parser.error = error self.load_options(self._cl_args) # Make it possible to redirect stdin, stdout, and stderr, for testing # See sandbox(), below. self.stdin = sys.stdin self.stdout = sys.stdout self.stderr = sys.stderr ### Defining one-step jobs ### def mapper(self, key, value): """Re-define this to define the mapper for a one-step job. Yields zero or more tuples of ``(out_key, out_value)``. :param key: A value parsed from input. :param value: A value parsed from input. If you don't re-define this, your job will have a mapper that simply yields ``(key, value)`` as-is. By default (if you don't mess with :ref:`job-protocols`): - ``key`` will be ``None`` - ``value`` will be the raw input line, with newline stripped. - ``out_key`` and ``out_value`` must be JSON-encodable: numeric, unicode, boolean, ``None``, list, or dict whose keys are unicodes. """ raise NotImplementedError def reducer(self, key, values): """Re-define this to define the reducer for a one-step job. Yields one or more tuples of ``(out_key, out_value)`` :param key: A key which was yielded by the mapper :param value: A generator which yields all values yielded by the mapper which correspond to ``key``. By default (if you don't mess with :ref:`job-protocols`): - ``out_key`` and ``out_value`` must be JSON-encodable. - ``key`` and ``value`` will have been decoded from JSON (so tuples will become lists). """ raise NotImplementedError def combiner(self, key, values): """Re-define this to define the combiner for a one-step job. Yields one or more tuples of ``(out_key, out_value)`` :param key: A key which was yielded by the mapper :param value: A generator which yields all values yielded by one mapper task/node which correspond to ``key``. By default (if you don't mess with :ref:`job-protocols`): - ``out_key`` and ``out_value`` must be JSON-encodable. - ``key`` and ``value`` will have been decoded from JSON (so tuples will become lists). """ raise NotImplementedError def mapper_init(self): """Re-define this to define an action to run before the mapper processes any input. One use for this function is to initialize mapper-specific helper structures. Yields one or more tuples of ``(out_key, out_value)``. By default, ``out_key`` and ``out_value`` must be JSON-encodable; re-define :py:attr:`INTERNAL_PROTOCOL` to change this. """ raise NotImplementedError def mapper_final(self): """Re-define this to define an action to run after the mapper reaches the end of input. One way to use this is to store a total in an instance variable, and output it after reading all input data. See :py:mod:`mrjob.examples` for an example. Yields one or more tuples of ``(out_key, out_value)``. By default, ``out_key`` and ``out_value`` must be JSON-encodable; re-define :py:attr:`INTERNAL_PROTOCOL` to change this. """ raise NotImplementedError def reducer_init(self): """Re-define this to define an action to run before the reducer processes any input. One use for this function is to initialize reducer-specific helper structures. Yields one or more tuples of ``(out_key, out_value)``. By default, ``out_key`` and ``out_value`` must be JSON-encodable; re-define :py:attr:`INTERNAL_PROTOCOL` to change this. """ raise NotImplementedError def reducer_final(self): """Re-define this to define an action to run after the reducer reaches the end of input. Yields one or more tuples of ``(out_key, out_value)``. By default, ``out_key`` and ``out_value`` must be JSON-encodable; re-define :py:attr:`INTERNAL_PROTOCOL` to change this. """ raise NotImplementedError def combiner_init(self): """Re-define this to define an action to run before the combiner processes any input. One use for this function is to initialize combiner-specific helper structures. Yields one or more tuples of ``(out_key, out_value)``. By default, ``out_key`` and ``out_value`` must be JSON-encodable; re-define :py:attr:`INTERNAL_PROTOCOL` to change this. """ raise NotImplementedError def combiner_final(self): """Re-define this to define an action to run after the combiner reaches the end of input. Yields one or more tuples of ``(out_key, out_value)``. By default, ``out_key`` and ``out_value`` must be JSON-encodable; re-define :py:attr:`INTERNAL_PROTOCOL` to change this. """ raise NotImplementedError ### Defining multi-step jobs ### def steps(self): """Re-define this to make a multi-step job. If you don't re-define this, we'll automatically create a one-step job using any of :py:meth:`mapper`, :py:meth:`mapper_init`, :py:meth:`mapper_final`, :py:meth:`reducer_init`, :py:meth:`reducer_final`, and :py:meth:`reducer` that you've re-defined. For example:: def steps(self): return [self.mr(mapper=self.transform_input, reducer=self.consolidate_1), self.mr(reducer_init=self.log_mapper_init, reducer=self.consolidate_2)] :return: a list of steps constructed with :py:meth:`mr` """ # Use mapper(), reducer() etc. only if they've been re-defined kwargs = dict((func_name, getattr(self, func_name)) for func_name in _JOB_STEP_PARAMS if (getattr(self, func_name).im_func is not getattr(MRJob, func_name).im_func)) return [self.mr(**kwargs)] @classmethod def mr(cls, mapper=None, reducer=None, _mapper_final=None, **kwargs): """Define a step (mapper, reducer, and/or any combination of mapper_init, reducer_final, etc.) for your job. Used by :py:meth:`steps`. (Don't re-define this, just call it!) Accepts the following keyword arguments. For convenience, you may specify *mapper* and *reducer* as positional arguments as well. :param mapper: function with same function signature as :py:meth:`mapper`, or ``None`` for an identity mapper. :param reducer: function with same function signature as :py:meth:`reducer`, or ``None`` for no reducer. :param combiner: function with same function signature as :py:meth:`combiner`, or ``None`` for no combiner. :param mapper_init: function with same function signature as :py:meth:`mapper_init`, or ``None`` for no initial mapper action. :param mapper_final: function with same function signature as :py:meth:`mapper_final`, or ``None`` for no final mapper action. :param reducer_init: function with same function signature as :py:meth:`reducer_init`, or ``None`` for no initial reducer action. :param reducer_final: function with same function signature as :py:meth:`reducer_final`, or ``None`` for no final reducer action. :param combiner_init: function with same function signature as :py:meth:`combiner_init`, or ``None`` for no initial combiner action. :param combiner_final: function with same function signature as :py:meth:`combiner_final`, or ``None`` for no final combiner action. Please consider the way we represent steps to be opaque, and expect it to change in future versions of ``mrjob``. """ # limit which keyword args can be specified bad_kwargs = sorted(set(kwargs) - set(_JOB_STEP_PARAMS)) if bad_kwargs: raise TypeError( 'mr() got an unexpected keyword argument %r' % bad_kwargs[0]) # handle incorrect usage of positional args. This was wrong in mrjob # v0.2 as well, but we didn't issue a warning. if _mapper_final is not None: if 'mapper_final' in kwargs: raise TypeError("mr() got multiple values for keyword argument" " 'mapper_final'") else: log.warn( 'mapper_final should be specified as a keyword argument to' ' mr(), not a positional argument. This will be required' ' in mrjob 0.4.') kwargs['mapper_final'] = _mapper_final step = dict((f, None) for f in _JOB_STEP_PARAMS) step['mapper'] = mapper step['reducer'] = reducer step.update(kwargs) if not any(step.itervalues()): raise Exception("Step has no mappers and no reducers") # Hadoop streaming requires a mapper, so patch in _IDENTITY_MAPPER step['mapper'] = step['mapper'] or _IDENTITY_MAPPER return step def increment_counter(self, group, counter, amount=1): """Increment a counter in Hadoop streaming by printing to stderr. :type group: str :param group: counter group :type counter: str :param counter: description of the counter :type amount: int :param amount: how much to increment the counter by Commas in ``counter`` or ``group`` will be automatically replaced with semicolons (commas confuse Hadoop streaming). """ # don't allow people to pass in floats if not isinstance(amount, (int, long)): raise TypeError('amount must be an integer, not %r' % (amount,)) # Extra commas screw up hadoop and there's no way to escape them. So # replace them with the next best thing: semicolons! # # cast to str() because sometimes people pass in exceptions or whatever # # The relevant Hadoop code is incrCounter(), here: # http://svn.apache.org/viewvc/hadoop/mapreduce/trunk/src/contrib/streaming/src/java/org/apache/hadoop/streaming/PipeMapRed.java?view=markup group = str(group).replace(',', ';') counter = str(counter).replace(',', ';') self.stderr.write('reporter:counter:%s,%s,%d\n' % (group, counter, amount)) self.stderr.flush() def set_status(self, msg): """Set the job status in hadoop streaming by printing to stderr. This is also a good way of doing a keepalive for a job that goes a long time between outputs; Hadoop streaming usually times out jobs that give no output for longer than 10 minutes. """ self.stderr.write('reporter:status:%s\n' % (msg,)) self.stderr.flush() ### Running the job ### @classmethod def run(cls): """Entry point for running job from the command-line. This is also the entry point when a mapper or reducer is run by Hadoop Streaming. Does one of: * Print step information (:option:`--steps`). See :py:meth:`show_steps` * Run a mapper (:option:`--mapper`). See :py:meth:`run_mapper` * Run a combiner (:option:`--combiner`). See :py:meth:`run_combiner` * Run a reducer (:option:`--reducer`). See :py:meth:`run_reducer` * Run the entire job. See :py:meth:`run_job` """ # load options from the command line mr_job = cls(args=_READ_ARGS_FROM_SYS_ARGV) mr_job.execute() def execute(self): if self.options.show_steps: self.show_steps() elif self.options.run_mapper: self.run_mapper(self.options.step_num) elif self.options.run_combiner: self.run_combiner(self.options.step_num) elif self.options.run_reducer: self.run_reducer(self.options.step_num) else: self.run_job() def make_runner(self): """Make a runner based on command-line arguments, so we can launch this job on EMR, on Hadoop, or locally. :rtype: :py:class:`mrjob.runner.MRJobRunner` """ bad_words = ( '--steps', '--mapper', '--reducer', '--combiner', '--step-num') for w in bad_words: if w in sys.argv: raise UsageError("make_runner() was called with %s. This" " probably means you tried to use it from" " __main__, which doesn't work." % w) # have to import here so that we can still run the MRJob # without importing boto from mrjob.emr import EMRJobRunner from mrjob.hadoop import HadoopJobRunner from mrjob.local import LocalMRJobRunner from mrjob.inline import InlineMRJobRunner if self.options.runner == 'emr': return EMRJobRunner(**self.emr_job_runner_kwargs()) elif self.options.runner == 'hadoop': return HadoopJobRunner(**self.hadoop_job_runner_kwargs()) elif self.options.runner == 'inline': return InlineMRJobRunner( mrjob_cls=self.__class__, **self.inline_job_runner_kwargs()) else: # run locally by default return LocalMRJobRunner(**self.local_job_runner_kwargs()) @classmethod def set_up_logging(cls, quiet=False, verbose=False, stream=None): """Set up logging when running from the command line. This is also used by the various command-line utilities. :param bool quiet: If true, don't log. Overrides *verbose*. :param bool verbose: If true, set log level to ``DEBUG`` (default is ``INFO``) :param bool stream: Stream to log to (default is ``sys.stderr``) This will also set up a null log handler for boto, so we don't get warnings if boto tries to log about throttling and whatnot. """ if quiet: log_to_null(name='mrjob') else: log_to_stream(name='mrjob', debug=verbose, stream=stream) log_to_null(name='boto') def run_job(self): """Run the all steps of the job, logging errors (and debugging output if :option:`--verbose` is specified) to STDERR and streaming the output to STDOUT. Called from :py:meth:`run`. You'd probably only want to call this directly from automated tests. """ self.set_up_logging(quiet=self.options.quiet, verbose=self.options.verbose, stream=self.stderr) with self.make_runner() as runner: runner.run() if not self.options.no_output: for line in runner.stream_output(): self.stdout.write(line) self.stdout.flush() def run_mapper(self, step_num=0): """Run the mapper and final mapper action for the given step. :type step_num: int :param step_num: which step to run (0-indexed) If we encounter a line that can't be decoded by our input protocol, or a tuple that can't be encoded by our output protocol, we'll increment a counter rather than raising an exception. If --strict-protocols is set, then an exception is raised Called from :py:meth:`run`. You'd probably only want to call this directly from automated tests. """ steps = self.steps() if not 0 <= step_num < len(steps): raise ValueError('Out-of-range step: %d' % step_num) step = steps[step_num] mapper = step['mapper'] mapper_init = step['mapper_init'] mapper_final = step['mapper_final'] # pick input and output protocol read_lines, write_line = self._wrap_protocols(step_num, 'M') if mapper_init: for out_key, out_value in mapper_init() or (): write_line(out_key, out_value) # run the mapper on each line for key, value in read_lines(): for out_key, out_value in mapper(key, value) or (): write_line(out_key, out_value) if mapper_final: for out_key, out_value in mapper_final() or (): write_line(out_key, out_value) def run_reducer(self, step_num=0): """Run the reducer for the given step. :type step_num: int :param step_num: which step to run (0-indexed) If we encounter a line that can't be decoded by our input protocol, or a tuple that can't be encoded by our output protocol, we'll increment a counter rather than raising an exception. If --strict-protocols is set, then an exception is raised Called from :py:meth:`run`. You'd probably only want to call this directly from automated tests. """ steps = self.steps() if not 0 <= step_num < len(steps): raise ValueError('Out-of-range step: %d' % step_num) step = steps[step_num] reducer = step['reducer'] reducer_init = step['reducer_init'] reducer_final = step['reducer_final'] if reducer is None: raise ValueError('No reducer in step %d' % step_num) # pick input and output protocol read_lines, write_line = self._wrap_protocols(step_num, 'R') if reducer_init: for out_key, out_value in reducer_init() or (): write_line(out_key, out_value) # group all values of the same key together, and pass to the reducer # # be careful to use generators for everything, to allow for # very large groupings of values for key, kv_pairs in itertools.groupby(read_lines(), key=lambda(k, v): k): values = (v for k, v in kv_pairs) for out_key, out_value in reducer(key, values) or (): write_line(out_key, out_value) if reducer_final: for out_key, out_value in reducer_final() or (): write_line(out_key, out_value) def run_combiner(self, step_num=0): """Run the combiner for the given step. :type step_num: int :param step_num: which step to run (0-indexed) If we encounter a line that can't be decoded by our input protocol, or a tuple that can't be encoded by our output protocol, we'll increment a counter rather than raising an exception. If --strict-protocols is set, then an exception is raised Called from :py:meth:`run`. You'd probably only want to call this directly from automated tests. """ steps = self.steps() if not 0 <= step_num < len(steps): raise ValueError('Out-of-range step: %d' % step_num) step = steps[step_num] combiner = step['combiner'] combiner_init = step['combiner_init'] combiner_final = step['combiner_final'] if combiner is None: raise ValueError('No combiner in step %d' % step_num) # pick input and output protocol read_lines, write_line = self._wrap_protocols(step_num, 'C') if combiner_init: for out_key, out_value in combiner_init() or (): write_line(out_key, out_value) # group all values of the same key together, and pass to the combiner # # be careful to use generators for everything, to allow for # very large groupings of values for key, kv_pairs in itertools.groupby(read_lines(), key=lambda(k, v): k): values = (v for k, v in kv_pairs) for out_key, out_value in combiner(key, values) or (): write_line(out_key, out_value) if combiner_final: for out_key, out_value in combiner_final() or (): write_line(out_key, out_value) def show_steps(self): """Print information about how many steps there are, and whether they contain a mapper or reducer. Job runners (see :doc:`runners`) use this to determine how Hadoop should call this script. Called from :py:meth:`run`. You'd probably only want to call this directly from automated tests. We currently output something like ``MR M R``, but expect this to change! """ print >> self.stdout, ' '.join(self._steps_desc()) def _steps_desc(self): res = [] for step_num, step in enumerate(self.steps()): mapper_funcs = ('mapper_init', 'mapper_final') reducer_funcs = ('reducer', 'reducer_init', 'reducer_final') combiner_funcs = ('combiner', 'combiner_init', 'combiner_final') has_explicit_mapper = (step['mapper'] != _IDENTITY_MAPPER or any(step[k] for k in mapper_funcs)) has_explicit_reducer = any(step[k] for k in reducer_funcs) has_explicit_combiner = any(step[k] for k in combiner_funcs) func_strs = [] # Print a mapper if: # - The user specifies one # - Different input and output protocols are used (infer from # step number) # - We don't have anything else to print (excluding combiners) if has_explicit_mapper \ or step_num == 0 \ or not has_explicit_reducer: func_strs.append('M') if has_explicit_combiner: func_strs.append('C') if has_explicit_reducer: func_strs.append('R') res.append(''.join(func_strs)) return res @classmethod def mr_job_script(cls): """Path of this script. This returns the file containing this class.""" return inspect.getsourcefile(cls) ### Other useful utilities ### def _read_input(self): """Read from stdin, or one more files, or directories. Yield one line at time. - Resolve globs (``foo_*.gz``). - Decompress ``.gz`` and ``.bz2`` files. - If path is ``-``, read from STDIN. - Recursively read all files in a directory """ paths = self.args or ['-'] for path in paths: for line in read_input(path, stdin=self.stdin): yield line def _wrap_protocols(self, step_num, step_type): """Pick the protocol classes to use for reading and writing for the given step, and wrap them so that bad input and output trigger a counter rather than an exception unless --strict-protocols is set. Returns a tuple of read_lines, write_line read_lines() is a function that reads lines from input, decodes them, and yields key, value pairs write_line() is a function that takes key and value as args, encodes them, and writes a line to output. Args: step_num -- which step to run (e.g. 0) step_type -- 'M' for mapper, 'C' for combiner, 'R' for reducer """ read, write = self.pick_protocols(step_num, step_type) def read_lines(): for line in self._read_input(): try: key, value = read(line.rstrip('\r\n')) yield key, value except Exception, e: if self.options.strict_protocols: raise else: self.increment_counter('Undecodable input', e.__class__.__name__) def write_line(key, value): try: print >> self.stdout, write(key, value) except Exception, e: if self.options.strict_protocols: raise else: self.increment_counter('Unencodable output', e.__class__.__name__) return read_lines, write_line def pick_protocols(self, step_num, step_type): """Pick the protocol classes to use for reading and writing for the given step. :type step_num: int :param step_num: which step to run (e.g. ``0`` for the first step) :type step_type: str :param step_type: ``'M'`` for mapper, ``'C'`` for combiner, ``'R'`` for reducer :return: (read_function, write_function) By default, we use one protocol for reading input, one internal protocol for communication between steps, and one protocol for final output (which is usually the same as the internal protocol). Protocols can be controlled by setting :py:attr:`INPUT_PROTOCOL`, :py:attr:`INTERNAL_PROTOCOL`, and :py:attr:`OUTPUT_PROTOCOL`. Re-define this if you need fine control over which protocols are used by which steps. """ steps_desc = self._steps_desc() # pick input protocol if step_num == 0 and step_type == steps_desc[0][0]: read = self.input_protocol().read else: read = self.internal_protocol().read if step_num == len(steps_desc) - 1 and step_type == steps_desc[-1][-1]: write = self.output_protocol().write else: write = self.internal_protocol().write return read, write ### Command-line arguments ### def configure_options(self): """Define arguments for this script. Called from :py:meth:`__init__()`. Run ``python -m mrjob.job.MRJob --help`` to see all options. Re-define to define custom command-line arguments:: def configure_options(self): super(MRYourJob, self).configure_options self.add_passthrough_option(...) self.add_file_option(...) ... """ # To describe the steps self.option_parser.add_option( '--steps', dest='show_steps', action='store_true', default=False, help='show the steps of mappers and reducers') # To run mappers or reducers self.mux_opt_group = OptionGroup( self.option_parser, 'Running specific parts of the job') self.option_parser.add_option_group(self.mux_opt_group) self.mux_opt_group.add_option( '--mapper', dest='run_mapper', action='store_true', default=False, help='run a mapper') self.mux_opt_group.add_option( '--combiner', dest='run_combiner', action='store_true', default=False, help='run a combiner') self.mux_opt_group.add_option( '--reducer', dest='run_reducer', action='store_true', default=False, help='run a reducer') self.mux_opt_group.add_option( '--step-num', dest='step_num', type='int', default=0, help='which step to execute (default is 0)') # protocol stuff protocol_choices = sorted(self.protocols()) self.proto_opt_group = OptionGroup( self.option_parser, 'Protocols') self.option_parser.add_option_group(self.proto_opt_group) self.add_passthrough_option( '--input-protocol', dest='input_protocol', opt_group=self.proto_opt_group, default=None, choices=protocol_choices, help=('DEPRECATED: protocol to read input with (default:' ' raw_value)')) self.add_passthrough_option( '--output-protocol', dest='output_protocol', opt_group=self.proto_opt_group, default=self.DEFAULT_OUTPUT_PROTOCOL, choices=protocol_choices, help='DEPRECATED: protocol for final output (default: %s)' % ( 'same as --protocol' if self.DEFAULT_OUTPUT_PROTOCOL is None else '%default')) self.add_passthrough_option( '-p', '--protocol', dest='protocol', opt_group=self.proto_opt_group, default=None, choices=protocol_choices, help=('DEPRECATED: output protocol for mappers/reducers. Choices:' ' %s (default: json)' % ', '.join(protocol_choices))) self.add_passthrough_option( '--strict-protocols', dest='strict_protocols', default=None, opt_group=self.proto_opt_group, action='store_true', help='If something violates an input/output ' 'protocol then raise an exception') # options for running the entire job self.runner_opt_group = OptionGroup( self.option_parser, 'Running the entire job') self.option_parser.add_option_group(self.runner_opt_group) self.runner_opt_group.add_option( '--archive', dest='upload_archives', action='append', default=[], help=('Unpack archive in the working directory of this script. You' ' can use --archive multiple times.')) self.runner_opt_group.add_option( '--bootstrap-mrjob', dest='bootstrap_mrjob', action='store_true', default=None, help=("Automatically tar up the mrjob library and install it when" " we run the mrjob. This is the default. Use" " --no-bootstrap-mrjob if you've already installed mrjob on" " your Hadoop cluster.")) self.runner_opt_group.add_option( '-c', '--conf-path', dest='conf_path', default=None, help='Path to alternate mrjob.conf file to read from') self.runner_opt_group.add_option( '--cleanup', dest='cleanup', default=None, help=('Comma-separated list of which directories to delete when' ' a job succeeds, e.g. SCRATCH,LOGS. Choices:' ' %s (default: ALL)' % ', '.join(CLEANUP_CHOICES))) self.runner_opt_group.add_option( '--cleanup-on-failure', dest='cleanup_on_failure', default=None, help=('Comma-separated list of which directories to delete when' ' a job fails, e.g. SCRATCH,LOGS. Choices:' ' %s (default: NONE)' % ', '.join(CLEANUP_CHOICES))) self.runner_opt_group.add_option( '--cmdenv', dest='cmdenv', default=[], action='append', help='set an environment variable for your job inside Hadoop ' 'streaming. Must take the form KEY=VALUE. You can use --cmdenv ' 'multiple times.') self.runner_opt_group.add_option( '--file', dest='upload_files', action='append', default=[], help=('Copy file to the working directory of this script. You can' ' use --file multiple times.')) self.runner_opt_group.add_option( '--no-bootstrap-mrjob', dest='bootstrap_mrjob', action='store_false', default=None, help=("Don't automatically tar up the mrjob library and install it" " when we run this job. Use this if you've already installed" " mrjob on your Hadoop cluster.")) self.runner_opt_group.add_option( '--no-conf', dest='conf_path', action='store_false', default=None, help="Don't load mrjob.conf even if it's available") self.runner_opt_group.add_option( '--no-output', dest='no_output', default=None, action='store_true', help="Don't stream output after job completion") self.runner_opt_group.add_option( '-o', '--output-dir', dest='output_dir', default=None, help='Where to put final job output. This must be an s3:// URL ' + 'for EMR, an HDFS path for Hadoop, and a system path for local,' + 'and must be empty') self.runner_opt_group.add_option( '--partitioner', dest='partitioner', default=None, help=('Hadoop partitioner class to use to determine how mapper' ' output should be sorted and distributed to reducers. For' ' example: org.apache.hadoop.mapred.lib.HashPartitioner')) self.runner_opt_group.add_option( '--python-archive', dest='python_archives', default=[], action='append', help=('Archive to unpack and add to the PYTHONPATH of the mr_job' ' script when it runs. You can use --python-archives' ' multiple times.')) self.runner_opt_group.add_option( '--python-bin', dest='python_bin', default=None, help=("Name/path of alternate python binary for mappers/reducers." " You can include arguments, e.g. --python-bin 'python -v'")) self.runner_opt_group.add_option( '-q', '--quiet', dest='quiet', default=None, action='store_true', help="Don't print anything to stderr") self.runner_opt_group.add_option( '-r', '--runner', dest='runner', default='local', choices=('local', 'hadoop', 'emr', 'inline'), help=('Where to run the job: local to run locally, hadoop to run' ' on your Hadoop cluster, emr to run on Amazon' ' ElasticMapReduce, and inline for local debugging. Default' ' is local.')) self.runner_opt_group.add_option( '--setup-cmd', dest='setup_cmds', action='append', default=[], help=('A command to run before each mapper/reducer step in the' ' shell (e.g. "cd my-src-tree; make") specified as a string.' ' You can use --setup-cmd more than once. Use mrjob.conf to' ' specify arguments as a list to be run directly.')) self.runner_opt_group.add_option( '--setup-script', dest='setup_scripts', action='append', default=[], help=('Path to file to be copied into the local working directory' ' and then run. You can use --setup-script more than once.' ' These are run after setup_cmds.')) self.runner_opt_group.add_option( '--steps-python-bin', dest='steps_python_bin', default=None, help='Name/path of alternate python binary to use to query the ' 'job about its steps, if different from the current Python ' 'interpreter. Rarely needed.') self.runner_opt_group.add_option( '-v', '--verbose', dest='verbose', default=None, action='store_true', help='print more messages to stderr') self.hadoop_opts_opt_group = OptionGroup( self.option_parser, 'Configuring or emulating Hadoop (these apply when you set -r' ' hadoop, -r emr, or -r local)') self.option_parser.add_option_group(self.hadoop_opts_opt_group) self.hadoop_opts_opt_group.add_option( '--hadoop-version', dest='hadoop_version', default=None, help=('Version of Hadoop to specify to EMR or to emulate for -r' ' local. Default is 0.20.')) # for more info about jobconf: # http://hadoop.apache.org/mapreduce/docs/current/mapred-default.html self.hadoop_opts_opt_group.add_option( '--jobconf', dest='jobconf', default=[], action='append', help=('-jobconf arg to pass through to hadoop streaming; should' ' take the form KEY=VALUE. You can use --jobconf multiple' ' times.')) # options common to Hadoop and EMR self.hadoop_emr_opt_group = OptionGroup( self.option_parser, 'Running on Hadoop or EMR (these apply when you set -r hadoop or' ' -r emr)') self.option_parser.add_option_group(self.hadoop_emr_opt_group) self.hadoop_emr_opt_group.add_option( '--hadoop-arg', dest='hadoop_extra_args', default=[], action='append', help='Argument of any type to pass to hadoop ' 'streaming. You can use --hadoop-arg multiple times.') self.hadoop_emr_opt_group.add_option( '--hadoop-input-format', dest='hadoop_input_format', default=None, help=('DEPRECATED: the hadoop InputFormat class used by the first' ' step of your job to read data. Custom formats must be' ' included in your hadoop streaming jar (see' ' --hadoop-streaming-jar). Current best practice is to' ' redefine HADOOP_INPUT_FORMAT or hadoop_input_format()' ' in your job.')) self.hadoop_emr_opt_group.add_option( '--hadoop-output-format', dest='hadoop_output_format', default=None, help=('DEPRECATED: the hadoop OutputFormat class used by the first' ' step of your job to read data. Custom formats must be' ' included in your hadoop streaming jar (see' ' --hadoop-streaming-jar). Current best practice is to' ' redefine HADOOP_OUTPUT_FORMAT or hadoop_output_format()' ' in your job.')) self.hadoop_emr_opt_group.add_option( '--hadoop-streaming-jar', dest='hadoop_streaming_jar', default=None, help='Path of your hadoop streaming jar (locally, or on S3/HDFS)') self.hadoop_emr_opt_group.add_option( '--label', dest='label', default=None, help='custom prefix for job name, to help us identify the job') self.hadoop_emr_opt_group.add_option( '--owner', dest='owner', default=None, help='custom username to use, to help us identify who ran the job') # options for running the job on Hadoop self.hadoop_opt_group = OptionGroup( self.option_parser, 'Running on Hadoop (these apply when you set -r hadoop)') self.option_parser.add_option_group(self.hadoop_opt_group) self.hadoop_opt_group.add_option( '--hadoop-bin', dest='hadoop_bin', default=None, help='hadoop binary. Defaults to $HADOOP_HOME/bin/hadoop') self.hadoop_opt_group.add_option( '--hdfs-scratch-dir', dest='hdfs_scratch_dir', default=None, help='Scratch space on HDFS (default is tmp/)') # options for running the job on EMR self.emr_opt_group = OptionGroup( self.option_parser, 'Running on Amazon Elastic MapReduce (these apply when you set -r' ' emr)') self.option_parser.add_option_group(self.emr_opt_group) self.emr_opt_group.add_option( '--additional-emr-info', dest='additional_emr_info', default=None, help='A JSON string for selecting additional features on EMR') self.emr_opt_group.add_option( '--ami-version', dest='ami_version', default=None, help=( 'AMI Version to use (currently 1.0, 2.0, or latest).')) self.emr_opt_group.add_option( '--aws-availability-zone', dest='aws_availability_zone', default=None, help='Availability zone to run the job flow on') self.emr_opt_group.add_option( '--aws-region', dest='aws_region', default=None, help='Region to connect to S3 and EMR on (e.g. us-west-1).') self.emr_opt_group.add_option( '--bootstrap-action', dest='bootstrap_actions', action='append', default=[], help=('Raw bootstrap action scripts to run before any of the other' ' bootstrap steps. You can use --bootstrap-action more than' ' once. Local scripts will be automatically uploaded to S3.' ' To add arguments, just use quotes: "foo.sh arg1 arg2"')) self.emr_opt_group.add_option( '--bootstrap-cmd', dest='bootstrap_cmds', action='append', default=[], help=('Commands to run on the master node to set up libraries,' ' etc. You can use --bootstrap-cmd more than once. Use' ' mrjob.conf to specify arguments as a list to be run' ' directly.')) self.emr_opt_group.add_option( '--bootstrap-file', dest='bootstrap_files', action='append', default=[], help=('File to upload to the master node before running' ' bootstrap_cmds (for example, debian packages). These will' ' be made public on S3 due to a limitation of the bootstrap' ' feature. You can use --bootstrap-file more than once.')) self.emr_opt_group.add_option( '--bootstrap-python-package', dest='bootstrap_python_packages', action='append', default=[], help=('Path to a Python module to install on EMR. These should be' ' standard python module tarballs where you can cd into a' ' subdirectory and run ``sudo python setup.py install``. You' ' can use --bootstrap-python-package more than once.')) self.emr_opt_group.add_option( '--bootstrap-script', dest='bootstrap_scripts', action='append', default=[], help=('Script to upload and then run on the master node (a' ' combination of bootstrap_cmds and bootstrap_files). These' ' are run after the command from bootstrap_cmds. You can use' ' --bootstrap-script more than once.')) self.emr_opt_group.add_option( '--check-emr-status-every', dest='check_emr_status_every', default=None, type='int', help='How often (in seconds) to check status of your EMR job') self.emr_opt_group.add_option( '--ec2-instance-type', dest='ec2_instance_type', default=None, help=('Type of EC2 instance(s) to launch (e.g. m1.small,' ' c1.xlarge, m2.xlarge). See' ' http://aws.amazon.com/ec2/instance-types/ for the full' ' list.')) self.emr_opt_group.add_option( '--ec2-key-pair', dest='ec2_key_pair', default=None, help='Name of the SSH key pair you set up for EMR') self.emr_opt_group.add_option( '--ec2-key-pair-file', dest='ec2_key_pair_file', default=None, help='Path to file containing SSH key for EMR') # EMR instance types self.emr_opt_group.add_option( '--ec2-core-instance-type', '--ec2-slave-instance-type', dest='ec2_core_instance_type', default=None, help='Type of EC2 instance for core (or "slave") nodes only') self.emr_opt_group.add_option( '--ec2-master-instance-type', dest='ec2_master_instance_type', default=None, help='Type of EC2 instance for master node only') self.emr_opt_group.add_option( '--ec2-task-instance-type', dest='ec2_task_instance_type', default=None, help='Type of EC2 instance for task nodes only') # EMR instance bid prices self.emr_opt_group.add_option( '--ec2-core-instance-bid-price', dest='ec2_core_instance_bid_price', default=None, help=( 'Bid price to specify for core (or "slave") nodes when' ' setting them up as EC2 spot instances (you probably only' ' want to set a bid price for task instances).') ) self.emr_opt_group.add_option( '--ec2-master-instance-bid-price', dest='ec2_master_instance_bid_price', default=None, help=( 'Bid price to specify for the master node when setting it up ' 'as an EC2 spot instance (you probably only want to set ' 'a bid price for task instances).') ) self.emr_opt_group.add_option( '--ec2-task-instance-bid-price', dest='ec2_task_instance_bid_price', default=None, help=( 'Bid price to specify for task nodes when ' 'setting them up as EC2 spot instances.') ) self.emr_opt_group.add_option( '--emr-endpoint', dest='emr_endpoint', default=None, help=('Optional host to connect to when communicating with S3' ' (e.g. us-west-1.elasticmapreduce.amazonaws.com). Default' ' is to infer this from aws_region.')) self.emr_opt_group.add_option( '--emr-job-flow-id', dest='emr_job_flow_id', default=None, help='ID of an existing EMR job flow to use') self.emr_opt_group.add_option( '--enable-emr-debugging', dest='enable_emr_debugging', default=None, action='store_true', help='Enable storage of Hadoop logs in SimpleDB') self.emr_opt_group.add_option( '--disable-emr-debugging', dest='enable_emr_debugging', action='store_false', help='Enable storage of Hadoop logs in SimpleDB') self.emr_opt_group.add_option( '--hadoop-streaming-jar-on-emr', dest='hadoop_streaming_jar_on_emr', default=None, help=('Local path of the hadoop streaming jar on the EMR node.' ' Rarely necessary.')) self.emr_opt_group.add_option( '--no-pool-emr-job-flows', dest='pool_emr_job_flows', action='store_false', help="Don't try to run our job on a pooled job flow.") self.emr_opt_group.add_option( '--num-ec2-instances', dest='num_ec2_instances', default=None, type='int', help='Total number of EC2 instances to launch ') # NB: EMR instance counts are only applicable for slave/core and # task, since a master count > 1 causes the EMR API to return the # ValidationError "A master instance group must specify a single # instance". self.emr_opt_group.add_option( '--num-ec2-core-instances', dest='num_ec2_core_instances', default=None, type='int', help=('Number of EC2 instances to start as core (or "slave") ' 'nodes. Incompatible with --num-ec2-instances.')) self.emr_opt_group.add_option( '--num-ec2-task-instances', dest='num_ec2_task_instances', default=None, type='int', help=('Number of EC2 instances to start as task ' 'nodes. Incompatible with --num-ec2-instances.')) self.emr_opt_group.add_option( '--pool-emr-job-flows', dest='pool_emr_job_flows', action='store_true', help='Add to an existing job flow or create a new one that does' ' not terminate when the job completes. Overrides other job' ' flow-related options including EC2 instance configuration.' ' Joins pool "default" if emr_job_flow_pool_name is not' ' specified. WARNING: do not run this without' ' mrjob.tools.emr.terminate_idle_job_flows in your crontab;' ' job flows left idle can quickly become expensive!') self.emr_opt_group.add_option( '--pool-name', dest='emr_job_flow_pool_name', action='store', default=None, help=('Specify a pool name to join. Set to "default" if not' ' specified.')) self.emr_opt_group.add_option( '--s3-endpoint', dest='s3_endpoint', default=None, help=('Host to connect to when communicating with S3 (e.g.' ' s3-us-west-1.amazonaws.com). Default is to infer this from' ' region (see --aws-region).')) self.emr_opt_group.add_option( '--s3-log-uri', dest='s3_log_uri', default=None, help='URI on S3 to write logs into') self.emr_opt_group.add_option( '--s3-scratch-uri', dest='s3_scratch_uri', default=None, help='URI on S3 to use as our temp directory.') self.emr_opt_group.add_option( '--s3-sync-wait-time', dest='s3_sync_wait_time', default=None, type='float', help=('How long to wait for S3 to reach eventual consistency. This' ' is typically less than a second (zero in us-west) but the' ' default is 5.0 to be safe.')) self.emr_opt_group.add_option( '--ssh-bin', dest='ssh_bin', default=None, help=("Name/path of ssh binary. Arguments are allowed (e.g." " --ssh-bin 'ssh -v')")) self.emr_opt_group.add_option( '--ssh-bind-ports', dest='ssh_bind_ports', default=None, help=('A list of port ranges that are safe to listen on, delimited' ' by colons and commas, with syntax like' ' 2000[:2001][,2003,2005:2008,etc].' ' Defaults to 40001:40840.')) self.emr_opt_group.add_option( '--ssh-tunnel-is-closed', dest='ssh_tunnel_is_open', default=None, action='store_false', help='Make ssh tunnel accessible from localhost only') self.emr_opt_group.add_option( '--ssh-tunnel-is-open', dest='ssh_tunnel_is_open', default=None, action='store_true', help=('Make ssh tunnel accessible from remote hosts (not just' ' localhost).')) self.emr_opt_group.add_option( '--ssh-tunnel-to-job-tracker', dest='ssh_tunnel_to_job_tracker', default=None, action='store_true', help='Open up an SSH tunnel to the Hadoop job tracker') def all_option_groups(self): return (self.option_parser, self.mux_opt_group, self.proto_opt_group, self.runner_opt_group, self.hadoop_emr_opt_group, self.emr_opt_group, self.hadoop_opts_opt_group) def add_passthrough_option(self, *args, **kwargs): """Function to create options which both the job runner and the job itself respect (we use this for protocols, for example). Use it like you would use :py:func:`optparse.OptionParser.add_option`:: def configure_options(self): super(MRYourJob, self).configure_options() self.add_passthrough_option( '--max-ngram-size', type='int', default=4, help='...') Specify an *opt_group* keyword argument to add the option to that :py:class:`OptionGroup` rather than the top-level :py:class:`OptionParser`. If you want to pass files through to the mapper/reducer, use :py:meth:`add_file_option` instead. """ if 'opt_group' in kwargs: pass_opt = kwargs.pop('opt_group').add_option(*args, **kwargs) else: pass_opt = self.option_parser.add_option(*args, **kwargs) self._passthrough_options.append(pass_opt) def add_file_option(self, *args, **kwargs): """Add a command-line option that sends an external file (e.g. a SQLite DB) to Hadoop:: def configure_options(self): super(MRYourJob, self).configure_options() self.add_file_option('--scoring-db', help=...) This does the right thing: the file will be uploaded to the working dir of the script on Hadoop, and the script will be passed the same option, but with the local name of the file in the script's working directory. We suggest against sending Berkeley DBs to your job, as Berkeley DB is not forwards-compatible (so a Berkeley DB that you construct on your computer may not be readable from within Hadoop). Use SQLite databases instead. If all you need is an on-disk hash table, try out the :py:mod:`sqlite3dbm` module. """ pass_opt = self.option_parser.add_option(*args, **kwargs) if not pass_opt.type == 'string': raise OptionError( 'passthrough file options must take strings' % pass_opt.type) if not pass_opt.action in ('store', 'append'): raise OptionError("passthrough file options must use the options" " 'store' or 'append'") self._file_options.append(pass_opt) def load_options(self, args): """Load command-line options into ``self.options``. Called from :py:meth:`__init__()` after :py:meth:`configure_options`. :type args: list of str :param args: a list of command line arguments. ``None`` will be treated the same as ``[]``. Re-define if you want to post-process command-line arguments:: def load_options(self, args): super(MRYourJob, self).load_options(args) self.stop_words = self.options.stop_words.split(',') ... """ self.options, self.args = self.option_parser.parse_args(args) # parse custom options here to avoid setting a custom Option subclass # and confusing users if self.options.ssh_bind_ports: try: ports = parse_port_range_list(self.options.ssh_bind_ports) except ValueError, e: self.option_parser.error('invalid port range list "%s": \n%s' % (self.options.ssh_bind_ports, e.args[0])) self.options.ssh_bind_ports = ports cmdenv_err = 'cmdenv argument "%s" is not of the form KEY=VALUE' self.options.cmdenv = parse_key_value_list(self.options.cmdenv, cmdenv_err, self.option_parser.error) jobconf_err = 'jobconf argument "%s" is not of the form KEY=VALUE' self.options.jobconf = parse_key_value_list(self.options.jobconf, jobconf_err, self.option_parser.error) def parse_commas(cleanup_str): cleanup_error = ('cleanup option %s is not one of ' + ', '.join(CLEANUP_CHOICES)) new_cleanup_options = [] for choice in cleanup_str.split(','): if choice in CLEANUP_CHOICES: new_cleanup_options.append(choice) else: self.option_parser.error(cleanup_error % choice) if ('NONE' in new_cleanup_options and len(set(new_cleanup_options)) > 1): self.option_parser.error( 'Cannot clean up both nothing and something!') return new_cleanup_options if self.options.cleanup is not None: self.options.cleanup = parse_commas(self.options.cleanup) if self.options.cleanup_on_failure is not None: self.options.cleanup_on_failure = parse_commas( self.options.cleanup_on_failure) # DEPRECATED protocol stuff ignore_switches = ( self.INPUT_PROTOCOL != RawValueProtocol or self.INTERNAL_PROTOCOL != JSONProtocol or self.OUTPUT_PROTOCOL != JSONProtocol or any( (getattr(self, func_name).im_func is not getattr(MRJob, func_name).im_func) for func_name in ( 'input_protocol', 'internal_protocol', 'output_protocol', ) ) ) warn_deprecated = False if self.options.protocol is None: self.options.protocol = self.DEFAULT_PROTOCOL if self.DEFAULT_PROTOCOL != 'json': warn_deprecated = True else: warn_deprecated = True if self.options.input_protocol is None: self.options.input_protocol = self.DEFAULT_INPUT_PROTOCOL if self.DEFAULT_INPUT_PROTOCOL != 'raw_value': warn_deprecated = True else: warn_deprecated = True # output_protocol defaults to protocol if self.options.output_protocol is None: self.options.output_protocol = self.options.protocol else: warn_deprecated = True if warn_deprecated: if ignore_switches: log.warn('You have specified custom behavior in both' ' deprecated and non-deprecated ways.' ' The custom non-deprecated behavior will override' ' the deprecated behavior in all cases, including' ' command line switches.') self.options.input_protocol = None self.options.protocol = None self.options.output_protocol = None else: log.warn('Setting protocols via --input-protocol, --protocol,' ' --output-protocol, DEFAULT_INPUT_PROTOCOL,' ' DEFAULT_PROTOCOL, and DEFAULT_OUTPUT_PROTOCOL is' ' deprecated as of mrjob 0.3 and will no longer be' ' supported in mrjob 0.4.') def is_mapper_or_reducer(self): """True if this is a mapper/reducer. This is mostly useful inside :py:meth:`load_options`, to disable loading options when we aren't running inside Hadoop Streaming. """ return self.options.run_mapper \ or self.options.run_combiner \ or self.options.run_reducer def job_runner_kwargs(self): """Keyword arguments used to create runners when :py:meth:`make_runner` is called. :return: map from arg name to value Re-define this if you want finer control of runner initialization. You might find :py:meth:`mrjob.conf.combine_dicts` useful if you want to add or change lots of keyword arguments. """ return { 'bootstrap_mrjob': self.options.bootstrap_mrjob, 'cleanup': self.options.cleanup, 'cleanup_on_failure': self.options.cleanup_on_failure, 'cmdenv': self.options.cmdenv, 'conf_path': self.options.conf_path, 'extra_args': self.generate_passthrough_arguments(), 'file_upload_args': self.generate_file_upload_args(), 'hadoop_extra_args': self.options.hadoop_extra_args, 'hadoop_input_format': self.hadoop_input_format(), 'hadoop_output_format': self.hadoop_output_format(), 'hadoop_streaming_jar': self.options.hadoop_streaming_jar, 'hadoop_version': self.options.hadoop_version, 'input_paths': self.args, 'jobconf': self.jobconf(), 'mr_job_script': self.mr_job_script(), 'label': self.options.label, 'output_dir': self.options.output_dir, 'owner': self.options.owner, 'partitioner': self.partitioner(), 'python_archives': self.options.python_archives, 'python_bin': self.options.python_bin, 'setup_cmds': self.options.setup_cmds, 'setup_scripts': self.options.setup_scripts, 'stdin': self.stdin, 'steps_python_bin': self.options.steps_python_bin, 'upload_archives': self.options.upload_archives, 'upload_files': self.options.upload_files, } def inline_job_runner_kwargs(self): """Keyword arguments to create create runners when :py:meth:`make_runner` is called, when we run a job locally (``-r inline``). :return: map from arg name to value Re-define this if you want finer control when running jobs locally. """ return self.job_runner_kwargs() def local_job_runner_kwargs(self): """Keyword arguments to create create runners when :py:meth:`make_runner` is called, when we run a job locally (``-r local``). :return: map from arg name to value Re-define this if you want finer control when running jobs locally. """ return self.job_runner_kwargs() def emr_job_runner_kwargs(self): """Keyword arguments to create create runners when :py:meth:`make_runner` is called, when we run a job on EMR (``-r emr``). :return: map from arg name to value Re-define this if you want finer control when running jobs on EMR. """ return combine_dicts( self.job_runner_kwargs(), self._get_kwargs_from_opt_group(self.emr_opt_group)) def hadoop_job_runner_kwargs(self): """Keyword arguments to create create runners when :py:meth:`make_runner` is called, when we run a job on EMR (``-r hadoop``). :return: map from arg name to value Re-define this if you want finer control when running jobs on hadoop. """ return combine_dicts( self.job_runner_kwargs(), self._get_kwargs_from_opt_group(self.hadoop_opt_group)) def _get_kwargs_from_opt_group(self, opt_group): """Helper function that returns a dictionary of the values of options in the given options group (this works because the options and the keyword args we want to set have identical names). """ keys = set(opt.dest for opt in opt_group.option_list) return dict((key, getattr(self.options, key)) for key in keys) def generate_passthrough_arguments(self): """Returns a list of arguments to pass to subprocesses, either on hadoop or executed via subprocess. These are passed to :py:meth:`mrjob.runner.MRJobRunner.__init__` as *extra_args*. """ arg_map = parse_and_save_options(self.option_parser, self._cl_args) output_args = [] passthrough_dests = sorted(set(option.dest for option \ in self._passthrough_options)) for option_dest in passthrough_dests: output_args.extend(arg_map.get(option_dest, [])) return output_args def generate_file_upload_args(self): """Figure out file upload args to pass through to the job runner. Instead of generating a list of args, we're generating a list of tuples of ``('--argname', path)`` These are passed to :py:meth:`mrjob.runner.MRJobRunner.__init__` as ``file_upload_args``. """ file_upload_args = [] master_option_dict = self.options.__dict__ for opt in self._file_options: opt_prefix = opt.get_opt_string() opt_value = master_option_dict[opt.dest] if opt_value: paths = opt_value if opt.action == 'append' else [opt_value] for path in paths: file_upload_args.append((opt_prefix, path)) return file_upload_args ### protocols ### def input_protocol(self): """Instance of the protocol to use to convert input lines to Python objects. Default behavior is to return an instance of :py:attr:`INPUT_PROTOCOL`. """ if (self.options.input_protocol is not None and self.INPUT_PROTOCOL == RawValueProtocol): # deprecated protocol_name = self.options.input_protocol return self.protocols()[protocol_name]() else: # non-deprecated return self.INPUT_PROTOCOL() def internal_protocol(self): """Instance of the protocol to use to communicate between steps. Default behavior is to return an instance of :py:attr:`INTERNAL_PROTOCOL`. """ if (self.options.protocol is not None and self.INTERNAL_PROTOCOL == JSONProtocol): # deprecated protocol_name = self.options.protocol return self.protocols()[protocol_name] else: # non-deprecated return self.INTERNAL_PROTOCOL() def output_protocol(self): """Instance of the protocol to use to convert Python objects to output lines. Default behavior is to return an instance of :py:attr:`OUTPUT_PROTOCOL`. """ if (self.options.output_protocol is not None and self.OUTPUT_PROTOCOL == JSONProtocol): # deprecated return self.protocols()[self.options.output_protocol] else: # non-deprecated return self.OUTPUT_PROTOCOL() @classmethod def protocols(cls): """Deprecated in favor of :py:attr:`INPUT_PROTOCOL`, :py:attr:`OUTPUT_PROTOCOL`, and :py:attr:`INTERNAL_PROTOCOL`. Mapping from protocol name to the protocol class to use for parsing job input and writing job output. We give protocols names so that we can easily choose them from the command line. This returns :py:data:`mrjob.protocol.PROTOCOL_DICT` by default. To add a custom protocol, define a subclass of :py:class:`mrjob.protocol.HadoopStreamingProtocol`, and re-define this method:: @classmethod def protocols(cls): protocol_dict = super(MRYourJob, cls).protocols() protocol_dict['rot13'] = Rot13Protocol return protocol_dict DEFAULT_PROTOCOL = 'rot13' """ return PROTOCOL_DICT.copy() # copy to stop monkey-patching #: Protocol for reading input to the first mapper in your job. #: Default: :py:class:`RawValueProtocol`. #: #: For example you know your input data were in JSON format, you could #: set:: #: #: INPUT_PROTOCOL = JsonValueProtocol #: #: in your class, and your initial mapper would receive decoded JSONs #: rather than strings. #: #: See :py:data:`mrjob.protocol` for the full list of protocols. INPUT_PROTOCOL = RawValueProtocol #: Protocol for communication between steps and final output. #: Default: :py:class:`JSONProtocol`. #: #: For example if your step output weren't JSON-encodable, you could set:: #: #: INTERNAL_PROTOCOL = PickleProtocol #: #: and step output would be encoded as string-escaped pickles. #: #: See :py:data:`mrjob.protocol` for the full list of protocols. INTERNAL_PROTOCOL = JSONProtocol #: Protocol to use for writing output. Default: :py:class:`JSONProtocol`. #: #: For example, if you wanted the final output in repr, you could set:: #: #: OUTPUT_PROTOCOL = ReprProtocol #: #: See :py:data:`mrjob.protocol` for the full list of protocols. OUTPUT_PROTOCOL = JSONProtocol #: .. deprecated:: 0.3.0 #: #: Default protocol for reading input to the first mapper in your job #: specified by a string. #: #: Overridden by any changes to :py:attr:`.INPUT_PROTOCOL`. #: #: See :py:data:`mrjob.protocol.PROTOCOL_DICT` for the full list of #: protocol strings. Can be overridden by :option:`--input-protocol`. DEFAULT_INPUT_PROTOCOL = 'raw_value' #: .. deprecated:: 0.3.0 #: #: Default protocol for communication between steps and final output #: specified by a string. #: #: Overridden by any changes to :py:attr:`.INTERNAL_PROTOCOL`. #: #: See :py:data:`mrjob.protocol.PROTOCOL_DICT` for the full list of #: protocol strings. Can be overridden by :option:`--protocol`. DEFAULT_PROTOCOL = DEFAULT_PROTOCOL # i.e. the one from mrjob.protocols #: .. deprecated:: 0.3.0 #: #: Overridden by any changes to :py:attr:`.OUTPUT_PROTOCOL`. If #: :py:attr:`.OUTPUT_PROTOCOL` is not set, defaults to #: :py:attr:`.DEFAULT_PROTOCOL`. #: #: See :py:data:`mrjob.protocol.PROTOCOL_DICT` for the full list of #: protocol strings. Can be overridden by the :option:`--output-protocol`. DEFAULT_OUTPUT_PROTOCOL = None def parse_output_line(self, line): """ Parse a line from the final output of this MRJob into ``(key, value)``. Used extensively in tests like this:: runner.run() for line in runner.stream_output(): key, value = mr_job.parse_output_line(line) """ return self.output_protocol().read(line) ### Hadoop Input/Output Formats ### #: Optional name of an optional Hadoop ``InputFormat`` class, e.g. #: ``'org.apache.hadoop.mapred.lib.NLineInputFormat'``. #: #: Passed to Hadoop with the *first* step of this job with the #: ``-inputformat`` option. HADOOP_INPUT_FORMAT = None def hadoop_input_format(self): """Optional Hadoop ``InputFormat`` class to parse input for the first step of the job. Normally, setting :py:attr:`HADOOP_INPUT_FORMAT` is sufficient; redefining this method is only for when you want to get fancy. """ if self.options.hadoop_input_format: log.warn('--hadoop-input-format is deprecated as of mrjob 0.3 and' ' will no longer be supported in mrjob 0.4. Redefine' ' HADOOP_INPUT_FORMAT or hadoop_input_format() instead.') return self.options.hadoop_input_format else: return self.HADOOP_INPUT_FORMAT #: Optional name of an optional Hadoop ``OutputFormat`` class, e.g. #: ``'org.apache.hadoop.mapred.FileOutputFormat'``. #: #: Passed to Hadoop with the *last* step of this job with the #: ``-outputformat`` option. HADOOP_OUTPUT_FORMAT = None def hadoop_output_format(self): """Optional Hadoop ``OutputFormat`` class to write output for the last step of the job. Normally, setting :py:attr:`HADOOP_OUTPUT_FORMAT` is sufficient; redefining this method is only for when you want to get fancy. """ if self.options.hadoop_output_format: log.warn('--hadoop-output-format is deprecated as of mrjob 0.3 and' ' will no longer be supported in mrjob 0.4. Redefine ' ' HADOOP_OUTPUT_FORMAT or hadoop_output_format() instead.' ) return self.options.hadoop_output_format else: return self.HADOOP_OUTPUT_FORMAT ### Partitioning ### #: Optional Hadoop partitioner class to use to determine how mapper #: output should be sorted and distributed to reducers. For example: #: ``'org.apache.hadoop.mapred.lib.HashPartitioner'``. PARTITIONER = None def partitioner(self): """Optional Hadoop partitioner class to use to determine how mapper output should be sorted and distributed to reducers. By default, returns whatever is passed to :option:`--partitioner`, of if that option isn't used, :py:attr:`PARTITIONER`. You probably don't need to re-define this; it's just here for completeness. """ return self.options.partitioner or self.PARTITIONER ### Jobconf ### #: Optional jobconf arguments we should always pass to Hadoop. This #: is a map from property name to value. e.g.: #: #: ``{'stream.num.map.output.key.fields': '4'}`` #: #: It's recommended that you only use this to hard-code things that #: affect the semantics of your job, and leave performance tweaks to #: the command line or whatever you use to launch your job. JOBCONF = {} def jobconf(self): """``-jobconf`` args to pass to hadoop streaming. This should be a map from property name to value. By default, this combines :option:`jobconf` options from the command lines with :py:attr:`JOBCONF`, with command line arguments taking precedence. If you want to re-define this, it's strongly recommended that do something like this, so as not to inadvertently disable :option:`jobconf`:: def jobconf(self): orig_jobconf = super(MyMRJobClass, self).jobconf() custom_jobconf = ... return mrjob.conf.combine_dicts(orig_jobconf, custom_jobconf) """ return combine_dicts(self.JOBCONF, self.options.jobconf) ### Testing ### def sandbox(self, stdin=None, stdout=None, stderr=None): """Redirect stdin, stdout, and stderr for automated testing. You can set stdin, stdout, and stderr to file objects. By default, they'll be set to empty ``StringIO`` objects. You can then access the job's file handles through ``self.stdin``, ``self.stdout``, and ``self.stderr``. See :ref:`testing` for more information about testing. You may call sandbox multiple times (this will essentially clear the file handles). ``stdin`` is empty by default. You can set it to anything that yields lines:: mr_job.sandbox(stdin=StringIO('some_data\\n')) or, equivalently:: mr_job.sandbox(stdin=['some_data\\n']) For convenience, this sandbox() returns self, so you can do:: mr_job = MRJobClassToTest().sandbox() Simple testing example:: mr_job = MRYourJob.sandbox() assert_equal(list(mr_job.reducer('foo', ['bar', 'baz'])), [...]) More complex testing example:: from StringIO import StringIO mr_job = MRYourJob(args=[...]) fake_input = '"foo"\\t"bar"\\n"foo"\\t"baz"\\n' mr_job.sandbox(stdin=StringIO(fake_input)) mr_job.run_reducer(link_num=0) assert_equal(mr_job.parse_output(), ...) assert_equal(mr_job.parse_counters(), ...) """ self.stdin = stdin or StringIO() self.stdout = stdout or StringIO() self.stderr = stderr or StringIO() return self def parse_counters(self, counters=None): """Convenience method for reading counters. This only works in sandbox mode. This does not clear ``self.stderr``. :return: a map from counter group to counter name to amount. To read everything from ``self.stderr`` (including status messages) use :py:meth:`mrjob.parse.parse_mr_job_stderr`. When writing unit tests, you may find :py:meth:`MRJobRunner.counters() ` more useful. """ if self.stderr == sys.stderr: raise AssertionError('You must call sandbox() first;' ' parse_counters() is for testing only.') stderr_results = parse_mr_job_stderr(self.stderr.getvalue(), counters) return stderr_results['counters'] def parse_output(self, protocol=None): """Convenience method for parsing output from any mapper or reducer, all at once. This helps you test individual mappers and reducers by calling run_mapper() or run_reducer(). For example:: mr_job.sandbox(stdin=your_input) mr_job.run_mapper(step_num=0) output = mrjob.parse_output() :type protocol: str :param protocol: A protocol instance to use (e.g. JSONProtocol()), Also accepts protocol names (e.g. ``'json'``), but this is deprecated. This only works in sandbox mode. This does not clear ``self.stdout``. """ if self.stdout == sys.stdout: raise AssertionError('You must call sandbox() first;' ' parse_output() is for testing only.') if protocol is None: protocol = JSONProtocol() elif isinstance(protocol, basestring): protocol = self.protocols()[protocol] lines = StringIO(self.stdout.getvalue()) return [protocol.read(line) for line in lines] if __name__ == '__main__': MRJob.run() mrjob-0.3.3.2/mrjob/local.py0000664€q(¼€tzÕß0000006526211740625117021403 0ustar sjohnsonAD\Domain Users00000000000000# Copyright 2009-2012 Yelp and Contributors # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. """Run an MRJob locally by forking off a bunch of processes and piping them together. Useful for testing.""" from __future__ import with_statement import itertools import logging import os import shutil import stat from subprocess import Popen from subprocess import PIPE import sys from mrjob.compat import translate_jobconf from mrjob.conf import combine_dicts from mrjob.conf import combine_local_envs from mrjob.parse import find_python_traceback from mrjob.parse import parse_mr_job_stderr from mrjob.runner import MRJobRunner from mrjob.util import cmd_line from mrjob.util import read_input from mrjob.util import unarchive log = logging.getLogger('mrjob.local') DEFAULT_MAP_TASKS = 2 DEFAULT_REDUCE_TASKS = 2 class LocalMRJobRunner(MRJobRunner): """Runs an :py:class:`~mrjob.job.MRJob` locally, for testing purposes. This is the default way of running jobs; we assume you'll spend some time debugging your job before you're ready to run it on EMR or Hadoop. It's rare to need to instantiate this class directly (see :py:meth:`~LocalMRJobRunner.__init__` for details). :py:class:`LocalMRJobRunner` simulates the following jobconf variables: * ``mapreduce.job.cache.archives`` * ``mapreduce.job.cache.files`` * ``mapreduce.job.cache.local.archives`` * ``mapreduce.job.cache.local.files`` * ``mapreduce.job.id`` * ``mapreduce.job.local.dir`` * ``mapreduce.map.input.file`` * ``mapreduce.map.input.length`` * ``mapreduce.map.input.start`` * ``mapreduce.task.attempt.id`` * ``mapreduce.task.id`` * ``mapreduce.task.ismap`` * ``mapreduce.task.output.dir`` * ``mapreduce.task.partition`` :py:class:`LocalMRJobRunner` adds the current working directory to the subprocesses' :envvar:`PYTHONPATH`, so if you're using it to test an EMR job locally, be aware that it may see more Python modules than will actaully be uploaded. This behavior may change in the future. """ alias = 'local' def __init__(self, **kwargs): """Arguments to this constructor may also appear in :file:`mrjob.conf` under ``runners/local``. :py:class:`~mrjob.local.LocalMRJobRunner`'s constructor takes the same keyword args as :py:class:`~mrjob.runner.MRJobRunner`. However, please note: * *cmdenv* is combined with :py:func:`~mrjob.conf.combine_local_envs` * *python_bin* defaults to ``sys.executable`` (the current python interpreter) * *hadoop_extra_args*, *hadoop_input_format*, *hadoop_output_format*, *hadoop_streaming_jar*, and *partitioner* are ignored because they require Java. If you need to test these, consider starting up a standalone Hadoop instance and running your job with ``-r hadoop``. """ super(LocalMRJobRunner, self).__init__(**kwargs) self._working_dir = None self._prev_outfiles = [] self._counters = [] self._map_tasks = DEFAULT_MAP_TASKS self._reduce_tasks = DEFAULT_REDUCE_TASKS # jobconf variables set by our own job (e.g. files "uploaded") # # By convention, we use the Hadoop 0.21 (newer) versions of the # jobconf variables internally (they get auto-translated before # running the job) self._internal_jobconf = {} @classmethod def _default_opts(cls): """A dictionary giving the default value of options.""" return combine_dicts(super(LocalMRJobRunner, cls)._default_opts(), { # prefer whatever interpreter we're currently using 'python_bin': [sys.executable or 'python'], }) @classmethod def _opts_combiners(cls): # on windows, PYTHONPATH should use ;, not : return combine_dicts( super(LocalMRJobRunner, cls)._opts_combiners(), {'cmdenv': combine_local_envs}) # options that we ignore because they require real Hadoop IGNORED_HADOOP_OPTS = [ 'hadoop_extra_args', 'hadoop_streaming_jar', ] # keyword arguments that we ignore that are stored directly in # self._ because they aren't configurable from mrjob.conf # use the version with the underscore to better support grepping our code IGNORED_HADOOP_ATTRS = [ '_hadoop_input_format', '_hadoop_output_format', '_partitioner', ] def _run(self): if self._opts['bootstrap_mrjob']: self._add_python_archive(self._create_mrjob_tar_gz() + '#') for ignored_opt in self.IGNORED_HADOOP_OPTS: if self._opts[ignored_opt]: log.warning('ignoring %s option (requires real Hadoop): %r' % (ignored_opt, self._opts[ignored_opt])) for ignored_attr in self.IGNORED_HADOOP_ATTRS: value = getattr(self, ignored_attr) if value is not None: log.warning( 'ignoring %s keyword arg (requires real Hadoop): %r' % (ignored_attr[1:], value)) self._create_wrapper_script() self._setup_working_dir() self._setup_output_dir() # process jobconf arguments jobconf = self._opts['jobconf'] self._process_jobconf_args(jobconf) assert self._script # shouldn't be able to run if no script wrapper_args = self._opts['python_bin'] if self._wrapper_script: wrapper_args = (self._opts['python_bin'] + [self._wrapper_script['name']] + wrapper_args) # run mapper, combiner, sort, reducer for each step for i, step in enumerate(self._get_steps()): self._counters.append({}) # run the mapper mapper_args = (wrapper_args + [self._script['name'], '--step-num=%d' % i, '--mapper'] + self._mr_job_extra_args()) combiner_args = [] if 'C' in step: combiner_args = (wrapper_args + [self._script['name'], '--step-num=%d' % i, '--combiner'] + self._mr_job_extra_args()) self._invoke_step(mapper_args, 'step-%d-mapper' % i, step_num=i, step_type='M', num_tasks=self._map_tasks, combiner_args=combiner_args) if 'R' in step: # sort the output. Treat this as a mini-step for the purpose # of self._prev_outfiles sort_output_path = os.path.join( self._get_local_tmp_dir(), 'step-%d-mapper-sorted' % i) self._invoke_sort(self._step_input_paths(), sort_output_path) self._prev_outfiles = [sort_output_path] # run the reducer reducer_args = (wrapper_args + [self._script['name'], '--step-num=%d' % i, '--reducer'] + self._mr_job_extra_args()) self._invoke_step(reducer_args, 'step-%d-reducer' % i, step_num=i, step_type='R', num_tasks=self._reduce_tasks) # move final output to output directory for i, outfile in enumerate(self._prev_outfiles): final_outfile = os.path.join(self._output_dir, 'part-%05d' % i) log.info('Moving %s -> %s' % (outfile, final_outfile)) shutil.move(outfile, final_outfile) def _process_jobconf_args(self, jobconf): if jobconf: for (conf_arg, value) in jobconf.iteritems(): # Internally, use one canonical Hadoop version canon_arg = translate_jobconf(conf_arg, '0.21') if canon_arg == 'mapreduce.job.maps': self._map_tasks = int(value) if self._map_tasks < 1: raise ValueError( '%s should be at least 1' % conf_arg) elif canon_arg == 'mapreduce.job.reduces': self._reduce_tasks = int(value) if self._reduce_tasks < 1: raise ValueError('%s should be at least 1' % conf_arg) elif canon_arg == 'mapreduce.job.local.dir': # Hadoop supports multiple direcories. Sticking with only # one here if not os.path.isdir(value): raise IOError("Directory %s does not exist" % value) self._working_dir = value def _setup_working_dir(self): """Make a working directory with symlinks to our script and external files. Return name of the script""" # specify that we want to upload our script along with other files if self._script: self._script['upload'] = 'file' if self._wrapper_script: self._wrapper_script['upload'] = 'file' # create the working directory if not self._working_dir: self._working_dir = os.path.join( self._get_local_tmp_dir(), 'working_dir') self.mkdir(self._working_dir) # give all our files names, and symlink or unarchive them self._name_files() for file_dict in self._files: path = file_dict['path'] name = file_dict['name'] dest = os.path.join(self._working_dir, name) if file_dict.get('upload') == 'file': self._symlink_to_file_or_copy(path, dest) elif file_dict.get('upload') == 'archive': log.debug('unarchiving %s -> %s' % (path, dest)) unarchive(path, dest) def _setup_output_dir(self): if not self._output_dir: self._output_dir = os.path.join( self._get_local_tmp_dir(), 'output') if not os.path.isdir(self._output_dir): log.debug('Creating output directory %s' % self._output_dir) self.mkdir(self._output_dir) def _symlink_to_file_or_copy(self, path, dest): """Symlink from *dest* to the absolute version of *path*. If symlinks aren't available, copy *path* to *dest* instead.""" if hasattr(os, 'symlink'): path = os.path.abspath(path) log.debug('creating symlink %s <- %s' % (path, dest)) os.symlink(path, dest) else: log.debug('copying %s -> %s' % (path, dest)) shutil.copyfile(path, dest) def _get_file_splits(self, input_paths, num_splits, keep_sorted=False): """ Split the input files into (roughly) *num_splits* files. Gzipped files are not split, but each gzipped file counts as one split. :param input_paths: Iterable of paths to be split :param num_splits: Number of splits to target :param keep_sorted: If True, group lines by key Returns a dictionary that maps split_file names to a dictionary of properties: * *orig_name*: the original name of the file whose data is in the split * *start*: where the split starts * *length*: the length of the split """ # sanity check: if keep_sorted is True, we should only have one file assert(not keep_sorted or len(input_paths) == 1) file_names = {} input_paths_to_split = [] for input_path in input_paths: for path in self.ls(input_path): if path.endswith('.gz'): # do not split compressed files file_names[path] = { 'orig_name': path, 'start': 0, 'length': os.stat(path)[stat.ST_SIZE], } # this counts as "one split" num_splits -= 1 else: # do split uncompressed files input_paths_to_split.append(path) # exit early if no uncompressed files given if not input_paths_to_split: return file_names # account for user giving fewer splits than there are compressed files num_splits = max(num_splits, 1) # determine the size of each file split total_size = 0 for input_path in input_paths_to_split: for path in self.ls(input_path): total_size += os.stat(path)[stat.ST_SIZE] split_size = total_size / num_splits # we want each file split to be as close to split_size as possible # we also want different input files to be in different splits tmp_directory = self._get_local_tmp_dir() # Helper functions: def create_outfile(orig_name='', start=''): # create a new output file and initialize its properties dict outfile_name = os.path.join(tmp_directory, 'input_part-%05d' % len(file_names)) new_file = { 'orig_name': orig_name, 'start': start, } file_names[outfile_name] = new_file return outfile_name def line_group_generator(input_path): # Generate lines from a given input_path, if keep_sorted is True, # group lines by key; otherwise have one line per group # concatenate all lines with the same key and yield them # together if keep_sorted: def reducer_key(line): return line.split('\t')[0] # assume that input is a collection of key value pairs # match all non-tab characters for _, lines in itertools.groupby( read_input(input_path), key=reducer_key): yield lines else: for line in read_input(input_path): yield (line,) for path in input_paths_to_split: # create a new split file for each new path # initialize file and accumulators outfile_name = create_outfile(path, 0) bytes_written = 0 total_bytes = 0 outfile = None try: outfile = open(outfile_name, 'w') # write each line to a file as long as we are within the limit # (split_size) for line_group in line_group_generator(path): if bytes_written >= split_size: # new split file if we exceeded the limit file_names[outfile_name]['length'] = bytes_written total_bytes += bytes_written outfile_name = create_outfile(path, total_bytes) outfile.close() outfile = open(outfile_name, 'w') bytes_written = 0 for line in line_group: outfile.write(line) bytes_written += len(line) file_names[outfile_name]['length'] = bytes_written finally: if not outfile is None: outfile.close() return file_names def _step_input_paths(self): """Decide where to get input for a step. Dump stdin to a temp file if need be.""" if self._prev_outfiles: return self._prev_outfiles else: input_paths = [] for path in self._input_paths: if path == '-': input_paths.append(self._dump_stdin_to_local_file()) else: input_paths.append(path) return input_paths def _invoke_step(self, args, outfile_name, step_num=0, num_tasks=1, step_type='M', combiner_args=None): """Run the given command, outputting into outfile, and reading from the previous outfile (or, for the first step, from our original output files). outfile is a path relative to our local tmp dir. commands are run inside self._working_dir We'll intelligently handle stderr from the process. :param combiner_args: If this mapper has a combiner, we need to do some extra shell wrangling, so pass the combiner arguments in separately. """ # get file splits for mappers and reducers keep_sorted = (step_type == 'R') file_splits = self._get_file_splits( self._step_input_paths(), num_tasks, keep_sorted=keep_sorted) # Start the tasks associated with the step: # if we need to sort, then just sort all input files into one file # otherwise, split the files needed for mappers and reducers # and setup the task environment for each all_proc_dicts = [] self._prev_outfiles = [] for task_num, file_name in enumerate(file_splits): # setup environment variables if step_type == 'M': env = self._subprocess_env( step_type, step_num, task_num, # mappers have extra file split info input_file=file_splits[file_name]['orig_name'], input_start=file_splits[file_name]['start'], input_length=file_splits[file_name]['length']) else: env = self._subprocess_env(step_type, step_num, task_num) task_outfile = outfile_name + '_part-%05d' % task_num proc_dicts = self._invoke_process(args + [file_name], task_outfile, env=env, combiner_args=combiner_args) all_proc_dicts.extend(proc_dicts) for proc_dict in all_proc_dicts: self._wait_for_process(proc_dict, step_num) self.print_counters([step_num + 1]) def _subprocess_env(self, step_type, step_num, task_num, input_file=None, input_start=None, input_length=None): """Set up environment variables for a subprocess (mapper, etc.) This combines, in decreasing order of priority: * environment variables set by the **cmdenv** option * **jobconf** environment variables set by our job (e.g. ``mapreduce.task.ismap`) * environment variables from **jobconf** options, translated to whatever version of Hadoop we're emulating * the current environment * PYTHONPATH set to current working directory We use :py:func:`~mrjob.conf.combine_local_envs`, so ``PATH`` environment variables are handled specially. """ version = self.get_hadoop_version() jobconf_env = dict( (translate_jobconf(k, version).replace('.', '_'), str(v)) for (k, v) in self._opts['jobconf'].iteritems()) internal_jobconf = self._simulate_jobconf_for_step( step_type, step_num, task_num, input_file=input_file, input_start=input_start, input_length=input_length) internal_jobconf_env = dict( (translate_jobconf(k, version).replace('.', '_'), str(v)) for (k, v) in internal_jobconf.iteritems()) # keep the current environment because we need PATH to find binaries # and make PYTHONPATH work return combine_local_envs({'PYTHONPATH': os.getcwd()}, os.environ, jobconf_env, internal_jobconf_env, self._get_cmdenv()) def _simulate_jobconf_for_step(self, step_type, step_num, task_num, input_file=None, input_start=None, input_length=None): """Simulate jobconf variables set by Hadoop to indicate input files, files uploaded, working directory, etc. for a particular step. Returns a dictionary mapping jobconf variable name (e.g. ``'mapreduce.map.input.file'``) to its value, which is always a string. We use the newer (Hadoop 0.21+) jobconf names; these will be translated to the correct Hadoop version elsewhere. """ j = {} # our final results j['mapreduce.job.id'] = self._job_name j['mapreduce.job.local.dir'] = self._working_dir j['mapreduce.task.output.dir'] = self._output_dir # archives and files for jobconf cache_archives = [] cache_files = [] cache_local_archives = [] cache_local_files = [] for file_dict in self._files: path = file_dict['path'] name = file_dict['name'] dest = os.path.join(self._working_dir, name) if file_dict.get('upload') == 'file': cache_files.append('%s#%s' % (path, name)) cache_local_files.append(dest) elif file_dict.get('upload') == 'archive': cache_archives.append('%s#%s' % (path, name)) cache_local_archives.append(dest) # could add mtime info here too (e.g. # mapreduce.job.cache.archives.timestamps) here too, though we should # probably cache that in self._files j['mapreduce.job.cache.files'] = (','.join(cache_files)) j['mapreduce.job.cache.local.files'] = (','.join(cache_local_files)) j['mapreduce.job.cache.archives'] = (','.join(cache_archives)) j['mapreduce.job.cache.local.archives'] = ( ','.join(cache_local_archives)) # task and attempt IDs j['mapreduce.task.id'] = 'task_%s_%s_%05d%d' % ( self._job_name, step_type.lower(), step_num, task_num) # (we only have one attempt) j['mapreduce.task.attempt.id'] = 'attempt_%s_%s_%05d%d_0' % ( self._job_name, step_type.lower(), step_num, task_num) # not actually sure what's correct for combiners here. It'll definitely # be true if we're just using pipes to simulate a combiner though j['mapreduce.task.ismap'] = str(step_type in ('M', 'C')).lower() j['mapreduce.task.partition'] = str(task_num) if input_file is not None: j['mapreduce.map.input.file'] = input_file if input_start is not None: j['mapreduce.map.input.start'] = str(input_start) if input_length is not None: j['mapreduce.map.input.length'] = str(input_length) return j def _invoke_process(self, args, outfile_name, env, combiner_args=None): """invoke the process described by *args* and write to *outfile_name* :param combiner_args: If this mapper has a combiner, we need to do some extra shell wrangling, so pass the combiner arguments in separately. :return: dict(proc=Popen, args=[process args], write_to=file) """ if combiner_args: log.info('> %s | sort | %s' % (cmd_line(args), cmd_line(combiner_args))) else: log.info('> %s' % cmd_line(args)) # set up outfile outfile = os.path.join(self._get_local_tmp_dir(), outfile_name) log.info('writing to %s' % outfile) self._prev_outfiles.append(outfile) with open(outfile, 'w') as write_to: if combiner_args: # set up a pipeline: mapper | sort | combiner mapper_proc = Popen(args, stdout=PIPE, stderr=PIPE, cwd=self._working_dir, env=env) sort_proc = Popen(['sort'], stdin=mapper_proc.stdout, stdout=PIPE, stderr=PIPE, cwd=self._working_dir, env=env) combiner_proc = Popen(combiner_args, stdin=sort_proc.stdout, stdout=write_to, stderr=PIPE, cwd=self._working_dir, env=env) # this process shouldn't read from the pipes mapper_proc.stdout.close() sort_proc.stdout.close() return [ {'proc': mapper_proc, 'args': args}, {'proc': sort_proc, 'args': ['sort']}, {'proc': combiner_proc, 'args': combiner_args}, ] else: # just run the mapper process proc = Popen(args, stdout=write_to, stderr=PIPE, cwd=self._working_dir, env=env) return [{'proc': proc, 'args': args}] def _wait_for_process(self, proc_dict, step_num): # handle counters, status msgs, and other stuff on stderr stderr_lines = self._process_stderr_from_script( proc_dict['proc'].stderr, step_num=step_num) tb_lines = find_python_traceback(stderr_lines) returncode = proc_dict['proc'].wait() if returncode != 0: self.print_counters([step_num + 1]) # try to throw a useful exception if tb_lines: raise Exception( 'Command %r returned non-zero exit status %d:\n%s' % (proc_dict['args'], returncode, ''.join(tb_lines))) else: raise Exception( 'Command %r returned non-zero exit status %d' % (proc_dict['args'], returncode)) def _process_stderr_from_script(self, stderr, step_num=0): """Handle stderr a line at time: * for counter lines, store counters * for status message, log the status change * for all other lines, log an error, and yield the lines """ for line in stderr: # just pass one line at a time to parse_mr_job_stderr(), # so we can print error and status messages in realtime parsed = parse_mr_job_stderr( [line], counters=self._counters[step_num]) # in practice there's only going to be at most one line in # one of these lists, but the code is cleaner this way for status in parsed['statuses']: log.info('status: %s' % status) for line in parsed['other']: log.error('STDERR: %s' % line.rstrip('\r\n')) yield line def counters(self): return self._counters def get_hadoop_version(self): return self._opts['hadoop_version'] mrjob-0.3.3.2/mrjob/logparsers.py0000664€q(¼€tzÕß0000002400411740642733022462 0ustar sjohnsonAD\Domain Users00000000000000# Copyright 2011 Matthew Tai # Copyright 2011 Yelp # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. """Parsing classes to find errors in Hadoop logs""" from __future__ import with_statement import logging import posixpath import re from mrjob.parse import find_hadoop_java_stack_trace from mrjob.parse import find_input_uri_for_mapper from mrjob.parse import find_interesting_hadoop_streaming_error from mrjob.parse import find_python_traceback from mrjob.parse import find_timeout_error from mrjob.parse import parse_hadoop_counters_from_line log = logging.getLogger('mrjob.logparser') # Constants used to distinguish between different kinds of logs TASK_ATTEMPT_LOGS = 'TASK_ATTEMPT_LOGS' STEP_LOGS = 'STEP_LOGS' JOB_LOGS = 'JOB_LOGS' NODE_LOGS = 'NODE_LOGS' # regex for matching task-attempts log URIs TASK_ATTEMPTS_LOG_URI_RE = re.compile( r'^.*/attempt_' #attempt_ r'(?P\d+)_' #201203222119_ r'(?P\d+)_' #0001_ r'(?P\w)_' #m_ r'(?P\d+)_' #000000_ r'(?P\d+)/' #3/ r'(?Pstderr|syslog)$') #stderr # regex for matching step log URIs STEP_LOG_URI_RE = re.compile( r'^.*/(?P\d+)/(?Psyslog|stderr)$') # regex for matching job log URIs. There is some variety in how these are # formatted, so this expression is pretty general. EMR_JOB_LOG_URI_RE = re.compile( r'^.*?' # sometimes there is a number at the beginning, and the # containing directory can be almost anything. r'job_(?P\d+)_(?P\d+)' # oh look, meaningful data! r'(_\d+)?' # sometimes there is a number here. r'_hadoop_streamjob(\d+).jar$') HADOOP_JOB_LOG_URI_RE = re.compile( r'^.*?/job_(?P\d+)_(?P\d+)_(?P\d+)' r'_(?P.*?)_streamjob(?P\d+).jar$') # regex for matching slave log URIs NODE_LOG_URI_RE = re.compile( r'^.*?/hadoop-hadoop-(jobtracker|namenode).*.out$') def scan_for_counters_in_files(log_file_uris, runner, hadoop_version): """Scan *log_file_uris* for counters, using *runner* for file system access """ counters = {} relevant_logs = [] # list of (sort key, URI) for log_file_uri in log_file_uris: match = EMR_JOB_LOG_URI_RE.match(log_file_uri) if match is None: match = HADOOP_JOB_LOG_URI_RE.match(log_file_uri) if not match: continue relevant_logs.append((match.group('step_num'), log_file_uri)) relevant_logs.sort() for _, log_file_uri in relevant_logs: log_lines = runner.cat(log_file_uri) if not log_lines: continue for line in log_lines: new_counters, step_num = parse_hadoop_counters_from_line( line, hadoop_version) if new_counters: counters[step_num] = new_counters return counters def scan_logs_in_order(task_attempt_logs, step_logs, job_logs, runner): """Use mapping and order from :py:func:`processing_order` to find errors in logs. Returns:: None (nothing found) or a dictionary containing: lines -- lines in the log file containing the error message log_file_uri -- the log file containing the error message input_uri -- if the error happened in a mapper in the first step, the URI of the input file that caused the error (otherwise None) """ log_type_to_uri_list = { TASK_ATTEMPT_LOGS: task_attempt_logs, STEP_LOGS: step_logs, # job logs may be scanned twice, so save the ls generator output JOB_LOGS: list(job_logs), } for log_type, sort_func, parsers in processing_order(): relevant_logs = sort_func(log_type_to_uri_list[log_type]) # unfortunately need to special case task attempts since later # attempts may have succeeded and we don't want those (issue #31) tasks_seen = set() for sort_key, info, log_file_uri in relevant_logs: log.debug('Parsing %s' % log_file_uri) if log_type == TASK_ATTEMPT_LOGS: task_info = (info['step_num'], info['node_type'], info['node_num'], info['stream']) if task_info in tasks_seen: continue tasks_seen.add(task_info) val = _apply_parsers_to_log(parsers, log_file_uri, runner) if val: if info.get('node_type', None) == 'm': val['input_uri'] = _scan_for_input_uri(log_file_uri, runner) return val return None def _apply_parsers_to_log(parsers, log_file_uri, runner): """Have each :py:class:`LogParser` in *parsers* try to find an error in the contents of *log_file_uri* """ for parser in parsers: if parser.LOG_NAME_RE.match(log_file_uri): log_lines = runner.cat(log_file_uri) if not log_lines: continue lines = parser.parse(log_lines) if lines is not None: return { 'lines': lines, 'log_file_uri': log_file_uri, 'input_uri': None, } return None def _scan_for_input_uri(log_file_uri, runner): """Scan the syslog file corresponding to log_file_uri for information about the input file. Helper function for :py:func:`scan_task_attempt_logs()` """ syslog_uri = posixpath.join( posixpath.dirname(log_file_uri), 'syslog') syslog_lines = runner.cat(syslog_uri) if syslog_lines: log.debug('scanning %s for input URI' % syslog_uri) return find_input_uri_for_mapper(syslog_lines) else: return None def _make_sorting_func(regexp, sort_key_func): def sorting_func(log_file_uris): """Sort *log_file_uris* matching *regexp* according to *sort_key_func* :return: [(sort_key, info, log_file_uri)] """ relevant_logs = [] # list of (sort key, info, URI) for log_file_uri in log_file_uris: match = regexp.match(log_file_uri) if not match: continue info = match.groupdict() sort_key = sort_key_func(info) relevant_logs.append((sort_key, info, log_file_uri)) relevant_logs.sort(reverse=True) return relevant_logs return sorting_func def processing_order(): """Define a mapping and order for the log parsers. Returns tuples of ``(LOG_TYPE, sort_function, [parser])``, where *sort_function* takes a list of log URIs and returns a list of tuples ``(sort_key, info, log_file_uri)``. :return: [(LOG_TYPE, sort_function, [LogParser])] """ task_attempt_sort = _make_sorting_func(TASK_ATTEMPTS_LOG_URI_RE, make_task_attempt_log_sort_key) step_sort = _make_sorting_func(STEP_LOG_URI_RE, make_step_log_sort_key) emr_job_sort = _make_sorting_func(EMR_JOB_LOG_URI_RE, make_job_log_sort_key) hadoop_job_sort = _make_sorting_func(HADOOP_JOB_LOG_URI_RE, make_job_log_sort_key) return [ # give priority to task-attempts/ logs as they contain more useful # error messages. this may take a while. (TASK_ATTEMPT_LOGS, task_attempt_sort, [ PythonTracebackLogParser(), HadoopJavaStackTraceLogParser() ]), (STEP_LOGS, step_sort, [ HadoopStreamingErrorLogParser() ]), (JOB_LOGS, emr_job_sort, [ TimeoutErrorLogParser() ]), (JOB_LOGS, hadoop_job_sort, [ TimeoutErrorLogParser() ]), ] ### SORT KEY FUNCTIONS ### # prefer stderr to syslog (Python exceptions are more # helpful than Java ones) def make_task_attempt_log_sort_key(info): return (info['step_num'], info['node_type'], info['attempt_num'], info['stream'] == 'stderr', info['node_num']) def make_step_log_sort_key(info): return (info['step_num'], info['stream'] == 'stderr') def make_job_log_sort_key(info): return (info['timestamp'], info['step_num']) ### LOG PARSERS ### class LogParser(object): """Methods for parsing information from Hadoop logs""" # Log type is sometimes too general, so allow parsers to limit their logs # further by name LOG_NAME_RE = re.compile(r'.*') def parse(self, lines): """Parse one kind of error from *lines*. Return list of lines or None. :type lines: iterable of str :param lines: lines to scan for information :return: [str] or None """ raise NotImplementedError class PythonTracebackLogParser(LogParser): LOG_NAME_RE = re.compile(r'.*stderr$') def parse(self, lines): return find_python_traceback(lines) class HadoopJavaStackTraceLogParser(LogParser): def parse(self, lines): return find_hadoop_java_stack_trace(lines) class HadoopStreamingErrorLogParser(LogParser): def parse(self, lines): msg = find_interesting_hadoop_streaming_error(lines) if msg: return [msg + '\n'] else: return None class TimeoutErrorLogParser(LogParser): def parse(self, lines): n = find_timeout_error(lines) if n is not None: return ['Timeout after %d seconds\n' % n] else: return None mrjob-0.3.3.2/mrjob/parse.py0000664€q(¼€tzÕß0000004502711740642733021423 0ustar sjohnsonAD\Domain Users00000000000000# Copyright 2009-2012 Yelp # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. """Utilities for parsing errors, counters, and status messages.""" from functools import wraps import logging import re from urlparse import ParseResult from urlparse import urlparse as urlparse_buggy try: from cStringIO import StringIO StringIO # quiet "redefinition of unused ..." warning from pyflakes except ImportError: from StringIO import StringIO from mrjob.compat import uses_020_counters # match the filename of a hadoop streaming jar HADOOP_STREAMING_JAR_RE = re.compile(r'^hadoop.*streaming.*\.jar$') # match an mrjob job name (these are used to name EMR job flows) JOB_NAME_RE = re.compile(r'^(.*)\.(.*)\.(\d+)\.(\d+)\.(\d+)$') # match an mrjob step name (these are used to name steps in EMR) STEP_NAME_RE = re.compile( r'^(.*)\.(.*)\.(\d+)\.(\d+)\.(\d+): Step (\d+) of (\d+)$') log = logging.getLogger('mrjob.parse') ### URI PARSING ### # Used to parse the real netloc out of a malformed path from Python 2.5 # urlparse() NETLOC_RE = re.compile(r'//(.*?)((/.*?)?)$') def is_uri(uri): """Return True if *uri* is any sort of URI.""" return bool(urlparse(uri).scheme) def is_s3_uri(uri): """Return True if *uri* can be parsed into an S3 URI, False otherwise.""" try: parse_s3_uri(uri) return True except ValueError: return False def parse_s3_uri(uri): """Parse an S3 URI into (bucket, key) >>> parse_s3_uri('s3://walrus/tmp/') ('walrus', 'tmp/') If ``uri`` is not an S3 URI, raise a ValueError """ components = urlparse(uri) if (components.scheme not in ('s3', 's3n') or '/' not in components.path): raise ValueError('Invalid S3 URI: %s' % uri) return components.netloc, components.path[1:] @wraps(urlparse_buggy) def urlparse(*args, **kwargs): """A wrapper for :py:func:`urlparse.urlparse` that handles buckets in S3 URIs correctly. (:py:func:`~urlparse.urlparse` does this correctly on its own in Python 2.6+; this is just a patch for Python 2.5.)""" components = urlparse_buggy(*args, **kwargs) if components.netloc == '' and components.path.startswith('//'): m = NETLOC_RE.match(components.path) return ParseResult(components.scheme, m.group(1), m.group(2), components.params, components.query, components.fragment) else: return components ### OPTION PARSING ### def parse_port_range_list(range_list_str): """Parse a port range list of the form (start[:end])(,(start[:end]))*""" all_ranges = [] for range_str in range_list_str.split(','): if ':' in range_str: a, b = [int(x) for x in range_str.split(':')] all_ranges.extend(xrange(a, b + 1)) else: all_ranges.append(int(range_str)) return all_ranges def parse_key_value_list(kv_string_list, error_fmt, error_func): """Parse a list of strings like ``KEY=VALUE`` into a dictionary. :param kv_string_list: Parse a list of strings like ``KEY=VALUE`` into a dictionary. :type kv_string_list: [str] :param error_fmt: Format string accepting one ``%s`` argument which is the malformed (i.e. not ``KEY=VALUE``) string :type error_fmt: str :param error_func: Function to call when a malformed string is encountered. :type error_func: function(str) """ ret = {} for value in kv_string_list: try: k, v = value.split('=', 1) ret[k] = v except ValueError: error_func(error_fmt % (value,)) return ret ### LOG PARSING ### _HADOOP_0_20_ESCAPED_CHARS_RE = re.compile(r'\\([.(){}[\]"\\])') def counter_unescape(escaped_string): """Fix names of counters and groups emitted by Hadoop 0.20+ logs, which use escape sequences for more characters than most decoders know about (e.g. ``().``). :param escaped_string: string from a counter log line :type escaped_string: str """ escaped_string = escaped_string.decode('string_escape') escaped_string = _HADOOP_0_20_ESCAPED_CHARS_RE.sub(r'\1', escaped_string) return escaped_string def find_python_traceback(lines): """Scan a log file or other iterable for a Python traceback, and return it as a list of lines. In logs from EMR, we find python tracebacks in ``task-attempts/*/stderr`` """ # Lines to pass back representing entire error found all_tb_lines = [] # This is used to store a working list of lines in a single traceback tb_lines = [] # This is used to store a working list of non-traceback lines between the # current traceback and the previous one non_tb_lines = [] # Track whether or not we are in a traceback rather than consuming the # iterator in_traceback = False for line in lines: if in_traceback: tb_lines.append(line) # If no indentation, this is the last line of the traceback if line.lstrip() == line: in_traceback = False if line.startswith('subprocess.CalledProcessError'): # CalledProcessError may mean that the subprocess printed # errors to stderr which we can show the user all_tb_lines += non_tb_lines all_tb_lines += tb_lines # Reset all working lists tb_lines = [] non_tb_lines = [] else: if line.startswith('Traceback (most recent call last):'): tb_lines.append(line) in_traceback = True else: non_tb_lines.append(line) if all_tb_lines: return all_tb_lines else: return None def find_hadoop_java_stack_trace(lines): """Scan a log file or other iterable for a java stack trace from Hadoop, and return it as a list of lines. In logs from EMR, we find java stack traces in ``task-attempts/*/syslog`` Sample stack trace:: 2010-07-27 18:25:48,397 WARN org.apache.hadoop.mapred.TaskTracker (main): Error running child java.lang.OutOfMemoryError: Java heap space at org.apache.hadoop.mapred.IFile$Reader.readNextBlock(IFile.java:270) at org.apache.hadoop.mapred.IFile$Reader.next(IFile.java:332) at org.apache.hadoop.mapred.Merger$Segment.next(Merger.java:147) at org.apache.hadoop.mapred.Merger$MergeQueue.adjustPriorityQueue(Merger.java:238) at org.apache.hadoop.mapred.Merger$MergeQueue.next(Merger.java:255) at org.apache.hadoop.mapred.Merger.writeFile(Merger.java:86) at org.apache.hadoop.mapred.Merger$MergeQueue.merge(Merger.java:377) at org.apache.hadoop.mapred.Merger.merge(Merger.java:58) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:277) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2216) (We omit the "Error running child" line from the results) """ for line in lines: if line.rstrip('\r\n').endswith("Error running child"): st_lines = [] for line in lines: st_lines.append(line) for line in lines: if not line.startswith(' at '): break st_lines.append(line) return st_lines else: return None _OPENING_FOR_READING_RE = re.compile("^.*: Opening '(.*)' for reading$") def find_input_uri_for_mapper(lines): """Scan a log file or other iterable for the path of an input file for the first mapper on Hadoop. Just returns the path, or None if no match. In logs from EMR, we find python tracebacks in ``task-attempts/*/syslog`` Matching log lines look like:: 2010-07-27 17:54:54,344 INFO org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Opening 's3://yourbucket/logs/2010/07/23/log2-00077.gz' for reading """ val = None for line in lines: match = _OPENING_FOR_READING_RE.match(line) if match: val = match.group(1) return val _HADOOP_STREAMING_ERROR_RE = re.compile( r'^.*ERROR org\.apache\.hadoop\.streaming\.StreamJob \(main\): (.*)$') _HADOOP_STREAMING_ERROR_RE_2 = re.compile(r'^(.*does not exist.*)$') def find_interesting_hadoop_streaming_error(lines): """Scan a log file or other iterable for a hadoop streaming error other than "Job not Successful!". Return the error as a string, or None if nothing found. In logs from EMR, we find java stack traces in ``steps/*/syslog`` Example line:: 2010-07-27 19:53:35,451 ERROR org.apache.hadoop.streaming.StreamJob (main): Error launching job , Output path already exists : Output directory s3://yourbucket/logs/2010/07/23/ already exists and is not empty """ for line in lines: match = _HADOOP_STREAMING_ERROR_RE.match(line) \ or _HADOOP_STREAMING_ERROR_RE_2.match(line) if match: msg = match.group(1) if msg != 'Job not Successful!': return msg return None _MULTILINE_JOB_LOG_ERROR_RE = re.compile( r'^\w+Attempt.*?TASK_STATUS="FAILED".*?ERROR="(?P[^"]*)$') def find_job_log_multiline_error(lines): """Scan a log file for an arbitrary multi-line error. Return it as a list of lines, or None of nothing was found. Here is an example error:: MapAttempt TASK_TYPE="MAP" TASKID="task_201106280040_0001_m_000218" TASK_ATTEMPT_ID="attempt_201106280040_0001_m_000218_5" TASK_STATUS="FAILED" FINISH_TIME="1309246900665" HOSTNAME="/default-rack/ip-10-166-239-133.us-west-1.compute.internal" ERROR="Error initializing attempt_201106280040_0001_m_000218_5: java.io.IOException: Cannot run program "bash": java.io.IOException: error=12, Cannot allocate memory at java.lang.ProcessBuilder.start(ProcessBuilder.java:460) at org.apache.hadoop.util.Shell.runCommand(Shell.java:149) at org.apache.hadoop.util.Shell.run(Shell.java:134) at org.apache.hadoop.fs.DF.getAvailable(DF.java:73) at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:296) at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:124) at org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:648) at org.apache.hadoop.mapred.TaskTracker.startNewTask(TaskTracker.java:1320) at org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:956) at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1357) at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2361) Caused by: java.io.IOException: java.io.IOException: error=12, Cannot allocate memory at java.lang.UNIXProcess.(UNIXProcess.java:148) at java.lang.ProcessImpl.start(ProcessImpl.java:65) at java.lang.ProcessBuilder.start(ProcessBuilder.java:453) ... 10 more " The first line returned will only include the text after ``ERROR="``, and discard the final line with just ``"``. These errors are parsed from jobs/\*.jar. """ for line in lines: m = _MULTILINE_JOB_LOG_ERROR_RE.match(line) if m: st_lines = [] if m.group('first_line'): st_lines.append(m.group('first_line')) for line in lines: st_lines.append(line) for line in lines: if line.strip() == '"': break st_lines.append(line) return st_lines return None _TIMEOUT_ERROR_RE = re.compile( r'.*?TASK_STATUS="FAILED".*?ERROR=".*?failed to report status for (\d+)' r' seconds.*?"') def find_timeout_error(lines): """Scan a log file or other iterable for a timeout error from Hadoop. Return the number of seconds the job ran for before timing out, or None if nothing found. In logs from EMR, we find timeouterrors in ``jobs/*.jar`` Example line:: Task TASKID="task_201010202309_0001_m_000153" TASK_TYPE="MAP" TASK_STATUS="FAILED" FINISH_TIME="1287618918658" ERROR="Task attempt_201010202309_0001_m_000153_3 failed to report status for 602 seconds. Killing!" """ result = None for line in lines: match = _TIMEOUT_ERROR_RE.match(line) if match: result = match.group(1) if result is None: return None else: return int(result) # recognize hadoop streaming output _COUNTER_RE = re.compile(r'^reporter:counter:([^,]*),([^,]*),(-?\d+)$') _STATUS_RE = re.compile(r'^reporter:status:(.*)$') def parse_mr_job_stderr(stderr, counters=None): """Parse counters and status messages out of MRJob output. :param data: a filehandle, a list of lines, or a str containing data :type counters: Counters so far, to update; a map from group to counter name to count. Returns a dictionary with the keys *counters*, *statuses*, *other*: - *counters*: counters so far; same format as above - *statuses*: a list of status messages encountered - *other*: lines that aren't either counters or status messages """ # For the corresponding code in Hadoop Streaming, see ``incrCounter()`` in # http://svn.apache.org/viewvc/hadoop/mapreduce/trunk/src/contrib/streaming/src/java/org/apache/hadoop/streaming/PipeMapRed.java?view=markup if isinstance(stderr, str): stderr = StringIO(stderr) if counters is None: counters = {} statuses = [] other = [] for line in stderr: m = _COUNTER_RE.match(line.rstrip('\r\n')) if m: group, counter, amount_str = m.groups() counters.setdefault(group, {}) counters[group].setdefault(counter, 0) counters[group][counter] += int(amount_str) continue m = _STATUS_RE.match(line.rstrip('\r\n')) if m: statuses.append(m.group(1)) continue other.append(line) return {'counters': counters, 'statuses': statuses, 'other': other} # Match a job output line containing counter data. # The line is of the form # "Job KEY="value" KEY2="value2" ... COUNTERS="" # We just want to pull out the counter string, which varies between # Hadoop versions. _KV_EXPR = r'\s+\w+=".*?"' # this matches KEY="VALUE" _COUNTER_LINE_EXPR = r'^.*?JOBID=".*?_%s".*?COUNTERS="%s".*?$' % \ ('(?P\d+)', r'(?P.*?)') _COUNTER_LINE_RE = re.compile(_COUNTER_LINE_EXPR) # 0.18-specific # see _parse_counters_0_18 for format # A counter looks like this: groupname.countername:countervalue _COUNTER_EXPR_0_18 = r'(,|^)(?P[^,]+?)[.](?P[^,]+):(?P\d+)' _COUNTER_RE_0_18 = re.compile(_COUNTER_EXPR_0_18) # 0.20-specific # capture one group including sub-counters # these look like: {(gid)(gname)[...][...][...]...} _COUNTER_LIST_EXPR = r'(?P\[.*?\])' _GROUP_RE_0_20 = re.compile(r'{\(%s\)\(%s\)%s}' % (r'(?P.*?)', r'(?P.*?)', _COUNTER_LIST_EXPR)) # capture a single counter from a group # this is what the ... is in _COUNTER_LIST_EXPR (incl. the brackets). # it looks like: [(cid)(cname)(value)] _COUNTER_0_20_EXPR = r'\[\(%s\)\(%s\)\(%s\)\]' % (r'(?P.*?)', r'(?P.*?)', r'(?P\d+)') _COUNTER_RE_0_20 = re.compile(_COUNTER_0_20_EXPR) def _parse_counters_0_18(counter_string): # 0.18 counters look like this: # GroupName.CounterName:Value,Group1.Crackers:3,Group2.Nerf:243,... groups = _COUNTER_RE_0_18.finditer(counter_string) if groups is None: log.warn('Cannot parse Hadoop counter string: %s' % counter_string) for m in groups: yield m.group('group'), m.group('name'), int(m.group('value')) def _parse_counters_0_20(counter_string): # 0.20 counters look like this: # {(groupid)(groupname)[(counterid)(countername)(countervalue)][...]...} groups = _GROUP_RE_0_20.findall(counter_string) if not groups: log.warn('Cannot parse Hadoop counter string: %s' % counter_string) for group_id, group_name, counter_str in groups: matches = _COUNTER_RE_0_20.findall(counter_str) for counter_id, counter_name, counter_value in matches: try: group_name = counter_unescape(group_name) except ValueError: log.warn("Could not decode group name %s" % group_name) try: counter_name = counter_unescape(counter_name) except ValueError: log.warn("Could not decode counter name %s" % counter_name) yield group_name, counter_name, int(counter_value) def parse_hadoop_counters_from_line(line, hadoop_version=None): """Parse Hadoop counter values from a log line. The counter log line format changed significantly between Hadoop 0.18 and 0.20, so this function switches between parsers for them. :param line: log line containing counter data :type line: str :return: (counter_dict, step_num) or (None, None) """ m = _COUNTER_LINE_RE.match(line) if not m: return None, None if hadoop_version is None: # try both if hadoop_version not specified counters_1, step_num_1 = parse_hadoop_counters_from_line(line, '0.20') if counters_1: return (counters_1, step_num_1) else: return parse_hadoop_counters_from_line(line, '0.18') if uses_020_counters(hadoop_version): parse_func = _parse_counters_0_20 else: parse_func = _parse_counters_0_18 counter_substring = m.group('counters') counters = {} for group, counter, value in parse_func(counter_substring): counters.setdefault(group, {}) counters[group].setdefault(counter, 0) counters[group][counter] += int(value) return counters, int(m.group('step_num')) mrjob-0.3.3.2/mrjob/pool.py0000664€q(¼€tzÕß0000000437411717277734021273 0ustar sjohnsonAD\Domain Users00000000000000# Copyright 2012 Yelp and Contributors # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. """Utilities related to job-flow-pooling. This code used to be in mrjob.emr. """ from datetime import datetime from datetime import timedelta try: import boto.utils boto # quiet "redefinition of unused ..." warning from pyflakes except ImportError: # don't require boto; MRJobs don't actually need it when running # inside hadoop streaming boto = None def est_time_to_hour(job_flow, now=None): """How long before job reaches the end of the next full hour since it began. This is important for billing purposes. If it happens to be exactly a whole number of hours, we return one hour, not zero. """ if now is None: now = datetime.utcnow() creationdatetime = getattr(job_flow, 'creationdatetime', None) startdatetime = getattr(job_flow, 'startdatetime', None) if creationdatetime: if startdatetime: start = datetime.strptime(startdatetime, boto.utils.ISO8601) else: start = datetime.strptime(job_flow.creationdatetime, boto.utils.ISO8601) else: # do something reasonable if creationdatetime isn't set return timedelta(minutes=60) run_time = now - start return timedelta(seconds=((-run_time).seconds % 3600.0 or 3600.0)) def pool_hash_and_name(job_flow): """Return the hash and pool name for the given job flow, or ``(None, None)`` if it isn't pooled.""" bootstrap_actions = getattr(job_flow, 'bootstrapactions', None) if bootstrap_actions: args = [arg.value for arg in bootstrap_actions[-1].args] if len(args) == 2 and args[0].startswith('pool-'): return args[0][5:], args[1] return (None, None) mrjob-0.3.3.2/mrjob/protocol.py0000664€q(¼€tzÕß0000001746311734676607022167 0ustar sjohnsonAD\Domain Users00000000000000# Copyright 2009-2010 Yelp # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. """Protocols are what allow :py:class:`mrjob.job.MRJob` to input and output arbitrary values, rather than just strings. We use JSON as our default protocol rather than something more powerful because we want to encourage interoperability with other languages. If you need more power, you can represent values as reprs or pickles. Also, if know that your input will always be in JSON format, consider :py:class:`JSONValueProtocol` as an alternative to :py:class:`RawValueProtocol`. Custom Protocols ^^^^^^^^^^^^^^^^ A protocol is an object with methods ``read(self, line)`` and ``write(self, key, value)``. The ``read(line)`` method takes a string and returns a 2-tuple of decoded objects, and ``write(cls, key, value)`` takes the key and value and returns the line to be passed back to Hadoop Streaming or as output. The built-in protocols use class methods instead of instance methods for legacy reasons, but you should use instance methods. For more information on using alternate protocols in your job, see :ref:`job-protocols`. """ # don't add imports here that aren't part of the standard Python library, # since MRJobs need to run in Amazon's generic EMR environment import cPickle from mrjob.util import safeeval try: import simplejson as json # preferred because of C speedups json # quiet "redefinition of unused ..." warning from pyflakes except ImportError: import json # built in to Python 2.6 and later # DEPRECATED: Abstract base class for all protocols. Now just an alias for # ``object``. HadoopStreamingProtocol = object class _ClassBasedKeyCachingProtocol(object): """Protocol that caches the last decoded key and uses class methods instead of instance methods. Do not inherit from this. """ _last_key_encoded = None _last_key_decoded = None @classmethod def load_from_string(self, value): raise NotImplementedError @classmethod def dump_to_string(self, value): raise NotImplementedError @classmethod def read(cls, line): """Decode a line of input. :type line: str :param line: A line of raw input to the job, without trailing newline. :return: A tuple of ``(key, value)``.""" raw_key, raw_value = line.split('\t') if raw_key != cls._last_key_encoded: cls._last_key_encoded = raw_key cls._last_key_decoded = cls.load_from_string(raw_key) return (cls._last_key_decoded, cls.load_from_string(raw_value)) @classmethod def write(cls, key, value): """Encode a key and value. :param key: A key (of any type) yielded by a mapper/reducer :param value: A value (of any type) yielded by a mapper/reducer :rtype: str :return: A line, without trailing newline.""" return '%s\t%s' % (cls.dump_to_string(key), cls.dump_to_string(value)) class JSONProtocol(_ClassBasedKeyCachingProtocol): """Encode ``(key, value)`` as two JSONs separated by a tab. Note that JSON has some limitations; dictionary keys must be strings, and there's no distinction between lists and tuples.""" @classmethod def load_from_string(cls, value): return json.loads(value) @classmethod def dump_to_string(cls, value): return json.dumps(value) class JSONValueProtocol(object): """Encode ``value`` as a JSON and discard ``key`` (``key`` is read in as ``None``). """ @classmethod def read(cls, line): return (None, json.loads(line)) @classmethod def write(cls, key, value): return json.dumps(value) class PickleProtocol(_ClassBasedKeyCachingProtocol): """Encode ``(key, value)`` as two string-escaped pickles separated by a tab. We string-escape the pickles to avoid having to deal with stray ``\\t`` and ``\\n`` characters, which would confuse Hadoop Streaming. Ugly, but should work for any type. """ @classmethod def load_from_string(cls, value): return cPickle.loads(value.decode('string_escape')) @classmethod def dump_to_string(cls, value): return cPickle.dumps(value).encode('string_escape') class PickleValueProtocol(object): """Encode ``value`` as a string-escaped pickle and discard ``key`` (``key`` is read in as ``None``). """ @classmethod def read(cls, line): return (None, cPickle.loads(line.decode('string_escape'))) @classmethod def write(cls, key, value): return cPickle.dumps(value).encode('string_escape') # This was added in 0.3, so no @classmethod for backwards compatibility class RawProtocol(object): """Encode ``(key, value)`` as ``key`` and ``value`` separated by a tab (``key`` and ``value`` should be bytestrings). If ``key`` or ``value`` is ``None``, don't include a tab. When decoding a line with no tab in it, ``value`` will be ``None``. When reading from a line with multiple tabs, we break on the first one. Your key should probably not be ``None`` or have tab characters in it, but we don't check. """ def read(cls, line): key_value = line.split('\t', 1) if len(key_value) == 1: key_value.append(None) return tuple(key_value) def write(cls, key, value): return '\t'.join(x for x in (key, value) if x is not None) class RawValueProtocol(object): """Read in a line as ``(None, line)``. Write out ``(key, value)`` as ``value``. ``value`` must be a ``str``. The default way for a job to read its initial input. """ @classmethod def read(cls, line): return (None, line) @classmethod def write(cls, key, value): return value class ReprProtocol(_ClassBasedKeyCachingProtocol): """Encode ``(key, value)`` as two reprs separated by a tab. This only works for basic types (we use :py:func:`mrjob.util.safeeval`). """ @classmethod def load_from_string(cls, value): return safeeval(value) @classmethod def dump_to_string(cls, value): return repr(value) class ReprValueProtocol(object): """Encode ``value`` as a repr and discard ``key`` (``key`` is read in as None). This only works for basic types (we use :py:func:`mrjob.util.safeeval`). """ @classmethod def read(cls, line): return (None, safeeval(line)) @classmethod def write(cls, key, value): return repr(value) #: .. deprecated:: 0.3.0 #: #: Formerly the default protocol for all encoded input and output: ``'json'`` DEFAULT_PROTOCOL = 'json' #: .. deprecated:: 0.3.0 #: #: Default mapping from protocol name to class: #: #: ============ =============================== #: name class #: ============ =============================== #: json :py:class:`JSONProtocol` #: json_value :py:class:`JSONValueProtocol` #: pickle :py:class:`PickleProtocol` #: pickle_value :py:class:`PickleValueProtocol` #: raw_value :py:class:`RawValueProtocol` #: repr :py:class:`ReprProtocol` #: repr_value :py:class:`ReprValueProtocol` #: ============ =============================== PROTOCOL_DICT = { 'json': JSONProtocol, 'json_value': JSONValueProtocol, 'pickle': PickleProtocol, 'pickle_value': PickleValueProtocol, 'raw_value': RawValueProtocol, 'repr': ReprProtocol, 'repr_value': ReprValueProtocol, } mrjob-0.3.3.2/mrjob/retry.py0000664€q(¼€tzÕß0000000701411706610131021435 0ustar sjohnsonAD\Domain Users00000000000000# Copyright 2009-2010 Yelp # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. """A simple wrapper for retrying when we get transient errors.""" import logging import time log = logging.getLogger('mrjob.retry') class RetryWrapper(object): """This class can wrap any object. The wrapped object will behave like the original one, except that if you call a function and it raises a retriable exception, we'll back off for a certain number of seconds and call the function again, until it succeeds or we get a non-retriable exception. """ def __init__(self, wrapped, retry_if, backoff=15, multiplier=1.5, max_tries=10): """ Wrap the given object :param wrapped: the object to wrap. this should work for functions as well as regular objects (essentially, we're wrapping ``__call__``) :param retry_if: a method that takes an exception, and returns whether we should retry :type backoff: float :param backoff: the number of seconds to wait the first time we get a retriable error. :type multiplier: float :param multiplier: if we retry multiple times, the amount to multiply the backoff time by every time we get an error :type max_tries: int :param max_tries: how many tries we get. ``0`` means to keep trying forever """ self.__wrapped = wrapped self.__retry_if = retry_if self.__backoff = backoff if self.__backoff <= 0: raise ValueError('backoff must be positive') self.__multiplier = multiplier if self.__multiplier < 1: raise ValueError('multiplier must be at least one!') self.__max_tries = max_tries def __getattr__(self, name): """The glue that makes functions retriable, and returns other attributes from the wrapped object as-is.""" x = getattr(self.__wrapped, name) if hasattr(x, '__call__'): return self.__wrap_method_with_call_and_maybe_retry(x) else: return x def __wrap_method_with_call_and_maybe_retry(self, f): """Wrap method f in a retry loop.""" def call_and_maybe_retry(*args, **kwargs): backoff = self.__backoff tries = 0 while (not self.__max_tries or tries < self.__max_tries): try: return f(*args, **kwargs) except Exception, ex: if self.__retry_if(ex): log.info('Got retriable error: %r' % ex) log.info('Backing off for %.1f seconds' % backoff) time.sleep(backoff) tries += 1 backoff *= self.__multiplier else: raise # pretend to be the original function call_and_maybe_retry.__name__ == f.__name__ return call_and_maybe_retry mrjob-0.3.3.2/mrjob/runner.py0000664€q(¼€tzÕß0000014554211740642733021625 0ustar sjohnsonAD\Domain Users00000000000000# Copyright 2009-2012 Yelp and Contributors # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. from __future__ import with_statement """Base class for all runners.""" import copy import datetime import getpass import glob import hashlib import logging import os import random import re import shutil import sys from subprocess import CalledProcessError from subprocess import Popen from subprocess import PIPE from subprocess import check_call import tempfile try: from cStringIO import StringIO StringIO # quiet "redefinition of unused ..." warning from pyflakes except ImportError: from StringIO import StringIO from mrjob import compat from mrjob.conf import calculate_opt_priority from mrjob.conf import combine_cmds from mrjob.conf import combine_dicts from mrjob.conf import combine_envs from mrjob.conf import combine_local_envs from mrjob.conf import combine_lists from mrjob.conf import combine_opts from mrjob.conf import combine_paths from mrjob.conf import combine_path_lists from mrjob.conf import load_opts_from_mrjob_conf from mrjob.util import cmd_line from mrjob.util import file_ext from mrjob.util import read_file from mrjob.util import tar_and_gzip log = logging.getLogger('mrjob.runner') # use to detect globs and break into the part before and after the glob GLOB_RE = re.compile(r'^(.*?)([\[\*\?].*)$') #: cleanup options: #: #: * ``'ALL'``: delete local scratch, remote scratch, and logs #: * ``'LOCAL_SCRATCH'``: delete local scratch only #: * ``'LOGS'``: delete logs only #: * ``'NONE'``: delete nothing #: * ``'REMOTE_SCRATCH'``: delete remote scratch only #: * ``'SCRATCH'``: delete local and remote scratch, but not logs #: * ``'IF_SUCCESSFUL'`` (deprecated): same as ``ALL``. Not supported for #: ``cleanup_on_failure``. CLEANUP_CHOICES = ['ALL', 'LOCAL_SCRATCH', 'LOGS', 'NONE', 'REMOTE_SCRATCH', 'SCRATCH', 'IF_SUCCESSFUL'] #: .. deprecated:: 0.3.0 #: #: the default cleanup-on-success option: ``'IF_SUCCESSFUL'`` CLEANUP_DEFAULT = 'IF_SUCCESSFUL' _STEP_RE = re.compile(r'^M?C?R?$') # buffer for piping files into sort on Windows _BUFFER_SIZE = 4096 class MRJobRunner(object): """Abstract base class for all runners. Runners are responsible for launching your job on Hadoop Streaming and fetching the results. Most of the time, you won't have any reason to construct a runner directly; it's more like a utility that allows an :py:class:`~mrjob.job.MRJob` to run itself. Normally things work something like this: * Get a runner by calling :py:meth:`~mrjob.job.MRJob.make_runner` on your job * Call :py:meth:`~mrjob.runner.MRJobRunner.run` on your runner. This will: * Run your job with :option:`--steps` to find out how many mappers/reducers to run * Copy your job and supporting files to Hadoop * Instruct Hadoop to run your job with the appropriate :option:`--mapper`, :option:`--combiner`, :option:`--reducer`, and :option:`--step-num` arguments Each runner runs a single job once; if you want to run a job multiple times, make multiple runners. Subclasses: :py:class:`~mrjob.emr.EMRJobRunner`, :py:class:`~mrjob.hadoop.HadoopJobRunner`, :py:class:`~mrjob.inline.InlineJobRunner`, :py:class:`~mrjob.local.LocalMRJobRunner` """ #: alias for this runner; used for picking section of #: :py:mod:``mrjob.conf`` to load one of ``'local'``, ``'emr'``, #: or ``'hadoop'`` alias = None ### methods to call from your batch script ### def __init__(self, mr_job_script=None, conf_path=None, extra_args=None, file_upload_args=None, hadoop_input_format=None, hadoop_output_format=None, input_paths=None, output_dir=None, partitioner=None, stdin=None, **opts): """All runners take the following keyword arguments: :type mr_job_script: str :param mr_job_script: the path of the ``.py`` file containing the :py:class:`~mrjob.job.MRJob`. If this is None, you won't actually be able to :py:meth:`run` the job, but other utilities (e.g. :py:meth:`ls`) will work. :type conf_path: str :param conf_path: Alternate path to read configs from, or ``False`` to ignore all config files. :type extra_args: list of str :param extra_args: a list of extra cmd-line arguments to pass to the mr_job script. This is a hook to allow jobs to take additional arguments. :param file_upload_args: a list of tuples of ``('--ARGNAME', path)``. The file at the given path will be uploaded to the local directory of the mr_job script when it runs, and then passed into the script with ``--ARGNAME``. Useful for passing in SQLite DBs and other configuration files to your job. :type hadoop_input_format: str :param hadoop_input_format: name of an optional Hadoop ``InputFormat`` class. Passed to Hadoop along with your first step with the ``-inputformat`` option. Note that if you write your own class, you'll need to include it in your own custom streaming jar (see *hadoop_streaming_jar*). :type hadoop_output_format: str :param hadoop_output_format: name of an optional Hadoop ``OutputFormat`` class. Passed to Hadoop along with your first step with the ``-outputformat`` option. Note that if you write your own class, you'll need to include it in your own custom streaming jar (see *hadoop_streaming_jar*). :type input_paths: list of str :param input_paths: Input files for your job. Supports globs and recursively walks directories (e.g. ``['data/common/', 'data/training/*.gz']``). If this is left blank, we'll read from stdin :type output_dir: str :param output_dir: an empty/non-existent directory where Hadoop streaming should put the final output from the job. If you don't specify an output directory, we'll output into a subdirectory of this job's temporary directory. You can control this from the command line with ``--output-dir``. :type partitioner: str :param partitioner: Optional name of a Hadoop partitoner class, e.g. ``'org.apache.hadoop.mapred.lib.HashPartitioner'``. Hadoop streaming will use this to determine how mapper output should be sorted and distributed to reducers. :param stdin: an iterable (can be a ``StringIO`` or even a list) to use as stdin. This is a hook for testing; if you set ``stdin`` via :py:meth:`~mrjob.job.MRJob.sandbox`, it'll get passed through to the runner. If for some reason your lines are missing newlines, we'll add them; this makes it easier to write automated tests. All runners also take the following options as keyword arguments. These can be defaulted in your :mod:`mrjob.conf` file: :type base_tmp_dir: str :param base_tmp_dir: path to put local temp dirs inside. By default we just call :py:func:`tempfile.gettempdir` :type bootstrap_mrjob: bool :param bootstrap_mrjob: should we automatically tar up the mrjob library and install it when we run the mrjob? Set this to ``False`` if you've already installed ``mrjob`` on your Hadoop cluster. :type cleanup: list :param cleanup: List of which kinds of directories to delete when a job succeeds. See :py:data:`.CLEANUP_CHOICES`. :type cleanup_on_failure: list :param cleanup_on_failure: Which kinds of directories to clean up when a job fails. See :py:data:`.CLEANUP_CHOICES`. :type cmdenv: dict :param cmdenv: environment variables to pass to the job inside Hadoop streaming :type hadoop_extra_args: list of str :param hadoop_extra_args: extra arguments to pass to hadoop streaming :type hadoop_streaming_jar: str :param hadoop_streaming_jar: path to a custom hadoop streaming jar. :type jobconf: dict :param jobconf: ``-jobconf`` args to pass to hadoop streaming. This should be a map from property name to value. Equivalent to passing ``['-jobconf', 'KEY1=VALUE1', '-jobconf', 'KEY2=VALUE2', ...]`` to *hadoop_extra_args*. :type label: str :param label: description of this job to use as the part of its name. By default, we use the script's module name, or ``no_script`` if there is none. :type owner: str :param owner: who is running this job. Used solely to set the job name. By default, we use :py:func:`getpass.getuser`, or ``no_user`` if it fails. :type python_archives: list of str :param python_archives: same as upload_archives, except they get added to the job's :envvar:`PYTHONPATH` :type python_bin: str :param python_bin: Name/path of alternate python binary for mappers/reducers (e.g. for use with :py:mod:`virtualenv`). Defaults to ``'python'``. :type setup_cmds: list :param setup_cmds: a list of commands to run before each mapper/reducer step (e.g. ``['cd my-src-tree; make', 'mkdir -p /tmp/foo']``). You can specify commands as strings, which will be run through the shell, or lists of args, which will be invoked directly. We'll use file locking to ensure that multiple mappers/reducers running on the same node won't run *setup_cmds* simultaneously (it's safe to run ``make``). :type setup_scripts: list of str :param setup_scripts: files that will be copied into the local working directory and then run. These are run after *setup_cmds*. Like with *setup_cmds*, we use file locking to keep multiple mappers/reducers on the same node from running *setup_scripts* simultaneously. :type steps_python_bin: str :param steps_python_bin: Name/path of alternate python binary to use to query the job about its steps (e.g. for use with :py:mod:`virtualenv`). Rarely needed. Defaults to ``sys.executable`` (the current Python interpreter). :type upload_archives: list of str :param upload_archives: a list of archives (e.g. tarballs) to unpack in the local directory of the mr_job script when it runs. You can set the local name of the dir we unpack into by appending ``#localname`` to the path; otherwise we just use the name of the archive file (e.g. ``foo.tar.gz``) :type upload_files: list of str :param upload_files: a list of files to copy to the local directory of the mr_job script when it runs. You can set the local name of the dir we unpack into by appending ``#localname`` to the path; otherwise we just use the name of the file """ self._set_opts(opts, conf_path) # we potentially have a lot of files to copy, so we keep track # of them as a list of dictionaries, with the following keys: # # 'path': the path to the file on the local system # 'name': a unique name for the file when we copy it into HDFS etc. # if this is blank, we'll pick one # 'cache': if 'file', copy into mr_job_script's working directory # on the Hadoop nodes. If 'archive', uncompress the file self._files = [] self._validate_cleanup() # add the script to our list of files (don't actually commit to # uploading it) if mr_job_script: self._script = {'path': mr_job_script} self._files.append(self._script) self._ran_job = False else: self._script = None self._ran_job = True # don't allow user to call run() # setup cmds and wrapper script self._setup_scripts = [] for path in self._opts['setup_scripts']: file_dict = self._add_file_for_upload(path) self._setup_scripts.append(file_dict) # we'll create the wrapper script later self._wrapper_script = None # extra args to our job self._extra_args = list(extra_args) if extra_args else [] # extra file arguments to our job self._file_upload_args = [] if file_upload_args: for arg, path in file_upload_args: file_dict = self._add_file_for_upload(path) self._file_upload_args.append((arg, file_dict)) # set up uploading for path in self._opts['upload_archives']: self._add_archive_for_upload(path) for path in self._opts['upload_files']: self._add_file_for_upload(path) # set up python archives self._python_archives = [] for path in self._opts['python_archives']: self._add_python_archive(path) # where to read input from (log files, etc.) self._input_paths = input_paths or ['-'] # by default read from stdin self._stdin = stdin or sys.stdin self._stdin_path = None # temp file containing dump from stdin # where a tarball of the mrjob library is stored locally self._mrjob_tar_gz_path = None # store output_dir self._output_dir = output_dir # store partitioner self._partitioner = partitioner # store hadoop input and output formats self._hadoop_input_format = hadoop_input_format self._hadoop_output_format = hadoop_output_format # give this job a unique name self._job_name = self._make_unique_job_name( label=self._opts['label'], owner=self._opts['owner']) # a local tmp directory that will be cleaned up when we're done # access/make this using self._get_local_tmp_dir() self._local_tmp_dir = None # info about our steps. this is basically a cache for self._get_steps() self._steps = None # if this is True, we have to pipe input into the sort command # rather than feed it multiple files self._sort_is_windows_sort = None def _set_opts(self, opts, conf_path): # enforce correct arguments allowed_opts = set(self._allowed_opts()) unrecognized_opts = set(opts) - allowed_opts if unrecognized_opts: log.warn('got unexpected keyword arguments: ' + ', '.join(sorted(unrecognized_opts))) opts = dict((k, v) for k, v in opts.iteritems() if k in allowed_opts) # issue a warning for unknown opts from mrjob.conf and filter them out unsanitized_opt_dicts = load_opts_from_mrjob_conf( self.alias, conf_path=conf_path) sanitized_opt_dicts = [] for path, mrjob_conf_opts in unsanitized_opt_dicts: unrecognized_opts = set(mrjob_conf_opts) - allowed_opts if unrecognized_opts: log.warn('got unexpected opts from %s: %s' % ( path, ', '.join(sorted(unrecognized_opts)))) new_opts = dict((k, v) for k, v in mrjob_conf_opts.iteritems() if k in allowed_opts) sanitized_opt_dicts.append(new_opts) else: sanitized_opt_dicts.append(mrjob_conf_opts) # make sure all opts are at least set to None blank_opts = dict((key, None) for key in allowed_opts) # combine all of these options # only __init__() methods should modify self._opts! opt_dicts = ( [blank_opts, self._default_opts()] + sanitized_opt_dicts + [opts] ) self._opts = self.combine_opts(*opt_dicts) self._opt_priority = calculate_opt_priority(self._opts, opt_dicts) def _validate_cleanup(self): # old API accepts strings for cleanup # new API wants lists for opt_key in ('cleanup', 'cleanup_on_failure'): if isinstance(self._opts[opt_key], basestring): self._opts[opt_key] = [self._opts[opt_key]] def validate_cleanup(error_str, opt_list): for choice in opt_list: if choice not in CLEANUP_CHOICES: raise ValueError(error_str % choice) if 'NONE' in opt_list and len(set(opt_list)) > 1: raise ValueError( 'Cannot clean up both nothing and something!') cleanup_error = ('cleanup must be one of %s, not %%s' % ', '.join(CLEANUP_CHOICES)) validate_cleanup(cleanup_error, self._opts['cleanup']) if 'IF_SUCCESSFUL' in self._opts['cleanup']: log.warning( 'IF_SUCCESSFUL is deprecated and will be removed in mrjob 0.4.' ' Use ALL instead.') cleanup_failure_error = ( 'cleanup_on_failure must be one of %s, not %%s' % ', '.join(CLEANUP_CHOICES)) validate_cleanup(cleanup_failure_error, self._opts['cleanup_on_failure']) if 'IF_SUCCESSFUL' in self._opts['cleanup_on_failure']: raise ValueError( 'IF_SUCCESSFUL is not supported for cleanup_on_failure.' ' Use NONE instead.') @classmethod def _allowed_opts(cls): """A list of the options that can be passed to :py:meth:`__init__` *and* can be defaulted from :mod:`mrjob.conf`.""" return [ 'base_tmp_dir', 'bootstrap_mrjob', 'cleanup', 'cleanup_on_failure', 'cmdenv', 'hadoop_extra_args', 'hadoop_streaming_jar', 'hadoop_version', 'jobconf', 'label', 'owner', 'python_archives', 'python_bin', 'setup_cmds', 'setup_scripts', 'steps_python_bin', 'upload_archives', 'upload_files', ] @classmethod def _default_opts(cls): """A dictionary giving the default value of options.""" # getpass.getuser() isn't available on all systems, and may fail try: owner = getpass.getuser() except: owner = None return { 'base_tmp_dir': tempfile.gettempdir(), 'bootstrap_mrjob': True, 'cleanup': ['ALL'], 'cleanup_on_failure': ['NONE'], 'hadoop_version': '0.20', 'owner': owner, 'python_bin': ['python'], 'steps_python_bin': [sys.executable or 'python'], } @classmethod def _opts_combiners(cls): """Map from option name to a combine_*() function used to combine values for that option. This allows us to specify that some options are lists, or contain environment variables, or whatever.""" return { 'base_tmp_dir': combine_paths, 'cmdenv': combine_envs, 'hadoop_extra_args': combine_lists, 'jobconf': combine_dicts, 'python_archives': combine_path_lists, 'python_bin': combine_cmds, 'setup_cmds': combine_lists, 'setup_scripts': combine_path_lists, 'steps_python_bin': combine_cmds, 'upload_archives': combine_path_lists, 'upload_files': combine_path_lists, } @classmethod def combine_opts(cls, *opts_list): """Combine options from several sources (e.g. defaults, mrjob.conf, command line). Options later in the list take precedence. You don't need to re-implement this in a subclass """ return combine_opts(cls._opts_combiners(), *opts_list) ### Running the job and parsing output ### def run(self): """Run the job, and block until it finishes. Raise an exception if there are any problems. """ assert not self._ran_job self._run() self._ran_job = True def stream_output(self): """Stream raw lines from the job's output. You can parse these using the read() method of the appropriate HadoopStreamingProtocol class.""" assert self._ran_job output_dir = self.get_output_dir() log.info('Streaming final output from %s' % output_dir) def split_path(path): while True: base, name = os.path.split(path) # no more elements if not name: break yield name path = base for filename in self.ls(output_dir): subpath = filename[len(output_dir):] if not any(name.startswith('_') for name in split_path(subpath)): for line in self._cat_file(filename): yield line def _cleanup_local_scratch(self): """Cleanup any files/directories on the local machine we created while running this job. Should be safe to run this at any time, or multiple times. This particular function removes any local tmp directories added to the list self._local_tmp_dirs This won't remove output_dir if it's outside of our scratch dir. """ if self._local_tmp_dir: log.info('removing tmp directory %s' % self._local_tmp_dir) try: shutil.rmtree(self._local_tmp_dir) except OSError, e: log.exception(e) self._local_tmp_dir = None def _cleanup_remote_scratch(self): """Cleanup any files/directories on the remote machine (S3) we created while running this job. Should be safe to run this at any time, or multiple times. """ pass # this only happens on EMR def _cleanup_logs(self): """Cleanup any log files that are created as a side-effect of the job. """ pass # this only happens on EMR def _cleanup_jobs(self): """Stop any jobs that we created that are still running.""" pass # this only happens on EMR def cleanup(self, mode=None): """Clean up running jobs, scratch dirs, and logs, subject to the *cleanup* option passed to the constructor. If you create your runner in a :keyword:`with` block, :py:meth:`cleanup` will be called automatically:: with mr_job.make_runner() as runner: ... # cleanup() called automatically here :param mode: override *cleanup* passed into the constructor. Should be a list of strings from :py:data:`CLEANUP_CHOICES` """ if self._ran_job: mode = mode or self._opts['cleanup'] else: mode = mode or self._opts['cleanup_on_failure'] # always terminate running jobs self._cleanup_jobs() def mode_has(*args): return any((choice in mode) for choice in args) if mode_has('ALL', 'SCRATCH', 'LOCAL_SCRATCH', 'IF_SUCCESSFUL'): self._cleanup_local_scratch() if mode_has('ALL', 'SCRATCH', 'REMOTE_SCRATCH', 'IF_SUCCESSFUL'): self._cleanup_remote_scratch() if mode_has('ALL', 'LOGS', 'IF_SUCCESSFUL'): self._cleanup_logs() def counters(self): """Get counters associated with this run in this form:: [{'group name': {'counter1': 1, 'counter2': 2}}, {'group name': ...}] The list contains an entry for every step of the current job, ignoring earlier steps in the same job flow. """ raise NotImplementedError def print_counters(self, limit_to_steps=None): """Display this run's counters in a user-friendly way. :type first_step_num: int :param first_step_num: Display step number of the counters from the first step :type limit_to_steps: list of int :param limit_to_steps: List of step numbers *relative to this job* to print, indexed from 1 """ for step_num, step_counters in enumerate(self.counters()): step_num = step_num + 1 if limit_to_steps is None or step_num in limit_to_steps: log.info('Counters from step %d:' % step_num) if step_counters.keys(): for group_name in sorted(step_counters.keys()): log.info(' %s:' % group_name) group_counters = step_counters[group_name] for counter_name in sorted(group_counters.keys()): log.info(' %s: %d' % ( counter_name, group_counters[counter_name])) else: log.info(' (no counters found)') ### hooks for the with statement ### def __enter__(self): """Don't do anything special at start of with block""" return self def __exit__(self, type, value, traceback): """Call self.cleanup() at end of with block.""" self.cleanup() ### more runner information ### def get_opts(self): """Get options set for this runner, as a dict.""" return copy.deepcopy(self._opts) @classmethod def get_default_opts(self): """Get default options for this runner class, as a dict.""" blank_opts = dict((key, None) for key in self._allowed_opts()) return self.combine_opts(blank_opts, self._default_opts()) def get_job_name(self): """Get the unique name for the job run by this runner. This has the format ``label.owner.date.time.microseconds`` """ return self._job_name ### file management utilties ### # Some simple filesystem operations that work for all runners. # To access files on HDFS (when using # :py:class:``~mrjob.hadoop.HadoopJobRunner``) and S3 (when using # ``~mrjob.emr.EMRJobRunner``), use ``hdfs://...`` and ``s3://...``, # respectively. # We don't currently support ``mv()`` and ``cp()`` because S3 doesn't # really have directories, so the semantics get a little weird. # Some simple filesystem operations that are easy to implement. # We don't support mv() and cp() because they don't totally make sense # on S3, which doesn't really have moves or directories! def get_output_dir(self): """Find the directory containing the job output. If the job hasn't run yet, returns None""" if not self._ran_job: return None return self._output_dir def du(self, path_glob): """Get the total size of files matching ``path_glob`` Corresponds roughly to: ``hadoop fs -dus path_glob`` """ return sum(os.path.getsize(path) for path in self.ls(path_glob)) def ls(self, path_glob): """Recursively list all files in the given path. We don't return directories for compatibility with S3 (which has no concept of them) Corresponds roughly to: ``hadoop fs -lsr path_glob`` """ for path in glob.glob(path_glob): if os.path.isdir(path): for dirname, _, filenames in os.walk(path): for filename in filenames: yield os.path.join(dirname, filename) else: yield path def cat(self, path): """cat output from a given path. This would automatically decompress .gz and .bz2 files. Corresponds roughly to: ``hadoop fs -cat path`` """ for filename in self.ls(path): for line in self._cat_file(filename): yield line def mkdir(self, path): """Create the given dir and its subdirs (if they don't already exist). Corresponds roughly to: ``hadoop fs -mkdir path`` """ if not os.path.isdir(path): os.makedirs(path) def path_exists(self, path_glob): """Does the given path exist? Corresponds roughly to: ``hadoop fs -test -e path_glob`` """ return bool(glob.glob(path_glob)) def path_join(self, dirname, filename): """Join a directory name and filename.""" return os.path.join(dirname, filename) def rm(self, path_glob): """Recursively delete the given file/directory, if it exists Corresponds roughly to: ``hadoop fs -rmr path_glob`` """ for path in glob.glob(path_glob): if os.path.isdir(path): log.debug('Recursively deleting %s' % path) shutil.rmtree(path) else: log.debug('Deleting %s' % path) os.remove(path) def touchz(self, path): """Make an empty file in the given location. Raises an error if a non-zero length file already exists in that location. Correponds to: ``hadoop fs -touchz path`` """ if os.path.isfile(path) and os.path.getsize(path) != 0: raise OSError('Non-empty file %r already exists!' % (path,)) # zero out the file open(path, 'w').close() def _md5sum_file(self, fileobj, block_size=(512 ** 2)): # 256K default md5 = hashlib.md5() while True: data = fileobj.read(block_size) if not data: break md5.update(data) return md5.hexdigest() def md5sum(self, path): """Generate the md5 sum of the file at ``path``""" with open(path, 'rb') as f: return self._md5sum_file(f) ### other methods you need to implement in your subclass ### def get_hadoop_version(self): """Return the version number of the Hadoop environment as a string if Hadoop is being used or simulated. Return None if not applicable. :py:class:`~mrjob.emr.EMRJobRunner` infers this from the job flow. :py:class:`~mrjob.hadoop.HadoopJobRunner` gets this from ``hadoop version``. :py:class:`~mrjob.local.LocalMRJobRunner` has an additional `hadoop_version` option to specify which version it simulates, with a default of 0.20. :py:class:`~mrjob.inline.InlineMRJobRunner` does not simulate Hadoop at all. """ return None # you'll probably wan't to add your own __init__() and cleanup() as well def _run(self): """Run the job.""" raise NotImplementedError def _cat_file(self, filename): """cat a file, decompress if necessary.""" for line in read_file(filename): yield line ### internal utilities for implementing MRJobRunners ### def _split_path(self, path): """Split a path like /foo/bar.py#baz.py into (path, name) (in this case: '/foo/bar.py', 'baz.py'). It's valid to specify no name with something like '/foo/bar.py#' In practice this means that we'll pick a name. """ if '#' in path: path, name = path.split('#', 1) if '/' in name or '#' in name: raise ValueError('Bad name %r; must not contain # or /' % name) # empty names are okay else: name = os.path.basename(path) return name, path def _add_file(self, path): """Add a file that's uploaded, but not added to the working dir for *mr_job_script*. You probably want _add_for_upload() in most cases """ name, path = self._split_path(path) file_dict = {'path': path, 'name': name} self._files.append(file_dict) return file_dict def _add_for_upload(self, path, what): """Add a file to our list of files to copy into the working dir for *mr_job_script*. path -- path to the file on the local filesystem. Normally we just use the file's name as it's remote name. You can use a # character to pick a different name for the file: /foo/bar#baz -> upload /foo/bar as baz /foo/bar# -> upload /foo/bar, pick any name for it upload -- either 'file' (just copy) or 'archive' (uncompress) Returns: The internal dictionary representing the file (in case we want to point to it). """ name, path = self._split_path(path) file_dict = {'path': path, 'name': name, 'upload': what} self._files.append(file_dict) return file_dict def _add_file_for_upload(self, path): return self._add_for_upload(path, 'file') def _add_archive_for_upload(self, path): return self._add_for_upload(path, 'archive') def _add_python_archive(self, path): file_dict = self._add_archive_for_upload(path) self._python_archives.append(file_dict) def _get_cmdenv(self): """Get the environment variables to use inside Hadoop. These should be `self._opts['cmdenv']` combined with python archives added to :envvar:`PYTHONPATH`. This function calls :py:meth:`MRJobRunner._name_files` (since we need to know where each python archive ends up in the job's working dir) """ self._name_files() # on Windows, PYTHONPATH should be separated by ;, not : cmdenv_combiner = self._opts_combiners()['cmdenv'] envs_to_combine = ([{'PYTHONPATH': file_dict['name']} for file_dict in self._python_archives] + [self._opts['cmdenv']]) return cmdenv_combiner(*envs_to_combine) def _assign_unique_names_to_files(self, name_field, prefix='', match=None): """Go through self._files, and fill in name_field for all files where it's not already filled, so that every file has a unique value for name_field. We'll try to give the file the same name as its local path (and we'll definitely keep the extension the same). Args: name_field -- field to fill in (e.g. 'name', 's3_uri', hdfs_uri') prefix -- prefix to prepend to each name (e.g. a path to a tmp dir) match -- a function that returns a true value if the path should just be copied verbatim to the name (for example if we're assigning HDFS uris and the path starts with 'hdfs://'). """ # handle files that are already on S3, HDFS, etc. if match: for file_dict in self._files: path = file_dict['path'] if match(path) and not file_dict.get(name_field): file_dict[name_field] = path # check for name collisions name_to_path = {} for file_dict in self._files: name = file_dict.get(name_field) if name: path = file_dict['path'] if name in name_to_path and path != name_to_path[name]: raise ValueError("Can't copy both %s and %s to %s" % (path, name_to_path[name], name)) name_to_path[name] = path # give names to files that don't have them for file_dict in self._files: if not file_dict.get(name_field): path = file_dict['path'] basename = os.path.basename(path) name = prefix + basename # if name is taken, prepend some random stuff to it while name in name_to_path: name = prefix + '%08x-%s' % ( random.randint(0, 2 ** 32 - 1), basename) file_dict[name_field] = name name_to_path[name] = path # reserve this name def _name_files(self): """Fill in the 'name' field for every file in self._files so that they all have unique names. It's safe to run this method as many times as you want. """ self._assign_unique_names_to_files('name') def _get_local_tmp_dir(self): """Create a tmp directory on the local filesystem that will be cleaned up by self.cleanup()""" if not self._local_tmp_dir: path = os.path.join(self._opts['base_tmp_dir'], self._job_name) log.info('creating tmp directory %s' % path) os.makedirs(path) self._local_tmp_dir = path return self._local_tmp_dir def _make_unique_job_name(self, label=None, owner=None): """Come up with a useful unique ID for this job. We use this to choose the output directory, etc. for the job. """ # use the name of the script if one wasn't explicitly # specified if not label: if self._script: label = os.path.basename( self._script['path']).split('.')[0] else: label = 'no_script' if not owner: owner = 'no_user' now = datetime.datetime.utcnow() return '%s.%s.%s.%06d' % ( label, owner, now.strftime('%Y%m%d.%H%M%S'), now.microsecond) def _get_steps(self): """Call the mr_job to find out how many steps it has, and whether there are mappers and reducers for each step. Validate its output. Returns output like ['MR', 'M'] (two steps, second only has a mapper) We'll cache the result (so you can call _get_steps() as many times as you want) """ if self._steps is None: if not self._script: self._steps = [] else: args = (self._opts['steps_python_bin'] + [self._script['path'], '--steps'] + self._mr_job_extra_args(local=True)) log.debug('> %s' % cmd_line(args)) # add . to PYTHONPATH (in case mrjob isn't actually installed) env = combine_local_envs(os.environ, {'PYTHONPATH': os.path.abspath('.')}) steps_proc = Popen(args, stdout=PIPE, stderr=PIPE, env=env) stdout, stderr = steps_proc.communicate() if steps_proc.returncode != 0: raise Exception( 'error getting step information: %s', stderr) steps = stdout.strip().split(' ') # verify that this is a proper step description if not steps or not stdout: raise ValueError('step description is empty!') for step in steps: if len(step) < 1 or not _STEP_RE.match(step): raise ValueError( 'unexpected step type %r in steps %r' % (step, stdout)) self._steps = steps return self._steps def _mr_job_extra_args(self, local=False): """Return arguments to add to every invocation of MRJob. :type local: boolean :param local: if this is True, use files' local paths rather than the path they'll have inside Hadoop streaming """ return self._get_file_upload_args(local=local) + self._extra_args def _get_file_upload_args(self, local=False): """Arguments used to pass through config files, etc from the job runner through to the local directory where the script is run. :type local: boolean :param local: if this is True, use files' local paths rather than the path they'll have inside Hadoop streaming """ args = [] for arg, file_dict in self._file_upload_args: args.append(arg) if local: args.append(file_dict['path']) else: args.append(file_dict['name']) return args def _wrapper_script_content(self): """Output a python script to the given file descriptor that runs setup_cmds and setup_scripts, and then runs its arguments. This will give names to our files if they don't already have names. """ self._name_files() out = StringIO() def writeln(line=''): out.write(line + '\n') # imports writeln('from fcntl import flock, LOCK_EX, LOCK_UN') writeln('from subprocess import check_call, PIPE') writeln('import sys') writeln() # make lock file and lock it writeln("lock_file = open('/tmp/wrapper.lock.%s', 'a')" % self._job_name) writeln('flock(lock_file, LOCK_EX)') writeln() # run setup cmds if self._opts['setup_cmds']: writeln('# run setup cmds:') for cmd in self._opts['setup_cmds']: # only use the shell for strings, not for lists of arguments # redir stdout to /dev/null so that it won't get confused # with the mapper/reducer's output writeln( "check_call(%r, shell=%r, stdout=open('/dev/null', 'w'))" % (cmd, bool(isinstance(cmd, basestring)))) writeln() # run setup scripts if self._setup_scripts: writeln('# run setup scripts:') for file_dict in self._setup_scripts: writeln("check_call(%r, stdout=open('/dev/null', 'w'))" % ( ['./' + file_dict['name']],)) writeln() # unlock the lock file writeln('flock(lock_file, LOCK_UN)') writeln() # run the real script writeln('# run the real mapper/reducer') writeln('check_call(sys.argv[1:])') return out.getvalue() def _create_wrapper_script(self, dest='wrapper.py'): """Create the wrapper script, and write it into our local temp directory (by default, to a file named wrapper.py). This will set self._wrapper_script, and append it to self._files This will do nothing if setup_cmds and setup_scripts are empty, or _create_wrapper_script() has already been called. """ if not (self._opts['setup_cmds'] or self._setup_scripts): return if self._wrapper_script: return path = os.path.join(self._get_local_tmp_dir(), dest) log.info('writing wrapper script to %s' % path) contents = self._wrapper_script_content() for line in StringIO(contents): log.debug('WRAPPER: ' + line.rstrip('\r\n')) f = open(path, 'w') f.write(contents) f.close() self._wrapper_script = {'path': path} self._files.append(self._wrapper_script) def _dump_stdin_to_local_file(self): """Dump STDIN to a file in our local dir, and set _stdin_path to point at it. You can safely call this multiple times; it'll only read from stdin once. """ if self._stdin_path is None: # prompt user, so they don't think the process has stalled log.info('reading from STDIN') stdin_path = os.path.join(self._get_local_tmp_dir(), 'STDIN') log.debug('dumping stdin to local file %s' % stdin_path) with open(stdin_path, 'w') as stdin_file: for line in self._stdin: # catch missing newlines (this often happens with test data) if not line.endswith('\n'): line += '\n' stdin_file.write(line) self._stdin_path = stdin_path return self._stdin_path def _create_mrjob_tar_gz(self): """Make a tarball of the mrjob library, without .pyc or .pyo files, and return its path. This will also set self._mrjob_tar_gz_path It's safe to call this method multiple times (we'll only create the tarball once.) """ if self._mrjob_tar_gz_path is None: # find mrjob library import mrjob if not os.path.basename(mrjob.__file__).startswith('__init__.'): raise Exception( "Bad path for mrjob library: %s; can't bootstrap mrjob", mrjob.__file__) mrjob_dir = os.path.dirname(mrjob.__file__) or '.' tar_gz_path = os.path.join( self._get_local_tmp_dir(), 'mrjob.tar.gz') def filter_path(path): filename = os.path.basename(path) return not(file_ext(filename).lower() in ('.pyc', '.pyo') or # filter out emacs backup files filename.endswith('~') or # filter out emacs lock files filename.startswith('.#') or # filter out MacFuse resource forks filename.startswith('._')) log.debug('archiving %s -> %s as %s' % ( mrjob_dir, tar_gz_path, os.path.join('mrjob', ''))) tar_and_gzip( mrjob_dir, tar_gz_path, filter=filter_path, prefix='mrjob') self._mrjob_tar_gz_path = tar_gz_path return self._mrjob_tar_gz_path def _hadoop_conf_args(self, step_num, num_steps): """Build a list of extra arguments to the hadoop binary. This handles *cmdenv*, *hadoop_extra_args*, *hadoop_input_format*, *hadoop_output_format*, *jobconf*, and *partitioner*. This doesn't handle input, output, mappers, reducers, or uploading files. """ assert 0 <= step_num < num_steps args = [] # hadoop_extra_args args.extend(self._opts['hadoop_extra_args']) # new-style jobconf version = self.get_hadoop_version() if compat.uses_generic_jobconf(version): for key, value in sorted(self._opts['jobconf'].iteritems()): args.extend(['-D', '%s=%s' % (key, value)]) # partitioner if self._partitioner: args.extend(['-partitioner', self._partitioner]) # cmdenv for key, value in sorted(self._get_cmdenv().iteritems()): args.append('-cmdenv') args.append('%s=%s' % (key, value)) # hadoop_input_format if (step_num == 0 and self._hadoop_input_format): args.extend(['-inputformat', self._hadoop_input_format]) # hadoop_output_format if (step_num == num_steps - 1 and self._hadoop_output_format): args.extend(['-outputformat', self._hadoop_output_format]) # old-style jobconf if not compat.uses_generic_jobconf(version): for key, value in sorted(self._opts['jobconf'].iteritems()): args.extend(['-jobconf', '%s=%s' % (key, value)]) return args def _invoke_sort(self, input_paths, output_path): """Use the local sort command to sort one or more input files. Raise an exception if there is a problem. This is is just a wrapper to handle limitations of Windows sort (see Issue #288). :type input_paths: list of str :param input_paths: paths of one or more input files :type output_path: str :param output_path: where to pipe sorted output into """ if not input_paths: raise ValueError('Must specify at least one input path.') # ignore locale when sorting env = os.environ.copy() env['LC_ALL'] = 'C' log.info('writing to %s' % output_path) err_path = os.path.join(self._get_local_tmp_dir(), 'sort-stderr') # assume we're using UNIX sort unless we know otherwise if (not self._sort_is_windows_sort) or len(input_paths) == 1: with open(output_path, 'w') as output: with open(err_path, 'w') as err: args = ['sort'] + list(input_paths) log.info('> %s' % cmd_line(args)) try: check_call(args, stdout=output, stderr=err, env=env) return except CalledProcessError: pass # Looks like we're using Windows sort self._sort_is_windows_sort = True log.info('Piping files into sort for Windows compatibility') with open(output_path, 'w') as output: with open(err_path, 'w') as err: args = ['sort'] log.info('> %s' % cmd_line(args)) proc = Popen(args, stdin=PIPE, stdout=output, stderr=err, env=env) # shovel bytes into the sort process for input_path in input_paths: with open(input_path, 'r') as input: while True: buf = input.read(_BUFFER_SIZE) if not buf: break proc.stdin.write(buf) proc.stdin.close() proc.wait() if proc.returncode == 0: return # looks like there was a problem. log it and raise an error with open(err_path) as err: for line in err: log.error('STDERR: %s' % line.rstrip('\r\n')) raise CalledProcessError(proc.returncode, args) mrjob-0.3.3.2/mrjob/ssh.py0000664€q(¼€tzÕß0000001604211740642733021101 0ustar sjohnsonAD\Domain Users00000000000000# Copyright 2009-2012 Yelp # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. """Shortcuts for SSH operations""" # don't add imports here that aren't part of the standard Python library, # since MRJobs need to run in Amazon's generic EMR environment from __future__ import with_statement import logging import os import re from subprocess import Popen from subprocess import PIPE SSH_PREFIX = 'ssh://' SSH_LOG_ROOT = '/mnt/var/log/hadoop' SSH_URI_RE = re.compile( r'^%s(?P[^/]+)?(?P/.*)$' % (SSH_PREFIX,)) log = logging.getLogger('mrjob.ssh') class SSHException(Exception): pass def _ssh_args(ssh_bin, address, ec2_key_pair_file): """Helper method for :py:func:`ssh_run` to build an argument list for ``subprocess``. Specifies an identity, disables strict host key checking, and adds the ``hadoop`` username. """ return ssh_bin + [ '-i', ec2_key_pair_file, '-o', 'StrictHostKeyChecking=no', '-o', 'UserKnownHostsFile=/dev/null', 'hadoop@%s' % (address,), ] def check_output(out, err): if err: if 'No such file or directory' in err: raise IOError(err) elif 'Warning: Permanently added' not in err: raise SSHException(err) if 'Permission denied' in out: raise SSHException(out) return out def ssh_run(ssh_bin, address, ec2_key_pair_file, cmd_args, stdin=''): """Shortcut to call ssh on a Hadoop node via ``subprocess``. :param ssh_bin: Path to ``ssh`` binary :param address: Address of your job's master node (obtained via :py:meth:`boto.emr.EmrConnection.describe_jobflow`) :param ec2_key_pair_file: Path to the key pair file (argument to ``-i``) :param cmd_args: The command you want to run :param stdin: String to pass to the process's standard input :return: (stdout, stderr) """ args = _ssh_args(ssh_bin, address, ec2_key_pair_file) + list(cmd_args) log.debug('Run SSH command: %s' % args) p = Popen(args, stdout=PIPE, stderr=PIPE, stdin=PIPE) return p.communicate(stdin) def ssh_run_with_recursion(ssh_bin, address, ec2_key_pair_file, keyfile, cmd_args): """Some files exist on the master and can be accessed directly via SSH, but some files are on the slaves which can only be accessed via the master node. To differentiate between hosts, we adopt the UUCP "bang path" syntax to specify "SSH hops." Specifically, ``host1!host2`` forms the command to be run on ``host2``, then wraps that in a call to ``ssh`` from ``host``, and finally executes that ``ssh`` call on ``host1`` from ``localhost``. Confused yet? For bang paths to work, :py:func:`ssh_copy_key` must have been run, and the ``keyfile`` argument must be the same as was passed to that function. """ if '!' in address: host1, host2 = address.split('!') more_args = [ 'ssh', '-i', keyfile, '-o', 'StrictHostKeyChecking=no', '-o', 'UserKnownHostsFile=/dev/null', 'hadoop@%s' % host2, ] return ssh_run(ssh_bin, host1, ec2_key_pair_file, more_args + list(cmd_args)) else: return ssh_run(ssh_bin, address, ec2_key_pair_file, cmd_args) def ssh_copy_key(ssh_bin, master_address, ec2_key_pair_file, keyfile): """Prepare master to SSH to slaves by copying the EMR private key to the master node. This is done via ``cat`` to avoid having to store an ``scp_bin`` variable. :param ssh_bin: Path to ``ssh`` binary :param master_address: Address of node to copy keyfile to :param ec2_key_pair_file: Path to the key pair file (argument to ``-i``) :param keyfile: What to call the key file on the master """ with open(ec2_key_pair_file, 'rb') as f: args = ['bash -c "cat > %s" && chmod 600 %s' % (keyfile, keyfile)] check_output(*ssh_run(ssh_bin, master_address, ec2_key_pair_file, args, stdin=f.read())) def ssh_slave_addresses(ssh_bin, master_address, ec2_key_pair_file): """Get the IP addresses of the slave nodes. Fails silently because it makes testing easier and if things are broken they will fail before this function is called. """ if not ec2_key_pair_file or not os.path.exists(ec2_key_pair_file): return [] # this is a testing environment cmd = "hadoop dfsadmin -report | grep ^Name | cut -f2 -d: | cut -f2 -d' '" args = ['bash -c "%s"' % cmd] ips = check_output(*ssh_run(ssh_bin, master_address, ec2_key_pair_file, args)) return [ip for ip in ips.split('\n') if ip] def ssh_cat(ssh_bin, address, ec2_key_pair_file, path, keyfile=None): """Return the file at ``path`` as a string. Raises ``IOError`` if the file doesn't exist or ``SSHException if SSH access fails. :param ssh_bin: Path to ``ssh`` binary :param address: Address of your job's master node (obtained via :py:meth:`boto.emr.EmrConnection.describe_jobflow`) :param ec2_key_pair_file: Path to the key pair file (argument to ``-i``) :param path: Path on the remote host to get :param keyfile: Name of the EMR private key file on the master node in case ``path`` exists on one of the slave nodes """ out = check_output(*ssh_run_with_recursion(ssh_bin, address, ec2_key_pair_file, keyfile, ['cat', path])) if 'No such file or directory' in out: raise IOError("File not found: %s" % path) return out def ssh_ls(ssh_bin, address, ec2_key_pair_file, path, keyfile=None): """Recursively list files under ``path`` on the specified SSH host. Return the file at ``path`` as a string. Raises ``IOError`` if the path doesn't exist or ``SSHException if SSH access fails. :param ssh_bin: Path to ``ssh`` binary :param address: Address of your job's master node (obtained via :py:meth:`boto.emr.EmrConnection.describe_jobflow`) :param ec2_key_pair_file: Path to the key pair file (argument to ``-i``) :param path: Path on the remote host to list :param keyfile: Name of the EMR private key file on the master node in case ``path`` exists on one of the slave nodes """ out = check_output(*ssh_run_with_recursion( ssh_bin, address, ec2_key_pair_file, keyfile, ['find', '-L', path, '-type', 'f'])) if 'No such file or directory' in out: raise IOError("No such file or directory: %s" % path) return out.split('\n') mrjob-0.3.3.2/mrjob/tools/0000775€q(¼€tzÕß0000000000011741151621021057 5ustar sjohnsonAD\Domain Users00000000000000mrjob-0.3.3.2/mrjob/tools/__init__.py0000664€q(¼€tzÕß0000000000011622561764023171 0ustar sjohnsonAD\Domain Users00000000000000mrjob-0.3.3.2/mrjob/tools/emr/0000775€q(¼€tzÕß0000000000011741151621021642 5ustar sjohnsonAD\Domain Users00000000000000mrjob-0.3.3.2/mrjob/tools/emr/__init__.py0000664€q(¼€tzÕß0000000000011714067341023746 0ustar sjohnsonAD\Domain Users00000000000000mrjob-0.3.3.2/mrjob/tools/emr/audit_usage.py0000664€q(¼€tzÕß0000007164411717277734024543 0ustar sjohnsonAD\Domain Users00000000000000# Copyright 2009-2010 Yelp # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. """Audit EMR usage over the past 2 weeks, sorted by job flow name and user. Usage:: python -m mrjob.tools.emr.audit_usage > report Options:: -h, --help show this help message and exit -v, --verbose print more messages to stderr -q, --quiet Don't log status messages; just print the report. -c CONF_PATH, --conf-path=CONF_PATH Path to alternate mrjob.conf file to read from --no-conf Don't load mrjob.conf even if it's available --max-days-ago=MAX_DAYS_AGO Max number of days ago to look at jobs. By default, we go back as far as EMR supports (currently about 2 months) """ from __future__ import with_statement import boto.utils from datetime import datetime from datetime import timedelta import math import logging from optparse import OptionParser import sys from mrjob.emr import EMRJobRunner from mrjob.emr import describe_all_job_flows from mrjob.job import MRJob from mrjob.parse import JOB_NAME_RE from mrjob.parse import STEP_NAME_RE from mrjob.util import strip_microseconds log = logging.getLogger('mrjob.tools.emr.audit_usage') def main(args): # parser command-line args option_parser = make_option_parser() options, args = option_parser.parse_args(args) if args: option_parser.error('takes no arguments') MRJob.set_up_logging(quiet=options.quiet, verbose=options.verbose) now = datetime.utcnow() log.info('getting job flow history...') job_flows = get_job_flows(options.conf_path, options.max_days_ago, now=now) log.info('compiling job flow stats...') stats = job_flows_to_stats(job_flows, now=now) print_report(stats, now=now) def make_option_parser(): usage = '%prog [options]' description = 'Print a giant report on EMR usage.' option_parser = OptionParser(usage=usage, description=description) option_parser.add_option( '-v', '--verbose', dest='verbose', default=False, action='store_true', help='print more messages to stderr') option_parser.add_option( '-q', '--quiet', dest='quiet', default=False, action='store_true', help="Don't log status messages; just print the report.") option_parser.add_option( '-c', '--conf-path', dest='conf_path', default=None, help='Path to alternate mrjob.conf file to read from') option_parser.add_option( '--no-conf', dest='conf_path', action='store_false', help="Don't load mrjob.conf even if it's available") option_parser.add_option( '--max-days-ago', dest='max_days_ago', type='float', default=None, help=('Max number of days ago to look at jobs. By default, we go back' ' as far as EMR supports (currently about 2 months)')) return option_parser def job_flows_to_stats(job_flows, now=None): """Aggregate statistics for several job flows into a dictionary. :param job_flows: a list of :py:class:`boto.emr.EmrObject` :param now: the current UTC time, as a :py:class:`datetime.datetime`. Defaults to the current time. Returns a dictionary with many keys, including: * *flows*: A list of dictionaries; the result of running :py:func:`job_flow_to_full_summary` on each job flow. total usage: * *nih_billed*: total normalized instances hours billed, for all job flows * *nih_used*: total normalized instance hours actually used for bootstrapping and running jobs. * *nih_bbnu*: total usage billed but not used (`nih_billed - nih_used`) further breakdown of total usage: * *bootstrap_nih_used*: total usage for bootstrapping * *end_nih_bbnu*: unused time at the end of job flows * *job_nih_used*: total usage for jobs (`nih_used - bootstrap_nih_used`) * *other_nih_bbnu*: other unused time (`nih_bbnu - end_nih_bbnu`) grouping by various keys: (There is a *_used*, *_billed*, and *_bbnu* version of all stats below) * *date_to_nih_\**: map from a :py:class:`datetime.date` to number of normalized instance hours on that date * *hour_to_nih_\**: map from a :py:class:`datetime.datetime` to number of normalized instance hours during the hour starting at that time * *label_to_nih_\**: map from jobs' labels (usually the module name of the job) to normalized instance hours, with ``None`` for non-:py:mod:`mrjob` jobs. This includes usage data for bootstrapping. * *job_step_to_nih_\**: map from jobs' labels and step number to normalized instance hours, using ``(None, None)`` for non-:py:mod:`mrjob` jobs. This does not include bootstrapping. * *job_step_to_nih_\*_no_pool*: Same as *job_step_to_nih_\**, but only including non-pooled job flows. * *owner_to_nih_\**: map from jobs' owners (usually the user who ran them) to normalized instance hours, with ``None`` for non-:py:mod:`mrjob` jobs. This includes usage data for bootstrapping. * *pool_to_nih_\**: Map from pool name to normalized instance hours, with ``None`` for non-pooled jobs and non-:py:mod:`mrjob` jobs. """ s = {} # stats for all job flows s['flows'] = [job_flow_to_full_summary(job_flow, now=now) for job_flow in job_flows] # total usage for nih_type in ('nih_billed', 'nih_used', 'nih_bbnu'): s[nih_type] = float(sum(jf[nih_type] for jf in s['flows'])) # break down by usage/waste s['bootstrap_nih_used'] = float(sum( jf['usage'][0]['nih_used'] for jf in s['flows'] if jf['usage'])) s['job_nih_used'] = s['nih_used'] - s['bootstrap_nih_used'] s['end_nih_bbnu'] = float(sum( jf['usage'][-1]['nih_bbnu'] for jf in s['flows'] if jf['usage'])) s['other_nih_bbnu'] = s['nih_bbnu'] - s['end_nih_bbnu'] # stats by date/hour for interval_type in ('date', 'hour'): for nih_type in ('nih_billed', 'nih_used', 'nih_bbnu'): key = '%s_to_%s' % (interval_type, nih_type) start_to_nih = {} for jf in s['flows']: for u in jf['usage']: for start, nih in u[key].iteritems(): start_to_nih.setdefault(start, 0.0) start_to_nih[start] += nih s[key] = start_to_nih # break down by label ("job name") and owner ("user") for key in ('label', 'owner'): for nih_type in ('nih_used', 'nih_billed', 'nih_bbnu'): key_to_nih = {} for jf in s['flows']: for u in jf['usage']: key_to_nih.setdefault(u[key], 0.0) key_to_nih[u[key]] += u[nih_type] s['%s_to_%s' % (key, nih_type)] = key_to_nih # break down by job step. separate out un-pooled jobs for nih_type in ('nih_used', 'nih_billed', 'nih_bbnu'): job_step_to_nih = {} job_step_to_nih_no_pool = {} for jf in s['flows']: for u in jf['usage'][1:]: job_step = (u['label'], u['step_num']) job_step_to_nih.setdefault(job_step, 0.0) job_step_to_nih[job_step] += u[nih_type] if not jf['pool']: job_step_to_nih_no_pool.setdefault(job_step, 0.0) job_step_to_nih_no_pool[job_step] += u[nih_type] s['job_step_to_%s' % nih_type] = job_step_to_nih s['job_step_to_%s_no_pool' % nih_type] = job_step_to_nih_no_pool # break down by pool for nih_type in ('nih_used', 'nih_billed', 'nih_bbnu'): pool_to_nih = {} for jf in s['flows']: pool_to_nih.setdefault(jf['pool'], 0.0) pool_to_nih[jf['pool']] += jf[nih_type] s['pool_to_%s' % nih_type] = pool_to_nih return s def job_flow_to_full_summary(job_flow, now=None): """Convert a job flow to a full summary for use in creating a report, including billing/usage information. :param job_flow: a :py:class:`boto.emr.EmrObject` :param now: the current UTC time, as a :py:class:`datetime.datetime`. Defaults to the current time. Returns a dictionary with the keys from :py:func:`job_flow_to_basic_summary` plus: * *nih_billed*: total normalized instances hours billed for this job flow * *nih_used*: total normalized instance hours actually used for bootstrapping and running jobs. * *nih_bbnu*: total usage billed but not used (`nih_billed - nih_used`) * *usage*: job-specific usage information, returned by :py:func:`job_flow_to_usage_data`. """ jf = job_flow_to_basic_summary(job_flow, now=now) jf['usage'] = job_flow_to_usage_data(job_flow, basic_summary=jf, now=now) # add up billing info if jf['end']: # avoid rounding errors if the job is done jf['nih_billed'] = jf['nih'] else: jf['nih_billed'] = float(sum(u['nih_billed'] for u in jf['usage'])) for nih_type in ('nih_used', 'nih_bbnu'): jf[nih_type] = float(sum(u[nih_type] for u in jf['usage'])) return jf def job_flow_to_basic_summary(job_flow, now=None): """Extract fields such as creation time, owner, etc. from the job flow, so we can safely reference them without using :py:func:`getattr`. :param job_flow: a :py:class:`boto.emr.EmrObject` :param now: the current UTC time, as a :py:class:`datetime.datetime`. Defaults to the current time. Returns a dictionary with the following keys. These will be ``None`` if the corresponding field in the job flow is unavailable. * *created*: UTC `datetime.datetime` that the job flow was created, or ``None`` * *end*: UTC `datetime.datetime` that the job flow finished, or ``None`` * *id*: job flow ID, or ``None`` (this should never happen) * *label*: The label for the job flow (usually the module name of the :py:class:`~mrjob.job.MRJob` script that started it), or ``None`` for non-:py:mod:`mrjob` job flows. * *name*: job flow name, or ``None`` (this should never happen) * *nih*: number of normalized instance hours used by the job flow. * *num_steps*: Number of steps in the job flow. * *owner*: The owner for the job flow (usually the user that started it), or ``None`` for non-:py:mod:`mrjob` job flows. * *pool*: pool name (e.g. ``'default'``) if the job flow is pooled, otherwise ``None``. * *ran*: How long the job flow ran, or has been running, as a :py:class:`datetime.timedelta`. This will be ``timedelta(0)`` if the job flow hasn't started. * *ready*: UTC `datetime.datetime` that the job flow finished bootstrapping, or ``None`` * *start*: UTC `datetime.datetime` that the job flow became available, or ``None`` * *state*: The job flow's state as a string (e.g. ``'RUNNING'``) """ if now is None: now = datetime.utcnow() jf = {} # summary to fill in jf['id'] = getattr(job_flow, 'jobflowid', None) jf['name'] = getattr(job_flow, 'name', None) jf['created'] = to_datetime(getattr(job_flow, 'creationdatetime', None)) jf['start'] = to_datetime(getattr(job_flow, 'startdatetime', None)) jf['ready'] = to_datetime(getattr(job_flow, 'readydatetime', None)) jf['end'] = to_datetime(getattr(job_flow, 'enddatetime', None)) if jf['start']: jf['ran'] = (jf['end'] or now) - jf['start'] else: jf['ran'] = timedelta(0) jf['state'] = getattr(job_flow, 'state', None) jf['num_steps'] = len(getattr(job_flow, 'steps', None) or ()) jf['pool'] = None bootstrap_actions = getattr(job_flow, 'bootstrapactions', None) if bootstrap_actions: args = [arg.value for arg in bootstrap_actions[-1].args] if len(args) == 2 and args[0].startswith('pool-'): jf['pool'] = args[1] m = JOB_NAME_RE.match(getattr(job_flow, 'name', '')) if m: jf['label'], jf['owner'] = m.group(1), m.group(2) else: jf['label'], jf['owner'] = None, None jf['nih'] = float(getattr(job_flow, 'normalizedinstancehours', '0')) return jf def job_flow_to_usage_data(job_flow, basic_summary=None, now=None): """Break billing/usage information for a job flow down by job. :param job_flow: a :py:class:`boto.emr.EmrObject` :param basic_summary: a basic summary of the job flow, returned by :py:func:`job_flow_to_basic_summary`. If this is ``None``, we'll call :py:func:`job_flow_to_basic_summary` ourselves. :param now: the current UTC time, as a :py:class:`datetime.datetime`. Defaults to the current time. Returns a list of dictionaries containing usage information, one for bootstrapping, and one for each step that ran or is currently running. If the job flow hasn't started yet, return ``[]``. Usage dictionaries have the following keys: * *end*: when the job finished running, or *now* if it's still running. * *end_billing*: the effective end of the job for billing purposes, either when the next job starts, the current time if the job is still running, or the end of the next full hour in the job flow. * *nih_billed*: normalized instances hours billed for this job or bootstrapping step * *nih_used*: normalized instance hours actually used for running the job or bootstrapping * *nih_bbnu*: usage billed but not used (`nih_billed - nih_used`) * *date_to_nih_\**: map from a :py:class:`datetime.date` to number of normalized instance hours billed/used/billed but not used on that date * *hour_to_nih_\**: map from a :py:class:`datetime.datetime` to number of normalized instance hours billed/used/billed but not used during the hour starting at that time * *label*: job's label (usually the module name of the job), or for the bootstrapping step, the label of the job flow * *owner*: job's owner (usually the user that started it), or for the bootstrapping step, the owner of the job flow * *start*: when the job or bootstrapping step started, as a :py:class:`datetime.datetime` """ jf = basic_summary or job_flow_to_basic_summary(job_flow) if now is None: now = datetime.utcnow() if not jf['start']: return [] # Figure out billing rate per second for the job, given that # normalizedinstancehours is how much we're charged up until # the next full hour. full_hours = math.ceil(to_secs(jf['ran']) / 60.0 / 60.0) nih_per_sec = jf['nih'] / (full_hours * 3600.0) # Don't actually count a step as billed for the full hour until # the job flow finishes. This means that our total "nih_billed" # will be less than normalizedinstancehours in the job flow, but it # also keeps stats stable for steps that have already finished. if jf['end']: jf_end_billing = jf['start'] + timedelta(hours=full_hours) else: jf_end_billing = now intervals = [] # add a fake step for the job that started the job flow, and credit # it for time spent bootstrapping. intervals.append({ 'label': jf['label'], 'owner': jf['owner'], 'start': jf['start'], 'end': jf['ready'] or now, 'step_num': None, }) for step in (getattr(job_flow, 'steps', None) or ()): # we've reached the last step that's actually run if not hasattr(step, 'startdatetime'): break step_start = to_datetime(step.startdatetime) step_end = to_datetime(getattr(step, 'enddatetime', None)) if step_end is None: # step started running and was cancelled. credit it for 0 usage if jf['end']: step_end = step_start # step is still running else: step_end = now m = STEP_NAME_RE.match(getattr(step, 'name', '')) if m: step_label = m.group(1) step_owner = m.group(2) step_num = int(m.group(6)) else: step_label, step_owner, step_num = None, None, None intervals.append({ 'label': step_label, 'owner': step_owner, 'start': step_start, 'end': step_end, 'step_num': step_num, }) # fill in end_billing for i in xrange(len(intervals) - 1): intervals[i]['end_billing'] = intervals[i + 1]['start'] intervals[-1]['end_billing'] = jf_end_billing # fill normalized usage information for interval in intervals: interval['nih_used'] = ( nih_per_sec * to_secs(interval['end'] - interval['start'])) interval['date_to_nih_used'] = dict( (d, nih_per_sec * secs) for d, secs in subdivide_interval_by_date(interval['start'], interval['end']).iteritems()) interval['hour_to_nih_used'] = dict( (d, nih_per_sec * secs) for d, secs in subdivide_interval_by_hour(interval['start'], interval['end']).iteritems()) interval['nih_billed'] = ( nih_per_sec * to_secs(interval['end_billing'] - interval['start'])) interval['date_to_nih_billed'] = dict( (d, nih_per_sec * secs) for d, secs in subdivide_interval_by_date(interval['start'], interval['end_billing']).iteritems()) interval['hour_to_nih_billed'] = dict( (d, nih_per_sec * secs) for d, secs in subdivide_interval_by_hour(interval['start'], interval['end_billing']).iteritems()) # time billed but not used interval['nih_bbnu'] = interval['nih_billed'] - interval['nih_used'] interval['date_to_nih_bbnu'] = {} for d, nih_billed in interval['date_to_nih_billed'].iteritems(): nih_bbnu = nih_billed - interval['date_to_nih_used'].get(d, 0.0) if nih_bbnu: interval['date_to_nih_bbnu'][d] = nih_bbnu interval['hour_to_nih_bbnu'] = {} for d, nih_billed in interval['hour_to_nih_billed'].iteritems(): nih_bbnu = nih_billed - interval['hour_to_nih_used'].get(d, 0.0) if nih_bbnu: interval['hour_to_nih_bbnu'][d] = nih_bbnu return intervals def subdivide_interval_by_date(start, end): """Convert a time interval to a map from :py:class:`datetime.date` to the number of seconds within the interval on that date. *start* and *end* are :py:class:`datetime.datetime` objects. """ if start.date() == end.date(): date_to_secs = {start.date(): to_secs(end - start)} else: date_to_secs = {} date_to_secs[start.date()] = to_secs( datetime(start.year, start.month, start.day) + timedelta(days=1) - start) date_to_secs[end.date()] = to_secs( end - datetime(end.year, end.month, end.day)) # fill in dates in the middle cur_date = start.date() + timedelta(days=1) while cur_date < end.date(): date_to_secs[cur_date] = to_secs(timedelta(days=1)) cur_date += timedelta(days=1) # remove zeros date_to_secs = dict( (d, secs) for d, secs in date_to_secs.iteritems() if secs) return date_to_secs def subdivide_interval_by_hour(start, end): """Convert a time interval to a map from hours (represented as :py:class:`datetime.datetime` for the start of the hour) to the number of seconds during that hour that are within the interval *start* and *end* are :py:class:`datetime.datetime` objects. """ start_hour = start.replace(minute=0, second=0, microsecond=0) end_hour = end.replace(minute=0, second=0, microsecond=0) if start_hour == end_hour: hour_to_secs = {start_hour: to_secs(end - start)} else: hour_to_secs = {} hour_to_secs[start_hour] = to_secs( start_hour + timedelta(hours=1) - start) hour_to_secs[end_hour] = to_secs(end - end_hour) # fill in dates in the middle cur_hour = start_hour + timedelta(hours=1) while cur_hour < end_hour: hour_to_secs[cur_hour] = to_secs(timedelta(hours=1)) cur_hour += timedelta(hours=1) # remove zeros hour_to_secs = dict( (h, secs) for h, secs in hour_to_secs.iteritems() if secs) return hour_to_secs def get_job_flows(conf_path, max_days_ago=None, now=None): """Get relevant job flow information from EMR. :param str conf_path: Alternate path to read :py:mod:`mrjob.conf` from, or ``False`` to ignore all config files. :param float max_days_ago: If set, don't fetch job flows created longer than this many days ago. :param now: the current UTC time, as a :py:class:`datetime.datetime`. Defaults to the current time. """ if now is None: now = datetime.utcnow() emr_conn = EMRJobRunner(conf_path=conf_path).make_emr_conn() # if --max-days-ago is set, only look at recent jobs created_after = None if max_days_ago is not None: created_after = now - timedelta(days=max_days_ago) return describe_all_job_flows(emr_conn, created_after=created_after) def print_report(stats, now=None): """Print final report. :param stats: a dictionary returned by :py:func:`job_flows_to_stats` :param now: the current UTC time, as a :py:class:`datetime.datetime`. Defaults to the current time. """ if now is None: now = datetime.utcnow() s = stats if not s['flows']: print 'No job flows created in the past two months!' return print 'Total # of Job Flows: %d' % len(s['flows']) print print '* All times are in UTC.' print print 'Min create time: %s' % min(jf['created'] for jf in s['flows']) print 'Max create time: %s' % max(jf['created'] for jf in s['flows']) print ' Current time: %s' % now.replace(microsecond=0) print print '* All usage is measured in Normalized Instance Hours, which are' print ' roughly equivalent to running an m1.small instance for an hour.' print " Billing is estimated, and may not match Amazon's system exactly." print # total compute-unit hours used def with_pct(usage): return (usage, percent(usage, s['nih_billed'])) print 'Total billed: %9.2f %5.1f%%' % with_pct(s['nih_billed']) print ' Total used: %9.2f %5.1f%%' % with_pct(s['nih_used']) print ' bootstrap: %9.2f %5.1f%%' % with_pct(s['bootstrap_nih_used']) print ' jobs: %9.2f %5.1f%%' % with_pct(s['job_nih_used']) print ' Total waste: %9.2f %5.1f%%' % with_pct(s['nih_bbnu']) print ' at end: %9.2f %5.1f%%' % with_pct(s['end_nih_bbnu']) print ' other: %9.2f %5.1f%%' % with_pct(s['other_nih_bbnu']) print if s['date_to_nih_billed']: print 'Daily statistics:' print print ' date billed used waste % waste' d = max(s['date_to_nih_billed']) while d >= min(s['date_to_nih_billed']): print ' %10s %9.2f %9.2f %9.2f %5.1f' % ( d, s['date_to_nih_billed'].get(d, 0.0), s['date_to_nih_used'].get(d, 0.0), s['date_to_nih_bbnu'].get(d, 0.0), percent(s['date_to_nih_bbnu'].get(d, 0.0), s['date_to_nih_billed'].get(d, 0.0))) d -= timedelta(days=1) print if s['hour_to_nih_billed']: print 'Hourly statistics:' print print ' hour billed used waste % waste' h = max(s['hour_to_nih_billed']) while h >= min(s['hour_to_nih_billed']): print ' %13s %9.2f %9.2f %9.2f %5.1f' % ( h.strftime('%Y-%m-%d %H'), s['hour_to_nih_billed'].get(h, 0.0), s['hour_to_nih_used'].get(h, 0.0), s['hour_to_nih_bbnu'].get(h, 0.0), percent(s['hour_to_nih_bbnu'].get(h, 0.0), s['hour_to_nih_billed'].get(h, 0.0))) h -= timedelta(hours=1) print print '* Job flows are considered to belong to the user and job that' print ' started them or last ran on them.' print # Top jobs print 'Top jobs, by total time used:' for label, nih_used in sorted(s['label_to_nih_used'].iteritems(), key=lambda (lb, nih): (-nih, lb)): print ' %9.2f %s' % (nih_used, label) print print 'Top jobs, by time billed but not used:' for label, nih_bbnu in sorted(s['label_to_nih_bbnu'].iteritems(), key=lambda (lb, nih): (-nih, lb)): print ' %9.2f %s' % (nih_bbnu, label) print # Top users print 'Top users, by total time used:' for owner, nih_used in sorted(s['owner_to_nih_used'].iteritems(), key=lambda (o, nih): (-nih, o)): print ' %9.2f %s' % (nih_used, owner) print print 'Top users, by time billed but not used:' for owner, nih_bbnu in sorted(s['owner_to_nih_bbnu'].iteritems(), key=lambda (o, nih): (-nih, o)): print ' %9.2f %s' % (nih_bbnu, owner) print # Top job steps print 'Top job steps, by total time used (step number first):' for (label, step_num), nih_used in sorted( s['job_step_to_nih_used'].iteritems(), key=lambda (k, nih): (-nih, k)): if label: print ' %9.2f %3d %s' % (nih_used, step_num, label) else: print ' %9.2f (non-mrjob step)' % (nih_used,) print print 'Top job steps, by total time billed but not used (un-pooled only):' for (label, step_num), nih_bbnu in sorted( s['job_step_to_nih_bbnu_no_pool'].iteritems(), key=lambda (k, nih): (-nih, k)): if label: print ' %9.2f %3d %s' % (nih_bbnu, step_num, label) else: print ' %9.2f (non-mrjob step)' % (nih_bbnu,) print # Top pools print 'All pools, by total time billed:' for pool, nih_billed in sorted(s['pool_to_nih_billed'].iteritems(), key=lambda (p, nih): (-nih, p)): print ' %9.2f %s' % (nih_billed, pool or '(not pooled)') print print 'All pools, by total time billed but not used:' for pool, nih_bbnu in sorted(s['pool_to_nih_bbnu'].iteritems(), key=lambda (p, nih): (-nih, p)): print ' %9.2f %s' % (nih_bbnu, pool or '(not pooled)') print # Top job flows print 'All job flows, by total time billed:' top_job_flows = sorted(s['flows'], key=lambda jf: (-jf['nih_billed'], jf['name'])) for jf in top_job_flows: print ' %9.2f %-15s %s' % ( jf['nih_billed'], jf['id'], jf['name']) print print 'All job flows, by time billed but not used:' top_job_flows_bbnu = sorted(s['flows'], key=lambda jf: (-jf['nih_bbnu'], jf['name'])) for jf in top_job_flows_bbnu: print ' %9.2f %-15s %s' % ( jf['nih_bbnu'], jf['id'], jf['name']) print # Details print 'Details for all job flows:' print print (' id state created steps' ' time ran billed waste user name') all_job_flows = sorted(s['flows'], key=lambda jf: jf['created'], reverse=True) for jf in all_job_flows: print ' %-15s %-13s %19s %3d %17s %9.2f %9.2f %8s %s' % ( jf['id'], jf['state'], jf['created'], jf['num_steps'], strip_microseconds(jf['ran']), jf['nih_used'], jf['nih_bbnu'], (jf['owner'] or ''), (jf['label'] or ('not started by mrjob'))) def to_secs(delta): """Convert a :py:class:`datetime.timedelta` to a number of seconds. (This is basically a backport of :py:meth:`datetime.timedelta.total_seconds`.) """ return (delta.days * 86400.0 + delta.seconds + delta.microseconds / 1000000.0) def to_datetime(iso8601_time): """Convert a ISO8601-formatted datetime (from :py:mod:`boto`) to a :py:class:`datetime.datetime`.""" if iso8601_time is None: return None return datetime.strptime(iso8601_time, boto.utils.ISO8601) def percent(x, total, default=0.0): """Return what percentage *x* is of *total*, or *default* if *total* is zero.""" if total: return 100.0 * x / total else: return default if __name__ == '__main__': main(sys.argv[1:]) mrjob-0.3.3.2/mrjob/tools/emr/create_job_flow.py0000664€q(¼€tzÕß0000001026711740642733025356 0ustar sjohnsonAD\Domain Users00000000000000# Copyright 2009-2012 Yelp and Contributors # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. """Create a persistent EMR job flow, using bootstrap scripts and other configs from :py:mod:`mrjob.conf`, and print the job flow ID to stdout. Usage:: python -m mrjob.tools.emr.create_job_flow **WARNING**: do not run this without having :py:mod:`mrjob.tools.emr.terminate_idle_job_flows` in your crontab; job flows left idle can quickly become expensive! """ from __future__ import with_statement from optparse import OptionParser from optparse import OptionGroup from mrjob.emr import EMRJobRunner from mrjob.job import MRJob from mrjob.util import scrape_options_into_new_groups def main(): """Run the create_job_flow tool with arguments from ``sys.argv`` and printing to ``sys.stdout``.""" runner = EMRJobRunner(**runner_kwargs()) emr_job_flow_id = runner.make_persistent_job_flow() print emr_job_flow_id def runner_kwargs(): """Parse command line arguments into arguments for :py:class:`EMRJobRunner` """ # parser command-line args option_parser = make_option_parser() options, args = option_parser.parse_args() if args: option_parser.error('takes no arguments') MRJob.set_up_logging(quiet=options.quiet, verbose=options.verbose) # create the persistent job kwargs = options.__dict__.copy() del kwargs['quiet'] del kwargs['verbose'] return kwargs def make_option_parser(): usage = '%prog [options]' description = ( 'Create a persistent EMR job flow to run jobs in. WARNING: do not run' ' this without mrjob.tools.emr.terminate_idle_job_flows in your' ' crontab; job flows left idle can quickly become expensive!') option_parser = OptionParser(usage=usage, description=description) def make_option_group(halp): g = OptionGroup(option_parser, halp) option_parser.add_option_group(g) return g runner_group = make_option_group('Running the entire job') hadoop_emr_opt_group = make_option_group( 'Running on Hadoop or EMR (these apply when you set -r hadoop or -r' ' emr)') emr_opt_group = make_option_group( 'Running on Amazon Elastic MapReduce (these apply when you set -r' ' emr)') assignments = { runner_group: ( 'bootstrap_mrjob', 'conf_path', 'quiet', 'verbose' ), hadoop_emr_opt_group: ( 'label', 'owner', ), emr_opt_group: ( 'additional_emr_info', 'ami_version', 'aws_availability_zone', 'aws_region', 'bootstrap_actions', 'bootstrap_cmds', 'bootstrap_files', 'bootstrap_python_packages', 'ec2_core_instance_bid_price', 'ec2_core_instance_type', 'ec2_instance_type', 'ec2_key_pair', 'ec2_master_instance_bid_price', 'ec2_master_instance_type', 'ec2_task_instance_bid_price', 'ec2_task_instance_type', 'emr_endpoint', 'emr_job_flow_pool_name', 'enable_emr_debugging', 'hadoop_version', 'num_ec2_core_instances', 'num_ec2_instances', 'num_ec2_task_instances', 'pool_emr_job_flows', 's3_endpoint', 's3_log_uri', 's3_scratch_uri', 's3_sync_wait_time', ), } # Scrape options from MRJob and index them by dest mr_job = MRJob() job_option_groups = mr_job.all_option_groups() scrape_options_into_new_groups(job_option_groups, assignments) return option_parser if __name__ == '__main__': main() mrjob-0.3.3.2/mrjob/tools/emr/fetch_logs.py0000664€q(¼€tzÕß0000002434311717277734024360 0ustar sjohnsonAD\Domain Users00000000000000# Copyright 2009-2010 Yelp # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. """List, display, and parse Hadoop logs associated with EMR job flows. Useful for debugging failed jobs for which mrjob did not display a useful error message or for inspecting jobs whose output has been lost. Usage:: python -m mrjob.tools.emr.fetch_logs -[l|L|a|A|--counters] [-s STEP_NUM]\ JOB_FLOW_ID Options:: -a, --cat Cat log files MRJob finds relevant -A, --cat-all Cat all log files to JOB_FLOW_ID/ -c CONF_PATH, --conf-path=CONF_PATH Path to alternate mrjob.conf file to read from --counters Show counters from the job flow --ec2-key-pair-file=EC2_KEY_PAIR_FILE Path to file containing SSH key for EMR -h, --help show this help message and exit -l, --list List log files MRJob finds relevant -L, --list-all List all log files --no-conf Don't load mrjob.conf even if it's available -q, --quiet Don't print anything to stderr -s STEP_NUM, --step-num=STEP_NUM Limit results to a single step. To be used with --list and --cat. -v, --verbose print more messages to stderr """ from __future__ import with_statement from optparse import OptionError from optparse import OptionParser import sys from mrjob.emr import EMRJobRunner from mrjob.emr import LogFetchError from mrjob.job import MRJob from mrjob.logparsers import TASK_ATTEMPT_LOGS from mrjob.logparsers import STEP_LOGS from mrjob.logparsers import JOB_LOGS from mrjob.logparsers import NODE_LOGS from mrjob.util import scrape_options_into_new_groups def main(): option_parser = make_option_parser() try: options = parse_args(option_parser) except OptionError: option_parser.error('This tool takes exactly one argument.') MRJob.set_up_logging(quiet=options.quiet, verbose=options.verbose) with EMRJobRunner(**runner_kwargs(options)) as runner: perform_actions(options, runner) def perform_actions(options, runner): """Given the command line arguments and an :py:class:`EMRJobRunner`, perform various actions for this tool. """ if options.step_num: step_nums = [options.step_num] else: step_nums = None if options.list_relevant: list_relevant(runner, step_nums) if options.list_all: list_all(runner) if options.cat_relevant: cat_relevant(runner, step_nums) if options.cat_all: cat_all(runner) if options.get_counters: desc = runner._describe_jobflow() runner._set_s3_job_log_uri(desc) runner._fetch_counters( xrange(1, len(desc.steps) + 1), skip_s3_wait=True) runner.print_counters() if options.find_failure: find_failure(runner, options.step_num) def parse_args(option_parser): option_parser = make_option_parser() options, args = option_parser.parse_args() # should be one argument, the job flow ID if len(args) != 1: raise OptionError('Must supply one positional argument as the job' ' flow ID', option_parser) options.emr_job_flow_id = args[0] return options def runner_kwargs(options): """Given the command line options, return the arguments to :py:class:`EMRJobRunner` """ kwargs = options.__dict__.copy() for unused_arg in ('quiet', 'verbose', 'list_relevant', 'list_all', 'cat_relevant', 'cat_all', 'get_counters', 'step_num', 'find_failure'): del kwargs[unused_arg] return kwargs def make_option_parser(): usage = 'usage: %prog [options] JOB_FLOW_ID' description = ( 'List, display, and parse Hadoop logs associated with EMR job flows.' ' Useful for debugging failed jobs for which mrjob did not display a' ' useful error message or for inspecting jobs whose output has been' ' lost.') option_parser = OptionParser(usage=usage, description=description) option_parser.add_option('-f', '--find-failure', dest='find_failure', action='store_true', default=False, help=('Search the logs for information about why' ' the job failed')) option_parser.add_option('-l', '--list', dest='list_relevant', action="store_true", default=False, help='List log files MRJob finds relevant') option_parser.add_option('-L', '--list-all', dest='list_all', action="store_true", default=False, help='List all log files') option_parser.add_option('-a', '--cat', dest='cat_relevant', action="store_true", default=False, help='Cat log files MRJob finds relevant') option_parser.add_option('-A', '--cat-all', dest='cat_all', action="store_true", default=False, help='Cat all log files to JOB_FLOW_ID/') option_parser.add_option('-s', '--step-num', dest='step_num', action='store', type='int', default=None, help=('Limit results to a single step. To be used' ' with --list and --cat.')) option_parser.add_option('--counters', dest='get_counters', action='store_true', default=False, help='Show counters from the job flow') assignments = { option_parser: ('conf_path', 'quiet', 'verbose', 'ec2_key_pair_file') } mr_job = MRJob() job_option_groups = (mr_job.option_parser, mr_job.mux_opt_group, mr_job.proto_opt_group, mr_job.runner_opt_group, mr_job.hadoop_emr_opt_group, mr_job.emr_opt_group, mr_job.hadoop_opts_opt_group) scrape_options_into_new_groups(job_option_groups, assignments) return option_parser def prettyprint_paths(paths): for path in paths: print path print def _prettyprint_relevant(log_type_to_uri_list): print 'Task attempts:' prettyprint_paths(log_type_to_uri_list[TASK_ATTEMPT_LOGS]) print 'Steps:' prettyprint_paths(log_type_to_uri_list[STEP_LOGS]) print 'Jobs:' prettyprint_paths(log_type_to_uri_list[JOB_LOGS]) print 'Nodes:' prettyprint_paths(log_type_to_uri_list[NODE_LOGS]) def list_relevant(runner, step_nums): try: logs = { TASK_ATTEMPT_LOGS: runner.ls_task_attempt_logs_ssh(step_nums), STEP_LOGS: runner.ls_step_logs_ssh(step_nums), JOB_LOGS: runner.ls_job_logs_ssh(step_nums), NODE_LOGS: runner.ls_node_logs_ssh(), } _prettyprint_relevant(logs) except LogFetchError, e: print 'SSH error:', e logs = { TASK_ATTEMPT_LOGS: runner.ls_task_attempt_logs_s3(step_nums), STEP_LOGS: runner.ls_step_logs_s3(step_nums), JOB_LOGS: runner.ls_job_logs_s3(step_nums), NODE_LOGS: runner.ls_node_logs_s3(), } _prettyprint_relevant(logs) def list_all(runner): try: prettyprint_paths(runner.ls_all_logs_ssh()) except LogFetchError, e: print 'SSH error:', e prettyprint_paths(runner.ls_all_logs_s3()) def cat_from_list(runner, path_list): for path in path_list: print '===', path, '===' for line in runner.cat(path): print line.rstrip() print def _cat_from_relevant(runner, log_type_to_uri_list): print 'Task attempts:' cat_from_list(runner, log_type_to_uri_list[TASK_ATTEMPT_LOGS]) print 'Steps:' cat_from_list(runner, log_type_to_uri_list[STEP_LOGS]) print 'Jobs:' cat_from_list(runner, log_type_to_uri_list[JOB_LOGS]) print 'Slaves:' cat_from_list(runner, log_type_to_uri_list[NODE_LOGS]) def cat_relevant(runner, step_nums): try: logs = { TASK_ATTEMPT_LOGS: runner.ls_task_attempt_logs_ssh(step_nums), STEP_LOGS: runner.ls_step_logs_ssh(step_nums), JOB_LOGS: runner.ls_job_logs_ssh(step_nums), NODE_LOGS: runner.ls_node_logs_ssh(), } _cat_from_relevant(runner, logs) except LogFetchError, e: print 'SSH error:', e logs = { TASK_ATTEMPT_LOGS: runner.ls_task_attempt_logs_s3(step_nums), STEP_LOGS: runner.ls_step_logs_s3(step_nums), JOB_LOGS: runner.ls_job_logs_s3(step_nums), NODE_LOGS: runner.ls_node_logs_s3(), } _cat_from_relevant(runner, logs) def cat_all(runner): try: cat_from_list(runner, runner.ls_all_logs_ssh()) except LogFetchError, e: print 'SSH error:', e cat_from_list(runner, runner.ls_all_logs_s3()) def find_failure(runner, step_num): if step_num: step_nums = [step_num] else: job_flow = runner._describe_jobflow() if job_flow: step_nums = range(1, len(job_flow.steps) + 1) else: print 'You do not have access to that job flow.' sys.exit(1) cause = runner._find_probable_cause_of_failure(step_nums) if cause: # log cause, and put it in exception cause_msg = [] # lines to log and put in exception cause_msg.append('Probable cause of failure (from %s):' % cause['log_file_uri']) cause_msg.extend(line.strip('\n') for line in cause['lines']) if cause['input_uri']: cause_msg.append('(while reading from %s)' % cause['input_uri']) print '\n'.join(cause_msg) else: print 'No probable cause of failure found.' if __name__ == '__main__': main() mrjob-0.3.3.2/mrjob/tools/emr/job_flow_pool.py0000664€q(¼€tzÕß0000001606711717277734025101 0ustar sjohnsonAD\Domain Users00000000000000# Copyright 2009-2012 Yelp and Contributors # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. """Inspect available job flow pools or identify job flows suitable for running a job with the specified options. Usage:: python -m mrjob.tools.emr.job_flow_pool """ from __future__ import with_statement from optparse import OptionError from optparse import OptionGroup from optparse import OptionParser from mrjob.emr import EMRJobRunner from mrjob.emr import est_time_to_hour from mrjob.job import MRJob from mrjob.util import scrape_options_into_new_groups from mrjob.util import strip_microseconds def get_pools(emr_conn): pools = {} for job_flow in emr_conn.describe_jobflows(): if job_flow.state in ('TERMINATED', 'FAILED', 'COMPLETED', 'SHUTTING_DOWN'): continue if not job_flow.bootstrapactions: continue args = [arg.value for arg in job_flow.bootstrapactions[-1].args] if len(args) != 2: continue pools.setdefault(args[1], list()).append(job_flow) return pools def pprint_job_flow(jf): """Print a job flow to stdout in this form:: job.flow.name j-JOB_FLOW_ID: 2 instances (master=m1.small, slaves=m1.small, 20 \ minutes to the hour) """ instance_count = int(jf.instancecount) nosep_segments = [ '%d instance' % instance_count, ] if instance_count > 1: nosep_segments.append('s') comma_segments = [ 'master=%s' % jf.masterinstancetype, ] if instance_count > 1: comma_segments.append('slaves=%s' % jf.slaveinstancetype) comma_segments.append('%s to end of hour' % strip_microseconds(est_time_to_hour(jf))) nosep_segments += [ ' (', ', '.join(comma_segments), ')', ] print '%s: %s' % (jf.jobflowid, jf.name) print ''.join(nosep_segments) print jf.state print def pprint_pools(runner): pools = get_pools(runner.make_emr_conn()) for pool_name, job_flows in pools.iteritems(): print '-' * len(pool_name) print pool_name print '-' * len(pool_name) for job_flow in job_flows: pprint_job_flow(job_flow) def terminate(runner, pool_name): emr_conn = runner.make_emr_conn() pools = get_pools(emr_conn) try: for job_flow in pools[pool_name]: emr_conn.terminate_jobflow(job_flow.jobflowid) print 'terminated %s' % job_flow.jobflowid except KeyError: print 'No job flows match pool name "%s"' % pool_name def main(): option_parser = make_option_parser() try: options = parse_args(option_parser) except OptionError: option_parser.error('This tool takes no arguments.') MRJob.set_up_logging(quiet=options.quiet, verbose=options.verbose) with EMRJobRunner(**runner_kwargs(options)) as runner: perform_actions(options, runner) def make_option_parser(): usage = '%prog [options]' description = ( 'Inspect available job flow pools or identify job flows suitable for' ' running a job with the specified options.') option_parser = OptionParser(usage=usage, description=description) def make_option_group(halp): g = OptionGroup(option_parser, halp) option_parser.add_option_group(g) return g ec2_opt_group = make_option_group('EC2 instance configuration') hadoop_opt_group = make_option_group('Hadoop configuration') job_opt_group = make_option_group('Job flow configuration') assignments = { option_parser: ( 'conf_path', 'emr_job_flow_pool_name', 'quiet', 'verbose', ), ec2_opt_group: ( 'aws_availability_zone', 'ec2_instance_type', 'ec2_key_pair', 'ec2_key_pair_file', 'ec2_master_instance_type', 'ec2_core_instance_type', 'emr_endpoint', 'num_ec2_instances', ), hadoop_opt_group: ( 'hadoop_version', 'label', 'owner', ), job_opt_group: ( 'bootstrap_actions', 'bootstrap_cmds', 'bootstrap_files', 'bootstrap_mrjob', 'bootstrap_python_packages', ), } option_parser.add_option('-a', '--all', action='store_true', default=False, dest='list_all', help=('List all available job flows without' ' filtering by configuration')) option_parser.add_option('-f', '--find', action='store_true', default=False, dest='find', help=('Find a job flow matching the pool name,' ' bootstrap configuration, and instance' ' number/type as specified on the command' ' line and in the configuration files')) option_parser.add_option('-t', '--terminate', action='store', default=None, dest='terminate', metavar='JOB_FLOW_ID', help=('Terminate all job flows in the given pool' ' (defaults to pool "default")')) # Scrape options from MRJob and index them by dest mr_job = MRJob() scrape_options_into_new_groups(mr_job.all_option_groups(), assignments) return option_parser def parse_args(option_parser): options, args = option_parser.parse_args() if len(args) != 0: raise OptionError('This program takes no arguments', option_parser) return options def runner_kwargs(options): """Given the command line options, return the arguments to :py:class:`EMRJobRunner` """ kwargs = options.__dict__.copy() for non_runner_kwarg in ('quiet', 'verbose', 'list_all', 'find', 'terminate'): del kwargs[non_runner_kwarg] return kwargs def perform_actions(options, runner): """Given the command line arguments and an :py:class:`EMRJobRunner`, perform various actions for this tool. """ if options.list_all: pprint_pools(runner) if options.find: sorted_job_flows = runner.usable_job_flows() if sorted_job_flows: jf = sorted_job_flows[-1] print 'You should use this one:' pprint_job_flow(jf) else: print 'No idle job flows match criteria' if options.terminate: terminate(runner, options.terminate) if __name__ == '__main__': main() mrjob-0.3.3.2/mrjob/tools/emr/mrboss.py0000664€q(¼€tzÕß0000001052311717277734023543 0ustar sjohnsonAD\Domain Users00000000000000# Copyright 2009-2010 Yelp # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. """Run a command on the master and all slaves. Store stdout and stderr for results in OUTPUT_DIR. Usage:: python -m mrjob.tools.emr.mrboss JOB_FLOW_ID [options] "command string" Options:: -c CONF_PATH, --conf-path=CONF_PATH --ec2-key-pair-file=EC2_KEY_PAIR_FILE Path to file containing SSH key for EMR -h, --help show this help message and exit --no-conf Don't load mrjob.conf even if it's available -o, --output-dir Specify an output directory (default: JOB_FLOW_ID) -q, --quiet Don't print anything to stderr -v, --verbose print more messages to stderr """ from __future__ import with_statement from optparse import OptionParser import os import shlex import sys from mrjob.emr import EMRJobRunner from mrjob.job import MRJob from mrjob.ssh import ssh_run_with_recursion from mrjob.util import scrape_options_into_new_groups def main(): usage = 'usage: %prog JOB_FLOW_ID OUTPUT_DIR [options] "command string"' description = ('Run a command on the master and all slaves of an EMR job' ' flow. Store stdout and stderr for results in OUTPUT_DIR.') option_parser = OptionParser(usage=usage, description=description) assignments = { option_parser: ('conf_path', 'quiet', 'verbose', 'ec2_key_pair_file') } option_parser.add_option('-o', '--output-dir', dest='output_dir', default=None, help="Specify an output directory (default:" " JOB_FLOW_ID)") mr_job = MRJob() scrape_options_into_new_groups(mr_job.all_option_groups(), assignments) options, args = option_parser.parse_args() MRJob.set_up_logging(quiet=options.quiet, verbose=options.verbose) runner_kwargs = options.__dict__.copy() for unused_arg in ('output_dir', 'quiet', 'verbose'): del runner_kwargs[unused_arg] if len(args) < 2: option_parser.print_help() sys.exit(1) job_flow_id, cmd_string = args[:2] cmd_args = shlex.split(cmd_string) output_dir = os.path.abspath(options.output_dir or job_flow_id) with EMRJobRunner(emr_job_flow_id=job_flow_id, **runner_kwargs) as runner: runner._enable_slave_ssh_access() run_on_all_nodes(runner, output_dir, cmd_args) def run_on_all_nodes(runner, output_dir, cmd_args, print_stderr=True): """Given an :py:class:`EMRJobRunner`, run the command specified by *cmd_args* on all nodes in the job flow and save the stdout and stderr of each run to subdirectories of *output_dir*. You should probably have run :py:meth:`_enable_slave_ssh_access()` on the runner before calling this function. """ master_addr = runner._address_of_master() addresses = [master_addr] if runner._opts['num_ec2_instances'] > 1: addresses += ['%s!%s' % (master_addr, slave_addr) for slave_addr in runner._addresses_of_slaves()] for addr in addresses: stdout, stderr = ssh_run_with_recursion( runner._opts['ssh_bin'], addr, runner._opts['ec2_key_pair_file'], runner._ssh_key_name, cmd_args, ) if print_stderr: print '---' print 'Command completed on %s.' % addr print stderr, if '!' in addr: base_dir = os.path.join(output_dir, 'slave ' + addr.split('!')[1]) else: base_dir = os.path.join(output_dir, 'master') if not os.path.exists(base_dir): os.makedirs(base_dir) with open(os.path.join(base_dir, 'stdout'), 'w') as f: f.write(stdout) with open(os.path.join(base_dir, 'stderr'), 'w') as f: f.write(stderr) if __name__ == '__main__': main() mrjob-0.3.3.2/mrjob/tools/emr/report_long_jobs.py0000664€q(¼€tzÕß0000001770111734676607025613 0ustar sjohnsonAD\Domain Users00000000000000# Copyright 2012 Yelp # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. """Report jobs running for more than a certain number of hours (by default, 24.0). This can help catch buggy jobs and Hadoop/EMR operational issues. Suggested usage: run this as a daily cron job with the ``-q`` option:: 0 0 * * * python -m mrjob.tools.emr.report_long_jobs -q Options:: -h, --help show this help message and exit -v, --verbose print more messages to stderr -q, --quiet Don't log status messages; just print the report. -c CONF_PATH, --conf-path=CONF_PATH Path to alternate mrjob.conf file to read from --no-conf Don't load mrjob.conf even if it's available --min-hours=MIN_HOURS Minimum number of hours a job can run before we report it. Default: 24.0 """ from __future__ import with_statement from datetime import datetime from datetime import timedelta import logging from optparse import OptionParser import sys import boto.utils from mrjob.emr import EMRJobRunner from mrjob.emr import describe_all_job_flows from mrjob.job import MRJob from mrjob.util import strip_microseconds # default minimum number of hours a job can run before we report it. DEFAULT_MIN_HOURS = 24.0 log = logging.getLogger('mrjob.tools.emr.report_long_jobs') def main(args, now=None): if now is None: now = datetime.utcnow() option_parser = make_option_parser() options, args = option_parser.parse_args(args) if args: option_parser.error('takes no arguments') MRJob.set_up_logging(quiet=options.quiet, verbose=options.verbose) log.info('getting information about running jobs') emr_conn = EMRJobRunner(conf_path=options.conf_path).make_emr_conn() job_flows = describe_all_job_flows( emr_conn, states=['BOOTSTRAPPING', 'RUNNING']) min_time = timedelta(hours=options.min_hours) job_info = find_long_running_jobs(job_flows, min_time, now=now) print_report(job_info) def find_long_running_jobs(job_flows, min_time, now=None): """Identify jobs that have been running or pending for a long time. :param job_flows: a list of :py:class:`boto.emr.emrobject.JobFlow` objects to inspect. :param min_time: a :py:class:`datetime.timedelta`: report jobs running or pending longer than this :param now: the current UTC time, as a :py:class:`datetime.datetime`. Defaults to the current time. For each job that is running or pending longer than *min_time*, yields a dictionary with the following keys: * *job_flow_id*: the job flow's unique ID (e.g. ``j-SOMEJOBFLOW``) * *name*: name of the step, or the job flow when bootstrapping * *step_state*: state of the step, either ``'RUNNING'`` or ``'PENDING'`` * *time*: amount of time step was running or pending, as a :py:class:`datetime.timedelta` """ if now is None: now = datetime.utcnow() for jf in job_flows: # special case for jobs that are taking a long time to bootstrap if jf.state == 'BOOTSTRAPPING': start_timestamp = jf.startdatetime start = datetime.strptime(start_timestamp, boto.utils.ISO8601) time_running = now - start if time_running >= min_time: # we tell bootstrapping info by step_state being empty, # and only use job_flow_id and time in the report yield({'job_flow_id': jf.jobflowid, 'name': jf.name, 'step_state': '', 'time': time_running}) # the default case: running job flows if jf.state != 'RUNNING': continue running_steps = [step for step in jf.steps if step.state == 'RUNNING'] pending_steps = [step for step in jf.steps if step.state == 'PENDING'] if running_steps: # should be only one, but if not, we should know for step in running_steps: start_timestamp = step.startdatetime start = datetime.strptime(start_timestamp, boto.utils.ISO8601) time_running = now - start if time_running >= min_time: yield({'job_flow_id': jf.jobflowid, 'name': step.name, 'step_state': step.state, 'time': time_running}) # sometimes EMR says it's "RUNNING" but doesn't actually run steps! elif pending_steps: step = pending_steps[0] # PENDING job should have run starting when the job flow # became ready, or the previous step completed start_timestamp = jf.readydatetime for step in jf.steps: if step.state == 'COMPLETED': start_timestamp = step.enddatetime start = datetime.strptime(start_timestamp, boto.utils.ISO8601) time_pending = now - start if time_pending >= min_time: yield({'job_flow_id': jf.jobflowid, 'name': step.name, 'step_state': step.state, 'time': time_pending}) def print_report(job_info): """Takes in a dictionary of info about a long-running job (see :py:func:`find_long_running_jobs`), and prints information about it on a single (long) line. """ for ji in job_info: # BOOTSTRAPPING case if not ji['step_state']: print '%-15s BOOTSTRAPPING for %17s (%s)' % ( ji['job_flow_id'], format_timedelta(ji['time']), ji['name']) else: print '%-15s %7s for %17s (%s)' % ( ji['job_flow_id'], ji['step_state'], format_timedelta(ji['time']), ji['name']) def format_timedelta(time): """Format a timedelta for use in a columnar format. This just tweaks stuff like ``'3 days, 9:00:00'`` to line up with ``'3 days, 10:00:00'`` """ result = str(strip_microseconds(time)) parts = result.split() if len(parts) == 3 and len(parts[-1]) == 7: return '%s %s %s' % tuple(parts) else: return result def make_option_parser(): usage = '%prog [options]' description = ('Report jobs running for more than a certain number of' ' hours (by default, %.1f). This can help catch buggy jobs' ' and Hadoop/EMR operational issues.' % DEFAULT_MIN_HOURS) option_parser = OptionParser(usage=usage, description=description) option_parser.add_option( '-v', '--verbose', dest='verbose', default=False, action='store_true', help='print more messages to stderr') option_parser.add_option( '-q', '--quiet', dest='quiet', default=False, action='store_true', help="Don't log status messages; just print the report.") option_parser.add_option( '-c', '--conf-path', dest='conf_path', default=None, help='Path to alternate mrjob.conf file to read from') option_parser.add_option( '--no-conf', dest='conf_path', action='store_false', help="Don't load mrjob.conf even if it's available") option_parser.add_option( '--min-hours', dest='min_hours', type='float', default=DEFAULT_MIN_HOURS, help=('Minimum number of hours a job can run before we report it.' ' Default: %default')) return option_parser if __name__ == '__main__': main(sys.argv[1:]) mrjob-0.3.3.2/mrjob/tools/emr/s3_tmpwatch.py0000664€q(¼€tzÕß0000001224511717277734024475 0ustar sjohnsonAD\Domain Users00000000000000# Copyright 2010-2011 Yelp # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. """Delete all files in a given URI that are older than a specified time. The time parameter defines the threshold for removing files. If the file has not been accessed for *time*, the file is removed. The time argument is a number with an optional single-character suffix specifying the units: m for minutes, h for hours, d for days. If no suffix is specified, time is in hours. Suggested usage: run this as a cron job with the -q option:: 0 0 * * * python -m mrjob.tools.emr.s3_tmpwatch -q 30d \ s3://your-bucket/tmp/ Usage:: python -m mrjob.tools.emr.s3_tmpwatch [options] Options:: -h, --help show this help message and exit -v, --verbose Print more messages -q, --quiet Report only fatal errors. -c CONF_PATH, --conf-path=CONF_PATH Path to alternate mrjob.conf file to read from --no-conf Don't load mrjob.conf even if it's available -t, --test Don't actually delete any files; just log that we would """ from datetime import datetime from datetime import timedelta import logging from optparse import OptionParser try: import boto.utils boto # quiet "redefinition of unused ..." warning from pyflakes except ImportError: boto = None from mrjob.emr import EMRJobRunner from mrjob.emr import iso8601_to_datetime from mrjob.job import MRJob from mrjob.parse import parse_s3_uri log = logging.getLogger('mrjob.tools.emr.s3_tmpwatch') def main(): option_parser = make_option_parser() options, args = option_parser.parse_args() # make sure time and uris are given if not args or len(args) < 2: option_parser.error('Please specify time and one or more URIs') MRJob.set_up_logging(quiet=options.quiet, verbose=options.verbose) time_old = process_time(args[0]) for path in args[1:]: s3_cleanup(path, time_old, conf_path=options.conf_path, dry_run=options.test) def s3_cleanup(glob_path, time_old, dry_run=False, conf_path=None): """Delete all files older than *time_old* in *path*. If *dry_run* is ``True``, then just log the files that need to be deleted without actually deleting them """ runner = EMRJobRunner(conf_path=conf_path) s3_conn = runner.make_s3_conn() log.info('Deleting all files in %s that are older than %s' % (glob_path, time_old)) for path in runner.ls(glob_path): bucket_name, key_name = parse_s3_uri(path) bucket = s3_conn.get_bucket(bucket_name) for key in bucket.list(key_name): last_modified = iso8601_to_datetime(key.last_modified) age = datetime.utcnow() - last_modified if age > time_old: # Delete it log.info('Deleting %s; is %s old' % (key.name, age)) if not dry_run: key.delete() def process_time(time): if time[-1] == 'm': return timedelta(minutes=int(time[:-1])) elif time[-1] == 'h': return timedelta(hours=int(time[:-1])) elif time[-1] == 'd': return timedelta(days=int(time[:-1])) else: return timedelta(hours=int(time)) def make_option_parser(): usage = '%prog [options] ' description = ( 'Delete all files in a given URI that are older than a specified' ' time.\n\nThe time parameter defines the threshold for removing' ' files. If the file has not been accessed for *time*, the file is' ' removed. The time argument is a number with an optional' ' single-character suffix specifying the units: m for minutes, h for' ' hours, d for days. If no suffix is specified, time is in hours.') option_parser = OptionParser(usage=usage, description=description) option_parser.add_option( '-v', '--verbose', dest='verbose', default=False, action='store_true', help='Print more messages') option_parser.add_option( '-q', '--quiet', dest='quiet', default=False, action='store_true', help='Report only fatal errors.') option_parser.add_option( '-c', '--conf-path', dest='conf_path', default=None, help='Path to alternate mrjob.conf file to read from') option_parser.add_option( '--no-conf', dest='conf_path', action='store_false', help="Don't load mrjob.conf even if it's available") option_parser.add_option( '-t', '--test', dest='test', default=False, action='store_true', help="Don't actually delete any files; just log that we would") return option_parser if __name__ == '__main__': main() mrjob-0.3.3.2/mrjob/tools/emr/terminate_idle_job_flows.py0000664€q(¼€tzÕß0000003425411741151504027255 0ustar sjohnsonAD\Domain Users00000000000000# Copyright 2009-2012 Yelp # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License """Terminate idle EMR job flows that meet the criteria passed in on the command line (or, by default, job flows that have been idle for one hour). Suggested usage: run this as a cron job with the ``-q`` option:: */30 * * * * python -m mrjob.tools.emr.terminate_idle_job_flows -q Options:: -h, --help show this help message and exit -v, --verbose Print more messages -q, --quiet Don't print anything to stderr; just print IDs of terminated job flows and idle time information to stdout. Use twice to print absolutely nothing. -c CONF_PATH, --conf-path=CONF_PATH Path to alternate mrjob.conf file to read from --no-conf Don't load mrjob.conf even if it's available --max-hours-idle=MAX_HOURS_IDLE Max number of hours a job flow can go without bootstrapping, running a step, or having a new step created. This will fire even if there are pending steps which EMR has failed to start. Make sure you set this higher than the amount of time your jobs can take to start instances and bootstrap. --mins-to-end-of-hour=MINS_TO_END_OF_HOUR Terminate job flows that are within this many minutes of the end of a full hour since the job started running AND have no pending steps. --unpooled-only Only terminate un-pooled job flows --pooled-only Only terminate pooled job flows --pool-name=POOL_NAME Only terminate job flows in the given named pool. --dry-run Don't actually kill idle jobs; just log that we would """ from datetime import datetime from datetime import timedelta import logging from optparse import OptionParser import re try: import boto.utils boto # quiet "redefinition of unused ..." warning from pyflakes except ImportError: boto = None from mrjob.emr import attempt_to_acquire_lock from mrjob.emr import EMRJobRunner from mrjob.emr import describe_all_job_flows from mrjob.job import MRJob from mrjob.pool import est_time_to_hour from mrjob.pool import pool_hash_and_name from mrjob.util import strip_microseconds log = logging.getLogger('mrjob.tools.emr.terminate_idle_job_flows') DEFAULT_MAX_HOURS_IDLE = 1 DEFAULT_MAX_MINUTES_LOCKED = 1 DEBUG_JAR_RE = re.compile( r's3n://.*\.elasticmapreduce/libs/state-pusher/[^/]+/fetch') def main(): option_parser = make_option_parser() options, args = option_parser.parse_args() if args: option_parser.error('takes no arguments') MRJob.set_up_logging(quiet=options.quiet, verbose=options.verbose) inspect_and_maybe_terminate_job_flows( conf_path=options.conf_path, dry_run=options.dry_run, max_hours_idle=options.max_hours_idle, mins_to_end_of_hour=options.mins_to_end_of_hour, unpooled_only=options.unpooled_only, now=datetime.utcnow(), pool_name=options.pool_name, pooled_only=options.pooled_only, max_mins_locked=options.max_mins_locked, quiet=(options.quiet > 1), ) def inspect_and_maybe_terminate_job_flows( conf_path=None, dry_run=False, max_hours_idle=None, mins_to_end_of_hour=None, now=None, pool_name=None, pooled_only=False, unpooled_only=False, max_mins_locked=None, quiet=False, **kwargs ): if now is None: now = datetime.utcnow() # old default behavior if max_hours_idle is None and mins_to_end_of_hour is None: max_hours_idle = DEFAULT_MAX_HOURS_IDLE runner = EMRJobRunner(conf_path=conf_path, **kwargs) emr_conn = runner.make_emr_conn() log.info( 'getting info about all job flows (this goes back about 2 months)') # We don't filter by job flow state because we want this to work even # if Amazon adds another kind of idle state. job_flows = describe_all_job_flows(emr_conn) num_bootstrapping = 0 num_done = 0 num_idle = 0 num_non_streaming = 0 num_pending = 0 num_running = 0 # a list of tuples of job flow id, name, idle time (as a timedelta) to_terminate = [] for jf in job_flows: # check if job flow is done if is_job_flow_done(jf): num_done += 1 # check if job flow is bootstrapping elif is_job_flow_bootstrapping(jf): num_bootstrapping += 1 # we can't really tell if non-streaming jobs are idle or not, so # let them be (see Issue #60) elif not is_job_flow_streaming(jf): num_non_streaming += 1 elif is_job_flow_running(jf): num_running += 1 else: time_idle = now - time_last_active(jf) time_to_end_of_hour = est_time_to_hour(jf, now=now) _, pool = pool_hash_and_name(jf) pending = job_flow_has_pending_steps(jf) if pending: num_pending += 1 else: num_idle += 1 log.debug( 'Job flow %s %s for %s, %s to end of hour, %s (%s)' % (jf.jobflowid, 'pending' if pending else 'idle', strip_microseconds(time_idle), strip_microseconds(time_to_end_of_hour), ('unpooled' if pool is None else 'in %s pool' % pool), jf.name)) # filter out job flows that don't meet our criteria if (max_hours_idle is not None and time_idle <= timedelta(hours=max_hours_idle)): continue # mins_to_end_of_hour doesn't apply to jobs with pending steps if (mins_to_end_of_hour is not None and (pending or time_to_end_of_hour >= timedelta( minutes=mins_to_end_of_hour))): continue if (pooled_only and pool is None): continue if (unpooled_only and pool is not None): continue if (pool_name is not None and pool != pool_name): continue to_terminate.append((jf, pending, time_idle, time_to_end_of_hour)) log.info( 'Job flow statuses: %d bootstrapping, %d running, %d pending, %d idle,' ' %d active non-streaming, %d done' % ( num_running, num_bootstrapping, num_pending, num_idle, num_non_streaming, num_done)) terminate_and_notify(runner, to_terminate, dry_run=dry_run, max_mins_locked=max_mins_locked, quiet=quiet) def is_job_flow_done(job_flow): """Return True if the given job flow is done running.""" return hasattr(job_flow, 'enddatetime') def is_job_flow_streaming(job_flow): """Return ``False`` if the give job flow has steps, but none of them are Hadoop streaming steps (for example, if the job flow is running Hive). """ steps = getattr(job_flow, 'steps', None) if not steps: return True for step in steps: args = [a.value for a in step.args] for arg in args: # This is hadoop streaming if arg == '-mapper': return True # This is a debug jar associated with hadoop streaming if DEBUG_JAR_RE.match(arg): return True # job has at least one step, and none are streaming steps return False def is_job_flow_running(job_flow): """Return ``True`` if *job_flow* has any steps which are currently running.""" steps = getattr(job_flow, 'steps', None) or [] return any(is_step_running(step) for step in steps) def is_job_flow_bootstrapping(job_flow): """Return ``True`` if *job_flow* is currently bootstrapping.""" return bool(getattr(job_flow, 'startdatetime', None) and not getattr(job_flow, 'readydatetime', None) and not getattr(job_flow, 'enddatetime', None)) def is_step_running(step): """Return true if the given job flow step is currently running.""" return bool(getattr(step, 'state', None) != 'CANCELLED' and getattr(step, 'startdatetime', None) and not getattr(step, 'enddatetime', None)) def time_last_active(job_flow): """When did something last happen with the given job flow? Things we look at: * ``job_flow.creationdatetime`` (always set) * ``job_flow.startdatetime`` * ``job_flow.readydatetime`` (i.e. when bootstrapping finished) * ``step.creationdatetime`` for any step * ``step.startdatetime`` for any step * ``step.enddatetime`` for any step This is not really meant to be run on job flows which are currently running, or done. """ timestamps = [] for key in 'creationdatetime', 'startdatetime', 'readydatetime': value = getattr(job_flow, key, None) if value: timestamps.append(value) steps = getattr(job_flow, 'steps', None) or [] for step in steps: for key in 'creationdatetime', 'startdatetime', 'enddatetime': value = getattr(step, key, None) if value: timestamps.append(value) # for ISO8601 timestamps, alpha order == chronological order last_timestamp = max(timestamps) return datetime.strptime(last_timestamp, boto.utils.ISO8601) def job_flow_has_pending_steps(job_flow): """Return ``True`` if *job_flow* has any steps in the ``PENDING`` state.""" steps = getattr(job_flow, 'steps', None) or [] return any(getattr(step, 'state', None) == 'PENDING' for step in steps) def terminate_and_notify(runner, to_terminate, dry_run=False, max_mins_locked=None, quiet=False): if not to_terminate: return for jf, pending, time_idle, time_to_end_of_hour in to_terminate: did_terminate = False if not dry_run: status = attempt_to_acquire_lock( runner.make_s3_conn(), runner._lock_uri(jf), runner._opts['s3_sync_wait_time'], runner._make_unique_job_name(label='terminate'), mins_to_expiration=max_mins_locked, ) if status: runner.make_emr_conn().terminate_jobflow(jf.jobflowid) did_terminate = True elif not quiet: log.info('%s was locked between getting job flow info and' ' trying to terminate it; skipping' % jf.jobflowid) if did_terminate and not quiet: fmt = ('Terminated job flow %s (%s); was %s for %s, %s to end of' ' hour') print fmt % ( jf.jobflowid, jf.name, 'pending' if pending else 'idle', strip_microseconds(time_idle), strip_microseconds(time_to_end_of_hour)) def make_option_parser(): usage = '%prog [options]' description = ('Terminate idle EMR job flows that meet the criteria' ' passed in on the command line (or, by default,' ' job flows that have been idle for one hour).') option_parser = OptionParser(usage=usage, description=description) option_parser.add_option( '-v', '--verbose', dest='verbose', default=False, action='store_true', help='Print more messages') option_parser.add_option( '-q', '--quiet', dest='quiet', action='count', help=("Don't print anything to stderr; just print IDs of terminated" " job flows and idle time information to stdout. Use twice" " to print absolutely nothing.")) option_parser.add_option( '-c', '--conf-path', dest='conf_path', default=None, help='Path to alternate mrjob.conf file to read from') option_parser.add_option( '--no-conf', dest='conf_path', action='store_false', help="Don't load mrjob.conf even if it's available") option_parser.add_option( '--max-hours-idle', dest='max_hours_idle', default=None, type='float', help=('Max number of hours a job flow can go without bootstrapping,' ' running a step, or having a new step created. This will fire' ' even if there are pending steps which EMR has failed to' ' start. Make sure you set this higher than the amount of time' ' your jobs can take to start instances and bootstrap.')) option_parser.add_option( '--max-mins-locked', dest='max_mins_locked', default=DEFAULT_MAX_MINUTES_LOCKED, type='float', help='Max number of minutes a job flow can be locked while idle.') option_parser.add_option( '--mins-to-end-of-hour', dest='mins_to_end_of_hour', default=None, type='float', help=('Terminate job flows that are within this many minutes of' ' the end of a full hour since the job started running' ' AND have no pending steps.')) option_parser.add_option( '--unpooled-only', dest='unpooled_only', action='store_true', default=False, help='Only terminate un-pooled job flows') option_parser.add_option( '--pooled-only', dest='pooled_only', action='store_true', default=False, help='Only terminate pooled job flows') option_parser.add_option( '--pool-name', dest='pool_name', default=None, help='Only terminate job flows in the given named pool.') option_parser.add_option( '--dry-run', dest='dry_run', default=False, action='store_true', help="Don't actually kill idle jobs; just log that we would") return option_parser if __name__ == '__main__': main() mrjob-0.3.3.2/mrjob/tools/emr/terminate_job_flow.py0000664€q(¼€tzÕß0000000531511717277734026112 0ustar sjohnsonAD\Domain Users00000000000000# Copyright 2009-2010 Yelp # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. """Terminate an existing EMR job flow. Usage:: python -m mrjob.tools.emr.terminate_job_flow [options] j-JOBFLOWID Terminate an existing EMR job flow. Options:: -h, --help show this help message and exit -v, --verbose print more messages to stderr -q, --quiet don't print anything -c CONF_PATH, --conf-path=CONF_PATH Path to alternate mrjob.conf file to read from --no-conf Don't load mrjob.conf even if it's available """ from __future__ import with_statement import logging from optparse import OptionParser from mrjob.emr import EMRJobRunner from mrjob.job import MRJob log = logging.getLogger('mrjob.tools.emr.terminate_job_flow') def main(): # parser command-line args option_parser = make_option_parser() options, args = option_parser.parse_args() if len(args) != 1: option_parser.error('This tool takes exactly one argument.') emr_job_flow_id = args[0] MRJob.set_up_logging(quiet=options.quiet, verbose=options.verbose) # create the persistent job runner = EMRJobRunner(conf_path=options.conf_path) log.debug('Terminating job flow %s' % emr_job_flow_id) runner.make_emr_conn().terminate_jobflow(emr_job_flow_id) log.info('Terminated job flow %s' % emr_job_flow_id) def make_option_parser(): usage = '%prog [options] jobflowid' description = 'Terminate an existing EMR job flow.' option_parser = OptionParser(usage=usage, description=description) option_parser.add_option( '-v', '--verbose', dest='verbose', default=False, action='store_true', help='print more messages to stderr') option_parser.add_option( '-q', '--quiet', dest='quiet', default=False, action='store_true', help="don't print anything") option_parser.add_option( '-c', '--conf-path', dest='conf_path', default=None, help='Path to alternate mrjob.conf file to read from') option_parser.add_option( '--no-conf', dest='conf_path', action='store_false', help="Don't load mrjob.conf even if it's available") return option_parser if __name__ == '__main__': main() mrjob-0.3.3.2/mrjob/util.py0000664€q(¼€tzÕß0000004405111741151504021252 0ustar sjohnsonAD\Domain Users00000000000000# Copyright 2009-2012 Yelp # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. """Utility functions for MRJob that have no external dependencies.""" # don't add imports here that aren't part of the standard Python library, # since MRJobs need to run in Amazon's generic EMR environment from __future__ import with_statement from collections import defaultdict import contextlib from copy import deepcopy from datetime import timedelta import glob import gzip import hashlib import itertools import logging import os import pipes import sys import tarfile import zipfile try: import bz2 except ImportError: bz2 = None class NullHandler(logging.Handler): def emit(self, record): pass def buffer_iterator_to_line_iterator(iterator): """boto's file iterator splits by buffer size instead of by newline. This wrapper puts them back into lines. """ buf = iterator.next() # might raise StopIteration, but that's okay while True: if '\n' in buf: (line, buf) = buf.split('\n', 1) yield line + '\n' else: try: more = iterator.next() buf += more except StopIteration: if buf: yield buf + '\n' return def cmd_line(args): """build a command line that works in a shell. """ args = [str(x) for x in args] return ' '.join(pipes.quote(x) for x in args) def extract_dir_for_tar(archive_path, compression='gz'): """Get the name of the directory the tar at *archive_path* extracts into. :type archive_path: str :param archive_path: path to archive file :type compression: str :param compression: Compression type to use. This can be one of ``''``, ``bz2``, or ``gz``. """ # Open the file for read-only streaming (no random seeks) tar = tarfile.open(archive_path, mode='r|%s' % compression) # Grab the first item first_member = tar.next() tar.close() # Return the first path component of the item's name return first_member.name.split('/')[0] def expand_path(path): """Resolve ``~`` (home dir) and environment variables in *path*. If *path* is ``None``, return ``None``. """ if path is None: return None else: return os.path.expanduser(os.path.expandvars(path)) def file_ext(path): """return the file extension, including the ``.`` >>> file_ext('foo.tar.gz') '.tar.gz' """ filename = os.path.basename(path) dot_index = filename.find('.') if dot_index == -1: return '' return filename[dot_index:] def hash_object(obj): """Generate a hash (currently md5) of the ``repr`` of the object""" m = hashlib.md5() m.update(repr(obj)) return m.hexdigest() def log_to_null(name=None): """Set up a null handler for the given stream, to suppress "no handlers could be found" warnings.""" logger = logging.getLogger(name) logger.addHandler(NullHandler()) def log_to_stream(name=None, stream=None, format=None, level=None, debug=False): """Set up logging. :type name: str :param name: name of the logger, or ``None`` for the root logger :type stderr: file object :param stderr: stream to log to (default is ``sys.stderr``) :type format: str :param format: log message format (default is '%(message)s') :param level: log level to use :type debug: bool :param debug: quick way of setting the log level: if true, use ``logging.DEBUG``, otherwise use ``logging.INFO`` """ if level is None: level = logging.DEBUG if debug else logging.INFO if format is None: format = '%(message)s' if stream is None: stream = sys.stderr handler = logging.StreamHandler(stream) handler.setLevel(level) handler.setFormatter(logging.Formatter(format)) logger = logging.getLogger(name) logger.setLevel(level) logger.addHandler(handler) def _process_long_opt(option_parser, arg_map, rargs, values): """Mimic function of the same name in ``OptionParser``, capturing the arguments consumed in *arg_map* """ arg = rargs.pop(0) # Value explicitly attached to arg? Pretend it's the next # argument. if "=" in arg: (opt, next_arg) = arg.split("=", 1) rargs.insert(0, next_arg) else: opt = arg opt = option_parser._match_long_opt(opt) option = option_parser._long_opt[opt] # Store the 'before' value of *rargs* rargs_before_processing = [x for x in rargs] if option.takes_value(): nargs = option.nargs if nargs == 1: value = rargs.pop(0) else: value = tuple(rargs[0:nargs]) del rargs[0:nargs] else: value = None option.process(opt, value, values, option_parser) # Measure rargs before and after processing. Store difference in arg_map. length_difference = len(rargs_before_processing) - len(rargs) list_difference = [opt] + rargs_before_processing[:length_difference] arg_map[option.dest].extend(list_difference) def _process_short_opts(option_parser, arg_map, rargs, values): """Mimic function of the same name in ``OptionParser``, capturing the arguments consumed in *arg_map* """ arg = rargs.pop(0) stop = False i = 1 for ch in arg[1:]: opt = "-" + ch option = option_parser._short_opt.get(opt) i += 1 # we have consumed a character # Store the 'before' value of *rargs* rargs_before_processing = [x for x in rargs] # We won't see a difference in rargs for things like '-pJSON', so # handle that edge case explicitly. args_from_smashed_short_opt = [] if option.takes_value(): # Any characters left in arg? Pretend they're the # next arg, and stop consuming characters of arg. if i < len(arg): rargs.insert(0, arg[i:]) args_from_smashed_short_opt.append(arg[i:]) stop = True nargs = option.nargs if nargs == 1: value = rargs.pop(0) else: value = tuple(rargs[0:nargs]) del rargs[0:nargs] else: # option doesn't take a value value = None option.process(opt, value, values, option_parser) # Measure rargs before and after processing. Store difference in # arg_map. length_difference = len(rargs_before_processing) - len(rargs) list_difference = ([opt] + args_from_smashed_short_opt + rargs_before_processing[:length_difference]) arg_map[option.dest].extend(list_difference) if stop: break def parse_and_save_options(option_parser, args): """Duplicate behavior of OptionParser, but capture the strings required to reproduce the same values. Ref. optparse.py lines 1414-1548 (python 2.6.5) """ arg_map = defaultdict(list) values = deepcopy(option_parser.get_default_values()) rargs = [x for x in args] option_parser.rargs = rargs while rargs: arg = rargs[0] if arg == '--': del rargs[0] return arg_map elif arg[0:2] == '--': _process_long_opt(option_parser, arg_map, rargs, values) elif arg[:1] == '-' and len(arg) > 1: _process_short_opts(option_parser, arg_map, rargs, values) else: del rargs[0] return arg_map def populate_option_groups_with_options(assignments, indexed_options): """Given a dictionary mapping :py:class:`OptionGroup` and :py:class:`OptionParser` objects to a list of strings represention option dests, populate the objects with options from ``indexed_options`` (generated by :py:func:`scrape_options_and_index_by_dest`) in alphabetical order by long option name. This function primarily exists to serve :py:func:`scrape_options_into_new_groups`. :type assignments: dict of the form ``{my_option_parser: ('verbose', 'help', ...), my_option_group: (...)}`` :param assignments: specification of which parsers/groups should get which options :type indexed_options: dict generated by :py:func:`util.scrape_options_and_index_by_dest` :param indexed_options: options to use when populating the parsers/groups """ for opt_group, opt_dest_list in assignments.iteritems(): new_options = [] for option_dest in assignments[opt_group]: for option in indexed_options[option_dest]: new_options.append(option) # New options must be added using add_options() or they will not be # allowed by the parser on the command line opt_group.add_options(new_options) # Sort alphabetically for help opt_group.option_list = sorted(opt_group.option_list, key=lambda item: item.get_opt_string()) def read_input(path, stdin=None): """Stream input the way Hadoop would. - Resolve globs (``foo_*.gz``). - Decompress ``.gz`` and ``.bz2`` files. - If path is ``'-'``, read from stdin - If path is a directory, recursively read its contents. You can redefine *stdin* for ease of testing. *stdin* can actually be any iterable that yields lines (e.g. a list). """ if stdin is None: stdin = sys.stdin # handle '-' (special case) if path == '-': for line in stdin: yield line return # resolve globs paths = glob.glob(path) if not paths: raise IOError(2, 'No such file or directory: %r' % path) elif len(paths) > 1: for path in paths: for line in read_input(path, stdin=stdin): yield line return else: path = paths[0] # recurse through directories if os.path.isdir(path): for dirname, _, filenames in os.walk(path): for filename in filenames: for line in read_input(os.path.join(dirname, filename), stdin=stdin): yield line return # read from files for line in read_file(path): yield line def read_file(path, fileobj=None): """Reads a file. - Decompress ``.gz`` and ``.bz2`` files. - If *fileobj* is not ``None``, stream lines from the *fileobj* """ try: if path.endswith('.gz'): f = gzip.GzipFile(path, fileobj=fileobj) elif path.endswith('.bz2'): if bz2 is None: raise Exception('bz2 module was not successfully imported (likely not installed).') elif fileobj is None: f = bz2.BZ2File(path) else: f = bunzip2_stream(fileobj) elif fileobj is None: f = open(path) else: f = fileobj for line in f: yield line finally: if fileobj is None and not f is None: f.close() def bunzip2_stream(fileobj): """Return an uncompressed bz2 stream from a file object """ # decompress chunks into a buffer, then stream from the buffer buffer = '' if bz2 is None: raise Exception('bz2 module was not successfully imported (likely not installed).') decomp = bz2.BZ2Decompressor() for part in fileobj: buffer = buffer.join(decomp.decompress(part)) f = buffer.splitlines(True) return f @contextlib.contextmanager def save_current_environment(): """ Context manager that saves os.environ and loads it back again after execution """ original_environ = os.environ.copy() yield os.environ.clear() os.environ.update(original_environ) def scrape_options_and_index_by_dest(*parsers_and_groups): """Scrapes ``optparse`` options from :py:class:`OptionParser` and :py:class:`OptionGroup` objects and builds a dictionary of ``dest_var: [option1, option2, ...]``. This function primarily exists to serve :py:func:`scrape_options_into_new_groups`. An example return value: ``{'verbose': [, ], 'files': []}`` :type parsers_and_groups: :py:class:`OptionParser` or :py:class:`OptionGroup` :param parsers_and_groups: Parsers and groups to scrape option objects from :return: dict of the form ``{dest_var: [option1, option2, ...], ...}`` """ # Scrape options from MRJob and index them by dest all_options = {} job_option_lists = [g.option_list for g in parsers_and_groups] for option in itertools.chain(*job_option_lists): other_options = all_options.get(option.dest, []) other_options.append(option) all_options[option.dest] = other_options return all_options def scrape_options_into_new_groups(source_groups, assignments): """Puts options from the :py:class:`OptionParser` and :py:class:`OptionGroup` objects in `source_groups` into the keys of `assignments` according to the values of `assignments`. An example: :type source_groups: list of :py:class:`OptionParser` and :py:class:`OptionGroup` objects :param source_groups: parsers/groups to scrape options from :type assignments: dict with keys that are :py:class:`OptionParser` and :py:class:`OptionGroup` objects and values that are lists of strings :param assignments: map empty parsers/groups to lists of destination names that they should contain options for """ all_options = scrape_options_and_index_by_dest(*source_groups) return populate_option_groups_with_options(assignments, all_options) # Thanks to http://lybniz2.sourceforge.net/safeeval.html for # explaining how to do this! def safeeval(expr, globals=None, locals=None): """Like eval, but with nearly everything in the environment blanked out, so that it's difficult to cause mischief. *globals* and *locals* are optional dictionaries mapping names to values for those names (just like in :py:func:`eval`). """ # blank out builtins, but keep None, True, and False safe_globals = {'__builtins__': None, 'True': True, 'False': False, 'None': None, 'set': set, 'xrange': xrange} # add the user-specified global variables if globals: safe_globals.update(globals) return eval(expr, safe_globals, locals) def strip_microseconds(delta): """Return the given :py:class:`datetime.timedelta`, without microseconds. Useful for printing :py:class:`datetime.timedelta` objects. """ return timedelta(delta.days, delta.seconds) def tar_and_gzip(dir, out_path, filter=None, prefix=''): """Tar and gzip the given *dir* to a tarball at *out_path*. If we encounter symlinks, include the actual file, not the symlink. :type dir: str :param dir: dir to tar up :type out_path: str :param out_path: where to write the tarball too :param filter: if defined, a function that takes paths (relative to *dir* and returns ``True`` if we should keep them :type prefix: str :param prefix: subdirectory inside the tarball to put everything into (e.g. ``'mrjob'``) """ if not os.path.isdir(dir): raise IOError('Not a directory: %r' % (dir,)) if not filter: filter = lambda path: True # supposedly you can also call tarfile.TarFile(), but I couldn't # get this to work in Python 2.5.1. Please leave as-is. tar_gz = tarfile.open(out_path, mode='w:gz') for dirpath, dirnames, filenames in os.walk(dir): for filename in filenames: path = os.path.join(dirpath, filename) # janky version of os.path.relpath() (Python 2.6): rel_path = path[len(os.path.join(dir, '')):] if filter(rel_path): # copy over real files, not symlinks real_path = os.path.realpath(path) path_in_tar_gz = os.path.join(prefix, rel_path) tar_gz.add(real_path, arcname=path_in_tar_gz, recursive=False) tar_gz.close() def unarchive(archive_path, dest): """Extract the contents of a tar or zip file at *archive_path* into the directory *dest*. :type archive_path: str :param archive_path: path to archive file :type dest: str :param dest: path to directory where archive will be extracted *dest* will be created if it doesn't already exist. tar files can be gzip compressed, bzip2 compressed, or uncompressed. Files within zip files can be deflated or stored. """ if tarfile.is_tarfile(archive_path): with contextlib.closing(tarfile.open(archive_path, 'r')) as archive: archive.extractall(dest) elif zipfile.is_zipfile(archive_path): with contextlib.closing(zipfile.ZipFile(archive_path, 'r')) as archive: for name in archive.namelist(): # the zip spec specifies that front slashes are always # used as directory separators dest_path = os.path.join(dest, *name.split('/')) # now, split out any dirname and filename and create # one and/or the other dirname, filename = os.path.split(dest_path) if dirname and not os.path.exists(dirname): os.makedirs(dirname) if filename: with open(dest_path, 'wb') as dest_file: dest_file.write(archive.read(name)) else: raise IOError('Unknown archive type: %s' % (archive_path,)) mrjob-0.3.3.2/mrjob.egg-info/0000775€q(¼€tzÕß0000000000011741151621021411 5ustar sjohnsonAD\Domain Users00000000000000mrjob-0.3.3.2/mrjob.egg-info/dependency_links.txt0000664€q(¼€tzÕß0000000000111741151621025457 0ustar sjohnsonAD\Domain Users00000000000000 mrjob-0.3.3.2/mrjob.egg-info/not-zip-safe0000664€q(¼€tzÕß0000000000111721024436023640 0ustar sjohnsonAD\Domain Users00000000000000 mrjob-0.3.3.2/mrjob.egg-info/PKG-INFO0000664€q(¼€tzÕß0000001236011741151621022510 0ustar sjohnsonAD\Domain Users00000000000000Metadata-Version: 1.1 Name: mrjob Version: 0.3.3.2 Summary: Python MapReduce framework Home-page: http://github.com/Yelp/mrjob Author: David Marin Author-email: dave@yelp.com License: Apache Description: mrjob ===== .. image:: http://github.com/yelp/mrjob/raw/master/docs/logos/logo_medium.png mrjob is a Python 2.5+ package that helps you write and run Hadoop Streaming jobs. `Main documentation `_ mrjob fully supports Amazon's Elastic MapReduce (EMR) service, which allows you to buy time on a Hadoop cluster on an hourly basis. It also works with your own Hadoop cluster. Some important features: * Run jobs on EMR, your own Hadoop cluster, or locally (for testing). * Write multi-step jobs (one map-reduce step feeds into the next) * Duplicate your production environment inside Hadoop * Upload your source tree and put it in your job's ``$PYTHONPATH`` * Run make and other setup scripts * Set environment variables (e.g. ``$TZ``) * Easily install python packages from tarballs (EMR only) * Setup handled transparently by ``mrjob.conf`` config file * Automatically interpret error logs from EMR * SSH tunnel to hadoop job tracker on EMR * Minimal setup * To run on EMR, set ``$AWS_ACCESS_KEY_ID`` and ``$AWS_SECRET_ACCESS_KEY`` * To run on your Hadoop cluster, install ``simplejson`` and make sure ``$HADOOP_HOME`` is set. Installation ------------ From PyPI: ``pip install mrjob`` From source: ``python setup.py install`` A Simple Map Reduce Job ----------------------- Code for this example and more live in ``mrjob/examples``. :: """The classic MapReduce job: count the frequency of words. """ from mrjob.job import MRJob import re WORD_RE = re.compile(r"[\w']+") class MRWordFreqCount(MRJob): def mapper(self, _, line): for word in WORD_RE.findall(line): yield (word.lower(), 1) def combiner(self, word, counts): yield (word, sum(counts)) def reducer(self, word, counts): yield (word, sum(counts)) if __name__ == '__main__': MRWordFreqCount.run() Try It Out! ----------- :: # locally python mrjob/examples/mr_word_freq_count.py README.rst > counts # on EMR python mrjob/examples/mr_word_freq_count.py README.rst -r emr > counts # on your Hadoop cluster python mrjob/examples/mr_word_freq_count.py README.rst -r hadoop > counts Setting up EMR on Amazon ------------------------ * create an `Amazon Web Services account `_ * sign up for `Elastic MapReduce `_ * Get your access and secret keys (click "Security Credentials" on `your account page `_) * Set the environment variables ``$AWS_ACCESS_KEY_ID`` and ``$AWS_SECRET_ACCESS_KEY`` accordingly Advanced Configuration ---------------------- To run in other AWS regions, upload your source tree, run ``make``, and use other advanced mrjob features, you'll need to set up ``mrjob.conf``. mrjob looks for its conf file in: * The contents of ``$MRJOB_CONF`` * ``~/.mrjob.conf`` * ``/etc/mrjob.conf`` See `the mrjob.conf documentation `_ for more information. Links ----- * source: * documentation: * discussion group: * Hadoop MapReduce: * Elastic MapReduce: * PyCon 2011 mrjob overview: Thanks to `Greg Killion `_ (`blind-works.net `_) for the logo. Platform: UNKNOWN Classifier: Development Status :: 4 - Beta Classifier: Intended Audience :: Developers Classifier: License :: OSI Approved :: Apache Software License Classifier: Natural Language :: English Classifier: Operating System :: OS Independent Classifier: Programming Language :: Python Classifier: Programming Language :: Python :: 2.5 Classifier: Programming Language :: Python :: 2.6 Classifier: Programming Language :: Python :: 2.7 Classifier: Topic :: System :: Distributed Computing Provides: mrjob mrjob-0.3.3.2/mrjob.egg-info/requires.txt0000664€q(¼€tzÕß0000000004211741151621024005 0ustar sjohnsonAD\Domain Users00000000000000boto>=2.0 PyYAML simplejson>=2.0.9mrjob-0.3.3.2/mrjob.egg-info/SOURCES.txt0000664€q(¼€tzÕß0000000204111741151621023272 0ustar sjohnsonAD\Domain Users00000000000000CHANGES.txt LICENSE.txt MANIFEST.in README.rst setup.cfg setup.py mrjob/__init__.py mrjob/boto_2_1_1_83aae37b.py mrjob/compat.py mrjob/conf.py mrjob/emr.py mrjob/hadoop.py mrjob/inline.py mrjob/job.py mrjob/local.py mrjob/logparsers.py mrjob/parse.py mrjob/pool.py mrjob/protocol.py mrjob/retry.py mrjob/runner.py mrjob/ssh.py mrjob/util.py mrjob.egg-info/PKG-INFO mrjob.egg-info/SOURCES.txt mrjob.egg-info/dependency_links.txt mrjob.egg-info/not-zip-safe mrjob.egg-info/requires.txt mrjob.egg-info/top_level.txt mrjob/examples/__init__.py mrjob/examples/mr_log_sampler.py mrjob/examples/mr_page_rank.py mrjob/examples/mr_text_classifier.py mrjob/examples/mr_wc.py mrjob/examples/mr_word_freq_count.py mrjob/tools/__init__.py mrjob/tools/emr/__init__.py mrjob/tools/emr/audit_usage.py mrjob/tools/emr/create_job_flow.py mrjob/tools/emr/fetch_logs.py mrjob/tools/emr/job_flow_pool.py mrjob/tools/emr/mrboss.py mrjob/tools/emr/report_long_jobs.py mrjob/tools/emr/s3_tmpwatch.py mrjob/tools/emr/terminate_idle_job_flows.py mrjob/tools/emr/terminate_job_flow.pymrjob-0.3.3.2/mrjob.egg-info/top_level.txt0000664€q(¼€tzÕß0000000000611741151621024137 0ustar sjohnsonAD\Domain Users00000000000000mrjob mrjob-0.3.3.2/PKG-INFO0000664€q(¼€tzÕß0000001236011741151621017705 0ustar sjohnsonAD\Domain Users00000000000000Metadata-Version: 1.1 Name: mrjob Version: 0.3.3.2 Summary: Python MapReduce framework Home-page: http://github.com/Yelp/mrjob Author: David Marin Author-email: dave@yelp.com License: Apache Description: mrjob ===== .. image:: http://github.com/yelp/mrjob/raw/master/docs/logos/logo_medium.png mrjob is a Python 2.5+ package that helps you write and run Hadoop Streaming jobs. `Main documentation `_ mrjob fully supports Amazon's Elastic MapReduce (EMR) service, which allows you to buy time on a Hadoop cluster on an hourly basis. It also works with your own Hadoop cluster. Some important features: * Run jobs on EMR, your own Hadoop cluster, or locally (for testing). * Write multi-step jobs (one map-reduce step feeds into the next) * Duplicate your production environment inside Hadoop * Upload your source tree and put it in your job's ``$PYTHONPATH`` * Run make and other setup scripts * Set environment variables (e.g. ``$TZ``) * Easily install python packages from tarballs (EMR only) * Setup handled transparently by ``mrjob.conf`` config file * Automatically interpret error logs from EMR * SSH tunnel to hadoop job tracker on EMR * Minimal setup * To run on EMR, set ``$AWS_ACCESS_KEY_ID`` and ``$AWS_SECRET_ACCESS_KEY`` * To run on your Hadoop cluster, install ``simplejson`` and make sure ``$HADOOP_HOME`` is set. Installation ------------ From PyPI: ``pip install mrjob`` From source: ``python setup.py install`` A Simple Map Reduce Job ----------------------- Code for this example and more live in ``mrjob/examples``. :: """The classic MapReduce job: count the frequency of words. """ from mrjob.job import MRJob import re WORD_RE = re.compile(r"[\w']+") class MRWordFreqCount(MRJob): def mapper(self, _, line): for word in WORD_RE.findall(line): yield (word.lower(), 1) def combiner(self, word, counts): yield (word, sum(counts)) def reducer(self, word, counts): yield (word, sum(counts)) if __name__ == '__main__': MRWordFreqCount.run() Try It Out! ----------- :: # locally python mrjob/examples/mr_word_freq_count.py README.rst > counts # on EMR python mrjob/examples/mr_word_freq_count.py README.rst -r emr > counts # on your Hadoop cluster python mrjob/examples/mr_word_freq_count.py README.rst -r hadoop > counts Setting up EMR on Amazon ------------------------ * create an `Amazon Web Services account `_ * sign up for `Elastic MapReduce `_ * Get your access and secret keys (click "Security Credentials" on `your account page `_) * Set the environment variables ``$AWS_ACCESS_KEY_ID`` and ``$AWS_SECRET_ACCESS_KEY`` accordingly Advanced Configuration ---------------------- To run in other AWS regions, upload your source tree, run ``make``, and use other advanced mrjob features, you'll need to set up ``mrjob.conf``. mrjob looks for its conf file in: * The contents of ``$MRJOB_CONF`` * ``~/.mrjob.conf`` * ``/etc/mrjob.conf`` See `the mrjob.conf documentation `_ for more information. Links ----- * source: * documentation: * discussion group: * Hadoop MapReduce: * Elastic MapReduce: * PyCon 2011 mrjob overview: Thanks to `Greg Killion `_ (`blind-works.net `_) for the logo. Platform: UNKNOWN Classifier: Development Status :: 4 - Beta Classifier: Intended Audience :: Developers Classifier: License :: OSI Approved :: Apache Software License Classifier: Natural Language :: English Classifier: Operating System :: OS Independent Classifier: Programming Language :: Python Classifier: Programming Language :: Python :: 2.5 Classifier: Programming Language :: Python :: 2.6 Classifier: Programming Language :: Python :: 2.7 Classifier: Topic :: System :: Distributed Computing Provides: mrjob mrjob-0.3.3.2/README.rst0000664€q(¼€tzÕß0000000710611740642733020311 0ustar sjohnsonAD\Domain Users00000000000000mrjob ===== .. image:: http://github.com/yelp/mrjob/raw/master/docs/logos/logo_medium.png mrjob is a Python 2.5+ package that helps you write and run Hadoop Streaming jobs. `Main documentation `_ mrjob fully supports Amazon's Elastic MapReduce (EMR) service, which allows you to buy time on a Hadoop cluster on an hourly basis. It also works with your own Hadoop cluster. Some important features: * Run jobs on EMR, your own Hadoop cluster, or locally (for testing). * Write multi-step jobs (one map-reduce step feeds into the next) * Duplicate your production environment inside Hadoop * Upload your source tree and put it in your job's ``$PYTHONPATH`` * Run make and other setup scripts * Set environment variables (e.g. ``$TZ``) * Easily install python packages from tarballs (EMR only) * Setup handled transparently by ``mrjob.conf`` config file * Automatically interpret error logs from EMR * SSH tunnel to hadoop job tracker on EMR * Minimal setup * To run on EMR, set ``$AWS_ACCESS_KEY_ID`` and ``$AWS_SECRET_ACCESS_KEY`` * To run on your Hadoop cluster, install ``simplejson`` and make sure ``$HADOOP_HOME`` is set. Installation ------------ From PyPI: ``pip install mrjob`` From source: ``python setup.py install`` A Simple Map Reduce Job ----------------------- Code for this example and more live in ``mrjob/examples``. :: """The classic MapReduce job: count the frequency of words. """ from mrjob.job import MRJob import re WORD_RE = re.compile(r"[\w']+") class MRWordFreqCount(MRJob): def mapper(self, _, line): for word in WORD_RE.findall(line): yield (word.lower(), 1) def combiner(self, word, counts): yield (word, sum(counts)) def reducer(self, word, counts): yield (word, sum(counts)) if __name__ == '__main__': MRWordFreqCount.run() Try It Out! ----------- :: # locally python mrjob/examples/mr_word_freq_count.py README.rst > counts # on EMR python mrjob/examples/mr_word_freq_count.py README.rst -r emr > counts # on your Hadoop cluster python mrjob/examples/mr_word_freq_count.py README.rst -r hadoop > counts Setting up EMR on Amazon ------------------------ * create an `Amazon Web Services account `_ * sign up for `Elastic MapReduce `_ * Get your access and secret keys (click "Security Credentials" on `your account page `_) * Set the environment variables ``$AWS_ACCESS_KEY_ID`` and ``$AWS_SECRET_ACCESS_KEY`` accordingly Advanced Configuration ---------------------- To run in other AWS regions, upload your source tree, run ``make``, and use other advanced mrjob features, you'll need to set up ``mrjob.conf``. mrjob looks for its conf file in: * The contents of ``$MRJOB_CONF`` * ``~/.mrjob.conf`` * ``/etc/mrjob.conf`` See `the mrjob.conf documentation `_ for more information. Links ----- * source: * documentation: * discussion group: * Hadoop MapReduce: * Elastic MapReduce: * PyCon 2011 mrjob overview: Thanks to `Greg Killion `_ (`blind-works.net `_) for the logo. mrjob-0.3.3.2/setup.cfg0000664€q(¼€tzÕß0000000026311741151621020430 0ustar sjohnsonAD\Domain Users00000000000000[build_sphinx] source-dir = docs/ build-dir = docs/_build all_files = 1 [upload_sphinx] upload-dir = docs/_build/html [egg_info] tag_build = tag_date = 0 tag_svn_revision = 0 mrjob-0.3.3.2/setup.py0000664€q(¼€tzÕß0000000274011717277734020344 0ustar sjohnsonAD\Domain Users00000000000000try: from setuptools import setup setup # quiet "redefinition of unused ..." warning from pyflakes # arguments that distutils doesn't understand setuptools_kwargs = { 'install_requires': [ 'boto>=2.0', 'PyYAML', 'simplejson>=2.0.9', ], 'provides': ['mrjob'], 'test_suite': 'tests.suite.load_tests', 'tests_require': ['unittest2'], 'zip_safe': False, # so that we can bootstrap mrjob } except ImportError: from distutils.core import setup setuptools_kwargs = {} import mrjob setup( author='David Marin', author_email='dave@yelp.com', classifiers=[ 'Development Status :: 4 - Beta', 'Intended Audience :: Developers', 'License :: OSI Approved :: Apache Software License', 'Natural Language :: English', 'Operating System :: OS Independent', 'Programming Language :: Python', 'Programming Language :: Python :: 2.5', 'Programming Language :: Python :: 2.6', 'Programming Language :: Python :: 2.7', 'Topic :: System :: Distributed Computing', ], description='Python MapReduce framework', license='Apache', long_description=open('README.rst').read(), name='mrjob', packages=['mrjob', 'mrjob.examples', 'mrjob.tools', 'mrjob.tools.emr'], url='http://github.com/Yelp/mrjob', version=mrjob.__version__, **setuptools_kwargs )