Discussion:
[PATCH RFC 0/1] cpufreq/x86: Add P-state driver for sandy bridge.
(too old to reply)
d***@gmail.com
2012-12-05 19:01:30 UTC
Permalink
=46rom: Dirk Brandewie <***@gmail.com>

This driver provides a P state driver for Sandybridge and Ivybridge
processors.
=20
Motivation:
The goal of this driver is to improve the power efficiency of
Sandybridge/Ivybridge based systems. As the investigation into how to
achieve this goal progressed it became apparent (to me) that some of th=
e
design assumptions of the cpufreq subsystem are no longer valid and
that a micro-architecure specific P state driver would be less complex
and potentially more effiecent. As Intel continues to innovate in the
area of freqency/power control this will become more true IMHO.

General info:
The driver uses a PID controller to adjust the core frequency based on
the presented load. The driver exposes the tuning parameters for the
controller in the /sys/devices/system/cpu/cpufreq/snb directory. The
controller code is being used in PI mode with the default tuning
parmeters.

Tuning parmeters:
setpoint - load in percent on the core will attempt to maintain.=09
sample_rate_ms - rate at which the driver will sample the load on th=
e core.=20
deadband - percent =C2=B1 around the setpoint the controller will
consider zero error.
p_gain_pct - Proportional gain in percent.=20
i_gain_pct - Integral gain in percent.=20
d_gain_pct - Derivative gain in percent

To use the driver as root run the following shell script:
#!/bin/sh
for file in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
do=20
echo snb > $file
done

Limitations:

ATM this driver will only run on SandyBridge systems testing on
Ivybridge systems is not complete.
=20
Open Questions:

What is the correct way to integrate this driver into the system? The
current implementation registers as a cpufreq frequency governor, this
was done to streamline testing using cpufreq to load/unload governors.

What tuning parameters should be exposed via sysfs (if any)? ATM all
the PID parameters are exposed to enable tuning of the driver.


Performance information:

--- Kernel build ---
The following is data collected for a bzImage kernel build. The
commands used were:
make -j8 clean
sysctl -w vm.drop_caches=3D3
/usr/bin/time -f "%E %c" make -j8 bzImage

Time and context switches measured with /usr/bin/time -f "%E %c"

Energy measured with package energy status MSR described in section
14.7 in the Intel=C2=AE 64 and IA-32 Architectures Software Developer=E2=
=80=99s
Manual Volume 3.
http://download.intel.com/products/processor/manual/325384.pdf

Average watts calculated with energy/time in seconds

time ctx sw energy avg watts
perf 02:24.49 116721 6660 46.09
snb 02:27.03 114940 6591 44.83
ondemand 02:26.83 190948 6683 45.51

A graph of the power usage during the kernel build for each governor
is available here:
Loading Image...

--- Power benchmark ---
I used industry standard power bench suite to compare the performance a=
nd
ondemand governors against the Sandybridge governor.

Governor | ssj_ops/watt
-----------------------------
performance | 1855
ondemand | 1839
snb | 2016
=20
A graph of the power usage for each governor is avavailable here:
Loading Image...

A graph showing the results of cpufreq-bench tool shipped with the
kernel Collected with
cpufreq-bench -l 6000 -s 6000 -x 2000 -y 2000 -c 0 \
-g {ondemand | snb} -n 40 -r 40=20
is available here:
Loading Image...

Dirk Brandewie (1):
cpufreq/x86: Add P-state driver for sandy bridge.

drivers/cpufreq/Kconfig.x86 | 8 +
drivers/cpufreq/Makefile | 1 +
drivers/cpufreq/cpufreq_snb.c | 727 +++++++++++++++++++++++++++++++++=
++++++++
3 files changed, 736 insertions(+), 0 deletions(-)
create mode 100644 drivers/cpufreq/cpufreq_snb.c

--=20
1.7.7.6

--
To unsubscribe from this list: send the line "unsubscribe cpufreq" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
d***@gmail.com
2012-12-05 19:01:31 UTC
Permalink
From: Dirk Brandewie <***@gmail.com>

Add a P-state driver for the Sandy bridge processor.

This driver provides better power efficiency than the current
governors of the Intel architecture. The driver masquerades as a
frequency governor to the cpufreq subsystem but does not use cpufreq
to change frequency.

Issues:
does not report current frequency via cpufreq subsystem so this
confuses some tools.


Signed-off-by: Dirk Brandewie <***@gmail.com>
---
drivers/cpufreq/Kconfig.x86 | 8 +
drivers/cpufreq/Makefile | 1 +
drivers/cpufreq/cpufreq_snb.c | 727 +++++++++++++++++++++++++++++++++++++++++
3 files changed, 736 insertions(+), 0 deletions(-)
create mode 100644 drivers/cpufreq/cpufreq_snb.c

diff --git a/drivers/cpufreq/Kconfig.x86 b/drivers/cpufreq/Kconfig.x86
index 934854a..8c8acd3 100644
--- a/drivers/cpufreq/Kconfig.x86
+++ b/drivers/cpufreq/Kconfig.x86
@@ -2,6 +2,14 @@
# x86 CPU Frequency scaling drivers
#

+config X86_SNB_CPUFREQ
+ tristate "SandyBridge frequency Governor"
+ help
+ This driver will override the CPU_FREQ subsystem when
+ the system has a SandyBridge processor
+
+ If in doubt, say N.
+
config X86_PCC_CPUFREQ
tristate "Processor Clocking Control interface driver"
depends on ACPI && ACPI_PROCESSOR
diff --git a/drivers/cpufreq/Makefile b/drivers/cpufreq/Makefile
index 1bc90e1..71ad49e 100644
--- a/drivers/cpufreq/Makefile
+++ b/drivers/cpufreq/Makefile
@@ -38,6 +38,7 @@ obj-$(CONFIG_X86_SPEEDSTEP_SMI) += speedstep-smi.o
obj-$(CONFIG_X86_SPEEDSTEP_CENTRINO) += speedstep-centrino.o
obj-$(CONFIG_X86_P4_CLOCKMOD) += p4-clockmod.o
obj-$(CONFIG_X86_CPUFREQ_NFORCE2) += cpufreq-nforce2.o
+obj-$(CONFIG_X86_SNB_CPUFREQ) += cpufreq_snb.o

##################################################################################
# ARM SoC drivers
diff --git a/drivers/cpufreq/cpufreq_snb.c b/drivers/cpufreq/cpufreq_snb.c
new file mode 100644
index 0000000..0d46862
--- /dev/null
+++ b/drivers/cpufreq/cpufreq_snb.c
@@ -0,0 +1,727 @@
+/*
+ * cpufreq_snb.c: Native P state management for Intel processors
+ *
+ * (C) Copyright 2012 Intel Corporation
+ * Author: Dirk Brandewie <***@intel.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; version 2
+ * of the License.
+ */
+
+
+
+#include <linux/kernel.h>
+#include <linux/kernel_stat.h>
+#include <linux/module.h>
+#include <linux/ktime.h>
+#include <linux/hrtimer.h>
+#include <linux/tick.h>
+#include <linux/slab.h>
+#include <linux/sched.h>
+#include <linux/list.h>
+#include <linux/cpufreq.h>
+#include <linux/sysfs.h>
+#include <linux/types.h>
+
+#include <trace/events/power.h>
+
+#include <asm/div64.h>
+#include <asm/msr.h>
+#include <asm/cpu_device_id.h>
+
+#define SAMPLE_COUNT 3
+
+struct sampling_state {
+ int idle_mode;
+ int first_sample;
+};
+
+struct sample {
+ ktime_t start_time;
+ ktime_t end_time;
+ int core_pct_busy;
+ int freq_pct_busy;
+ u64 duration_us;
+ u64 idletime_us;
+ u64 aperf;
+ u64 mperf;
+};
+
+struct freqdata {
+ int current_freq;
+ int min_freq;
+ int max_freq;
+ int turbo_freq;
+};
+
+struct _pid {
+ int setpoint;
+ int32_t integral;
+ int32_t p_gain;
+ int32_t i_gain;
+ int32_t d_gain;
+ int deadband;
+ int last_err;
+};
+
+struct cpudata {
+ int cpu;
+
+ char name[64];
+
+ struct timer_list timer;
+
+ struct freq_adjust_policy *freq_policy;
+ struct freqdata clock;
+ struct sampling_state sampling_state;
+ struct _pid pid;
+ struct _pid idle_pid;
+
+ int min_freq_count;
+
+ ktime_t prev_sample;
+ u64 prev_idle_time_us;
+ u64 prev_aperf;
+ u64 prev_mperf;
+ int sample_ptr;
+ struct sample samples[SAMPLE_COUNT];
+};
+
+static unsigned int snb_usage;
+static DEFINE_MUTEX(snb_mutex);
+
+struct cpudata **all_cpu_data;
+struct freq_adjust_policy {
+ int sample_rate_ms; /* sample rate */
+ int deadband; /*adjust freq on last sample or average */
+ int setpoint; /* starting freq when we have no info */
+ int p_gain_pct;
+ int d_gain_pct;
+ int i_gain_pct;
+};
+
+struct freq_adjust_policy default_policy = {
+ .sample_rate_ms = 10,
+ .deadband = 0,
+ .setpoint = 109,
+ .p_gain_pct = 17,
+ .d_gain_pct = 0,
+ .i_gain_pct = 4,
+};
+
+#define FRAC_BITS 8
+#define int_tofp(X) ((int64_t)(X) << FRAC_BITS)
+#define fp_toint(X) ((X) >> FRAC_BITS)
+
+static inline int32_t mul_fp(int32_t x, int32_t y)
+{
+ return ((int64_t)x * (int64_t)y) >> FRAC_BITS;
+}
+
+static inline int32_t div_fp(int32_t x, int32_t y)
+{
+ return div_s64((int64_t)x << FRAC_BITS, (int64_t)y);
+}
+
+
+static inline void pid_reset(struct _pid *pid, int setpoint, int busy,
+ int deadband, int integral) {
+ pid->setpoint = setpoint;
+ pid->deadband = deadband;
+ pid->integral = int_tofp(integral);
+ pid->last_err = setpoint - busy;
+}
+
+static inline void pid_p_gain_set(struct _pid *pid, int percent)
+{
+ pid->p_gain = div_fp(int_tofp(percent), int_tofp(100));
+}
+
+static inline void pid_i_gain_set(struct _pid *pid, int percent)
+{
+ pid->i_gain = div_fp(int_tofp(percent), int_tofp(100));
+}
+
+static inline void pid_d_gain_set(struct _pid *pid, int percent)
+{
+
+ pid->d_gain = div_fp(int_tofp(percent), int_tofp(100));
+}
+
+static inline int pid_calc(struct _pid *pid, int busy)
+{
+ int err, result;
+ int32_t pterm, dterm, fp_error;
+ int32_t integral_limit;
+
+ integral_limit = int_tofp(30);
+ err = pid->setpoint - busy;
+
+ if (abs(err) <= pid->deadband)
+ return 0;
+
+ fp_error = int_tofp(err);
+ pterm = mul_fp(pid->p_gain, fp_error);
+ pid->integral += mul_fp(pid->i_gain, fp_error);
+
+ /* limit the integral term */
+ if (pid->integral > integral_limit)
+ pid->integral = integral_limit;
+ if (pid->integral < -integral_limit)
+ pid->integral = -integral_limit;
+
+ dterm = mul_fp(pid->d_gain, (err - pid->last_err));
+ result = pterm + pid->integral + dterm;
+
+ pid->last_err = err;
+ return fp_toint(result);
+}
+
+
+static inline void snb_busy_pid_reset(struct cpudata *cpu)
+{
+ pid_reset(&cpu->pid,
+ cpu->freq_policy->setpoint,
+ 100,
+ cpu->freq_policy->deadband,
+ 0);
+
+ pid_p_gain_set(&cpu->pid, cpu->freq_policy->p_gain_pct);
+ pid_d_gain_set(&cpu->pid, cpu->freq_policy->d_gain_pct);
+ pid_i_gain_set(&cpu->pid, cpu->freq_policy->i_gain_pct);
+}
+
+static inline void snb_idle_pid_reset(struct cpudata *cpu)
+{
+ pid_reset(&cpu->idle_pid,
+ 75,
+ 50,
+ cpu->freq_policy->deadband,
+ 0);
+
+ pid_p_gain_set(&cpu->idle_pid, cpu->freq_policy->p_gain_pct);
+ pid_d_gain_set(&cpu->idle_pid, cpu->freq_policy->d_gain_pct);
+ pid_i_gain_set(&cpu->idle_pid, cpu->freq_policy->i_gain_pct);
+}
+
+static inline void snb_reset_all_pid(void)
+{
+ unsigned int cpu;
+ for_each_online_cpu(cpu) {
+ if (all_cpu_data[cpu])
+ snb_busy_pid_reset(all_cpu_data[cpu]);
+ }
+}
+
+/************************** sysfs begin ************************/
+#define show_one(file_name, object) \
+ static ssize_t show_##file_name \
+ (struct kobject *kobj, struct attribute *attr, char *buf) \
+ { \
+ return sprintf(buf, "%u\n", default_policy.object); \
+ }
+
+static ssize_t store_sample_rate_ms(struct kobject *a, struct attribute *b,
+ const char *buf, size_t count)
+{
+ unsigned int input;
+ int ret;
+ ret = sscanf(buf, "%u", &input);
+ if (ret != 1)
+ return -EINVAL;
+ default_policy.sample_rate_ms = input;
+ snb_reset_all_pid();
+ return count;
+}
+
+static ssize_t store_d_gain_pct(struct kobject *a, struct attribute *b,
+ const char *buf, size_t count)
+{
+ unsigned int input;
+ int ret;
+ ret = sscanf(buf, "%u", &input);
+ if (ret != 1)
+ return -EINVAL;
+ default_policy.d_gain_pct = input;
+ snb_reset_all_pid();
+
+ return count;
+}
+
+static ssize_t store_i_gain_pct(struct kobject *a, struct attribute *b,
+ const char *buf, size_t count)
+{
+ unsigned int input;
+ int ret;
+ ret = sscanf(buf, "%u", &input);
+ if (ret != 1)
+ return -EINVAL;
+ default_policy.i_gain_pct = input;
+ snb_reset_all_pid();
+
+ return count;
+}
+
+static ssize_t store_deadband(struct kobject *a, struct attribute *b,
+ const char *buf, size_t count)
+{
+ unsigned int input;
+ int ret;
+ ret = sscanf(buf, "%u", &input);
+ if (ret != 1)
+ return -EINVAL;
+ default_policy.deadband = input;
+ snb_reset_all_pid();
+
+ return count;
+}
+
+static ssize_t store_setpoint(struct kobject *a, struct attribute *b,
+ const char *buf, size_t count)
+{
+ unsigned int input;
+ int ret;
+ ret = sscanf(buf, "%u", &input);
+ if (ret != 1)
+ return -EINVAL;
+ default_policy.setpoint = input;
+ snb_reset_all_pid();
+
+ return count;
+}
+
+static ssize_t store_p_gain_pct(struct kobject *a, struct attribute *b,
+ const char *buf, size_t count)
+{
+ unsigned int input;
+ int ret;
+ ret = sscanf(buf, "%u", &input);
+ if (ret != 1)
+ return -EINVAL;
+ default_policy.p_gain_pct = input;
+ snb_reset_all_pid();
+
+ return count;
+}
+
+show_one(sample_rate_ms, sample_rate_ms);
+show_one(d_gain_pct, d_gain_pct);
+show_one(i_gain_pct, i_gain_pct);
+show_one(deadband, deadband);
+show_one(setpoint, setpoint);
+show_one(p_gain_pct, p_gain_pct);
+
+
+define_one_global_rw(sample_rate_ms);
+define_one_global_rw(d_gain_pct);
+define_one_global_rw(i_gain_pct);
+define_one_global_rw(deadband);
+define_one_global_rw(setpoint);
+define_one_global_rw(p_gain_pct);
+
+
+static struct attribute *snb_attributes[] = {
+ &sample_rate_ms.attr,
+ &d_gain_pct.attr,
+ &i_gain_pct.attr,
+ &deadband.attr,
+ &setpoint.attr,
+ &p_gain_pct.attr,
+ NULL
+};
+
+static struct attribute_group snb_attr_group = {
+ .attrs = snb_attributes,
+ .name = "snb",
+};
+
+/************************** sysfs end ************************/
+
+static int snb_get_min_freq(void)
+{
+ u64 value;
+ rdmsrl(0xCE, value);
+ return (value >> 40) & 0xFF;
+}
+
+static int snb_get_max_freq(void)
+{
+ u64 value;
+ rdmsrl(0xCE, value);
+ return (value >> 8) & 0xFF;
+}
+
+static int snb_get_turbo_freq(void)
+{
+ u64 value;
+ int nont, ret;
+ rdmsrl(0x1AD, value);
+ nont = snb_get_max_freq();
+ ret = ((value) & 255);
+ if (ret <= nont)
+ ret = nont;
+ return ret;
+}
+
+
+static void snb_set_freq(struct cpudata *cpu, int clock)
+{
+ clock = clamp_t(int, clock, cpu->clock.min_freq, cpu->clock.turbo_freq);
+
+ if (clock == cpu->clock.current_freq)
+ return;
+
+#ifndef MODULE
+ trace_cpu_frequency(clock * 100000, cpu->cpu);
+ trace_power_frequency(POWER_PSTATE, clock * 100000, cpu->cpu);
+#endif
+
+ cpu->clock.current_freq = clock;
+ wrmsrl(MSR_IA32_PERF_CTL, clock << 8);
+}
+
+static inline void snb_freq_increase(struct cpudata *cpu, int steps)
+{
+ int target;
+ target = cpu->clock.current_freq + steps;
+
+ snb_set_freq(cpu, target);
+}
+
+static inline void snb_freq_decrease(struct cpudata *cpu, int steps)
+{
+ int target;
+ target = cpu->clock.current_freq - steps;
+ snb_set_freq(cpu, target);
+}
+
+static void snb_get_cpu_freqs(struct cpudata *cpu)
+{
+ sprintf(cpu->name, "Intel 2nd generation core");
+
+ cpu->clock.min_freq = snb_get_min_freq();
+ cpu->clock.max_freq = snb_get_max_freq();
+ cpu->clock.turbo_freq = snb_get_turbo_freq();
+
+ /* goto max clock so we don't slow up boot if we are built-in
+ if we are a module we will take care of it during normal
+ operation
+ */
+ snb_set_freq(cpu, cpu->clock.max_freq);
+}
+
+
+static inline void snb_calc_busy(struct cpudata *cpu, struct sample *sample)
+{
+ u64 core_pct;
+
+ sample->freq_pct_busy = 100 - div64_u64(
+ sample->idletime_us * 100,
+ sample->duration_us);
+ core_pct = div64_u64(sample->aperf * 100, sample->mperf);
+ sample->core_pct_busy = sample->freq_pct_busy * core_pct / 100;
+}
+
+static inline int snb_sample(struct cpudata *cpu)
+{
+ ktime_t now;
+ u64 idle_time_us;
+ u64 aperf, mperf;
+
+ now = ktime_get();
+ idle_time_us = get_cpu_idle_time_us(cpu->cpu, NULL);
+
+ rdmsrl(MSR_IA32_APERF, aperf);
+ rdmsrl(MSR_IA32_MPERF, mperf);
+ /* for the first sample, don't actually record a sample, just
+ * set the baseline */
+ if (cpu->prev_idle_time_us > 0) {
+ cpu->sample_ptr = (cpu->sample_ptr + 1) % SAMPLE_COUNT;
+ cpu->samples[cpu->sample_ptr].start_time = cpu->prev_sample;
+ cpu->samples[cpu->sample_ptr].end_time = now;
+ cpu->samples[cpu->sample_ptr].duration_us =
+ ktime_us_delta(now, cpu->prev_sample);
+ cpu->samples[cpu->sample_ptr].idletime_us =
+ idle_time_us - cpu->prev_idle_time_us;
+
+ cpu->samples[cpu->sample_ptr].aperf = aperf;
+ cpu->samples[cpu->sample_ptr].mperf = mperf;
+ cpu->samples[cpu->sample_ptr].aperf -= cpu->prev_aperf;
+ cpu->samples[cpu->sample_ptr].mperf -= cpu->prev_mperf;
+
+ snb_calc_busy(cpu, &cpu->samples[cpu->sample_ptr]);
+ }
+
+ cpu->prev_sample = now;
+ cpu->prev_idle_time_us = idle_time_us;
+ cpu->prev_aperf = aperf;
+ cpu->prev_mperf = mperf;
+ return cpu->sample_ptr;
+}
+
+static inline void snb_set_sample_time(struct cpudata *cpu)
+{
+ int sample_time;
+ int delay;
+
+ sample_time = cpu->freq_policy->sample_rate_ms;
+ delay = msecs_to_jiffies(sample_time);
+ delay -= jiffies % delay;
+ mod_timer(&cpu->timer, jiffies + delay);
+}
+
+static inline void snb_idle_mode(struct cpudata *cpu)
+{
+ cpu->sampling_state.idle_mode = 1;
+}
+
+static inline void snb_normal_mode(struct cpudata *cpu)
+{
+ cpu->sampling_state.idle_mode = 0;
+}
+
+static inline int snb_get_scaled_busy(struct cpudata *cpu)
+{
+ int32_t busy_scaled;
+ int32_t core_busy, turbo_freq, current_freq;
+
+ core_busy = int_tofp(cpu->samples[cpu->sample_ptr].core_pct_busy);
+ turbo_freq = int_tofp(cpu->clock.turbo_freq);
+ current_freq = int_tofp(cpu->clock.current_freq);
+ busy_scaled = mul_fp(core_busy, div_fp(turbo_freq, current_freq));
+
+ return fp_toint(busy_scaled);
+}
+
+static inline void snb_adjust_busy_freq(struct cpudata *cpu)
+{
+ int busy_scaled;
+ struct _pid *pid;
+ int ctl = 0;
+ int steps;
+
+ pid = &cpu->pid;
+
+ busy_scaled = snb_get_scaled_busy(cpu);
+
+ ctl = pid_calc(pid, busy_scaled);
+
+ steps = abs(ctl);
+ if (ctl < 0)
+ snb_freq_increase(cpu, steps);
+ else
+ snb_freq_decrease(cpu, steps);
+}
+
+static inline void snb_adjust_idle_freq(struct cpudata *cpu)
+{
+ int busy_scaled;
+ struct _pid *pid;
+ int ctl = 0;
+ int steps;
+
+ pid = &cpu->idle_pid;
+
+ busy_scaled = snb_get_scaled_busy(cpu);
+
+ ctl = pid_calc(pid, 100 - busy_scaled);
+
+ steps = abs(ctl);
+ if (ctl < 0)
+ snb_freq_decrease(cpu, steps);
+ else
+ snb_freq_increase(cpu, steps);
+
+ if (cpu->clock.current_freq == cpu->clock.min_freq)
+ snb_normal_mode(cpu);
+}
+static inline int snb_valid_sample(struct cpudata *cpu, int idx)
+{
+ struct sample *sample = &cpu->samples[idx];
+
+ return sample->duration_us <
+ (cpu->freq_policy->sample_rate_ms * USEC_PER_MSEC * 2);
+}
+
+static void snb_timer_func(unsigned long __data)
+{
+ struct cpudata *cpu = (struct cpudata *) __data;
+ struct freq_adjust_policy *policy;
+ int idx;
+
+ policy = cpu->freq_policy;
+
+ idx = snb_sample(cpu);
+
+ if (snb_valid_sample(cpu, idx)) {
+ if (!cpu->sampling_state.idle_mode)
+ snb_adjust_busy_freq(cpu);
+ else
+ snb_adjust_idle_freq(cpu);
+ }
+
+#if defined(XPERF_FIX)
+ if (cpu->clock.current_freq == cpu->clock.min_freq) {
+ cpu->min_freq_count++;
+ if (!(cpu->min_freq_count % 5)) {
+ snb_set_freq(cpu, cpu->clock.max_freq);
+ snb_idle_mode(cpu);
+ }
+ } else
+ cpu->min_freq_count = 0;
+#endif
+ snb_set_sample_time(cpu);
+}
+
+static void snb_exit(unsigned int cpu)
+{
+ if (!all_cpu_data)
+ return;
+ pr_info("snb: disabling %d\n", cpu);
+ if (all_cpu_data[cpu]) {
+ del_timer_sync(&all_cpu_data[cpu]->timer);
+ kfree(all_cpu_data[cpu]);
+ }
+}
+
+#define ICPU(model, policy) \
+ { X86_VENDOR_INTEL, 6, model, X86_FEATURE_ANY, (unsigned long)&policy }
+
+static const struct x86_cpu_id intel_cpufreq_ids[] = {
+ ICPU(0x2a, default_policy),
+ ICPU(0x2d, default_policy),
+ {}
+};
+MODULE_DEVICE_TABLE(x86cpu, intel_cpufreq_ids);
+
+static int snb_init(unsigned int cpu)
+{
+ int rc;
+ const struct x86_cpu_id *id;
+
+ id = x86_match_cpu(intel_cpufreq_ids);
+ if (!id)
+ return -ENODEV;
+
+ all_cpu_data[cpu] = kzalloc(sizeof(struct cpudata), GFP_KERNEL);
+ if (!all_cpu_data[cpu]) {
+ rc = -ENOMEM;
+ goto unwind;
+ }
+
+ snb_get_cpu_freqs(all_cpu_data[cpu]);
+
+ all_cpu_data[cpu]->cpu = cpu;
+ all_cpu_data[cpu]->freq_policy =
+ (struct freq_adjust_policy *)id->driver_data;
+ init_timer_deferrable(&all_cpu_data[cpu]->timer);
+ all_cpu_data[cpu]->timer.function = snb_timer_func;
+ all_cpu_data[cpu]->timer.data =
+ (unsigned long)all_cpu_data[cpu];
+ all_cpu_data[cpu]->timer.expires = jiffies + HZ/100;
+ snb_busy_pid_reset(all_cpu_data[cpu]);
+ snb_idle_pid_reset(all_cpu_data[cpu]);
+ pr_info("snb: enabling %d\n", cpu);
+ add_timer_on(&all_cpu_data[cpu]->timer, cpu);
+ return 0;
+
+unwind:
+ snb_exit(cpu);
+ return -ENODEV;
+}
+
+/**
+ * cpufreq_set - set the CPU frequency
+ * @policy: pointer to policy struct where freq is being set
+ * @freq: target frequency in kHz
+ *
+ * Sets the CPU frequency to freq.
+ */
+static int cpufreq_snb_set(struct cpufreq_policy *policy, unsigned int freq)
+{
+ int ret = 0;
+ return ret;
+}
+
+
+static ssize_t cpufreq_snb_show_speed(struct cpufreq_policy *policy, char *buf)
+{
+ return 0;
+}
+
+static int cpufreq_snb(struct cpufreq_policy *policy,
+ unsigned int event)
+{
+ unsigned int cpu = policy->cpu;
+ int rc = 0;
+
+ switch (event) {
+ case CPUFREQ_GOV_START:
+ if (!cpu_online(cpu))
+ return -EINVAL;
+ mutex_lock(&snb_mutex);
+ snb_usage++;
+ rc = snb_init(cpu);
+
+ if (snb_usage == 1)
+ rc = sysfs_create_group(cpufreq_global_kobject,
+ &snb_attr_group);
+
+ mutex_unlock(&snb_mutex);
+ break;
+ case CPUFREQ_GOV_STOP:
+ mutex_lock(&snb_mutex);
+ snb_usage--;
+ snb_exit(cpu);
+ if (!snb_usage)
+ sysfs_remove_group(cpufreq_global_kobject,
+ &snb_attr_group);
+
+ mutex_unlock(&snb_mutex);
+ break;
+ case CPUFREQ_GOV_LIMITS:
+ mutex_lock(&snb_mutex);
+ mutex_unlock(&snb_mutex);
+ break;
+ }
+ return rc;
+}
+
+static struct cpufreq_governor cpufreq_gov_snb = {
+ .name = "snb",
+ .governor = cpufreq_snb,
+ .store_setspeed = cpufreq_snb_set,
+ .show_setspeed = cpufreq_snb_show_speed,
+ .owner = THIS_MODULE,
+};
+
+static int __init cpufreq_gov_snb_init(void)
+{
+ pr_info("Sandybridge frequency driver initializing.\n");
+
+ all_cpu_data = vmalloc(sizeof(void *) * num_possible_cpus());
+ if (!all_cpu_data)
+ return -ENOMEM;
+ memset(all_cpu_data, 0, sizeof(void *) * num_possible_cpus());
+
+
+ return cpufreq_register_governor(&cpufreq_gov_snb);
+}
+
+static void __exit cpufreq_gov_snb_exit(void)
+{
+ vfree(all_cpu_data);
+ all_cpu_data = NULL;
+ cpufreq_unregister_governor(&cpufreq_gov_snb);
+}
+
+
+MODULE_AUTHOR("Dirk Brandewie <***@intel.com>");
+MODULE_DESCRIPTION("'cpufreq_snb' - cpufreq governor for Sandy Bridge");
+MODULE_LICENSE("GPL");
+
+
+fs_initcall(cpufreq_gov_snb_init);
+module_exit(cpufreq_gov_snb_exit);
--
1.7.7.6

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
David C Niemi
2012-12-05 20:28:10 UTC
Permalink
Dirk,

I applaud the work you are doing. In general I believe it is important=
to separate policy (governor and its settings) from the driver, partic=
ularly so as different end-users have very different goals for power ma=
nagement. Not everyone is trying to maximize performance per watt per =
se (in fact probably rather few end users are doing so literally). In =
server applications, for example, the first priority is typically maxim=
um performance when under heavy load, and the second priority is minimu=
m power consumption at idle. There may not ever be a benefit for choos=
ing one of the middle clock states. The OnDemand governor with the sam=
pling_down_factor set to ~100 can do quite well at this, at least compa=
red to implementations prior to yours. Another consideration is that j=
ust blindly trying to run flat out all the time (e.g. the old performan=
ce governor approach) bumps you up against your thermal limits and can =
actually slow you down, vs. intelligently powersaving idle hardware
threads -- so a user who totally aims for performance with no regard fo=
r power savings cannot avoid must paying some attention to power manage=
ment.

So what you have sounds like both a new driver (very important) and a n=
ew governor (also potentially very useful), with some of the dynamic po=
rtions of power management handled by the hardware itself. Ideally the=
new driver would be separated from the new governor in a somewhat modu=
lar way (so that implementation and policy can be separated). And idea=
lly it would be nice if the new driver can be compatible with the exist=
ing governor by exposing an ability to set and report current frequenci=
es. But if this is impractical or pointless for Sandy Bridge, so be it=
=2E I expect your new governor probably could not sit on top of any of=
the existing drivers, but some of the existing drivers could perhaps b=
e enhanced to provide the necessary hooks, and it would be bad to have =
to implement the same policy framework over and over for all past and f=
uture hardware drivers that want to benefit from your work. So outside=
of a research kernel, I don't think having a "cpufreq/snb" directory
is a good place to expose tuning parameters, the exposed interface shou=
ld be generalized as much as possible and not be so implementation-spec=
ific. In the long run both integrators and maintainers of Linux distri=
butions are going to insist on a generic interface that can work across=
the vast majority of modern hardware, rather than cater to a special c=
ase that only works on one or CPU families, even if those families are =
particularly important ones.

David C Niemi
Post by d***@gmail.com
This driver provides a P state driver for Sandybridge and Ivybridge
processors.
=20
The goal of this driver is to improve the power efficiency of
Sandybridge/Ivybridge based systems. As the investigation into how t=
o
Post by d***@gmail.com
achieve this goal progressed it became apparent (to me) that some of =
the
Post by d***@gmail.com
design assumptions of the cpufreq subsystem are no longer valid and
that a micro-architecure specific P state driver would be less comple=
x
Post by d***@gmail.com
and potentially more effiecent. As Intel continues to innovate in th=
e
Post by d***@gmail.com
area of freqency/power control this will become more true IMHO.
The driver uses a PID controller to adjust the core frequency based o=
n
Post by d***@gmail.com
the presented load. The driver exposes the tuning parameters for the
controller in the /sys/devices/system/cpu/cpufreq/snb directory. The
controller code is being used in PI mode with the default tuning
parmeters.
setpoint - load in percent on the core will attempt to maintain.=09
sample_rate_ms - rate at which the driver will sample the load on =
the core.=20
Post by d***@gmail.com
deadband - percent =C2=B1 around the setpoint the controller will
consider zero error.
p_gain_pct - Proportional gain in percent.=20
i_gain_pct - Integral gain in percent.=20
d_gain_pct - Derivative gain in percent
#!/bin/sh
for file in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
do=20
echo snb > $file
done
ATM this driver will only run on SandyBridge systems testing on
Ivybridge systems is not complete.
=20
What is the correct way to integrate this driver into the system? Th=
e
Post by d***@gmail.com
current implementation registers as a cpufreq frequency governor, thi=
s
Post by d***@gmail.com
was done to streamline testing using cpufreq to load/unload governors=
=2E
Post by d***@gmail.com
What tuning parameters should be exposed via sysfs (if any)? ATM all
the PID parameters are exposed to enable tuning of the driver.
=2E...

--
To unsubscribe from this list: send the line "unsubscribe cpufreq" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Arjan van de Ven
2012-12-05 21:01:27 UTC
Permalink
Dirk,
I applaud the work you are doing. In general I believe it is important to separate policy (governor and its settings) from the driver, particularly so as different end-users have very different goals for power management. Not everyone is trying to maximize performance per watt per se (in fact probably rather few end users are doing so literally). In server applications, for example, the first priority is typically maximum performance when under heavy load, and the second priority is minimum power consumption at idle. There may not ever be a benefit for choosing one of the middle clock states. The OnDemand governor with the sampling_down_factor set to ~100 can do quite well at this, at least compared to implementations prior to yours. Another consideration is that just blindly tryin
g to run flat out all the time (e.g. the old performance governor approach) bumps you up against your thermal limits and can actually slow you down, vs. intelligently powersaving idle hardware
threads -- so a user who totally aims for performance with no regard for power savings cannot avoid must paying some attention to power management.
the idea that you can have separate policy and hardware is a big fallacy though.
A good policy ends up very hardware specific, and policies of the past work poorly on todays hardware
("ondemand" is one of the worst case behaviors you can have on modern Intel cpus for example).

While I appreciate the desire for some level of "preference" control, the split of policy and hardware in the
way cpufreq did that really isn't the way to go forward...

--
To unsubscribe from this list: send the line "unsubscribe cpufreq" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
David C Niemi
2012-12-05 21:40:04 UTC
Permalink
Post by Arjan van de Ven
the idea that you can have separate policy and hardware is a big fallacy though.
A good policy ends up very hardware specific, and policies of the past work poorly on todays hardware
("ondemand" is one of the worst case behaviors you can have on modern Intel cpus for example).
While I appreciate the desire for some level of "preference" control, the split of policy and hardware in the
way cpufreq did that really isn't the way to go forward...
I don't think separating policy from implementation is a fallacy at all, it is good design practice. Policy is a distillation of the "prorities and intent of the end user". It can be very high level, saying whether to prioritize single-thread performance, multi-thread performance, power savings, responsiveness coming out of idle, performance per watt on mid-level loads, etc. Maybe you can have one governor try to cater to all those things, or you have separate governors each targeting a subset of the use cases.

The problem with the existing governor configuration interfaces is that they are too detailed and too implementation-specific, as they grew out of an environment and mindset in which changing frequency was the only thing you could really control. The nice thing about the P-state driver is that it breaks new ground and can save power in new ways. But it should not repeat the mistake of just exposing implementation-specific knobs to tweak. That might be good for experimentation but it won't be good for widespread use. We should have a generalized interface between drivers and governors, for backwards and forwards compatibility reason, and obviously that interface needs work to support more modern power saving approaches like this one. What is exposed to the end user is then up to each g
overnor.

I would also say that the ondemand and performance governors are very widely used, and people will expect them to still work with any new driver. But maybe their attempts at changing frequency could be reinterpreted in new ways. All "performance" ever says is "run at frequency XXXX", and ondemand just wants to "run hardware thread x at max performance until further notice" and "now run hardware thread x at minimum power consumption". The latter would probably be easy to interpret. The former, depends on being able to set an explicit frequency any time you feel like it and have the hardware thread stay there, not sure how realistic that is.

DCN
--
To unsubscribe from this list: send the line "unsubscribe cpufreq" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Arjan van de Ven
2012-12-05 21:54:20 UTC
Permalink
Post by David C Niemi
Post by Arjan van de Ven
the idea that you can have separate policy and hardware is a big fallacy though.
A good policy ends up very hardware specific, and policies of the past work poorly on todays hardware
("ondemand" is one of the worst case behaviors you can have on modern Intel cpus for example).
While I appreciate the desire for some level of "preference" control, the split of policy and hardware in the
way cpufreq did that really isn't the way to go forward...
I don't think separating policy from implementation is a fallacy at all, it is good design practice.
thinking that policy is independent of the hardware is a fallacy.
Preference is what the user wants, sure. But a policy agent (governor) that implements that preference is very hardware
dependent.
Post by David C Niemi
I would also say that the ondemand and performance governors are very widely used,
here's where things go wrong. "ondemand" does not indicate a power-versus-performance preference.
It indicates a certain very specific behavior of frequency selection.
A behavior that is really bad on current Intel hardware, and hurting generally in BOTH power AND performance... at the same time.

I am by no means suggesting to take away a users ability to decide where he wants to live in the
performance-versus-power scale.... but what I am suggesting is that implementing that preference is
cpu dependent; it seems to be that, at least on the past Intel roadmap, there are very fundamental changes
every 2 years that mean throwing away the actual algorithm and starting over... and I don't see that changing;
if anything it might be yearly instead of every 2 years.

something like "ondemand" got designed 10 years ago, for hardware from back then... and SandyBridge ^W"2nd generation core"
is at least 2 if not 3 fundamental technology steps ahead of that, and the assumptions behind "ondemand" are
outright not true anymore.
(ondemand design still assumes for example that frequency selection matters for when the CPU is idle.. something that's not been
true for quite some time now.. in idle the frequency and voltage are both 0.)


--
To unsubscribe from this list: send the line "unsubscribe cpufreq" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
David C Niemi
2012-12-06 15:01:33 UTC
Permalink
My point is that performance vs. power is not just a linear continuum of preferences. How idle and full speed are handled are of particular importance in many applications.

I think we both agree the existing governors are obsolete and do things the wrong way. But we attach different meanings to "policy" and may have different ideas of what should be.

I think of policy as very high level and totally compatible with a variety of very different hardware implementations. The minimum a true high-level policy "governor" would need to do is this:
a) determine what the hardware's capabilities are (init)
b) provide a configuration interface analogous to what we have now but much higher-level and less frequency-centric
c) assess system load on an ongoing basis.
d) control the power management driver based on the user preferences and the system load pattern.

The lines between governor and driver could be drawn in various places, but the point of having some sort of governor is to not have to reimplement the whole stack for every driver.

The exposed configuration interface might be as simple as choosing one of several discrete settings:
- max single-threaded performance
- max multi-threaded performance
- "server" setting -- save power but only in ways that do not affect performance
- "default" -- a good general-purpose middle of the road setting that performs pretty well and also saves power
- "on battery" setting -- provide good interactive responsiveness but aggressively save power, potentially making long-running tasks take longer
- "min power"

The above is what I think of as policy. There is nothing hardware-specific about these. These say nothing directly about what frequency to run or whether to use P-States. On some hardware some of these settings might be equivalent to each other, but then again there is some hardware that can only run one way. The driver could expose lower-level implementation-specific controls in its own area, but there should be a higher level interface that separates that from what users normally have to deal with.

The interface between the governor and the driver needs to include some combination of current load conditions and user preferences. It does not have to talk about frequency or anything hardware-specific, but it needs to encompass both dynamic information (based on load) and fairly static information (user preferences).

If the CPU and chipset can assess load well enough by themself and carry out governor-like decisions in hardware, we can regard the need to have the governor assess load and communicate it to the driver as optional. In that case user priorities are the only thing left above the driver level.

DCN
Post by Arjan van de Ven
...
thinking that policy is independent of the hardware is a fallacy.
Preference is what the user wants, sure. But a policy agent (governor) that implements that preference is very hardware
dependent.
...
Post by Arjan van de Ven
here's where things go wrong. "ondemand" does not indicate a power-versus-performance preference.
It indicates a certain very specific behavior of frequency selection.
A behavior that is really bad on current Intel hardware, and hurting generally in BOTH power AND performance... at the same time.
I am by no means suggesting to take away a users ability to decide where he wants to live in the
performance-versus-power scale.... but what I am suggesting is that implementing that preference is
cpu dependent; it seems to be that, at least on the past Intel roadmap, there are very fundamental changes
every 2 years that mean throwing away the actual algorithm and starting over... and I don't see that changing;
if anything it might be yearly instead of every 2 years.
something like "ondemand" got designed 10 years ago, for hardware from back then... and SandyBridge ^W"2nd generation core"
is at least 2 if not 3 fundamental technology steps ahead of that, and the assumptions behind "ondemand" are
outright not true anymore.
(ondemand design still assumes for example that frequency selection matters for when the CPU is idle.. something that's not been
true for quite some time now.. in idle the frequency and voltage are both 0.)
--
To unsubscribe from this list: send the line "unsubscribe cpufreq" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Arjan van de Ven
2012-12-06 16:27:13 UTC
Permalink
Post by David C Niemi
My point is that performance vs. power is not just a linear continuum of preferences. How idle and full speed are handled are of particular importance in many applications.
I think we both agree the existing governors are obsolete and do things the wrong way. But we attach different meanings to "policy" and may have different ideas of what should be.
a) determine what the hardware's capabilities are (init)
b) provide a configuration interface analogous to what we have now but much higher-level and less frequency-centric
c) assess system load on an ongoing basis.
d) control the power management driver based on the user preferences and the system load pattern.
The lines between governor and driver could be drawn in various places, but the point of having some sort of governor is to not have to reimplement the whole stack for every driver.
the sad part is that this is where reality has caught up with the nice theory.
hardware keeps innovating/changing around power behavior... very very fundamentally.
When we started CPUFREQ (yes I was there ;-) ) we had the assumption that a clean split between hardware and governor
was possible. Even back then, Linus balked at that and made us change it at least somewhat.... the Transmeta
CPUs at the time showed enough differences already to break. We made, at the time, the minimal changes possible.
But really the whole idea does not work out.
Post by David C Niemi
- max single-threaded performance
- max multi-threaded performance
these are identical on todays silicon btw; or rather, this is not a P state choice item, but a task scheduler policy item.
Post by David C Niemi
- "server" setting -- save power but only in ways that do not affect performance
this is a fiction btw... if there was a way to reduce power and not affect performance, that's your "max performance" setting.
anything else will sacrifice SOME performance from max...
Post by David C Niemi
- "default" -- a good general-purpose middle of the road setting that performs pretty well and also saves power
... so you end up at this one.
Post by David C Niemi
- "on battery" setting -- provide good interactive responsiveness but aggressively save power, potentially making long-running tasks take longer
battery has nothing to do with power preference. Just ask any data center operator.
Post by David C Niemi
The above is what I think of as policy. There is nothing hardware-specific about these.
These say nothing directly about what frequency to run or whether to use P-States.
and defining a common policy interface I'm quite fine with (not quite in the way you defined it, but ok...)
But that's not going to lead to a common implementation as a "governor" ;-(

My idea for a policy "dial" is mostly

* Uncompromised performance
* Balanced - biased towards performance (say, defined to be lowest power at most a 2 1/2% perf hit)
* Balanced (say, at most a 5% perf hit)
* Balanced - biased towards lower power (sat, at most a 10% perf hit)
* Uncompromised lowest power

we can argue about the exact %ages, but the idea is to give at least some reasonably definition that people can understand,
but that also can be measured
--
To unsubscribe from this list: send the line "unsubscribe cpufreq" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
David C Niemi
2012-12-06 17:30:54 UTC
Permalink
Post by Arjan van de Ven
...
Post by David C Niemi
- max single-threaded performance
- max multi-threaded performance
these are identical on todays silicon btw; or rather, this is not a P state choice item, but a task scheduler policy item.
Here's where there is a difference in power management: if you want to maximize single-thread performance, you're willing to enable power-expensive boost modes on behalf of a thread. You don't want to do that for multithreaded performance because your thermal envelope may not let you boost them all at once. Or at least that is what I was thinking.

Also some people will be all about I/O throughput, and others will care more about latency than anything else, and percentages for those people may be wildly different than for general computation. So we can't guarantee any particular percentage outside some well-defined benchmarks. But we could try to lump them all together as best we can and have a couple of knobs on the side like the current "io_is_busy", perhaps.
Post by Arjan van de Ven
Post by David C Niemi
- "server" setting -- save power but only in ways that do not affect performance
this is a fiction btw... if there was a way to reduce power and not affect performance, that's your "max performance" setting.
anything else will sacrifice SOME performance from max...
I know people who don't pay for electricity or cooling and think max performance == run every thread at maximum possible speed all the time, even if it is idle. But boost modes mean "maximum possible speed" is a fluid concept.
...
Post by Arjan van de Ven
and defining a common policy interface I'm quite fine with (not quite in the way you defined it, but ok...)
But that's not going to lead to a common implementation as a "governor" ;-(
My idea for a policy "dial" is mostly
* Uncompromised performance
* Balanced - biased towards performance (say, defined to be lowest power at most a 2 1/2% perf hit)
* Balanced (say, at most a 5% perf hit)
* Balanced - biased towards lower power (sat, at most a 10% perf hit)
* Uncompromised lowest power
we can argue about the exact %ages, but the idea is to give at least some reasonably definition that people can understand,
but that also can be measured
I am quite happy with your definitions above. It is the same in spirit as what I was trying for, just better stated.

I expect the performance degradation percentages are going to vary a lot depending on what techniques are available in the hardware. If we want to generalize this to encompass older hardware too (which I think is a good idea), I could see percentages being, say, <3% <10% <20% to give more room to work with, and nicer newer hardware being able to do better as your percentages indicate.

On reporting frequency: would it be practical to report some sort of medium-term average frequency, or if that is not available, to just report the max freq that the hardware thread is currently eligible to use?

DCN
--
To unsubscribe from this list: send the line "unsubscribe cpufreq" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Arjan van de Ven
2012-12-06 17:41:12 UTC
Permalink
Post by David C Niemi
Post by Arjan van de Ven
...
Post by David C Niemi
- max single-threaded performance
- max multi-threaded performance
these are identical on todays silicon btw; or rather, this is not a P state choice item, but a task scheduler policy item.
if you want to maximize single-thread performance, you're willing to enable power-expensive boost
modes on behalf of a thread.
sure
Post by David C Niemi
You don't want to do that for multithreaded performance because your thermal envelope may not let
you boost them all at once. Or at least that is what I was thinking.
this part I don't buy, at least on current hw... the boost code will deal with this quite well;
there's no knob that can do better than that.
Post by David C Niemi
Also some people will be all about I/O throughput, and others will care more about latency than anything else, and percentages for those people may be wildly different than for general computation. So we can't guarantee any particular percentage outside some well-defined benchmarks. But we could try to lump them all together as best we can and have a couple of knobs on the side like the current "io_is_busy", perhaps.
Post by Arjan van de Ven
Post by David C Niemi
- "server" setting -- save power but only in ways that do not affect performance
this is a fiction btw... if there was a way to reduce power and not affect performance, that's your "max performance" setting.
anything else will sacrifice SOME performance from max...
I know people who don't pay for electricity or cooling and think max performance == run every thread at maximum possible speed all the time, even if it is idle.
But boost modes mean "maximum possible speed" is a fluid concept.
my point was that this is no different than "max single/multi performance" above.. unless you can make tradeoffs
(which means performance impact).
Post by David C Niemi
Post by Arjan van de Ven
and defining a common policy interface I'm quite fine with (not quite in the way you defined it, but ok...)
But that's not going to lead to a common implementation as a "governor" ;-(
My idea for a policy "dial" is mostly
* Uncompromised performance
* Balanced - biased towards performance (say, defined to be lowest power at most a 2 1/2% perf hit)
* Balanced (say, at most a 5% perf hit)
* Balanced - biased towards lower power (sat, at most a 10% perf hit)
* Uncompromised lowest power
we can argue about the exact %ages, but the idea is to give at least some reasonably definition that people can understand,
but that also can be measured
I am quite happy with your definitions above. It is the same in spirit as what I was trying for, just better stated.
I expect the performance degradation percentages are going to vary a lot depending on what
techniques are available in the hardware. If we want to generalize this to encompass older
hardware too (which I think is a good idea), I could see percentages being, say, <3% <10% <20% to
give more room to work with, and nicer newer hardware being able to do better as your percentages indicate.
I'm quite ok to add other steps... my point was to get an explicit/clear expectation of what a setting means
in a way that you can measure (and thus validate/etc)
Post by David C Niemi
On reporting frequency: would it be practical to report some sort of medium-term average frequency,
so there are counters in the cpus about what we ran it, and you do a delta over a time that you pick to get
an average. (if you pick too short a time, say, 100 cycles, obviously the division gives you a mostly noise number due
to quantization and then dividing a small number by a small noisy number)
so reporting in hindsight over a reasonable time (say a few dozen milliseconds) is not too hard as
long as you could define a time in the past where you did a measurement
to start the delta point... ideally we don't wake up the cpu to do this.. because then we're wasting power for it -(
Post by David C Niemi
or if that is not available, to just report the max freq that the hardware thread is currently eligible to use?
this part is not available at all..... so no we cannot do this.
(well, we do have the maximum the chip can do... but that's a constant number.. might as well report "42")

--
To unsubscribe from this list: send the line "unsubscribe cpufreq" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Dirk Brandewie
2012-12-06 18:25:15 UTC
Permalink
Post by David C Niemi
Post by Arjan van de Ven
...
Post by David C Niemi
- max single-threaded performance
- max multi-threaded performance
these are identical on todays silicon btw; or rather, this is not a P state choice item, but a task scheduler policy item.
Here's where there is a difference in power management: if you want to maximize single-thread performance,
you're willing to enable power-expensive boost modes on behalf of a thread.
You don't want to do that for
Post by David C Niemi
multithreaded performance because your thermal envelope may not let you boost
them all at once. Or at
Post by David C Niemi
least that is what I was thinking.
Without being VERY intimate with scheduler it is not clear how you could get
here. How can the governor know which core should get the most performance?

When we request a frequency greater than the frequency stamped on the part
(turbo frequency) the processor opportunistically run at a higher frequency
upto the requested frequency.
Post by David C Niemi
Also some people will be all about I/O throughput, and others will care more about latency than anything else,
and percentages for those people may be wildly different than for general
computation. So we can't guarantee
Post by David C Niemi
any particular percentage outside some well-defined benchmarks. But we could
try to lump them all together
Post by David C Niemi
as best we can and have a couple of knobs on the side like the current
"io_is_busy", perhaps
io_is_busy is a hint for ondemand to not move to the "idle frequency" while an
I/O is outstanding. It is not useful if you are not actively managing the
frequency at idle.
Post by David C Niemi
On reporting frequency: would it be practical to report some sort of medium-term
average frequency, or if that is not available,
Keeping an average over time is clearly possible in the driver but it is not
clear how it would be useful. In most situations other that proving that the
frequency changes over time there is little useful information provided
by knowing the current operating frequency.
Post by David C Niemi
to just report the max freq that the hardware thread is currently
eligible to use?
In Sandybridge you can request any turbo frequency at any time, what frequency
you actually get is up to the hardware and you can't tell what frequency you
actually got. AFAIK there is no way to tell what you are going to get
when you request a frequency higher than the frequency stamped on the part.

--Dirk
--
To unsubscribe from this list: send the line "unsubscribe cpufreq" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
David C Niemi
2012-12-06 18:41:07 UTC
Permalink
Post by Dirk Brandewie
Without being VERY intimate with scheduler it is not clear how you could get
here. How can the governor know which core should get the most performance?
When we request a frequency greater than the frequency stamped on the part
(turbo frequency) the processor opportunistically run at a higher frequency
upto the requested frequency.
I think being that intimate between the scheduler and the driver is probably not worth the considerable complexity it would introduce.

One bit of general user input I can see being relevant is whether to even consider using power-expensive modes like turbo frequencies at all. In the first 2 or 3 of Arjan's settings, yes, you would. In the last two you would never use them. If you were optimizing for multithreaded performance you might also not want to use them, but perhaps the CPU would be happy to handle that decision for you, as Arjan suggests, so it may not be worth trying to control it top-down.
...
Post by Dirk Brandewie
Keeping an average over time is clearly possible in the driver but it is not clear how it would be useful. In most situations other that proving that the
frequency changes over time there is little useful information provided
by knowing the current operating frequency.
I don't think knowing a precise frequency at a precise time is very critical, but finding something to report would help make users feel the driver is working well. Having low overhead in collecting the information (like an average speed over a period of time) is more important than its precision or timeliness, as the most common use case is just a GUI feature.
Post by Dirk Brandewie
Post by David C Niemi
to just report the max freq that the hardware thread is currently
eligible to use?
In Sandybridge you can request any turbo frequency at any time, what frequency
you actually get is up to the hardware and you can't tell what frequency you actually got. AFAIK there is no way to tell what you are going to get
when you request a frequency higher than the frequency stamped on the part.
So if it is not practical to get an average, reporting the frequency stamped on the part is better than nothing. It is boring, but less boring than reporting "0".
Post by Dirk Brandewie
--Dirk
DCN
--
To unsubscribe from this list: send the line "unsubscribe cpufreq" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Dirk Brandewie
2012-12-06 21:35:15 UTC
Permalink
Post by David C Niemi
Post by Dirk Brandewie
Without being VERY intimate with scheduler it is not clear how you could get
here. How can the governor know which core should get the most performance?
When we request a frequency greater than the frequency stamped on the part
(turbo frequency) the processor opportunistically run at a higher frequency
upto the requested frequency.
I think being that intimate between the scheduler and the driver is probably not worth the considerable complexity it would introduce.
One bit of general user input I can see being relevant is whether to even consider using
power-expensive modes like turbo frequencies at all. In the first 2 or 3 of Arjan's settings,
yes, you would. In the last two you would never use them. If you were optimizing for
multithreaded performance you might also not want to use them, but perhaps the
CPU would be
Post by David C Niemi
happy to handle that decision for you, as Arjan suggests, so it may not be worth trying to
control it top-down.
Yep there are a bunch of ways to skin the cat when it comes to trading peak
performance for saving power. The driver code is setup allow for having
multiple sets of tuning parameters that could be selected by the
user/system admin/integrator.

The current driver is tuned to have the same or better peak performance
than the ondemand governor while having better power efficiency.

The performance and power efficiency gains depends on the type of workload.

The thorny question in my mind if people agree that having a per architecture p
state driver is a valid approach is how should the per architecture drivers
be integrated into a system that allows distributions to build generic kernels
with reasonable default behaviour.
Post by David C Niemi
...
Post by Dirk Brandewie
Keeping an average over time is clearly possible in the driver but it is not clear how it would be useful. In most situations other that proving that the
frequency changes over time there is little useful information provided
by knowing the current operating frequency.
I don't think knowing a precise frequency at a precise time is very critical, but finding something to report would help make users feel the driver is working well. Having low overhead in collecting the information (like an average speed over a period of time) is more important than its precision or timeliness, as the most common use case is just a GUI feature.
Providing an interface to retrieve the current requested (operating)frequency
is trivial. Giving the user a warm fuzzy that things are changing what this
type of utility is good for IMHO not an unreasonable desire.

The real question is does it need to be reported via the cpufreq subsystem and
if not where should this driver and others like it report the frequency.
Post by David C Niemi
Post by Dirk Brandewie
Post by David C Niemi
to just report the max freq that the hardware thread is currently
eligible to use?
In Sandybridge you can request any turbo frequency at any time, what frequency
you actually get is up to the hardware and you can't tell what frequency you actually got. AFAIK there is no way to tell what you are going to get
when you request a frequency higher than the frequency stamped on the part.
So if it is not practical to get an average, reporting the frequency stamped on the part
is better than nothing. It is boring, but less boring than reporting "0".
What would you use this number for in userspace? I guess I might not
understand exactly what you are asking for here.

--Dirk
Post by David C Niemi
Post by Dirk Brandewie
--Dirk
DCN
--
To unsubscribe from this list: send the line "unsubscribe cpufreq" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
David C Niemi
2012-12-06 22:23:14 UTC
Permalink
On 12/06/12 16:35, Dirk Brandewie wrote:
...
Post by Dirk Brandewie
Yep there are a bunch of ways to skin the cat when it comes to trading peak
performance for saving power. The driver code is setup allow for having
multiple sets of tuning parameters that could be selected by the
user/system admin/integrator.
The current driver is tuned to have the same or better peak performance
than the ondemand governor while having better power efficiency.
An untuned ondemand governor performs very poorly, as it is constantly trying to switch frequency down when it is busy. Did you try it witn sampling_down_factor set to, say, 100? This would tend to make it consume more power but perform substantially better, and would be a more reasonable comparison than with sampling_down_factor set to 1 (default).
Post by Dirk Brandewie
The performance and power efficiency gains depends on the type of workload.
The thorny question in my mind if people agree that having a per architecture p state driver is a valid approach is how should the per architecture drivers
be integrated into a system that allows distributions to build generic kernels
with reasonable default behaviour.
There is nothing wrong with having a bunch of different architecture-specific drivers, there is no way around that. But they need some kind of abstraction layer over the top or distribution creators will bypass it, even if it is clearly superior. If the existing layers are unusable, then we need a new abstraction layer; the most important feature is that a single configuration file needs to be able to do some basic reasonable settings across a wide variety of hardware types, or at a bare minimum come up in a sensible default mode. Again, that could be a new config file or an existing one, but some existing ones (e.g. /etc/sysconfig/cpuspeed) are totally focused on the wrong things and I can fully understand wanting to ditch them and do something new.

The cpupower utility from kernel-tools is a much better framework and could probably be extended to control the pstate driver/governor so maybe that is a good way to look; it also tweaks scheduler settings. It's already used on Fedora 16 and later but not RHEL 6.x. You might want to talk to whoever is maintaining it.
Post by Dirk Brandewie
Providing an interface to retrieve the current requested (operating)frequency
is trivial. Giving the user a warm fuzzy that things are changing what this
type of utility is good for IMHO not an unreasonable desire.
Agreed, but an average frequency over the last second (or 100 msec) is probably more interesting than the requested frequency, if they are both easy to find out.
Post by Dirk Brandewie
The real question is does it need to be reported via the cpufreq subsystem and
if not where should this driver and others like it report the frequency.
I'm not sure there really is anywhere you CAN report it other than responding to inquiries about it via /sys.
Post by Dirk Brandewie
So if it is not practical to get an average, reporting the frequency stamped on the part
Post by David C Niemi
is better than nothing. It is boring, but less boring than reporting "0".
What would you use this number for in userspace? I guess I might not
understand exactly what you are asking for here.
--Dirk
Mostly applets that show current CPU speed, possibly other performance monitoring tools like i7z.

DCN
--
To unsubscribe from this list: send the line "unsubscribe cpufreq" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Rafael J. Wysocki
2012-12-06 20:45:30 UTC
Permalink
Post by Arjan van de Ven
Post by David C Niemi
My point is that performance vs. power is not just a linear continuum of preferences. How idle and full speed are handled are of particular importance in many applications.
I think we both agree the existing governors are obsolete and do things the wrong way. But we attach different meanings to "policy" and may have different ideas of what should be.
a) determine what the hardware's capabilities are (init)
b) provide a configuration interface analogous to what we have now but much higher-level and less frequency-centric
c) assess system load on an ongoing basis.
d) control the power management driver based on the user preferences and the system load pattern.
The lines between governor and driver could be drawn in various places, but the point of having some sort of governor is to not have to reimplement the whole stack for every driver.
the sad part is that this is where reality has caught up with the nice theory.
hardware keeps innovating/changing around power behavior... very very fundamentally.
When we started CPUFREQ (yes I was there ;-) ) we had the assumption that a clean split between hardware and governor
was possible. Even back then, Linus balked at that and made us change it at least somewhat.... the Transmeta
CPUs at the time showed enough differences already to break. We made, at the time, the minimal changes possible.
But really the whole idea does not work out.
Post by David C Niemi
- max single-threaded performance
- max multi-threaded performance
these are identical on todays silicon btw; or rather, this is not a P state choice item, but a task scheduler policy item.
Post by David C Niemi
- "server" setting -- save power but only in ways that do not affect performance
this is a fiction btw... if there was a way to reduce power and not affect performance, that's your "max performance" setting.
anything else will sacrifice SOME performance from max...
Post by David C Niemi
- "default" -- a good general-purpose middle of the road setting that performs pretty well and also saves power
... so you end up at this one.
Post by David C Niemi
- "on battery" setting -- provide good interactive responsiveness but aggressively save power, potentially making long-running tasks take longer
battery has nothing to do with power preference. Just ask any data center operator.
Post by David C Niemi
The above is what I think of as policy. There is nothing hardware-specific about these.
These say nothing directly about what frequency to run or whether to use P-States.
and defining a common policy interface I'm quite fine with (not quite in the way you defined it, but ok...)
But that's not going to lead to a common implementation as a "governor" ;-(
My idea for a policy "dial" is mostly
* Uncompromised performance
* Balanced - biased towards performance (say, defined to be lowest power at most a 2 1/2% perf hit)
* Balanced (say, at most a 5% perf hit)
* Balanced - biased towards lower power (sat, at most a 10% perf hit)
* Uncompromised lowest power
we can argue about the exact %ages, but the idea is to give at least some reasonably definition that people can understand,
but that also can be measured
It looks like you'd like a tunable setting the maximum allowed performance hit
due to power management. Is that correct?

Rafael
--
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.
--
To unsubscribe from this list: send the line "unsubscribe cpufreq" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Arjan van de Ven
2012-12-06 21:15:13 UTC
Permalink
Post by Rafael J. Wysocki
Post by Arjan van de Ven
My idea for a policy "dial" is mostly
* Uncompromised performance
* Balanced - biased towards performance (say, defined to be lowest power at most a 2 1/2% perf hit)
* Balanced (say, at most a 5% perf hit)
* Balanced - biased towards lower power (sat, at most a 10% perf hit)
* Uncompromised lowest power
we can argue about the exact %ages, but the idea is to give at least some reasonably definition that people can understand,
but that also can be measured
It looks like you'd like a tunable setting the maximum allowed performance hit
due to power management. Is that correct?
basically yes, but not as a continuous dial (that's not practical), but as a
certain number of sensible steps.... I'm not sure it makes sense to have more than 5 steps.

My key interest is to have something that is both understandable by the sysadmin as to what it means,
but also something that you can actually measure (and thus test)...
while still describing a desire/preference, not a specific implementation of an algorithm.

A sysadmin understands "willing to give up at most 5% of max performance to save power"... and he can reason about it,
and think about it, take his own situation into account and then make a decision on it as a result.

The policy side can then measure algorithms and tunables to make sure they stay within that 5%
(of course on "reasonable" or "realistic" workloads.. not theoretical foofoo stuff).

You must put something like this in place, because if you just call it "balanced", that means 101 different things
to 100 people.... and as a result neither the algorithm side can ever do anything (SOMEONE somewhere will regress),
nor can the sysadmin side make a reasonable decision, since it means something else on each machine as well.




--
To unsubscribe from this list: send the line "unsubscribe cpufreq" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Rafael J. Wysocki
2012-12-06 21:26:48 UTC
Permalink
Post by Arjan van de Ven
Post by Rafael J. Wysocki
Post by Arjan van de Ven
My idea for a policy "dial" is mostly
* Uncompromised performance
* Balanced - biased towards performance (say, defined to be lowest power at most a 2 1/2% perf hit)
* Balanced (say, at most a 5% perf hit)
* Balanced - biased towards lower power (sat, at most a 10% perf hit)
* Uncompromised lowest power
we can argue about the exact %ages, but the idea is to give at least some reasonably definition that people can understand,
but that also can be measured
It looks like you'd like a tunable setting the maximum allowed performance hit
due to power management. Is that correct?
basically yes, but not as a continuous dial (that's not practical), but as a
certain number of sensible steps.... I'm not sure it makes sense to have more than 5 steps.
Then you need to get the people to agree on what the "sensible steps" are. :-)

I agree that continuous is not practical, however.
Post by Arjan van de Ven
My key interest is to have something that is both understandable by the sysadmin as to what it means,
but also something that you can actually measure (and thus test)...
while still describing a desire/preference, not a specific implementation of an algorithm.
A sysadmin understands "willing to give up at most 5% of max performance to save power"... and he can reason about it,
and think about it, take his own situation into account and then make a decision on it as a result.
The policy side can then measure algorithms and tunables to make sure they stay within that 5%
(of course on "reasonable" or "realistic" workloads.. not theoretical foofoo stuff).
You must put something like this in place, because if you just call it "balanced", that means 101 different things
to 100 people.... and as a result neither the algorithm side can ever do anything (SOMEONE somewhere will regress),
nor can the sysadmin side make a reasonable decision, since it means something else on each machine as well.
Yes, I know that.

Thanks,
Rafael
--
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.
--
To unsubscribe from this list: send the line "unsubscribe cpufreq" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Rafael J. Wysocki
2012-12-06 21:34:55 UTC
Permalink
Post by Rafael J. Wysocki
Post by Arjan van de Ven
Post by Rafael J. Wysocki
Post by Arjan van de Ven
My idea for a policy "dial" is mostly
* Uncompromised performance
* Balanced - biased towards performance (say, defined to be lowest power at most a 2 1/2% perf hit)
* Balanced (say, at most a 5% perf hit)
* Balanced - biased towards lower power (sat, at most a 10% perf hit)
* Uncompromised lowest power
we can argue about the exact %ages, but the idea is to give at least some reasonably definition that people can understand,
but that also can be measured
It looks like you'd like a tunable setting the maximum allowed performance hit
due to power management. Is that correct?
basically yes, but not as a continuous dial (that's not practical), but as a
certain number of sensible steps.... I'm not sure it makes sense to have more than 5 steps.
Then you need to get the people to agree on what the "sensible steps" are. :-)
That said starting with a small value and going up exponentially, like
(1->)2->4->8->16->32->64, sounds like a good idea.

But the sysadmin will also need to know how much power s/he is going to save
by sacrificing that much performance, ie. if the result is worth the effort.

Thanks,
Rafael
--
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.
--
To unsubscribe from this list: send the line "unsubscribe cpufreq" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Arjan van de Ven
2012-12-06 22:08:31 UTC
Permalink
Post by Rafael J. Wysocki
Post by Arjan van de Ven
* Uncompromised performance
* Balanced - biased towards performance (say, defined to be lowest power at most a 2 1/2% perf hit)
* Balanced (say, at most a 5% perf hit)
* Balanced - biased towards lower power (sat, at most a 10% perf hit)
* Uncompromised lowest power
That said starting with a small value and going up exponentially, like
(1->)2->4->8->16->32->64, sounds like a good idea.
... like 2 1/2, 5 and 10 ? ;-)


--
To unsubscribe from this list: send the line "unsubscribe cpufreq" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Rafael J. Wysocki
2012-12-06 22:53:21 UTC
Permalink
Post by Arjan van de Ven
Post by Rafael J. Wysocki
Post by Arjan van de Ven
* Uncompromised performance
* Balanced - biased towards performance (say, defined to be lowest power at most a 2 1/2% perf hit)
* Balanced (say, at most a 5% perf hit)
* Balanced - biased towards lower power (sat, at most a 10% perf hit)
* Uncompromised lowest power
That said starting with a small value and going up exponentially, like
(1->)2->4->8->16->32->64, sounds like a good idea.
... like 2 1/2, 5 and 10 ? ;-)
In that case I'd prefer 2, 5, 10, 25, 50. I don't think we can be as precise
here as to go into fractions realistically.

Thanks,
Rafael
--
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.
--
To unsubscribe from this list: send the line "unsubscribe cpufreq" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Dirk Brandewie
2012-12-06 16:35:28 UTC
Permalink
Post by David C Niemi
Dirk,
I applaud the work you are doing.
Thanks :-)
Post by David C Niemi
In general I believe it is important to
separate policy (governor and its settings) from the driver,
particularly so as different end-users have very different goals for
power management.
I agree as a general rule separating mechanism from policy is the
correct thing to do. As Arajan pointed out in his replies the
"correct" policy decisions are processor architecture /
micro-architecture dependent.

For example in Sandybridge and later processors the requested
frequency during idle has no affect on power consumption the processor
will go to a minimum power state while the core is idle. So it is
unless to worry about setting the idle frequency and adds a fair
amount of processing and complexity for no benefit in power or
performance.

An generic governor has no hope of getting this type of decision right
to take advantage of the power features of the processor whether is an
IA processor or some other architecture.
Post by David C Niemi
Not everyone is trying to maximize performance per
watt per se (in fact probably rather few end users are doing so
literally). In server applications, for example, the first priority
is typically maximum performance when under heavy load, and the second
priority is minimum power consumption at idle. There may not ever be
a benefit for choosing one of the middle clock states.
I disagree the server/data center user cares deeply about performance
per watt. The are selling performance and watts are a cost. Power
consumption and required cooling are big issues for the data center.

The data center does not want to leave a lot of performance on the
table so they do not need to under provision a servers to satisfy
their SLAs.

I believe that server spend most of their time somewhere between idle
and max performance where selecting an appropriate intermediate
operating frequency will have significant benefit.

The laptop/mobile user cares about performance/watt as well, maybe not
explicitly but they want their shiny new device to show the
performance they paid for with the greatest battery life possible.

The desktop user is likely the most immune to thinking/caring about
performance/watt since most users don't care about (or have a way
measure) the power consumption of the system.
Post by David C Niemi
would be nice if the new driver can be compatible with the existing
governor by exposing an ability to set and report current frequencies.
But if this is impractical or pointless for Sandy Bridge, so be it.
I agree that reporting the current frequency is important to some
utilities. To make this work with the current cpufreq subsystem will
take some amount of refactoring of cpufreq. I did not take on this
work yet and was hoping to to get some advice from the list on the
correct way to do this.
Post by David C Niemi
So outside of a research kernel, I don't think having a "cpufreq/snb"
directory is a good place to expose tuning parameters,
I agree most of the tunables should NOT be exposed to the user. The
place for the tunables was chosen to make obvious to people that
snb had replaced ondemand.
Post by David C Niemi
In the long run both integrators and
maintainers of Linux distributions are going to insist on a generic
interface that can work across the vast majority of modern hardware,
rather than cater to a special case that only works on one or CPU
families, even if those families are particularly important ones.
How this driver gets integrated in to a system is still an open
question. I can think of more than a few "reasonable" ways to
integrate this into a system. Before I launched into creating a
solution I wanted feedback/guidance from the list.

--Dirk

--
To unsubscribe from this list: send the line "unsubscribe cpufreq" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Arjan van de Ven
2012-12-06 16:49:23 UTC
Permalink
Post by Dirk Brandewie
Post by David C Niemi
would be nice if the new driver can be compatible with the existing
governor by exposing an ability to set and report current frequencies.
But if this is impractical or pointless for Sandy Bridge, so be it.
I agree that reporting the current frequency is important to some
utilities.
this is a problem btw, for two reasons
1) the answer you get from the sysfs file is only valid for a very very short time (10 msec at most)
2) the answer you get from the sysfs file is a lie... so it's not really valid for even 0.01 msec.

Even today, the frequency that is reported is largely fictional, and typically not
the frequency of what the core/cpu is actually running it.
10 years ago we could report something reasonable.
Hardware just doesn't work that way anymore nowadays...
We can report what you ran at (past tense), that is something Intel and AMD hardware exposes.
(and if you run powertop, it'll report this as well)
But to know what you are currently running at? No chance in hell ;-(




--
To unsubscribe from this list: send the line "unsubscribe cpufreq" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
David C Niemi
2012-12-06 18:16:10 UTC
Permalink
Post by Dirk Brandewie
...
I disagree the server/data center user cares deeply about performance
per watt. The are selling performance and watts are a cost. Power
consumption and required cooling are big issues for the data center.
The data center does not want to leave a lot of performance on the
table so they do not need to under provision a servers to satisfy
their SLAs.
So the way many data centers work is that each rack is provisioned for a maximum amount of peak power, and both the people running the data center and those putting equipment in them (who are often different entities) want to be very sure the maximum peak power is never exceeded, as that would cause downtime for the whole rack. But beyond that, many data centers do not charge for actual power consumption, just for provisioned peak power, giving the equipment operators no incentive to conserve power when idle. It is for this sort of situation that having a setting like "< 3% degradation" is useful, if the equipment owners perceive they can use it with a performance loss that is small enough to ignore.

There are other issues in this situation too -- the driver/governor should not spend much effort reevaluating load when already running as fast as possible, for two reasons: (1) if you are busy you cannot afford to waste much CPU frequently reevaluating load; and (2) if you are generally busy it is counterproductive to frequently blip down to a lower-performance state even if you think you could based on instantaneous load data. But when in a more idle state, load must be reevaluated very often to see if a load spike has occurred. With frequency shifting, this would mean you ramp up fast and ramp down slowly. I'm not sure how applicable this issue is to your driver, but the same general issue probably applies unless load evaluation and power state switching are very nearly completely f
ree and instantaneous.
Post by Dirk Brandewie
I believe that server spend most of their time somewhere between idle
and max performance where selecting an appropriate intermediate
operating frequency will have significant benefit.
There certainly is a lot of time spent with small loads, but for many network applications average loads are so light as to leave most hardware threads idle nearly all the time. But on the rare occasions when they get busy, they get REALLY busy and performance is critical. Nobody really cares whether you have a 20% CPU performance degradation under light to medium loads, because the network stack is going to perform great and give you such low latency in those circumstances nobody will notice. It's when you exceed 50% of your max throughput (which means you really are very, very busy) that latency goes through the roof and performance matters.
...
Post by Dirk Brandewie
I agree that reporting the current frequency is important to some
utilities. To make this work with the current cpufreq subsystem will
take some amount of refactoring of cpufreq. I did not take on this
work yet and was hoping to to get some advice from the list on the
correct way to do this.
Per the other thread reporting the average speed over the last, say, 100msec would be plenty fast. The gauges and such people have on their desktops cannot respond faster than that. And if it has too much cost at 100msec make it slower.
Post by Dirk Brandewie
Post by David C Niemi
So outside of a research kernel, I don't think having a "cpufreq/snb"
directory is a good place to expose tuning parameters,
I agree most of the tunables should NOT be exposed to the user. The
place for the tunables was chosen to make obvious to people that
snb had replaced ondemand.
I think cpufreq itself is a bad name and should turn into something else. It is reasonable to expose snb-specific tunables under the driver, but I don't think it should be under cpufreq.
Post by Dirk Brandewie
Post by David C Niemi
In the long run both integrators and
maintainers of Linux distributions are going to insist on a generic
interface that can work across the vast majority of modern hardware,
rather than cater to a special case that only works on one or CPU
families, even if those families are particularly important ones.
How this driver gets integrated in to a system is still an open
question. I can think of more than a few "reasonable" ways to
integrate this into a system. Before I launched into creating a
solution I wanted feedback/guidance from the list.
Good, and unfortunately the short-term and long-term answers are rather different.

I like the idea of exposing a very high-level interface for users like that Arjan and I have been talking about. It is probably possible to have a "thin" governor maybe called "pstate", perhaps, that just handles this with the /sys interface and sits parallel to "ondemand", that would be the quickest thing to do in the short term, while fitting within the cpufreq ecosystem, which expects drivers and governors to be separate entities. The pstate governor <-> snb driver interface would be the main additional work over what you've already done, I expect. Not sure if it would make any sense to try to make the snb driver work with any of the existing governors.

In the longer term, cpufreq/ should be ditched and the whole thing rethought, probably, and maybe the governor/driver distinction would go away, and instead you'd have drivers for specific hardware and some shared services they can use to handle /sys. Or you could keep using a "thin governor" to handle the non-driver-specific /sys interface. But this requires distributions to change all their config files that are all oriented around switching frequency based on kernel-assessed load.

DCN

--
To unsubscribe from this list: send the line "unsubscribe cpufreq" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Continue reading on narkive:
Loading...