Ubiquitous Computing: Architecture

Topic 4
What is Computer Architecture

CPU

- PC
- MAR
- MBR
- I/O AR
- I/O BR
- ALU

Memory

- Instructions
- Data
- Data
- Data

Buffers

(Persistant Store and interface devices)

Peripherals

CSI 660, William A. Maniatty, Dept. of Computer Science, University at Albany
Computer Architecture Components

Processor(s) - Do computation
Memory - Stores values
Peripherals - I/O and data communication
  ▶ Human Readable Output/Displays/Speakers
  ▶ Input devices/Sensors
  ▶ Persistent Storage
  ▶ Networking (wireless)

Power (needed for mobility)
Software interface
Processor Trends

Mike Flynn [4]’s FCRC 2003 talk was interesting.

Processor design constrained, by $T^3P = k$ and $AT = c$ where

- $P$, Power
- $A$, Area (space)
- $T$, Time (speed)
- $k$, $c$ constants
- Power is proportional to the CUBE of the processor speed (clock frequency).

Wire length (packaging), not gates dominate delay

High Speed Clocking due to pipelining, NOT device technology

- Now have many very short pipelin segments
- Rate of clock speed increase is a big surprise
- Clock skew more pronounced with high clock rates
- But fast clocks don’t guarantee fast systems
Memory and a Fast Clock

Processor Clocks speeding up faster than memory clocks

The Memory Wall avoided by smart architecture

- Speculative Execution (e.g. branch prediction)
- Cache
- Out of order execution
- Hyperthreading

Perhaps what we really have is a frequency wall
Power and a Fast Clock

Power is true cost of Fast clock

\[ P_{\text{total}} = \text{Active Power} + \text{Static Power} \]
\[ = \frac{C \times V^2 \times \text{freq}}{2} + (I_{\text{leakage}} + I_{SC}) \times V \]

Where

- \( V \) is voltage
- \( C \) is (gate?) capacitance
- \( I_{\text{leakage}} \) is (static power) leakage rate
  - Function of number of transistors, transistor type and temperature
- \( I_{SC} \) might (I’m not sure) refer to loss of power due to packaging.

As device size shrinks

- \( C \) decreases and Voltage needed per transistor
- However number of transistors increase and \( f \) increases
Battery Size, Power and Life

Big rechargeable batteries hold $250\times$ more energy than watch batteries

- But button batteries are always on (up to 5 years)
- Supplying $1\ \mu W$ vs $400\ mW$ to $4W$
- Can we make low power devices to exploit this?
- Use $10^{-3}$ clock rate lets you use $10^{-9}$ power!
- Flynn predicts 100 MHz processors operating at $10^{-3}$ watts
- Wants $O(10^{-6})$ watts, about 10 MHz clock needed
- Increase area at the expense of speed

▷ So How to get Performance? Make very efficient:
  ▷ Architecture and segmented power
  ▷ O/S and Software (Possibly hard?)
  ▷ Arithmetic and Signal Processing
  ▷ Parallel Processing (this is hard!) Get performance from Area NOT frequency
Fabs, Some Facts of life

Dies per Wafer = \( \frac{\pi d^2}{4 \text{ Area per Die}} \)

Yield is fraction of good chips from the die
Assume defects are uniformly distributed (worst case)

\[ \text{Yield} = e^{\text{defect density}} \times \text{Die Area} \]

30 CM with defect density 0.2 wafers appear to be state of the art

Costs only about \( \$3 \times 10^9 \)

- What a bargain!
- Needs \( \$5 \times 10^9 \) sales needed per year or bankrupt!
- need to sell \( 5 \times 10^6 \) units at 1 cm\(^2\) per year
- But very small device size (90 nm) have about \( 10^8 \) to \( 2 \times 10^8 \) devices in a square cm core.

At \( O(\$1000) \) per wafer, thats \$1 per cm\(^2\) die and \$10^{-8} \) per 90 nm die.
But what do we do with them?
How much does Cost and Efficiency Matter

Depends on application

- Server users will pay extra for special purpose processors, must have speed!
- Clients need modest price, efficiency matters to some extent.
- Really small devices will have small costs, and are package limited (hence limited efficiency).

System on a chip models

Very low power, just support limited wireless and crypto

I/O uses higher power (different processors perhaps)

Secure wireless needed, but draws much more power.
What might be different?

Need reliability, testability

When things break need serviceability, recoverability and fail-safe support
Memory Trends, a few of General Interest 1 of 2

Slow relative to processor speed

Dynamic RAM (DRAM) vs. Static RAM (SRAM)

- SRAM fast, expensive, doesn’t “leak” the charge
- DRAM slower, cheaper, needs refresh since it “leaks” charge (so it forgets stored values).

Packaging/Bit Slicing - One bit is stored on each chip (extra needed for parity/ECC).

Banks/Interleaving - Addresses can be interleaved across chips to reduce latency (allow pipelined access).

DDR Ram - Edge triggered on transitions (doubling clock rate)
Memory Trends, a few of General Interest 2 of 2

Registered RAM — RAM uses registers (buffers) for fast access

Persistance - Memory values preserved when powered down
  - Flash/EEPROM, USB Key Chain storage — Writes are slow (burns into memory)
  - Nonvolatile RAM (NVRAM) — Doesn’t require special burn step and fast (not currently practical)

Intelligent RAM (IRAM), Processor In Memory (PIM) - Move processing into memory
  - Done by Patterson at UCB
  - Attempt to reduce time to move data to processor
  - Processing in memory is limited
  - Follows MIMD (Many Instructions Many Data) parallel model
  - Significant Barrier to Entry - Benefit requires code modification

In Ubicomp devices Memory can be up to 1/2 the power cost
  - Often lack the power hogs (e.g. display)
  - Use low power cpus
RAMBus QRSL and RSL Signalling

**Logic Levels and Voltage Levels**

- **Logic 00**: $V_{00} = V_{term} \approx 1.8\, \text{V}$
- **Logic 01**: $V_{01} \approx 1.53\, \text{V}$
- **Logic 11**: $V_{11} \approx 1.27\, \text{V}$
- **Logic 10**: $V_{10} \approx 1.00\, \text{V}$

- **Nominal Voltage** $\approx 0.8\, \text{V}$

**Waveform Diagram**

- **Clock**
- **RSL**
- **Value**
- **QRSL**

**RAMBus Signalling Level (RSL)** — Originally binary now uses 4 level PCM.
Work of Fan, Ellis and Lebeck [3] for RAMBus RDRAM [7].

- High bandwidth (1.6 GB/sec per device)
- Narrow bus topology (in terms of wires) — Allows independent chip access/control
- Chips have power management states (Highest Power Consumption/Speed first).
  - Active (or ATTN) — Needed for read/write access
  - Standby — Low Power, but monitors channel for packets directed at it
  - Nap — A Timed sleep state, only gets clock and refresh signals, wakes up after a predetermined number of cycles.
  - Powerdown — Only refreshes the memory
Place memory into low power consumption states when idle for too long.

When is too long?

- Measure time between requests (gap)
- If gap > threshold its too long
- Naps if gap exceeded
- Requests arriving during nap wait till end of nap
Procesor Effects

Overall energy usage is $E_{total} = E_{CPU} + E_{mem}$.

Recall that processors power savers reduce clock rate

- And $E_{CPU} \propto \text{ClockRate}^2$

Suppose processor leaves memory in active state

- Mem Power = Num Chips $\times$ Chip Power $\times$ Time
- Low Power comes from slowest processor clock, why?
- Can only conserve about 15-17%, but increases run time

Instead try Dynamic Power Awareness

- Try immediate nap
  - Works for small memory footprints.
  - Don’t interleave memory, can turn off an entire chip(s)!
  - If miss rate is low may also work well
  - Out of order execution did not impact memory energy costs much.
Nap Policy Impact on Run Time and Energy Usage

\[ T_{\text{active}} = t_{\text{access}} \times N_{\text{misses}} \]

\[ T_{\text{nap} \rightarrow \text{active}} = t_{\text{nap} \rightarrow \text{active}} \times N_{\text{misses}} \]

\[ T_{\text{nap}} = \frac{1}{f} \times (N_{\text{instructions}} - N_{\text{misses}}) \]

\[ T_{\text{exec}} = T_{\text{active}} + T_{\text{nap} \rightarrow \text{active}} + T_{\text{nap}} \]

CPU energy consumption relies on processor frequency \( f \)

CPI is assumed to be 1

\[ E_{\text{cpu}} = T_{\text{exec}} \times P_{\text{cpu}}(f) + T_{\text{residue}} \times P_{\text{leakage}} \]

\[ E_{\text{mem}} = T_{\text{active}} \times P_{\text{active}} + T_{\text{nap}} \times P_{\text{nap}} + \\
T_{\text{nap} \rightarrow \text{active}} \times P_{\text{nap} \rightarrow \text{active}} + T_{\text{residual}} \times P_{\text{PowerDown}} \]
Power Aware Memory Scheduling

Slowing clock speed reduces power consumption to a point

- Too slow and memory costs creep up due to run time stretch
- Leaving memory in Active state is expensive!
Perhaps we can do better than just using nap

- Which state should we use, generate a hint (prediction)
  - Negligible gap - stay active
  - Short Gap Not Confident on Duration - Stand By
  - Short Gap Confident on Duration - Nap
  - Long Gap Confident on Duration - Shutdown

- Hard problem
  - Don’t have much time to make a prediction
  - Hasty predictions are seldom good
Disk Drives Capacity growing by about 100% per year, why?

- Improved materials
- Smart signal encoding of data allows for compressed storage
- Improved controller
- Some of these methods have developed problems in the field
  - IDE drives appear to be most hit or miss
  - IDE drive users expect low cost
  - IBM Deathstar/Deskstar

But user’s want more storage

- So they tend not to reduce hard drive size

MicroElectronic Mechanical Systems (MEMS) and Nano Tech

- At small size physical constraints differ
Mems Based IC Mass Storage

Idea: Have many small devices fabricated

Several Approaches have been tried

- Micro disk drives (Hard drives)
- Phonograph like disks (melt media, creating pits) — IBM Millipede
  - Probes contact the media
  - Probes and Media likely to degrade quickly
- Phase Change Media (Amorphous to Chrystal) — HP Approach,
  - Used in CD RW (Magneto Optical Drives).
  - Has slow write speeds, read speed not the best.
- Nonrotating magnetic media read by probes (CMU Approach)

Carley et al. [1] took the nonrotating magnetic media approach

- Stiction (static friction) dominant
- Heat dissipation and tolerances make high rotation rates hard for disk drives.
- Can reduce speed of moving parts
Parallelizing probe accesses (Striping) and use of cache adds speed. Probes must move across media when depositing or reading charge. Which component should move?

- **Probes** - fast but waste storage
  - Each Probe Has Height Control ($z$ axis) to access media/retract
  - Small light travel short distances
  - However actuators are large relative to range of motion
  - Hence much of the media cannot be reached.

- **Media** - Slower but storage efficient
  - Need springs to recenter sled prevent chatter in $x$ and $y$ directions
  - Have 2 pairs of actuators in $x$, $y$ directions
  - Sleds don’t need to move fast
  - However inertia dominates
Mems Probes and Sled Characteristics

Figure 1. Conceptual view of MEMS-actuated data storage devices with X-Y motion of the media and Z motion of a 4x5 array of probe tips.
MemS Sled Based Design, Actuators and Springs
Data Layout and Impact on File Systems

Random access done by positioning in \((x, y)\)

- Don’t want to do this too much, it’s expensive
- Keep adjacent addresses contiguous on a track (like hard disks)

Most data read as a sequence of bits, how to handle contiguous data.

- Expensive operation is to reverse or change sled direction
- Sled travels slowly (especially relative to platter rotation speeds)
- Want to minimize number of reversals
- Layout data so that 2 adjacent tracks are anti-parallel
- When sled reaches end of a track in say \(x\) direction
  - Stops motion
  - Moves 1 track over in \(y\) direction
  - Begin motion in \(-x\) direction

Media and probes may fail, what to do?

- Exploit redundancy, trade space off for fault tolerance
- Reserve some sectors for ECC/parity bits (like RAID)
Data Layout and File Systems Design

[Diagram of data layout and file systems design with annotations]

CSI 660, William A. Maniatty, Dept. of Computer Science, University at Albany
Sled Design

Impact on storage

- Expect that it will be a layer between disk and DRAM
- Might be used in hand held devices (I think target density was about 2GB/chip, but I could not find the reference).
- Disk drives likely to continue to be cheap DASD, so they may still be popular for slower large stores.
- If energy costs, speed, reliability and space efficiency are good, might replace disks.
Peripherals/Displays

Display technology is one of the biggest power draws

- Low power LCD’s — Bistable Nematic (BiNem) Devices [6]
  - Active Matrix - High Power, Sharper Display, 1 transistor per pixel
  - Passive Matrix - Low Power, Less Sharp, Transmit Power along Transparent Wires along rows/columns

- Electronic Paper — More like flexible sheets of plastic
  - Monochrome Microball based approaches (Xerox)
  - Colored Electrowetting (moving ink) based approaches (Philips)
Twisted Nematic and BiNem Displayss

Nematic chrystals are used in Liquid Chrystal Diode (LCD) Technology
Their molecules are long and thread like
They exploit polarization characteristics of light
The Molecule Twists When voltage is applied to it, changing level of polarization

- Recall polarization filters out light waves having amplitude differing from the accepting range.
BiStable Nematic LCDs

Twisted Nematic devices need energy to “untwist”

- i.e. Twisted is the low energy state
- Untwisting whitens pixels in transmissive (backlit) devices
- Untwisting blackens pixels in reflective devices
- Reflective devices white when powered down, transmissive devices dark.

Bistable Nematic LCDS only need voltage to change state
State Selection in Bistable Nematic LCDs

High Leading edge breaks stable configuration
High Trailing Edge specifies twisted state (Write)
Low Trailing Edge specifies untwisted state (Read)
State Selection in Bistable Nematic LCDs

High Leading edge breaks stable configuration
High Trailing Edge specifies twisted state (Write)
Low Trailing Edge specifies untwisted state (Read)
Controlling LCDs/Multiplexing

Buses run along columns and rows
Signal on both buses selects pixel (voltages add)
Low power signals important (avoid crosstalk)
BiNem Tradeoffs/Limitations

Limitations on Refresh Rate (25 Hz)
Good for e-books, speed improving
Not currently mass produced (fab costs prohibitive)
Image quality reported to be good
Backlit displays more readable but draw illumination power
BiNem Tradeoffs/Limitations

Limitations on Refresh Rate (25 Hz)

- Good for e-books, can speed improve?

Has a wide viewing angle and good contrast

Has low cost

Backlit displays more readable but draw illumination power

Might be possible to construct using existing facilities
Sheridon’s Electronic Paper \cite{2} is monochrome

- Uses Gyricons, Small balls, one side white the other black
- Each pixel controlled by ball orientation
- Apparently manufacture of small balls was hard
- Used rapidly spinning plate and jets of black/white plastic
- Balls have charge differential, oriented using electromagnetic field
Hayes and Feenstra’s Electrowetting [5] is color and capable of video speeds

- Ink caught between surface and water
- When charge is applied ink retracts exposing surface (could be white)
- Pixels must be small (so interface forces overcome gravity)
- Retracted ink small enough, effectively invisible to naked eye
Electrowetting Color Management

Colors generated using standard CYM color
Partially retracted ink gives range of intensities/color
Color filters and “subtractive dyes” permit grayscale
Electrowetting Advantages

Has strong color resolution compared to LCD’s
Has 100 DPI Resolution, 10 msec switching time
Partially retracted ink gives range of intensities/colors
Has low power consumption
Goal - To be able to run CPU intensive Apps on a hand held
  - e.g. MPEG and Voice Recognition

Limiting factors - Display and Battery size and weight
Adding wireless using a separate card increased size 70%
Itsy - Architecture

- LCD, backlight
- Touch screen
- LED
- Speaker
- Microphone
- Encoder
- Buttons
- Available daughtercard functionality
  - Software modem
  - A/D Input
  - GPIOs
  - SSP
  - SDLC
  - 2 UARTs
  - Memory bus (static memory, DRAM, two PCMCIA sockets)

- Docking connector
- USB
- RS-232
- Power (3.5 V-13 V)
- IrDA
- USB
- RS-232
- Power (charger, supply, monitor)
- Two-axis accelerometer
- LCD
- Codec
- Analog interface
- StrongARM SA-1100 processor
- Flash memory
- DRAM

- Codec
- GPIO
- IrDA
- LCD
- LED
- Li-Ion

- Coder/decoder
- General-purpose I/O pin
- Infrared Data Association standard port
- Liquid crystal display
- Light-emitting diode
- Lithium-ion

- Personal Computer Memory Card Int’l Assoc.
- Serial Interface
- Synchronous data link controller
- Synchronous serial port
- Universal asynchronous receiver-transmitter
- Universal serial bus

CSI 660, William A. Maniatty, Dept. of Computer Science, University at Albany
Itsy - Component Features

Itsy uses the StrongARM SA-1100 processor (low power ARM)

- ARM is Acorn RISC Machine (defunct U.K. vendor)
- Has Programmable Clock Speed
- Has 3 modes of operation
  - Active - Normal mode
  - Idle - Clock to processor core is off, peripherals and rest of chip stay on
  - Sleep - Only real time clock and wakeup circuit in processor are on, optionally clock can be enabled for fast wakeup.

Display - Passive Matrix LCD used

- Uses PLD with 1 bit per pixel
- Static monochrome images stored while processor sleeps

Memory - Persistent and Volatile

- 32 MB DRAM and 32 MB Flash
  - Used MicroBall Grid Array (MBGA) packaging — costs more, but more dense
  - Allowed 2x DRAM and 8X Flash increase
Itsy - Power Management

Itsy has 2 switched power supplies and 1 linear low noise supply

- Low noise needed for clear audio
- 3.3V lines with ±0.3V tolerance driven at 3.05V, saving %15 power.

Power Saving Mechanisms were Developed

- Peripherals can be selectively soft-powered down during CPU Sleep
- CPU core voltage can be soft selected to be 1.5V or 1.23V (a bit below manufacturers spec, but o.k. at low clock speeds).
- Employs voltage monitoring equipment
- Uses battery charge monitors, but getting an accurate reading is hard due to dynamic power requirements.
Persistent data and O/S boot image written to flash memory

- Uses ext2fs file system (before journalling ext3) using Flash Translation Layer (FTL) driver
- To write flash you must erase large regions first (128KB or 256KB)
- Each erase takes about 700 msec (slow!)
- Low level driver maps (virtual) disk blocks to (physical) flash memory addresses
- How should file system update a block
  - Logical disk manager needs to track physical erased blocks
  - Worst Case - Only one free block, whole block must be erased and overwritten (SLOW!)
  - Instead reserve 5% of blocks to prevent this scenario
  - Reclaim unused blocks (schedule them for erasure)
- Virtual block manager informs physical block manager when it releases a block (to avoid extra writes)

OS creates a Ramdisk file system in DRAM

- Allows for use of preexisting linux tools
- Ramdisk does not use kernel cache so is memory efficient
- Special trick - their ramdisk doesn’t store zero filled blocks
Itsy - Devices and UI

Uses Virtual devices

- Supports device sharing via Squeek, Java, X
- Session management used (one or more processes)
- Raw Interface allowed (e.g. for background speech recognition)

Traditional desktop interfaces fall short

- Most apps don’t monopolize user (except games)
- Stylus/Mouse based pointing devices did not work well
- Speech — Recognition (Dragon Systems) and Generation (DECTalk) effective (30 K vocabulary)
- Gestures — Don’t move mouse or stylus, move the whole device
  ▶ e.g. Scrolling by rotating device (Rock n’ scroll)
Performance was roughly like Intel Pentium P5 90-110 MHz on SPECINT 92 and Dhrystone.

In sleep mode, ITSYS can hold memory contents for 1.3 days on a single charge.

Power awareness key to power management!
Bibliography

References


