Research & Best Practices

How Can a Reliability Engineer Reduce Downtime?

img

Downtime is expensive. If production machinery fails, it can result in lower productivity, increased worker safety risks, and lost revenues.  

Reliability engineers help companies reduce downtime by identifying potential risks, improving asset longevity, and implementing preventative strategies that help keep operations running smoothly. Let’s explore the tools, data, systems, and strategies reliability engineers use to accomplish these goals.  

How reliability engineers analyze and reduce downtime

Site reliability engineers take a methodical approach to analyzing and reducing downtime. These professionals typically focus on four operational areas: risk management, safety, loss elimination, and life cycle asset management (LCAM). 

In addition to reducing downtime, reliability engineers can also help companies improve regulatory compliance, enhance the impact of equipment maintenance, and extend the life of key assets. 

Given the complexity of modern manufacturing tools and technologies, however, there’s no one-size-fits-all solution for improved uptime and reduced risk. To create comprehensive reliability frameworks, engineers combine big-picture analysis with targeted investigations and evaluations. Here are seven ways reliability engineers help reduce risk and decrease downtime.

1. Root cause failure analysis (RCFA)

Treating the symptoms won’t cure the disease. It’s true in healthcare, and it’s true in manufacturing. Consider a piece of hydraulic equipment that continues to experience coolant leaks. Maintenance teams keep replacing seals, valves, and lines, but the problem persists. 

This is because the leak is the symptom of a larger problem: too much pressure. Until pressure at the source is reduced, other maintenance efforts are only temporary. This is the goal of root cause failure analysis: tracking and finding initial causes to eliminate repeat failures. Reliability engineers use several tools in RCFA: 

  • The “5 whys” — The idea behind the “5 whys” is simple: Keep asking why until you get to the root cause. On average, 5 whys gets you from symptom to cause, but this isn’t a fixed number. Engineers keep asking why until they’re sure they’ve found the answer.  
  • Fault tree analysis — Fault tree analysis is a diagram-based approach that puts the undesired event at the top and then connects contributing factors as it flows downward, creating a tree-like shape. The result is a map of the dependencies and relationships between the initial fault and its potential causes.  
  • Fishbone diagrams – Fishbone diagrams display the problem as the “head” of the fish, with categories of possible causes acting as the connecting bones. Common categories include equipment, materials, people, and processes. 

2. Predictive maintenance using condition monitoring

Manufacturing firms traditionally rely on reactive equipment maintenance: When equipment fails, teams are sent in to find and fix the problem. 

The problem? While issues are identified and addressed, companies suffer downtime. The longer it takes teams to find the root cause, the more time and money companies lose. In a best-case scenario, fixes are familiar and take only a few hours. In a worst-case situation, teams can’t find root causes and instead apply temporary solutions to get equipment back up and running. 

Predictive maintenance takes the opposite approach. Instead of waiting for machines to break, reliability engineers use what’s known as condition monitoring systems to track system performance and carry out repairs before failure occurs. Common condition monitoring approaches include: 

  • Vibration analysis — Vibration analysis measures the vibrational movement of pipes, tubes, or moving parts and compares it to baseline values.  
  • Thermal imaging — Thermal imaging reports the internal or external temperature of machinery. Sudden spikes or falls can indicate potential failure. 
  • Ultrasonic testing — Ultrasonic reliability testing tools measure sound. These sensors may be physically attached to devices or connected wirelessly. Abrupt changes in pitch may act as precursors to component seizures or sudden stoppages. 
  • Oil level measurement — Oil levels can indicate possible problems. If levels drop, machines may be leaking. If they rise, this may indicate temperature spikes. 

The efficacy of these approaches is improved when they are combined with IIoT sensors connected to centralized reporting platforms, which allow reliability engineers to access equipment data in real-time. 

Taking a predictive approach to system reliability, informed by condition monitoring, enables engineers to better control maintenance costs, reduce the number of emergency stops, and optimize maintenance schedules. 

3. Asset performance management (APM) & data analytics

APM and data analysis provide benchmarks to help engineers track asset reliability over time and make predictions about future trends and potential failures. Central to APM platforms are key performance indicators, such as: 

  • Overall equipment effectiveness (OEE) — OEE is a measure of how much planned production time results in productive outputs. The closer this value is to 100%, the more efficient your processes or equipment. 
  • Mean time between failures (MTBF) — MTBF represents the average time a piece of equipment or a system will operate before failure. The longer the MTBF, the more reliable your operations.  
  • Mean time to repair (MTTR) — MTTR measures how long it takes to repair equipment after failure. Lower MTTR means less downtime. A proactive maintenance strategy can significantly reduce MTTR by preventing many common issues. 

APM and analytics tools are often integrated with computerized maintenance management systems (CMMS) and enterprise asset management (EAM) systems to provide centralized data access and enable asset performance optimization. 

4. Reliability-centered maintenance (RCM)

Reliability-centered maintenance is focused on keeping critical assets functioning and critical resources available.  

It differs from both reactive and scheduled maintenance. In a reactive maintenance approach, fixes happen only after failures. In a scheduled maintenance framework, repairs happen at set time intervals. 

In the case of RCM, maintenance is tied to function and cost. Equipment that is critical for production lines to keep running, and tools that are expensive to maintain or repair, are at the top of RCM priority lists. The goal of RCM is to balance machine reliability and maintenance costs. RCM may be based on several factors: 

  • Function — What function does the equipment serve? For example, a piece of machinery that is essential to your production output should top the RCM list. Tools used for post-processing, meanwhile, can be maintained at a reduced frequency.  
  • Failure modes — Some machinery can partially operate during failure. Other equipment comes to a hard stop. Tools with multiple failure modes can be placed in the middle of RCM lists, while those with only on/off failure modes should be placed near the top. 
  • Expense — How expensive is equipment failure? The more you stand to lose if a piece of machinery fails, the higher its RCM priority. 

5. Continuous improvement & downtime metrics

Two key components of site reliability engineering are continuous improvement and downtime metrics. The more data available to engineers, the more they can learn from failure incidents, and the better prepared they are to handle new challenges. Continual improvement is also tied to measuring success via KPIs, including: 

  • Unplanned downtime rates — If machine downtime rates are falling, maintenance strategies are working. If they are increasing, engineers need a new approach. 
  • Asset utilization values — Higher utilization values mean more uptime, which means scheduled maintenance is doing its job. If specific assets continue to show high failure rates, further RCFA may be required. 
  • Maintenance backlogs — The better your maintenance program, the fewer backlogs you have. More backlogs mean key issues remain unaddressed. 

Tracking these values over time helps reliability engineers link operational improvements to ROI opportunities and operational goals. 

6. Leveraging modern technology

Industry 4.0 has revolutionized manufacturing by creating interconnected and intelligent frameworks. This fourth industrial revolution is underpinned by manufacturing technology solutions such as CMMS, digital twins, and artificial intelligence (AI). Here’s a look at how each contributes to improved reliability: 

  • CMMS — CMMS solutions help track assets, schedule maintenance, and manage work orders. This improves efficiency and reduces downtime. 
  • Digital twins — Digital twins let reliability engineers simulate the real-world impacts on digital duplicates of physical assets without the risk of downtime. They help identify potential issues and optimize maintenance strategies. 
  • AI (and IIoT) — AI and IIoT technologies can provide real-time insights into equipment reliability to inform proactive maintenance strategies. 

7. Reliability-focused culture and cross-functional collaboration

Reliability engineers also recognize the role of people in creating safe and consistent maintenance practices. As a result, they prioritize the creation of a reliability-focused culture across the organization, from front-line staff to specialized maintenance teams, managers, and C-suites. Best practices to build this maintenance culture include: 

  • Ongoing education — This includes training for both the operations teams and maintenance staff. 
  • Standardizing procedures — Standardized procedures ensure that all staff complete maintenance and repair tasks the same way, reducing the risk of potential errors or omissions. 
  • Encouraging cross-functional teamwork — Siloed maintenance efforts can lead to gaps in analysis or repair that cause downtime or safety issues. Reliability engineers should prioritize cross-functional teamwork between equipment specialists, process engineers, technicians, and operators. 
  • Laying the groundwork for a proactive response — Firefighting only gets you so far. While you may stay ahead of the worst issues, you’ll inevitably encounter problems that lead to significant downtime. Using condition monitoring, AI, analytics, and RCFA sets the stage for proactive analysis that helps reduce the source of downtime. 

Investing in reliability engineering

Reliability engineering can help reduce downtime across the entire manufacturing maintenance cycle. Engineers enable companies to recognize long-term savings, create safer work environments, and extend asset life. 

No matter the size or scope of your manufacturing organization, you can benefit from the expertise of a certified reliability engineer. Depending on your current budget and operational goals, you may choose to invest in full-time engineering staff or partner with expert outsourced support to help implement new strategies and keep systems running. 

The best time to invest in reliability engineering? ASAP. As modern manufacturing processes evolve and downtime becomes the difference between profit and loss, reliability is the key to unlocking consistent performance and cost-effective maintenance strategies. Let’s talk. 

Let’s Talk