代写CS-E4780 Scalable Systems and Data Management Course Project Efficient Pattern Detection over Data

CS-E4780 Scalable Systems and Data Management Course Project Efficient Pattern Detection over Data Streams

1 Background

Systems for event stream processing continuously evaluate queries over high-velocity event streams to detect user-specified patterns with low latency. Since the patterns become less important over time, it is crucial to detect them as quickly as possible. However, low-latency pattern detection is challenging, because query processing is stateful and the set of partial matches maintained by common algorithms for query evaluation grows exponentially in the size of the processed event data.

Handling intermediate states during query evaluation is particu-larly challenging under dynamic stream conditions. Event sources frequently exhibit variable arrival rates and diverse data distribu-tions, which in turn lead to highly fluctuating query selectivities. During periods of elevated load, the number of events to be pro-cessed may exceed the system’s computational capacity, rendering exhaustive query evaluation impractical or even infeasible. In such situations, maintaining responsiveness requires the system to apply load-shedding techniques. Rather than striving for all the results at the cost of unbounded delays, the system performs best-effort query evaluation that processes a subset of the intermediate com-putational results, aiming to maximize the quality of the reported results while adhering to strict latency constraints. This trade-off is fundamental in stream processing, where timely insights are often more critical than full completeness [8, 9].

2 Complex Event Processing

Complex Event Processing (CEP) is a prominent technology for robust and high-performance real-time detection of arbitrarily com-plex patterns in massive data streams[4]. It is widely employed in many areas where extremely large amounts of streaming data are continuously generated and need to be promptly and efficiently analyzed on-the-fly. Online finance, network security monitoring, credit card fraud detection, sensor networks, traffic monitoring, healthcare industry, and IoT applications are among the many ex-amples.

Complex Event Processing (CEP) has become integral to mod-ern data-driven environments because it allows organizations to identify complex temporal or semantic patterns in streaming data in real time [4]. Beyond just financial trading and fraud detection, CEP frameworks like FlinkCEP enable engineers to specify event patterns and react immediately when patterns are recognized [2]. Industry guides and glossaries emphasize that CEP matches incom-ing events against sophisticated patterns to recognize relationships and trigger real-time responses [5, 7]. Practical guides also highlight the architectural implications of CEP, noting that robust implemen-tations must handle streaming data at scale, maintain low latency, and support intelligent event sourcing [7]. In short, CEP systems are not only crucial for monitoring and security but also form. the backbone of modern applications that must react swiftly to complex sequences of events across many sectors [5].

The patterns detected by CEP engines are often of exceedingly high complexity and nesting level, and may include multiple com-plex operators and conditions on the data items. Moreover, these systems are typically required to operate under tight constraints on response time and detection precision, and to process multiple patterns and streams in parallel. Therefore, advanced algorithmic solutions and sophisticated optimizations must be utilized by CEP implementations to achieve an acceptable level of service quality.

The Figure 1 above presents an overview of OpenCEP structure[6]. Incoming data streams are analyzed on-the-fly and useful statistics and data characteristics are extracted to facilitate the optimizer in applying the aforementioned optimization techniques and maxi-mize the performance of the evaluation mechanism – a component in charge of the actual pattern matching.

By incorporating a multitude of state-of-the-art methods and algorithms for scalable event processing, OpenCEP can adequately satisfy the requirements of modern event-driven domains and out-perform. existing alternatives, both in terms of the actual perfor-mance and the provided functionality.

OpenCEP features a generic and intuitive API, making it eas-ily applicable to any domain where event-based streaming data is present.

3 Data Set: Citibike

The Citi Bike system dataset [3] provides a comprehensive record of bicycle-sharing trips in New York City, capturing millions of rides each month. The dataset is released as monthly compressed CSV files, with one row per trip, enabling fine-grained analysis of user mobility patterns. Each record includes the ride identifier, bi-cycle type (classic or electric), start and end timestamps, origin and destination stations (with names, identifiers, and geographic coordi-nates), and rider classification as either a subscriber (member) or oc-casional (casual) customer. When monthly data exceed one million trips, the dataset is split across multiple files within a single archive

Fig. 1: Working Flow of a CEP System

to maintain accessibility. These attributes make the dataset partic-ularly suitable for studying urban mobility, transportation demand forecasting, and the design of sustainable transportation systems.

3.1 Attributes and Format

Each trip record is structured with a consistent set of attributes: Ride ID (a unique trip identifier), Rideable type (e.g., classic or electric bike), Started at and Ended at timestamps, Start station name and Start station ID, End station name and End station ID, Start latitude and Start longitude, End latitude and End longitude, and finally, a categorical label indicating whether the trip was made by a member (annual subscriber) or a casual rider. This schema balances the need for detailed trip-level information with rider privacy, while still allowing researchers to study spatiotemporal demand dynamics, trip flows, and user behaviors.

4 Project Requirements and Problem Definition

This course project requires students to implement a basic load shed-ding strategy for a CEP system to handle bursty workloads while ensuring low-latency processing based on codebase OpenCEP[6]. Students will work with the Citi Bike dataset and evaluate the following query.

Query: detect hot paths. ‘Hot paths’ consist of stations where bicycles are accumulated faster than at other stations. They indicate the trend of movement for the bike fleet. Since bikes quickly accu-mulate in certain areas, the operator moves more than six thousand bicycles per day among stations. Hence, the detection of ‘hot paths’ of bike trips promises to improve operational efficiency.

PATTERN SEQ (BikeTrip+ a[], BikeTrip b)

WHERE a[i+1].bike = a[i].bike AND b.end in {7,8,9}

AND a[last].bike = b.bike AND a[i+1].start = a[i].end

WITHIN 1h

RETURN (a[1].start, a[i].end, b.end)

Listing 1: Citi Bike sequence pattern

Listing 1 shows a pattern detection query to detect such ‘hot paths’, using the SASE query language [1]: Within an hour time window, a bike is used in several subsequent trips, ending at particular stations, No. 7, No.8, and No.9 (7,8,9). Here, the Kleene closure operator detects arbitrary lengths of paths.

Project Requirement. Students are required to implement the state management of partial matches for the above query, including load shedding strategies to handle bursty workloads while ensuring low-latency processing. The implementation should be based on the OpenCEP codebase [6]. The measure of success will be the ability to process incoming events with (1) low latency, even under high load conditions, while (2) maintaining the recall of pattern detection.

4.1 Definitions

We summarize key Complex Event Processing (CEP) concepts used in this project and needed to reason about the query in Listing 1 and the required load shedding logic:

Primitive (raw) event A single input record arriving on a stream (e.g., one bike trip). It has a unique timestamp and a set of typed attributes.

Attribute Named field inside an event (e.g., start station id, end station id, start time).

Event time The timestamp embedded in the event payload de-scribing when the underlying real-world action occurred.

Processing time The wall-clock time at which the CEP engine observes/processes the event. Event time and processing time may differ under delay or disorder.

Complex event / Match A higher-level event produced when a set (typically an ordered sequence) of primitive events satis-fies a pattern.

Pattern A declarative specification (here SASE) describing struc-tural (ordering, repetition), temporal (window), and predicate constraints over events.

Sequence (SEQ) An operator requiring a temporal / positional order of constituent events.

Kleene closure (+) Operator allowing one-or-more repetitions of a sub-pattern, creating potentially unbounded numbers of partial matches.

Partial match / Partial state Data structure holding the events selected so far for a pattern instance that is not yet complete. Managing the population of partial matches dominates mem-ory and CPU cost.

State store Logical repository (in-memory tables, lists, indexes) the engine uses to retain partial matches and auxiliary meta-data during evaluation.

Correlation predicate A condition relating attributes across dif-ferent events in a partial match (e.g., same bike id, station chaining a[i+1].start = a[i].end).

Selection predicate / Filter A local condition on a single event (e.g., b.end in 7,8,9).

Window (WITHIN T) Temporal constraint limiting the maximal span (end time minus first event time) of a match; enables pruning of stale partial matches.

Selectivity Fraction of incoming events (or event combinations) that survive predicates or advance partial matches; impacts growth of state.

Load shedding Intentional dropping or early termination of pro-cessing for some events or partial matches under resource pressure to keep latency bounded.

Shedding unit The granularity at which the system discards work: event-level (skip ingest), partial-match-level (prune oldest / lowest-utility states), or predicate-level (relax a constraint temporarily).

Utility / Score Heuristic value estimating the expected contribu-tion of a partial match/event toward producing a future complete match (e.g., based on remaining window time, path length so far, station popularity).

Latency Time between arrival (or event time) of the last contribut-ing primitive event and emission of the complex event result.

Throughput Number of primitive events processed per second.

Recall Fraction of true pattern matches still reported after any shedding (measures correctness under best-effort mode).

Pruning Safe elimination of partial matches proven impossible to complete (e.g., window expired, violated chaining constraint) without loss of recall. Note this is distinct from load shedding.

Expiration Event or partial match removal triggered by window boundaries or watermarks.

Policy examples to shed (i) Probabilistic sampling of new partial matchs/events, (ii) Lowest-utility partial matchs/events drop, etc.

For this project, focus on: (1) measure the overload detection latency and throughput under bursty workloads, (2) implement a load shed-ding strategy that balances latency and recall under bursts (e.g., prioritizing longer chains nearing completion at target stations 7,8,9 within 1h), and (3) evaluate the effectiveness of the shedding strategy in maintaining low latency while maximizing recall.

5 Submission Requirements

Students must implement a system to solve the problems described in Section 4. Each student is required to submit a link to the system’s GitHub repository (ensure the repository is set to public visibil-ity) along with a 4 page report detailing the system. The detailed requirements for the project report are as follows:

5.1 Project Report Requirements

Template. Use the ACM Proceeding Template: LATEX or Word.

Report Organization. The report should be 4 pages and include the following sections. Students may use their own section titles, but the content should align with the following structure:

(1) Abstract.

(2) Introduction. Provide a brief background and overview of the project.

(3) System Architecture. Describe the design choices and mo-tivations in detail.

(4) Implementation. Describe how the system was imple-mented, including the algorithmic design, key data struc-tures, and the estimated time and space complexity of the core components.

(5) Performance Evaluation. Evaluate the system’s per-formance. Measure recall under latency bounds set to 10%, 30%, 50%, 70%, 90% of the original latency without load shedding. As an advanced requirement, assess scalabil-ity—how performance varies with resources (e.g., CPU cores) and workloads (e.g., events per second)—and report resource utilization (e.g., CPU and GPU usage).

(6) Conclusion.

References

[1] Jagrati Agrawal, Yanlei Diao, Daniel Gyllstrom, and Neil Immerman. 2008. Efficient pattern matching over event streams. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data. 147–160.

[2] Apache Software Foundation. 2025. FlinkCEP-Complex event processing for Flink. https://nightlies.apache.org/flink/flink-docs-master/docs/libs/cep/ Documenta-tion for an unreleased version of Apache Flink.

[3] Citi Bike. 2025. Citi Bike System Data. https://citibikenyc.com/system-data.

[4] Gianpaolo Cugola and Alessandro Margara. 2012. Processing flows of information: From data stream to complex event processing. ACM Computing Surveys (CSUR) 44, 3 (2012), 1–62.

[5] Databricks. 2024. What is Complex Event Processing [CEP]? https://www.databricks. com/glossary/complex-event-processing

[6] Ilya Kolchinsky. 2025. OpenCEP: Complex Event Processing Engine. https://github.com/ilya-kolchinsky/OpenCEP. Accessed: 2025-08-25.

[7] Redpanda Data. 2024. Complex event processing—Architecture and other practi-cal considerations. https://www.redpanda.com/guides/event-stream-processing-complex-event-processing

[8] Cong Yu, Tuo Shi, Matthias Weidlich, and Bo Zhao. 2025. SHARP: Shared State Reduction for Efficient Matching of Sequential Patterns. arXiv preprint arXiv:2507.04872 (2025).

[9] Bo Zhao, Nguyen Quoc Viet Hung, and Matthias Weidlich. 2020. Load shedding for complex event processing: Input-based and state-based techniques. In 2020 IEEE 36th International Conference on Data Engineering (ICDE). IEEE, 1093–1104.




热门主题

课程名

mktg2509 csci 2600 38170 lng302 csse3010 phas3226 77938 arch1162 engn4536/engn6536 acx5903 comp151101 phl245 cse12 comp9312 stat3016/6016 phas0038 comp2140 6qqmb312 xjco3011 rest0005 ematm0051 5qqmn219 lubs5062m eee8155 cege0100 eap033 artd1109 mat246 etc3430 ecmm462 mis102 inft6800 ddes9903 comp6521 comp9517 comp3331/9331 comp4337 comp6008 comp9414 bu.231.790.81 man00150m csb352h math1041 eengm4100 isys1002 08 6057cem mktg3504 mthm036 mtrx1701 mth3241 eeee3086 cmp-7038b cmp-7000a ints4010 econ2151 infs5710 fins5516 fin3309 fins5510 gsoe9340 math2007 math2036 soee5010 mark3088 infs3605 elec9714 comp2271 ma214 comp2211 infs3604 600426 sit254 acct3091 bbt405 msin0116 com107/com113 mark5826 sit120 comp9021 eco2101 eeen40700 cs253 ece3114 ecmm447 chns3000 math377 itd102 comp9444 comp(2041|9044) econ0060 econ7230 mgt001371 ecs-323 cs6250 mgdi60012 mdia2012 comm221001 comm5000 ma1008 engl642 econ241 com333 math367 mis201 nbs-7041x meek16104 econ2003 comm1190 mbas902 comp-1027 dpst1091 comp7315 eppd1033 m06 ee3025 msci231 bb113/bbs1063 fc709 comp3425 comp9417 econ42915 cb9101 math1102e chme0017 fc307 mkt60104 5522usst litr1-uc6201.200 ee1102 cosc2803 math39512 omp9727 int2067/int5051 bsb151 mgt253 fc021 babs2202 mis2002s phya21 18-213 cege0012 mdia1002 math38032 mech5125 07 cisc102 mgx3110 cs240 11175 fin3020s eco3420 ictten622 comp9727 cpt111 de114102d mgm320h5s bafi1019 math21112 efim20036 mn-3503 fins5568 110.807 bcpm000028 info6030 bma0092 bcpm0054 math20212 ce335 cs365 cenv6141 ftec5580 math2010 ec3450 comm1170 ecmt1010 csci-ua.0480-003 econ12-200 ib3960 ectb60h3f cs247—assignment tk3163 ics3u ib3j80 comp20008 comp9334 eppd1063 acct2343 cct109 isys1055/3412 math350-real math2014 eec180 stat141b econ2101 msinm014/msing014/msing014b fit2004 comp643 bu1002 cm2030
联系我们
EMail: 99515681@qq.com
QQ: 99515681
留学生作业帮-留学生的知心伴侣!
工作时间:08:00-21:00
python代写
微信客服:codinghelp
站长地图